Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/duckdb/duckdb/llms.txt

Use this file to discover all available pages before exploring further.

DuckDB is an in-process analytical database designed to provide high-performance data analytics directly embedded within applications. Unlike traditional client-server databases, DuckDB runs entirely within the host process, eliminating network overhead and simplifying deployment.

In-Process Architecture

DuckDB is an embedded database system, similar to SQLite but optimized for analytical workloads (OLAP) rather than transactional workloads (OLTP). This means:
  • No separate server process: DuckDB runs directly in your application’s process space
  • Zero-copy integration: Direct memory access to data structures without serialization
  • Single-file storage: Database stored in a single file (or in-memory)
  • Thread-safe: Multiple threads can query the same database concurrently
// Example: Opening a DuckDB database in C API
duckdb_database db;
duckdb_connection con;

// Opens or creates a single-file database
duckdb_open("my_database.duckdb", &db);
duckdb_connect(db, &con);

Core Components

DuckDB’s architecture consists of several key components that work together to process queries efficiently:

Parser

Transforms SQL text into an Abstract Syntax Tree (AST)

Planner

Converts AST into a logical query plan

Optimizer

Optimizes the logical plan for efficient execution

Execution Engine

Executes the physical query plan using vectorized processing

Query Flow

When you execute a SQL query in DuckDB, it flows through the following stages:

Parser

Location: src/parser/ The parser is the entry point for all SQL queries. DuckDB uses a modified version of the PostgreSQL parser (libpg_query) to ensure broad SQL compatibility. Key responsibilities:
  • Tokenize SQL text
  • Build Abstract Syntax Tree (AST)
  • Perform initial syntax validation
  • Transform PostgreSQL AST into DuckDB’s internal representation
The parser produces statements, expressions, and table references that are passed to the planner.

Planner

Location: src/planner/ The planner converts the parsed AST into a Logical Query Plan - a tree of LogicalOperator nodes representing the query’s semantic meaning. Key responsibilities:
  • Resolve table and column names via the Catalog
  • Type checking and inference
  • Bind expressions to actual table columns
  • Convert parsed structures to logical operators
The planner uses the Binder component to resolve symbols (table names, column names, functions) against the database catalog.

Optimizer

Location: src/optimizer/ The optimizer transforms the logical plan into a more efficient equivalent plan. DuckDB employs both rule-based and cost-based optimization strategies. Key optimizations include:
  • Constant folding: 2 + 35
  • Arithmetic simplification: x * 1x
  • Case simplification
  • Common subexpression elimination (CSE)
Moves filter predicates as close to the data source as possible to reduce the amount of data processed.
-- Filter pushed down to table scan
SELECT * FROM users 
WHERE age > 18 AND country = 'US'
Determines optimal join order using cost-based analysis. Implemented in src/optimizer/join_order/.
Removes unused columns early in the query plan to reduce memory usage and I/O.
Uses table and column statistics to make informed optimization decisions.
From src/optimizer/optimizer.cpp:49-78, the optimizer applies multiple optimization rules:
Optimizer::Optimizer(Binder &binder, ClientContext &context) : context(context), binder(binder), rewriter(context) {
    rewriter.rules.push_back(make_uniq<ConstantFoldingRule>(rewriter));
    rewriter.rules.push_back(make_uniq<DistributivityRule>(rewriter));
    rewriter.rules.push_back(make_uniq<ArithmeticSimplificationRule>(rewriter));
    rewriter.rules.push_back(make_uniq<CaseSimplificationRule>(rewriter));
    rewriter.rules.push_back(make_uniq<ConjunctionSimplificationRule>(rewriter));
    // ... and many more
}

Execution Engine

Location: src/execution/ The execution layer converts the optimized logical plan into a Physical Query Plan consisting of PhysicalOperator nodes, then executes it using a push-based vectorized execution model. Key characteristics:
  • Vectorized processing: Operates on batches of rows (default 2048 rows) for CPU efficiency
  • Push-based model: Data flows from operators to their parents
  • Parallel execution: Automatic parallelization across multiple threads
  • Pipelined execution: Minimizes materialization of intermediate results
See Query Execution for detailed information on vectorized execution.

Catalog

Location: src/catalog/ The catalog manages database metadata:
  • Tables and their schemas
  • Views
  • Indexes
  • User-defined functions
  • Sequences
  • Schemas and databases
The catalog is used by the planner’s binder to resolve symbols during query planning.

Storage Layer

Location: src/storage/ The storage component manages physical data on disk and in memory:
  • Columnar storage format: Optimized for analytical queries
  • Single-file database: All data in one .duckdb file
  • Buffer manager: Intelligent memory management and caching
  • Compression: Multiple compression algorithms per column
  • ACID compliance: Full transaction support with Write-Ahead Log (WAL)
See Storage System for detailed information.

Transaction Management

Location: src/transaction/ DuckDB provides full ACID transaction support:
  • Snapshot isolation: Each transaction sees a consistent snapshot of the database
  • Multi-Version Concurrency Control (MVCC): Readers don’t block writers
  • Write-Ahead Logging (WAL): Ensures durability and crash recovery
From src/transaction/duck_transaction_manager.cpp:36-78, each transaction receives unique identifiers:
DuckTransactionManager::DuckTransactionManager(AttachedDatabase &db) : TransactionManager(db) {
    // Start timestamp starts at two
    current_start_timestamp = 2;
    // Transaction ID starts very high
    current_transaction_id = TRANSACTION_ID_START;
    // ...
}

Summary

DuckDB’s in-process architecture provides several advantages:

Performance

No network overhead, zero-copy data access, and efficient memory usage

Simplicity

Single file deployment, no server administration, embedded in applications

Portability

Runs anywhere your application runs, cross-platform support

SQL Compatibility

PostgreSQL-compatible SQL with advanced analytical features
The modular architecture allows each component to focus on its specific task while maintaining clean interfaces between stages of query processing.

Next Steps