DuckDB is an in-process analytical database designed to provide high-performance data analytics directly embedded within applications. Unlike traditional client-server databases, DuckDB runs entirely within the host process, eliminating network overhead and simplifying deployment.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/duckdb/duckdb/llms.txt
Use this file to discover all available pages before exploring further.
In-Process Architecture
DuckDB is an embedded database system, similar to SQLite but optimized for analytical workloads (OLAP) rather than transactional workloads (OLTP). This means:- No separate server process: DuckDB runs directly in your application’s process space
- Zero-copy integration: Direct memory access to data structures without serialization
- Single-file storage: Database stored in a single file (or in-memory)
- Thread-safe: Multiple threads can query the same database concurrently
Core Components
DuckDB’s architecture consists of several key components that work together to process queries efficiently:Parser
Transforms SQL text into an Abstract Syntax Tree (AST)
Planner
Converts AST into a logical query plan
Optimizer
Optimizes the logical plan for efficient execution
Execution Engine
Executes the physical query plan using vectorized processing
Query Flow
When you execute a SQL query in DuckDB, it flows through the following stages:Parser
Location:src/parser/
The parser is the entry point for all SQL queries. DuckDB uses a modified version of the PostgreSQL parser (libpg_query) to ensure broad SQL compatibility.
Key responsibilities:
- Tokenize SQL text
- Build Abstract Syntax Tree (AST)
- Perform initial syntax validation
- Transform PostgreSQL AST into DuckDB’s internal representation
Planner
Location:src/planner/
The planner converts the parsed AST into a Logical Query Plan - a tree of LogicalOperator nodes representing the query’s semantic meaning.
Key responsibilities:
- Resolve table and column names via the Catalog
- Type checking and inference
- Bind expressions to actual table columns
- Convert parsed structures to logical operators
The planner uses the Binder component to resolve symbols (table names, column names, functions) against the database catalog.
Optimizer
Location:src/optimizer/
The optimizer transforms the logical plan into a more efficient equivalent plan. DuckDB employs both rule-based and cost-based optimization strategies.
Key optimizations include:
Expression Rewriting
Expression Rewriting
- Constant folding:
2 + 3→5 - Arithmetic simplification:
x * 1→x - Case simplification
- Common subexpression elimination (CSE)
Filter Pushdown
Filter Pushdown
Moves filter predicates as close to the data source as possible to reduce the amount of data processed.
Join Ordering
Join Ordering
Determines optimal join order using cost-based analysis. Implemented in
src/optimizer/join_order/.Column Pruning
Column Pruning
Removes unused columns early in the query plan to reduce memory usage and I/O.
Statistics Propagation
Statistics Propagation
Uses table and column statistics to make informed optimization decisions.
src/optimizer/optimizer.cpp:49-78, the optimizer applies multiple optimization rules:
Execution Engine
Location:src/execution/
The execution layer converts the optimized logical plan into a Physical Query Plan consisting of PhysicalOperator nodes, then executes it using a push-based vectorized execution model.
Key characteristics:
- Vectorized processing: Operates on batches of rows (default 2048 rows) for CPU efficiency
- Push-based model: Data flows from operators to their parents
- Parallel execution: Automatic parallelization across multiple threads
- Pipelined execution: Minimizes materialization of intermediate results
Catalog
Location:src/catalog/
The catalog manages database metadata:
- Tables and their schemas
- Views
- Indexes
- User-defined functions
- Sequences
- Schemas and databases
Storage Layer
Location:src/storage/
The storage component manages physical data on disk and in memory:
- Columnar storage format: Optimized for analytical queries
- Single-file database: All data in one
.duckdbfile - Buffer manager: Intelligent memory management and caching
- Compression: Multiple compression algorithms per column
- ACID compliance: Full transaction support with Write-Ahead Log (WAL)
Transaction Management
Location:src/transaction/
DuckDB provides full ACID transaction support:
- Snapshot isolation: Each transaction sees a consistent snapshot of the database
- Multi-Version Concurrency Control (MVCC): Readers don’t block writers
- Write-Ahead Logging (WAL): Ensures durability and crash recovery
src/transaction/duck_transaction_manager.cpp:36-78, each transaction receives unique identifiers:
Summary
DuckDB’s in-process architecture provides several advantages:Performance
No network overhead, zero-copy data access, and efficient memory usage
Simplicity
Single file deployment, no server administration, embedded in applications
Portability
Runs anywhere your application runs, cross-platform support
SQL Compatibility
PostgreSQL-compatible SQL with advanced analytical features
Next Steps
- Learn about Data Types in DuckDB
- Explore the Storage System
- Understand Query Execution