Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/duckdb/duckdb/llms.txt

Use this file to discover all available pages before exploring further.

DuckDB uses a vectorized execution model combined with push-based pipelining and automatic parallelization to achieve high performance on analytical queries. This execution strategy is fundamentally different from traditional row-at-a-time execution.

Vectorized Execution

Location: src/execution/

What is Vectorization?

Instead of processing one row at a time, DuckDB processes data in vectors - batches of rows stored in columnar format. This approach provides significant performance benefits. Traditional Row-at-a-Time:
for each row {
    value = row.column;
    result = function(value);
    output.append(result);
}
// Many function calls, poor CPU cache usage
Vectorized Processing:
for each vector {
    values[] = vector.column;  // Array of 2048 values
    results[] = function(values[]);  // Process entire array
    output.append(results[]);
}
// Fewer function calls, better CPU cache usage, SIMD-friendly

Vector Size

From src/include/duckdb/common/vector_size.hpp:15-21:
// The default standard vector size
#define DEFAULT_STANDARD_VECTOR_SIZE 2048U

// The vector size used in the execution engine
#ifndef STANDARD_VECTOR_SIZE
#define STANDARD_VECTOR_SIZE DEFAULT_STANDARD_VECTOR_SIZE
#endif
Default vector size: 2,048 rows This size is chosen to:
  • Fit comfortably in CPU cache (L1/L2)
  • Enable SIMD (Single Instruction Multiple Data) operations
  • Balance between batch size and memory pressure
  • Be a power of 2 for efficient bit operations

Data Chunks

Location: src/common/types/data_chunk.cpp Vectors are organized into DataChunks - the fundamental unit of data flow in DuckDB. From src/common/types/data_chunk.cpp:23-35:
DataChunk::DataChunk() : count(0), capacity(STANDARD_VECTOR_SIZE) {
}

void DataChunk::InitializeEmpty(const vector<LogicalType> &types) {
    D_ASSERT(data.empty());
    capacity = STANDARD_VECTOR_SIZE;
    for (idx_t i = 0; i < types.size(); i++) {
        data.emplace_back(types[i], nullptr);
    }
}
A DataChunk contains:
  • Multiple column vectors (one per column)
  • Count: Number of valid rows (≤ 2048)
  • Capacity: Maximum rows (typically 2048)
DataChunk:
  Column 0: [val₁, val₂, ..., val₂₀₄₈]  ← Vector of integers
  Column 1: [str₁, str₂, ..., str₂₀₄₈]  ← Vector of strings  
  Column 2: [ts₁,  ts₂,  ..., ts₂₀₄₈]   ← Vector of timestamps
  Count: 2048

Benefits of Vectorization

CPU Cache Efficiency

Consecutive memory access patterns maximize CPU cache hits

SIMD Instructions

Process multiple values with single CPU instruction (AVX-512, AVX-256)

Branch Prediction

Reduce branch mispredictions with batch-oriented logic

Function Call Overhead

Amortize function call costs across 2048 rows instead of 1

SIMD Example

Modern CPUs can process multiple values simultaneously:
// Traditional: 8 additions (8 cycles)
result[0] = a[0] + b[0];
result[1] = a[1] + b[1];
// ... 6 more

// SIMD (AVX-256): 8 additions in parallel (1-2 cycles)
__m256i va = _mm256_load_si256(a);
__m256i vb = _mm256_load_si256(b);
__m256i vr = _mm256_add_epi32(va, vb);
_mm256_store_si256(result, vr);
DuckDB automatically uses SIMD when available, without manual intervention.

Physical Operators

Location: src/execution/physical_operator.cpp The execution engine consists of physical operators that implement actual query operations. From src/execution/physical_operator.cpp:19-27:
PhysicalOperator::PhysicalOperator(PhysicalPlan &physical_plan, 
                                   PhysicalOperatorType type, 
                                   vector<LogicalType> types,
                                   idx_t estimated_cardinality)
    : children(physical_plan.ArenaRef()), type(type), 
      types(std::move(types)),
      estimated_cardinality(estimated_cardinality) {
}

Common Physical Operators

OperatorDescriptionExample
TableScanRead data from storageFROM users
FilterApply WHERE predicatesWHERE age > 18
ProjectionSelect specific columnsSELECT name, email
HashJoinJoin using hash tableINNER JOIN orders ON ...
HashAggregateGROUP BY with aggregatesGROUP BY country
OrderSort resultsORDER BY created_at DESC
LimitRestrict output rowsLIMIT 100
WindowWindow functionsROW_NUMBER() OVER (...)

Operator Interface

From src/execution/physical_operator.cpp:97-100:
OperatorResultType PhysicalOperator::Execute(ExecutionContext &context, 
                                             DataChunk &input, 
                                             DataChunk &chunk,
                                             GlobalOperatorState &gstate, 
                                             OperatorState &state) const {
    throw InternalException("Calling Execute on a node that is not an operator!");
}
Each operator:
  1. Takes input DataChunk (or pulls from child operators)
  2. Processes the data
  3. Produces output DataChunk
  4. Returns status (more data, finished, etc.)

Push-Based Execution Model

DuckDB uses a push-based (also called “data-centric”) execution model where data flows from operators to their parents.

Pull vs Push

Pull-Based (Volcano/Iterator):
// Parent asks for data
while (chunk = child.GetNext()) {
    process(chunk);
}
Push-Based (DuckDB):
// Child pushes data to parent
child.Execute(chunk) {
    process_data();
    parent.Execute(output_chunk);  // Push to parent
}

Benefits of Push-Based

Data flows through multiple operators without materialization. A chunk can be filtered, projected, and aggregated in a single pass.
Data stays hot in CPU cache as it flows through the pipeline.
Multiple pipelines can run concurrently on different cores.
Work is divided into small chunks (morsels) that can be distributed across threads.

Query Optimization

Location: src/optimizer/ Before execution, queries are optimized through multiple passes. See Architecture for details.

Key Optimizations for Execution

1. Predicate Pushdown
-- Original query
SELECT name FROM (SELECT * FROM users) WHERE age > 30;

-- Optimized: Filter pushed to table scan
TableScan(users) WITH FILTER age > 30 → Project(name)
2. Projection Pushdown
-- Only read 'name' and 'age' columns, skip others
SELECT name FROM users WHERE age > 30;
3. Join Ordering
-- Small table (countries) as build side, large table (users) as probe side
SELECT * FROM users 
JOIN countries ON users.country_id = countries.id;
4. Filter Ordering
// Apply most selective filters first
WHERE cheap_filter(...) AND expensive_filter(...)  -- cheap first

Parallel Execution

DuckDB automatically parallelizes query execution across available CPU cores.

Thread Count

From src/execution/physical_operator.cpp:56-73:
idx_t PhysicalOperator::EstimatedThreadCount() const {
    idx_t result = 0;
    if (children.empty()) {
        // Terminal operator (base table scan)
        // Determines degree of parallelism
        result = MaxValue<idx_t>(estimated_cardinality / (DEFAULT_ROW_GROUP_SIZE * 2), 1);
    } else if (type == PhysicalOperatorType::UNION) {
        // UNION: sum children's thread counts
        for (auto &child : children) {
            result += child.get().EstimatedThreadCount();
        }
    } else {
        // Other operators: max of children
        for (auto &child : children) {
            result = MaxValue(child.get().EstimatedThreadCount(), result);
        }
    }
    return result;
}

Morsel-Driven Parallelism

Work is divided into morsels (typically row groups of ~122K rows) that can be processed independently.
Table with 1M rows → 8 row groups

Thread 1: Process row groups 0, 1
Thread 2: Process row groups 2, 3
Thread 3: Process row groups 4, 5  
Thread 4: Process row groups 6, 7

→ All results combined
Benefits:
  • Load balancing: Fast threads can pick up more morsels
  • No synchronization overhead during processing
  • Cache-friendly: Each thread works on contiguous data

Configuring Parallelism

-- Set number of threads
SET threads = 8;

-- Check current setting
SELECT current_setting('threads');

-- Disable parallelism (for debugging)
SET threads = 1;

Pipeline Execution

Operators are organized into pipelines - sequences of operators that can run without materializing intermediate results.

Pipeline Breakers

Some operators require materialization and break the pipeline:
OperatorWhy It Breaks Pipeline
HashJoin (build side)Must build complete hash table before probing
HashAggregateMust see all rows before finalizing groups
OrderMust collect all rows before sorting
Window (some)Depends on window frame specification

Example Query Execution

SELECT country, COUNT(*) as user_count
FROM users
WHERE age > 18
GROUP BY country
ORDER BY user_count DESC
LIMIT 10;
Execution plan:
1. TableScan(users) 
   ↓ [Pipeline 1]
2. Filter(age > 18)

3. HashAggregate(GROUP BY country)  ← Pipeline breaker
   ↓ [Pipeline 2]
4. Order(user_count DESC)            ← Pipeline breaker
   ↓ [Pipeline 3]
5. Limit(10)
Pipeline 1: Scan → Filter → Aggregate (streaming)
  • Each vector flows from scan through filter to aggregator
  • No intermediate materialization
  • Parallel: Multiple threads scan different row groups
Pipeline 2: Aggregate → Order
  • Wait for all aggregation to complete
  • Stream results to sort
Pipeline 3: Order → Limit
  • Sort completes, stream sorted results
  • Stop after 10 rows (early termination)

Adaptive Execution

Location: src/execution/adaptive_filter.cpp DuckDB adapts execution strategies based on runtime statistics:

Adaptive Filtering

WHERE expensive_function(x) > 10 AND cheap_column = 'value'
The executor tracks filter selectivity and may reorder predicates dynamically:
  • If cheap_column = 'value' eliminates 99% of rows → evaluate it first
  • If both are selective → use cost-based ordering

Adaptive Hash Joins

Hash joins adapt based on data size:
  • Small data: In-memory hash table
  • Medium data: Partitioned hash table
  • Large data: Disk-based partitioning with spillage

Performance Monitoring

EXPLAIN ANALYZE

See actual execution statistics:
EXPLAIN ANALYZE
SELECT country, COUNT(*) 
FROM users 
WHERE age > 18 
GROUP BY country;
Output includes:
  • Actual row counts vs estimates
  • Execution time per operator
  • Memory usage
  • Parallelism degree

Profiling

-- Enable profiling
PRAGMA enable_profiling;
PRAGMA profiling_output = 'query_profile.json';

-- Run query
SELECT ...;

-- Profiling data written to file

Execution Best Practices

1

Filter Early

Apply WHERE clauses to reduce data volume as early as possible.
-- Good: Filter before join
SELECT * FROM large_table 
WHERE date > '2026-01-01'
JOIN small_table ON ...;
2

Project Only Needed Columns

Avoid SELECT * - specify only required columns.
-- Bad: Reads all 50 columns
SELECT * FROM wide_table WHERE id = 1;

-- Good: Reads only 2 columns
SELECT name, email FROM wide_table WHERE id = 1;
3

Use Appropriate Joins

Consider join order and types based on table sizes.
-- Small table (build side) first in hash join
SELECT * FROM small_table 
JOIN large_table ON ...;
4

Leverage Statistics

DuckDB uses statistics for row group pruning - ensure data is somewhat ordered.
5

Tune Thread Count

More threads ≠ faster. Optimal thread count depends on data size and query.
SET threads = 4;  -- For typical analytical queries

Execution Summary

Vectorized

Process 2,048 rows at a time for CPU efficiency

Push-Based

Data flows through pipelines without materialization

Parallel

Automatic parallelization across CPU cores

Adaptive

Runtime adaptation based on actual data characteristics

Next Steps