Chapter 8: Query Executor

Summary

The query executor is the runtime engine of PostgreSQL. It receives a plan tree from the optimizer and produces result tuples by recursively pulling data through a pipeline of interconnected plan nodes. Every SQL statement – whether a simple SELECT, an INSERT ... SELECT, or a complex analytical query with window functions – ultimately runs through the same executor framework.

Overview

After the planner builds an optimal PlannedStmt, the executor takes over. Execution proceeds in four phases, each exposed through a hook-able interface:

Phase	Entry Point	Purpose
Start	`ExecutorStart()`	Build the `PlanState` tree, open relations, allocate memory
Run	`ExecutorRun()`	Pull tuples through the plan tree via `ExecutePlan()`
Finish	`ExecutorFinish()`	Fire AFTER triggers, run post-processing
End	`ExecutorEnd()`	Close relations, free resources

The heart of the executor is the Volcano iterator model: every plan node exposes an ExecProcNode function pointer that, when called, returns the next tuple (or NULL to signal completion). The root node calls its children, which call their children, forming a demand-driven data pipeline.

                  ExecutorRun
                      |
                 ExecutePlan
                      |
               ExecProcNode (root)
                /           \
         ExecProcNode     ExecProcNode
          (scan)           (join)
                           /     \
                    ExecProcNode  ExecProcNode
                     (scan)        (sort)

Key Concepts

Plan vs. PlanState. The planner produces immutable Plan nodes. During ExecutorStart, ExecInitNode() walks the plan tree and creates a parallel tree of mutable PlanState nodes that carry runtime state – open scan descriptors, hash tables, sort states, expression evaluation machinery.

TupleTableSlots. Tuples flow between nodes as TupleTableSlot references. Slots are a uniform abstraction over heap tuples, minimal tuples, and virtual tuples (column arrays). This avoids copying data unnecessarily.

Expression evaluation. Qual filters, projection lists, and computed columns are compiled into flat ExprEvalStep arrays during initialization and executed via a fast interpreted dispatch loop (or JIT-compiled native code).

Memory management. Each node typically has a per-tuple ExprContext whose memory is reset between tuples, plus query-lifespan memory for hash tables and sort state. This two-tier approach keeps memory usage predictable.

Key Source Files

File	Purpose
`src/backend/executor/execMain.c`	Top-level executor interface: `ExecutorStart/Run/Finish/End`
`src/backend/executor/execProcnode.c`	`ExecInitNode()` / `ExecProcNode()` / `ExecEndNode()` dispatch
`src/backend/executor/execScan.c`	Generic scan loop shared by all scan nodes
`src/backend/executor/execExpr.c`	Expression “compilation” into `ExprEvalStep` arrays
`src/backend/executor/execExprInterp.c`	Interpreted expression evaluation (switch/computed-goto)
`src/backend/executor/execTuples.c`	`TupleTableSlot` management
`src/backend/executor/execUtils.c`	`EState` and `ExprContext` setup utilities
`src/backend/executor/execParallel.c`	Parallel query infrastructure
`src/include/nodes/execnodes.h`	All `PlanState` struct definitions
`src/include/nodes/plannodes.h`	All `Plan` struct definitions

Key Data Structures

EState (Executor State)

The global execution state for a query, shared across all plan nodes:

typedef struct EState {
    NodeTag         type;
    ScanDirection   es_direction;       /* forward or backward scan */
    Snapshot        es_snapshot;         /* visibility snapshot */
    Snapshot        es_crosscheck_snapshot;
    List           *es_range_table;     /* RT entries */
    Index           es_range_table_size;
    Relation       *es_relations;       /* opened relations array */
    ...
    PlannedStmt    *es_plannedstmt;     /* the plan being executed */
    ParamListInfo   es_param_list_info; /* external parameters */
    ParamExecData  *es_param_exec_vals; /* internal parameters */
    MemoryContext   es_query_cxt;       /* query-lifespan memory */
    List           *es_tupleTable;      /* TupleTableSlots */
    ...
} EState;

PlanState (base for all node states)

typedef struct PlanState {
    NodeTag         type;
    Plan           *plan;               /* associated Plan node */
    EState         *state;              /* link to per-query EState */
    ExecProcNodeMtd ExecProcNode;       /* function to get next tuple */
    ExecProcNodeMtd ExecProcNodeReal;   /* actual impl (ExecProcNode may wrap) */
    Instrumentation *instrument;        /* optional EXPLAIN ANALYZE stats */
    ExprState      *qual;               /* qual condition */
    struct PlanState *lefttree;         /* outer (left) input */
    struct PlanState *righttree;        /* inner (right) input */
    TupleTableSlot *ps_ResultTupleSlot; /* result slot */
    ExprContext    *ps_ExprContext;      /* expression eval context */
    ...
} PlanState;

How It Works: End-to-End

1. ExecutorStart – Building the Runtime Tree

ExecutorStart(queryDesc, eflags)
  --> InitPlan(queryDesc, eflags)
        --> ExecInitNode(plannedstmt->planTree, estate, eflags)
              --> switch (nodeTag(node))
                    T_SeqScan  --> ExecInitSeqScan()
                    T_HashJoin --> ExecInitHashJoin()
                    T_Sort     --> ExecInitSort()
                    ...
              --> recursively init children
              --> set result->ExecProcNode = ExecXxx

ExecInitNode() is a large switch statement in execProcnode.c that dispatches to the appropriate ExecInitXxx() function for each of the ~40 node types. Each init function:

Allocates its XxxState struct
Opens relations / indexes as needed
Initializes expressions via ExecInitExpr()
Recursively calls ExecInitNode() on child plans
Sets the ExecProcNode function pointer

2. ExecutorRun – The Tuple Pipeline

ExecutorRun(queryDesc, direction, count, execute_once)
  --> ExecutePlan(...)
        loop:
          slot = ExecProcNode(planstate)     /* pull one tuple */
          if TupIsNull(slot) break
          dest->receiveSlot(slot, dest)      /* send to client */
          count--

The first call to ExecProcNode() goes through ExecProcNodeFirst(), which sets up instrumentation if needed, then replaces itself with the real function pointer for subsequent calls. This avoids a branch on every tuple.

3. ExecutorFinish and ExecutorEnd

ExecutorFinish() fires deferred triggers and calls ExecPostprocessPlan(). ExecutorEnd() calls ExecEndNode() recursively, which closes scan descriptors, destroys hash tables, and frees sort state.

Executor Node Taxonomy

Plan Nodes (~40 types)
 |
 +-- Control Nodes
 |     Result, ProjectSet, ModifyTable, Append, MergeAppend,
 |     RecursiveUnion, BitmapAnd, BitmapOr
 |
 +-- Scan Nodes
 |     SeqScan, IndexScan, IndexOnlyScan, BitmapIndexScan,
 |     BitmapHeapScan, TidScan, TidRangeScan, SubqueryScan,
 |     FunctionScan, ValuesScan, TableFuncScan, CteScan,
 |     NamedTuplestoreScan, WorkTableScan, ForeignScan, CustomScan
 |
 +-- Join Nodes
 |     NestLoop, MergeJoin, HashJoin
 |
 +-- Materialization Nodes
 |     Material, Sort, IncrementalSort, Memoize, Group, Agg,
 |     WindowAgg, Unique, Gather, GatherMerge, SetOp, LockRows,
 |     Limit, Hash
 |
 +-- DML Nodes
       ModifyTable (INSERT/UPDATE/DELETE/MERGE)

Diagram: Executor Lifecycle

  Client sends query
        |
        v
  Parse --> Analyze --> Rewrite --> Plan
                                     |
                                     v
                              PlannedStmt
                                     |
        +----------------------------+
        |                            |
        v                            v
  ExecutorStart()              QueryDesc created
        |                      (plan + snapshot +
        v                       dest receiver)
  ExecInitNode()
  (build PlanState tree)
        |
        v
  ExecutorRun()
        |
        v
  ExecutePlan() loop  <--------+
        |                      |
        v                      |
  ExecProcNode(root)           |
        |                      |
   tuple returned?  --yes-->  send to client
        |                      |
        no                     |
        |                      |
        v                      |
  ExecutorFinish()             |
  (after triggers)             |
        |
        v
  ExecutorEnd()
  (cleanup)

Connections

Topic	Link
How plans are produced	Query Optimizer
Volcano iterator model details	Volcano Model
Scan node internals	Scan Nodes
Join algorithms	Join Nodes
Aggregation and window functions	Aggregation
Sorting and materialization	Sort and Materialize
Parallel query execution	Parallel Query
Expression compilation and JIT	Expression Evaluation
Buffer and storage layer	Storage Engine
Transaction visibility (snapshots)	Transactions
Memory allocation contexts	Memory Management

← Previous Next →

↑ Back to Table of Contents