The query executor is the runtime engine of PostgreSQL. It receives a plan tree
from the optimizer and produces result tuples by recursively pulling data through
a pipeline of interconnected plan nodes. Every SQL statement – whether a simple
SELECT, an INSERT ... SELECT, or a complex analytical query with window
functions – ultimately runs through the same executor framework.
After the planner builds an optimal PlannedStmt, the executor takes over.
Execution proceeds in four phases, each exposed through a hook-able interface:
| Phase | Entry Point | Purpose |
|---|---|---|
| Start | ExecutorStart() |
Build the PlanState tree, open relations, allocate memory |
| Run | ExecutorRun() |
Pull tuples through the plan tree via ExecutePlan() |
| Finish | ExecutorFinish() |
Fire AFTER triggers, run post-processing |
| End | ExecutorEnd() |
Close relations, free resources |
The heart of the executor is the Volcano iterator model: every plan node
exposes an ExecProcNode function pointer that, when called, returns the next
tuple (or NULL to signal completion). The root node calls its children, which
call their children, forming a demand-driven data pipeline.
ExecutorRun
|
ExecutePlan
|
ExecProcNode (root)
/ \
ExecProcNode ExecProcNode
(scan) (join)
/ \
ExecProcNode ExecProcNode
(scan) (sort)
Plan vs. PlanState. The planner produces immutable Plan nodes. During
ExecutorStart, ExecInitNode() walks the plan tree and creates a parallel
tree of mutable PlanState nodes that carry runtime state – open scan
descriptors, hash tables, sort states, expression evaluation machinery.
TupleTableSlots. Tuples flow between nodes as TupleTableSlot references.
Slots are a uniform abstraction over heap tuples, minimal tuples, and virtual
tuples (column arrays). This avoids copying data unnecessarily.
Expression evaluation. Qual filters, projection lists, and computed columns
are compiled into flat ExprEvalStep arrays during initialization and executed
via a fast interpreted dispatch loop (or JIT-compiled native code).
Memory management. Each node typically has a per-tuple ExprContext whose
memory is reset between tuples, plus query-lifespan memory for hash tables and
sort state. This two-tier approach keeps memory usage predictable.
| File | Purpose |
|---|---|
src/backend/executor/execMain.c |
Top-level executor interface: ExecutorStart/Run/Finish/End |
src/backend/executor/execProcnode.c |
ExecInitNode() / ExecProcNode() / ExecEndNode() dispatch |
src/backend/executor/execScan.c |
Generic scan loop shared by all scan nodes |
src/backend/executor/execExpr.c |
Expression “compilation” into ExprEvalStep arrays |
src/backend/executor/execExprInterp.c |
Interpreted expression evaluation (switch/computed-goto) |
src/backend/executor/execTuples.c |
TupleTableSlot management |
src/backend/executor/execUtils.c |
EState and ExprContext setup utilities |
src/backend/executor/execParallel.c |
Parallel query infrastructure |
src/include/nodes/execnodes.h |
All PlanState struct definitions |
src/include/nodes/plannodes.h |
All Plan struct definitions |
The global execution state for a query, shared across all plan nodes:
typedef struct EState {
NodeTag type;
ScanDirection es_direction; /* forward or backward scan */
Snapshot es_snapshot; /* visibility snapshot */
Snapshot es_crosscheck_snapshot;
List *es_range_table; /* RT entries */
Index es_range_table_size;
Relation *es_relations; /* opened relations array */
...
PlannedStmt *es_plannedstmt; /* the plan being executed */
ParamListInfo es_param_list_info; /* external parameters */
ParamExecData *es_param_exec_vals; /* internal parameters */
MemoryContext es_query_cxt; /* query-lifespan memory */
List *es_tupleTable; /* TupleTableSlots */
...
} EState;
typedef struct PlanState {
NodeTag type;
Plan *plan; /* associated Plan node */
EState *state; /* link to per-query EState */
ExecProcNodeMtd ExecProcNode; /* function to get next tuple */
ExecProcNodeMtd ExecProcNodeReal; /* actual impl (ExecProcNode may wrap) */
Instrumentation *instrument; /* optional EXPLAIN ANALYZE stats */
ExprState *qual; /* qual condition */
struct PlanState *lefttree; /* outer (left) input */
struct PlanState *righttree; /* inner (right) input */
TupleTableSlot *ps_ResultTupleSlot; /* result slot */
ExprContext *ps_ExprContext; /* expression eval context */
...
} PlanState;
ExecutorStart(queryDesc, eflags)
--> InitPlan(queryDesc, eflags)
--> ExecInitNode(plannedstmt->planTree, estate, eflags)
--> switch (nodeTag(node))
T_SeqScan --> ExecInitSeqScan()
T_HashJoin --> ExecInitHashJoin()
T_Sort --> ExecInitSort()
...
--> recursively init children
--> set result->ExecProcNode = ExecXxx
ExecInitNode() is a large switch statement in execProcnode.c that dispatches
to the appropriate ExecInitXxx() function for each of the ~40 node types. Each
init function:
XxxState structExecInitExpr()ExecInitNode() on child plansExecProcNode function pointerExecutorRun(queryDesc, direction, count, execute_once)
--> ExecutePlan(...)
loop:
slot = ExecProcNode(planstate) /* pull one tuple */
if TupIsNull(slot) break
dest->receiveSlot(slot, dest) /* send to client */
count--
The first call to ExecProcNode() goes through ExecProcNodeFirst(), which
sets up instrumentation if needed, then replaces itself with the real function
pointer for subsequent calls. This avoids a branch on every tuple.
ExecutorFinish() fires deferred triggers and calls ExecPostprocessPlan().
ExecutorEnd() calls ExecEndNode() recursively, which closes scan descriptors,
destroys hash tables, and frees sort state.
Plan Nodes (~40 types)
|
+-- Control Nodes
| Result, ProjectSet, ModifyTable, Append, MergeAppend,
| RecursiveUnion, BitmapAnd, BitmapOr
|
+-- Scan Nodes
| SeqScan, IndexScan, IndexOnlyScan, BitmapIndexScan,
| BitmapHeapScan, TidScan, TidRangeScan, SubqueryScan,
| FunctionScan, ValuesScan, TableFuncScan, CteScan,
| NamedTuplestoreScan, WorkTableScan, ForeignScan, CustomScan
|
+-- Join Nodes
| NestLoop, MergeJoin, HashJoin
|
+-- Materialization Nodes
| Material, Sort, IncrementalSort, Memoize, Group, Agg,
| WindowAgg, Unique, Gather, GatherMerge, SetOp, LockRows,
| Limit, Hash
|
+-- DML Nodes
ModifyTable (INSERT/UPDATE/DELETE/MERGE)
Client sends query
|
v
Parse --> Analyze --> Rewrite --> Plan
|
v
PlannedStmt
|
+----------------------------+
| |
v v
ExecutorStart() QueryDesc created
| (plan + snapshot +
v dest receiver)
ExecInitNode()
(build PlanState tree)
|
v
ExecutorRun()
|
v
ExecutePlan() loop <--------+
| |
v |
ExecProcNode(root) |
| |
tuple returned? --yes--> send to client
| |
no |
| |
v |
ExecutorFinish() |
(after triggers) |
|
v
ExecutorEnd()
(cleanup)
| Topic | Link |
|---|---|
| How plans are produced | Query Optimizer |
| Volcano iterator model details | Volcano Model |
| Scan node internals | Scan Nodes |
| Join algorithms | Join Nodes |
| Aggregation and window functions | Aggregation |
| Sorting and materialization | Sort and Materialize |
| Parallel query execution | Parallel Query |
| Expression compilation and JIT | Expression Evaluation |
| Buffer and storage layer | Storage Engine |
| Transaction visibility (snapshots) | Transactions |
| Memory allocation contexts | Memory Management |