Sorting and materialization are the key pipeline-breaking operations in the
executor. The Sort node uses the tuplesort infrastructure to perform
in-memory quicksort for small datasets and external balanced k-way merge sort
for large ones, with I/O managed by the logtape module. The Material node
buffers all input tuples for repeated access. IncrementalSort exploits
partially pre-sorted input. Memoize caches results of parameterized scans
using an LRU hash table, avoiding redundant rescans when the same parameter
values recur.
| Node | Purpose | Blocking? | Memory Behavior |
|---|---|---|---|
| Sort | Sort all input tuples by specified keys | Yes | In-memory up to work_mem, then external sort |
| IncrementalSort | Sort when leading keys are already ordered | Partially | Sorts groups of presorted rows |
| Material | Buffer all input for repeated reads | Yes | Spills to tuplestore on disk if needed |
| Memoize | Cache parameterized subplan results (LRU) | No (per lookup) | Bounded by work_mem, LRU eviction |
| File | Purpose |
|---|---|
src/backend/executor/nodeSort.c |
Sort executor node |
src/backend/executor/nodeIncrementalSort.c |
Incremental sort node |
src/backend/executor/nodeMaterial.c |
Materialize node |
src/backend/executor/nodeMemoize.c |
Memoize (result caching) node |
src/backend/utils/sort/tuplesort.c |
Core sorting engine |
src/backend/utils/sort/tuplesortvariants.c |
Sort specializations (heap, datum, etc.) |
src/backend/utils/sort/logtape.c |
Logical tape I/O for external sort |
src/backend/utils/sort/tuplestore.c |
General-purpose tuple buffer |
src/include/utils/tuplesort.h |
Tuplesortstate and public API |
src/include/utils/tuplestore.h |
Tuplestorestate and public API |
The Sort node operates in two phases:
Phase 1 (on first ExecProcNode call):
Read ALL tuples from outer plan
Pass each to tuplesort_puttupleslot()
Call tuplesort_performsort()
Phase 2 (on subsequent calls):
Call tuplesort_gettupleslot() to return sorted tuples one at a time
static TupleTableSlot *
ExecSort(PlanState *pstate)
{
SortState *node = castNode(SortState, pstate);
if (!node->sort_Done)
{
/* Phase 1: consume all input */
tuplesortstate = tuplesort_begin_heap(..., work_mem, ...);
for (;;)
{
slot = ExecProcNode(outerPlanState(node));
if (TupIsNull(slot))
break;
tuplesort_puttupleslot(tuplesortstate, slot);
}
tuplesort_performsort(tuplesortstate);
node->sort_Done = true;
}
/* Phase 2: return next sorted tuple */
if (tuplesort_gettupleslot(tuplesortstate, forward, false, slot, NULL))
return slot;
return NULL;
}
Datum sort optimization. When the sort result has a single column, PostgreSQL uses a Datum-only sort path which avoids tuple overhead and is significantly faster for pass-by-value types.
tuplesort.c implements a sophisticated sorting engine that adapts to data
size:
Input tuples
|
fits in work_mem?
/ \
yes no
| |
in-memory quicksort external sort
(or radix sort for (multiple phases)
pass-by-value types)
| |
single sorted array sorted runs on "tapes"
| |
+--------+----------+
|
tuplesort_gettupleslot()
returns tuples in order
When all tuples fit within work_mem:
memtuples)performsort, quicksort (or radix sort) is appliedPostgreSQL uses a comparison-based qsort_ssup (with SortSupport for
type-specific optimizations like abbreviated keys) or a radix sort for
integer/float types.
Abbreviated keys. The SortSupport mechanism can compute a fixed-size “abbreviated” representation of the first sort key that fits in a Datum. This allows most comparisons to be resolved without touching the full tuple, dramatically improving cache locality:
typedef struct SortSupportData {
Oid ssup_ctype;
bool ssup_nulls_first;
/* Abbreviated key support */
Datum (*abbrev_converter)(Datum original, SortSupport ssup);
int (*abbrev_comparator)(Datum a, Datum b, SortSupport ssup);
bool (*abbrev_abort)(int memtupcount, SortSupport ssup);
/* Full key comparator */
int (*comparator)(Datum a, Datum b, SortSupport ssup);
...
} SortSupportData;
When tuples exceed work_mem:
Run generation:
Fill memory --> sort --> write to tape 0
Fill memory --> sort --> write to tape 1
Fill memory --> sort --> write to tape 2
...
k-way merge:
Tape 0: [1, 5, 9, ...]
Tape 1: [2, 4, 8, ...] --> merge heap --> output tape
Tape 2: [3, 6, 7, ...]
If runs > tapes: multi-pass merge
Pass 1: merge runs into fewer, longer runs
Pass 2: merge again until single output
The number of tapes is determined by work_mem / TAPE_BUFFER_OVERHEAD, ensuring
each tape has enough read-ahead buffer to maintain sequential I/O. Since
PostgreSQL 15, a balanced merge is used instead of the older polyphase merge.
The logtape module provides an abstraction of multiple sequential-access
“tapes” backed by a single temporary file. It handles:
work_mem / num_tapes bytes
to maintain sequential access patterns./* Logical tape abstraction */
typedef struct LogicalTapeSet {
BufFile *pfile; /* underlying temp file */
int nBlocksWritten;
/* Free block tracking */
long *freeBlocks;
int nFreeBlocks;
...
} LogicalTapeSet;
typedef struct LogicalTape {
LogicalTapeSet *tapeSet;
bool writing; /* currently writing or reading? */
bool frozen; /* frozen for random access? */
long firstBlockNumber;
/* Read buffer */
char *buffer;
int buffer_size;
int pos; /* position in buffer */
int nbytes; /* valid bytes in buffer */
...
} LogicalTape;
When input is already sorted by a prefix of the required sort keys, IncrementalSort avoids sorting the entire dataset. It groups consecutive tuples that share the same presorted key prefix and sorts only within each group:
Required sort: (a, b, c)
Input sorted by: (a)
Group 1 (a=1): sort by (b, c) --> emit sorted group
Group 2 (a=2): sort by (b, c) --> emit sorted group
Group 3 (a=3): sort by (b, c) --> emit sorted group
This is particularly effective with LIMIT queries, where only the first few groups may need to be sorted before enough result tuples are produced.
Buffers all input tuples in a Tuplestorestate for repeated access:
static TupleTableSlot *
ExecMaterial(PlanState *pstate)
{
MaterialState *node = castNode(MaterialState, pstate);
if (!node->eof_underlying)
{
/* Still reading from child */
slot = ExecProcNode(outerPlanState(node));
if (!TupIsNull(slot))
{
tuplestore_puttupleslot(node->tuplestorestate, slot);
return slot;
}
node->eof_underlying = true;
}
/* Reading from materialized store */
if (tuplestore_gettupleslot(node->tuplestorestate, forward, false, slot))
return slot;
return NULL;
}
The tuplestore starts in memory and transparently spills to a temp file when
it exceeds work_mem. It supports multiple read pointers for concurrent
access (used by window functions and CTEs).
Caches results of parameterized subplans using a hash table with LRU eviction:
ExecMemoize:
|
+-- hash(current_parameters)
|
+-- cache lookup
| |
| +-- HIT: return cached tuples one at a time
| |
| +-- MISS:
| +-- evict LRU entry if cache full
| +-- execute subplan, cache results
| +-- return first tuple
|
+-- on rescan: try cache again with new parameters
The cache uses a doubly-linked list for LRU tracking. Accessed entries are moved
to the tail; eviction removes from the head. The singlerow optimization
marks cache entries as complete after one tuple, enabling early exit for
unique joins.
State machine:
MEMO_CACHE_LOOKUP -- try cache
MEMO_CACHE_FETCH_NEXT_TUPLE -- return next cached tuple
MEMO_FILLING_CACHE -- populating from subplan
MEMO_CACHE_BYPASS_MODE -- entry too large, pass through
MEMO_END -- done
typedef struct Tuplesortstate {
TupSortStatus status; /* INITIAL, BOUNDED, BUILDRUNS, ... */
bool bounded; /* using top-N heapsort? */
int64 bound; /* N for top-N */
int64 availMem; /* remaining memory budget */
int64 allowedMem; /* total work_mem */
/* In-memory sort */
SortTuple *memtuples; /* array of tuples */
int memtupcount;
int memtupsize; /* allocated array length */
/* External sort */
LogicalTapeSet *tapeset;
int maxTapes;
int nRuns; /* number of sorted runs */
SortTuple *mergetuples; /* merge heap */
...
} Tuplesortstate;
typedef struct SortState {
ScanState ss;
bool sort_Done; /* sort complete? */
bool bounded; /* using top-N? */
int64 bound; /* LIMIT value */
void *tuplesortstate; /* opaque Tuplesortstate */
...
} SortState;
INITIAL
|
+----------+----------+
| |
fits in memory? exceeds work_mem
| |
SORTEDINMEM BUILDRUNS
(qsort/radix) |
| write sorted runs
| to logical tapes
| |
| MERGING
| (k-way merge)
| |
+----------+----------+
|
SORTEDONTAPE or
SORTEDINMEM
|
tuplesort_gettupleslot()
When a Sort node sits below a Limit node, it uses a bounded heap to keep only the top N tuples in memory. This avoids sorting the entire input:
ExecSetTupleBound(n, sortNode)
--> tuplesort enters BOUNDED mode
--> maintains a max-heap of size N
--> each new tuple: if smaller than heap max, replace and sift down
--> after all input: extract heap in sorted order
This reduces memory from O(input) to O(N) and time from O(N log N) to O(input * log N).
| Topic | Link |
|---|---|
| Executor overview | Query Executor |
| Volcano model and pipeline breakers | Volcano Model |
| Merge join requiring sorted input | Join Nodes |
| Sorted aggregation | Aggregation |
| Parallel sort via Gather Merge | Parallel Query |
| work_mem and memory accounting | Memory Management |
| Temp file I/O | Storage Engine |