PostgreSQL supports two primary aggregation strategies – hash aggregation
and sorted (group) aggregation – plus grouping sets (ROLLUP, CUBE)
and window functions. The Agg node handles standard aggregation with a
uniform transition-function pipeline: for each input tuple, call the transition
function to update the aggregate state; after the last tuple in a group, call
the optional final function to produce the result. The WindowAgg node extends
this model to compute window functions over partitions with frame-based access
to surrounding rows.
Every aggregate follows the same three-step lifecycle:
transvalue = initcond
foreach input_tuple:
transvalue = transfunc(transvalue, input_value(s))
result = finalfunc(transvalue)
If no finalfunc is defined, the final transvalue is the result. For
partial aggregation (parallel queries), the pipeline extends with
combinefunc, serializefunc, and deserializefunc.
| Strategy | AggStrategy |
When Used | Memory |
|---|---|---|---|
| Plain | AGG_PLAIN |
No GROUP BY (single group) | Minimal |
| Sorted | AGG_SORTED |
Input pre-sorted on GROUP BY keys | Minimal |
| Hashed | AGG_HASHED |
Unsorted input, fits in work_mem |
O(groups) |
| Mixed | AGG_MIXED |
Grouping sets with both sorted and hashed phases | Varies |
| File | Purpose |
|---|---|
src/backend/executor/nodeAgg.c |
Main aggregation node (~3000 lines) |
src/backend/executor/nodeWindowAgg.c |
Window function execution |
src/backend/executor/nodeGroup.c |
Simple GROUP BY without aggregates |
src/backend/executor/execGrouping.c |
Group key comparison utilities |
src/include/nodes/execnodes.h |
AggState, WindowAggState structs |
For each input tuple, compute the hash of the GROUP BY keys and look up (or create) the corresponding entry in a hash table. Each entry holds the transition values for all aggregates in that group.
ExecAgg (AGG_HASHED)
|
+-- Phase 1: Build hash table
| loop:
| slot = ExecProcNode(outer)
| if NULL, move to phase 2
| hash = hash_group_keys(slot)
| entry = lookup_or_create(hashtable, hash, group_keys)
| advance_aggregates(entry, slot) /* call transfunc */
|
+-- Phase 2: Iterate hash table, emit results
for each entry in hashtable:
finalize_aggregates(entry) /* call finalfunc */
project and return result tuple
The hash table uses the TupleHashTable abstraction:
typedef struct TupleHashTableData {
tuplehash_hash *hashtab; /* simplehash hash table */
int numCols; /* number of GROUP BY columns */
AttrNumber *keyColIdx; /* column numbers of group keys */
Oid *eqfuncoids; /* equality function OIDs */
...
} TupleHashTableData;
Memory management. Each grouping set gets its own aggcontext (an
ExprContext). When a group boundary is hit in sorted mode, the context is
rescanned (not just reset) so that transition functions can register shutdown
callbacks via AggRegisterCallback.
Spilling to disk. As of PostgreSQL 15+, hash aggregation can spill
partitions to disk when work_mem is exceeded, similar to hash join batching.
This prevents out-of-memory failures on high-cardinality GROUP BY.
Requires input pre-sorted on the GROUP BY keys. Detects group boundaries by comparing the current tuple’s keys to the previous tuple’s keys:
ExecAgg (AGG_SORTED)
|
loop:
slot = ExecProcNode(outer)
if NULL or group_keys_changed:
finalize current group
emit result tuple
reset transition values for new group
advance_aggregates(current_group, slot)
This has minimal memory overhead (only the current group’s transition states) and works well when the planner can leverage an existing sort order.
Grouping sets allow multiple levels of aggregation in a single pass. For example:
SELECT department, region, SUM(sales)
FROM orders
GROUP BY ROLLUP (department, region);
-- Produces groups: (dept, region), (dept), ()
PostgreSQL handles this using a chained aggregation approach:
Input sorted by (department, region):
(Sales, East, 100) --> update groups: (Sales,East), (Sales), ()
(Sales, East, 200) --> update groups: (Sales,East), (Sales), ()
(Sales, West, 150) --> boundary! finalize (Sales,East)
update groups: (Sales,West), (Sales), ()
(Eng, East, 300) --> boundary! finalize (Sales,West), (Sales)
update groups: (Eng,East), (Eng), ()
EOF --> finalize (Eng,East), (Eng), ()
For non-ROLLUP grouping sets that cannot share a sort order, the planner emits
multiple Agg nodes (one per required sort order) combined with Append.
In parallel aggregation, work is split into partial and final phases:
Gather
|
Finalize Aggregate (combinefunc)
|
Partial Aggregate (transfunc, skip finalfunc)
|
[parallel worker scans]
Each worker runs the transition functions on its partition of the data and
outputs serialized partial results. The leader’s Finalize Aggregate node
deserializes these and combines them using the combinefunc.
The aggsplit flag controls this:
| Split | transfunc/combinefunc | finalfunc | serialize/deserialize |
|---|---|---|---|
AGGSPLIT_SIMPLE |
transfunc | yes | no |
AGGSPLIT_INITIAL_SERIAL |
transfunc | skip | serialize output |
AGGSPLIT_FINAL_DESERIAL |
combinefunc | yes | deserialize input |
Window functions compute results across a “window” of related rows without
collapsing them into groups. The WindowAgg node:
WindowObject APIWindowAgg
|
+-- buffer partition into tuplestore
|
+-- for each row in partition:
| set frame boundaries (ROWS/RANGE/GROUPS)
| for each window function:
| compute result over frame
| project: original columns + window results
| return tuple
|
+-- on partition boundary: reset and start new partition
The WindowObject API provides functions like:
WinGetFuncArgInPartition() – access argument values at arbitrary positionsWinGetFuncArgInFrame() – access within the current frameWinSetMarkPosition() / WinGetCurrentPosition() – navigationFrame types:
| Type | Boundary | Description |
|---|---|---|
ROWS |
Physical row offset | Fixed number of rows before/after |
RANGE |
Value range | Rows within a value distance |
GROUPS |
Peer groups | Groups of rows with equal ORDER BY values |
Optimization for running aggregates: For ROWS BETWEEN UNBOUNDED PRECEDING
AND CURRENT ROW, PostgreSQL maintains a running transition value rather than
re-scanning the entire frame for each row. It also supports inverse transition
functions for sliding frames (e.g., subtracting a row leaving the frame).
typedef struct WindowObjectData {
NodeTag type;
WindowAggState *winstate; /* parent state */
List *argstates; /* argument expressions */
void *localmem; /* per-partition local memory */
int markptr; /* tuplestore mark pointer */
int readptr; /* tuplestore read pointer */
int64 markpos; /* row that markptr is on */
int64 seekpos; /* row that readptr is on */
...
} WindowObjectData;
typedef struct AggState {
ScanState ss;
List *aggs; /* list of Aggref nodes */
int numaggs; /* number of aggregates */
int numtrans; /* number of transition states */
AggStrategy aggstrategy; /* PLAIN, SORTED, HASHED, MIXED */
AggSplit aggsplit; /* SIMPLE, INITIAL, FINAL */
AggStatePerAgg peragg; /* per-aggregate info */
AggStatePerTrans pertrans; /* per-transition info */
ExprContext *tmpcontext; /* short-lived for transfunc calls */
ExprContext **aggcontexts; /* per-grouping-set contexts */
TupleHashTable *hashtable; /* for AGG_HASHED */
bool table_filled; /* hash table complete? */
int current_set; /* current grouping set */
int numphases; /* number of sort phases */
...
} AggState;
typedef struct AggStatePerTransData {
Aggref *aggref;
FmgrInfo transfn; /* transition function */
FmgrInfo serialfn; /* serialization function */
FmgrInfo deserialfn; /* deserialization function */
Datum initValue; /* initial transition value */
bool initValueIsNull;
bool transfn_strict; /* is transfunc strict? */
Datum transValue; /* current transition value */
bool transValueIsNull;
...
} AggStatePerTransData;
Input Tuples
|
+----------+----------+
| |
AGG_SORTED AGG_HASHED
| |
sorted by group key hash(group_key)
| |
detect boundary lookup/create entry
| |
transfunc(state, val) transfunc(state, val)
| |
on boundary: after all input:
finalfunc(state) iterate hash table
| finalfunc(state)
| |
+----------+----------+
|
Result Tuples
Worker 0 Worker 1 Worker 2
| | |
Partial Agg Partial Agg Partial Agg
(transfunc) (transfunc) (transfunc)
(serialize) (serialize) (serialize)
| | |
+--------+---------+---------+-------+
| |
Gather (tuple queue)
|
Finalize Agg
(deserialize)
(combinefunc)
(finalfunc)
|
Result Tuples
| Topic | Link |
|---|---|
| Executor overview | Query Executor |
| Sorted input via Sort nodes | Sort and Materialize |
| Parallel partial aggregation | Parallel Query |
| Expression evaluation in transfuncs | Expression Evaluation |
| Planner aggregate strategy selection | Query Optimizer |
| Memory contexts for aggregate state | Memory Management |
| Statistics for group cardinality estimates | Statistics |