PostgreSQL implements true serializable isolation using the SSI algorithm, based on the research of Michael Cahill, Uwe Rohm, and Alan Fekete (SIGMOD 2008). SSI runs snapshot isolation as the base mechanism but monitors for rw-conflict patterns that could cause serialization anomalies, aborting transactions when necessary.
The key insight: snapshot isolation anomalies always involve a cycle in the transaction dependency graph, and every such cycle contains a dangerous structure – two adjacent rw-conflict edges.
| File | Purpose |
|---|---|
src/backend/storage/lmgr/predicate.c |
SSI implementation (~6000 lines) |
src/include/storage/predicate.h |
Public API |
src/include/storage/predicate_internals.h |
SERIALIZABLEXACT, predicate lock structs |
src/backend/storage/lmgr/README-SSI |
Detailed design document |
Snapshot isolation prevents dirty reads, non-repeatable reads, and phantoms. But it permits write skew – an anomaly where two concurrent transactions each read overlapping data, make decisions based on it, and write non-overlapping data, producing a result impossible under any serial execution.
Classic example:
T1: reads A=10, B=10 (sum=20, constraint: sum >= 20)
T1: sets A = A - 15 = -5
T2: reads A=10, B=10 (same snapshot, sum=20)
T2: sets B = B - 15 = -5
Both commit. A + B = -10. Constraint violated.
Under serial execution, the second transaction would see the first’s write and refuse.
Cahill et al. proved that every serialization anomaly under snapshot isolation involves a cycle containing two adjacent rw-conflict edges (also called anti-dependencies):
flowchart LR
Tin["T_in"] -->|"rw-conflict"| Tpivot["T_pivot"]
Tpivot -->|"rw-conflict"| Tout["T_out"]
style Tin fill:#e8f4f8
style Tpivot fill:#fce4ec
style Tout fill:#e8f4f8
An rw-conflict (anti-dependency) from T1 to T2 means: T1 read something, and T2 wrote a newer version of it that T1 could not see (because T2 was concurrent).
SSI does not need to track wr-dependencies or ww-dependencies. It only tracks rw-conflicts between concurrent transactions and watches for the dangerous pattern of two adjacent ones.
Each serializable transaction gets a SERIALIZABLEXACT struct in shared memory:
typedef struct SERIALIZABLEXACT
{
VirtualTransactionId vxid;
SerCommitSeqNo prepareSeqNo;
SerCommitSeqNo commitSeqNo;
union
{
SerCommitSeqNo earliestOutConflictCommit;
SerCommitSeqNo lastCommitBeforeSnapshot;
} SeqNo;
dlist_head outConflicts; /* rw-conflicts where we are the reader */
dlist_head inConflicts; /* rw-conflicts where we are the writer */
dlist_head predicateLocks; /* our predicate locks */
dlist_head possibleUnsafeConflicts;
TransactionId topXid;
TransactionId finishedBefore;
TransactionId xmin;
uint32 flags;
int pid;
} SERIALIZABLEXACT;
The outConflicts list tracks “I read something that another transaction wrote” (T_in edges). The inConflicts list tracks “another transaction read something I wrote” (T_out edges).
SSI uses predicate locks (SIRead locks) to track what each serializable transaction has read. These are not blocking locks – they are bookkeeping markers. The granularity levels:
| Level | Meaning |
|---|---|
| Tuple | Specific row was read |
| Page | Entire page was read |
| Relation | Entire table was read (e.g., sequential scan) |
Lock promotion (escalation) happens automatically:
flowchart TD
TUPLE["Tuple-level SIRead locks"] -->|"Too many on one page<br/>(max_predicate_locks_per_page)"| PAGE["Page-level SIRead lock"]
PAGE -->|"Too many on one relation<br/>(max_predicate_locks_per_relation)"| REL["Relation-level SIRead lock"]
This is controlled by the GUCs:
max_predicate_locks_per_xact (default 64)max_predicate_locks_per_relation (default -2, meaning pages/16)max_predicate_locks_per_page (default 2)When a serializable transaction writes a tuple, PostgreSQL checks for predicate locks held by other transactions on that tuple (or its page/relation). If found, an rw-conflict edge is recorded:
flowchart TD
WRITE["T2 writes tuple"] --> CHECK{"Any SIRead lock<br/>on this tuple/page/relation<br/>held by T1?"}
CHECK -->|No| OK["No conflict"]
CHECK -->|Yes| RECORD["Record rw-conflict: T1 -> T2"]
RECORD --> DANGEROUS{"Does T2 also have<br/>an rw-conflict out<br/>to some T3?"}
DANGEROUS -->|No| WAIT["No action yet"]
DANGEROUS -->|Yes| STRUCTURE["Dangerous structure detected!"]
STRUCTURE --> ABORT["Abort one transaction<br/>(usually T_pivot)"]
Similarly, when a serializable transaction reads a tuple, it checks if the tuple was written by a concurrent transaction and records the conflict.
PostgreSQL applies two optimizations from the literature to reduce false positives:
T_out must commit first: An anomaly can only occur if T_out committed before the other transactions in the cycle. So PostgreSQL only acts on the dangerous structure if T_out has already committed.
Read-only optimization: If T_in is read-only, an anomaly is only possible if T_out committed before T_in took its snapshot. This allows many read-only transactions to avoid serialization failures entirely.
A read-only transaction running at SERIALIZABLE can be marked “safe” early if all concurrent read-write transactions that it might conflict with have finished. Once safe, it can release its predicate locks and SERIALIZABLEXACT resources, reducing memory pressure.
The final check happens at commit time in PreCommit_CheckForSerializationFailure(). This function examines whether the committing transaction is the pivot in a dangerous structure:
/*
* Check whether the current transaction has both an inConflict
* and an outConflict with committed transactions, forming a
* dangerous structure.
*/
If so, the transaction is aborted with ERRCODE_T_R_SERIALIZATION_FAILURE.
SSI structures can consume significant shared memory. The key limits:
max_pred_locks_per_transaction controls the shared memory pool for predicate locksSERIALIZABLEXACT entries are allocated from a fixed pool sized by max_connections + max_prepared_transactionsFinishedSerializableTransactions list is cleaned up based on SxactGlobalXminApplications using SERIALIZABLE must handle serialization failures:
-- Application pseudocode:
LOOP
BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;
-- do work
COMMIT;
EXIT LOOP; -- success
EXCEPTION WHEN serialization_failure THEN
-- retry
END LOOP;
| Structure | Location | Role |
|---|---|---|
SERIALIZABLEXACT |
predicate_internals.h |
Per-transaction SSI state, conflict lists |
PREDICATELOCK |
predicate_internals.h |
Links a SERIALIZABLEXACT to a lock target |
PREDICATELOCKTARGET |
predicate_internals.h |
Represents a locked tuple/page/relation |
RWConflict |
predicate_internals.h |
An rw-conflict edge between two transactions |
SerCommitSeqNo |
predicate_internals.h |
Monotonic counter for ordering commits |
PredicateLockPage() or PredicateLockTID() during scans to register SIRead locks, making index choice (sequential vs. index scan) affect SSI false positive rates.