Physical streaming replication ships raw WAL bytes from a primary to one or more standbys, producing byte-identical copies of the database cluster. The primary runs a walsender process for each connected standby, and each standby runs a single walreceiver process that writes incoming WAL to disk. Replication slots track consumer progress and prevent premature WAL removal. This section traces the full lifecycle from connection establishment through steady-state streaming and failover.
Physical streaming replication was introduced in PostgreSQL 9.0 and has been
the foundation of PostgreSQL high availability ever since. It works at the
storage level: the walsender reads WAL segment files from pg_wal/ on the
primary and streams the raw bytes over a TCP connection. The walreceiver on
the standby writes these bytes to its own pg_wal/ directory, and the startup
process replays them to keep the standby’s data pages in sync.
flowchart LR
subgraph Primary
WAL["WAL (pg_wal/)"]
WS["walsender"]
Slot["Replication Slot<br/>restart_lsn<br/>confirmed_flush"]
WAL --> WS
Slot -.->|prevents WAL removal| WAL
end
WS -- "WAL byte stream<br/>(TCP COPY mode)" --> WR
subgraph Standby
WR["walreceiver"]
SWAL["pg_wal/"]
Startup["startup process"]
DataPages["Data Pages"]
WR -->|writes WAL| SWAL
SWAL -->|reads WAL| Startup
Startup -->|replays / applies| DataPages
end
WR -.->|status feedback<br/>write, flush, apply LSN| WS
WS -.->|updates| Slot
This design has several important consequences:
| File | Purpose |
|---|---|
src/backend/replication/walsender.c |
Walsender process main loop and protocol handling |
src/backend/replication/walreceiver.c |
Walreceiver process main loop |
src/backend/replication/walreceiverfuncs.c |
Shared memory management for walreceiver state |
src/backend/replication/slot.c |
Replication slot creation, persistence, and invalidation |
src/backend/replication/slotfuncs.c |
SQL-callable slot management functions |
src/backend/replication/libpqwalreceiver/ |
libpq-based transport for walreceiver |
src/include/replication/walsender_private.h |
WalSnd, WalSndCtlData structures |
src/include/replication/walreceiver.h |
WalRcvData structure and walreceiver states |
src/include/replication/slot.h |
ReplicationSlot, ReplicationSlotPersistentData |
src/backend/replication/repl_gram.y |
Grammar for replication protocol commands |
When a standby starts (or restarts streaming), the startup process fills in
WalRcvData->conninfo and WalRcvData->receiveStart in shared memory, then
signals the postmaster to launch a walreceiver. The walreceiver connects to
the primary using the configured primary_conninfo:
Standby Primary
| |
| startup process sets WalRcvData |
| signals postmaster |
| |
| walreceiver starts |
|-------- libpq connection (replication) --->|
| | postmaster forks walsender
| |
|<------- AuthenticationOk ------------------|
| |
|-------- IDENTIFY_SYSTEM ------------------>|
|<------- systemid, timeline, xlogpos -------|
| |
|-------- START_REPLICATION SLOT s PHYSICAL |
| WAL_POSITION 0/3000000 ----------->|
| |
|<------- CopyBothResponse ------------------|
| |
|<======= WAL data stream ==================>|
| (with bidirectional feedback) |
The walsender identifies itself by marking its slot in the PMSignalState
array, which allows the postmaster to treat it differently during shutdown
(walsenders are kept alive longer to stream the shutdown checkpoint).
Once START_REPLICATION is issued, the connection enters COPY mode with
bidirectional data flow:
Primary to Standby (WAL data messages):
Each message contains:
'w' for WAL data)MAX_SEND_SIZE, typically 128 KB)Standby to Primary (status updates):
The walreceiver sends StandbyStatusUpdate messages containing:
Primary (walsender) Standby (walreceiver)
| |
|--- WAL data (LSN 0/3000000-0/3010000)->|
|--- WAL data (LSN 0/3010000-0/3020000)->|
| | writes to pg_wal/
| | updates WalRcv->flushedUpto
|<-- Status: write=0/3020000 |
| flush=0/3020000 |
| apply=0/3015000 -------------|
| |
| updates WalSnd->write/flush/apply |
| (used by sync rep and monitoring) |
| |
The walsender main loop in WalSndLoop() alternates between sending WAL and
processing feedback. The core logic:
WalSndLoop():
while not stopping:
// Check for new WAL to send
SendDataUntilCaughtUp():
while sentPtr < GetFlushRecPtr():
XLogSendPhysical() // read WAL from disk, send via COPY
sentPtr += bytes_sent
// If caught up, wait for new WAL or timeout
if caught_up:
WaitForWALToBecomeAvailable()
// uses WalSndCtl->wal_flush_cv condition variable
// Process any feedback from standby
ProcessRepliesIfAny():
// updates WalSnd->write, flush, apply
// triggers SyncRepReleaseWaiters() if sync rep enabled
// Send keepalive if wal_sender_timeout approaching
WalSndKeepalive()
The walsender reads WAL using standard file I/O (not shared buffers), reading
directly from WAL segment files in pg_wal/. It tracks its position via
sentPtr and sends data in chunks up to MAX_SEND_SIZE (128 KB with default
8 KB block size).
The walreceiver runs a complementary loop in WalReceiverMain():
WalReceiverMain():
// Connect to primary
wrconn = walrcv_connect(conninfo)
// Start streaming
walrcv_startstreaming(wrconn, startpoint, slotname)
while connected:
// Receive WAL data
len = walrcv_receive(wrconn, &buf)
if len > 0:
XLogWalRcvWrite(buf, len, recvPtr) // write to pg_wal/
recvPtr += len
// Periodically flush to disk
if should_flush:
XLogWalRcvFlush()
// updates WalRcv->flushedUpto in shared memory
// wakes startup process to replay more WAL
// Send status feedback
if interval_elapsed or flush_happened:
XLogWalRcvSendReply()
XLogWalRcvSendHSFeedback() // hot_standby_feedback
The walreceiver uses the dynamically loaded libpqwalreceiver module for
actual network communication. This design allows the main server binary to
avoid linking against libpq. The module implements functions defined in the
WalReceiverFunctionsType function table.
Replication slots solve a critical problem: without them, the primary has no reliable way to know which WAL segments a standby still needs. Slots provide:
restart_lsn prevents pg_wal/ cleanup past this pointxmin / catalog_xmin hold back vacuum’s horizonconfirmed_flush records the consumer’s positionpg_replslot/<slotname>/state| Type | Persistency | Use |
|---|---|---|
| Physical, persistent | RS_PERSISTENT |
Standby that may disconnect and reconnect |
| Physical, temporary | RS_TEMPORARY |
Session-scoped, dropped on disconnect |
| Logical, persistent | RS_PERSISTENT |
Logical replication subscriptions |
| Logical, ephemeral | RS_EPHEMERAL |
Transitional state during slot creation |
CREATE_REPLICATION_SLOT:
ReplicationSlotCreate():
acquire ReplicationSlotAllocationLock (exclusive)
find free slot in ReplicationSlotCtl->replication_slots[]
set slot->in_use = true
initialize ReplicationSlotPersistentData
release lock
CreateSlotOnDisk():
mkdir pg_replslot/<name>/
SaveSlotToPath() // write state file with CRC
START_REPLICATION (using slot):
ReplicationSlotAcquire():
set slot->active_proc = MyProcNumber
Periodic status updates:
PhysicalConfirmReceivedLocation():
update slot->data.restart_lsn
if dirty: SaveSlotToPath()
DROP_REPLICATION_SLOT:
ReplicationSlotDrop():
ReplicationSlotMarkDirty()
remove pg_replslot/<name>/ directory
set slot->in_use = false
Slots can be invalidated when they block critical operations. The
ReplicationSlotInvalidationCause enum tracks the reason:
RS_INVAL_WAL_REMOVED – required WAL was removed (exceeds max_slot_wal_keep_size)RS_INVAL_HORIZON – required rows were removed (vacuum advanced past the slot’s xmin)RS_INVAL_WAL_LEVEL – wal_level was reduced below what the slot requiresRS_INVAL_IDLE_TIMEOUT – slot exceeded the configured idle timeoutAn invalidated slot will refuse START_REPLICATION and the consumer must
create a new slot and re-synchronize from scratch.
Each active walsender gets a WalSnd entry in the shared memory array
WalSndCtl->walsnds[]. The array has max_wal_senders entries:
typedef struct WalSnd
{
pid_t pid; /* 0 if slot is free */
WalSndState state; /* STARTUP -> CATCHUP -> STREAMING -> STOPPING */
XLogRecPtr sentPtr; /* WAL sent up to this point */
/* Reported positions from the standby */
XLogRecPtr write;
XLogRecPtr flush;
XLogRecPtr apply;
/* Measured replication lag */
TimeOffset writeLag;
TimeOffset flushLag;
TimeOffset applyLag;
int sync_standby_priority; /* 0 = not in sync list */
slock_t mutex;
TimestampTz replyTime;
ReplicationKind kind; /* physical or logical */
} WalSnd;
The global coordination structure for all walsenders:
typedef struct WalSndCtlData
{
/* Synchronous replication queues (one per wait mode) */
dlist_head SyncRepQueue[NUM_SYNC_REP_WAIT_MODE]; /* write, flush, apply */
XLogRecPtr lsn[NUM_SYNC_REP_WAIT_MODE];
bits8 sync_standbys_status;
/* Condition variables for waking walsenders */
ConditionVariable wal_flush_cv; /* physical walsenders */
ConditionVariable wal_replay_cv; /* logical walsenders */
ConditionVariable wal_confirm_rcv_cv; /* failover slot sync */
WalSnd walsnds[FLEXIBLE_ARRAY_MEMBER];
} WalSndCtlData;
The walreceiver’s shared memory state, used for communication with the startup process:
typedef struct WalRcvData
{
ProcNumber procno;
pid_t pid;
WalRcvState walRcvState; /* STOPPED -> STARTING -> STREAMING -> ... */
XLogRecPtr receiveStart; /* where to start streaming (set by startup) */
TimeLineID receiveStartTLI;
XLogRecPtr flushedUpto; /* WAL flushed to disk (updated by walreceiver) */
TimeLineID receivedTLI;
XLogRecPtr latestChunkStart; /* start of current batch */
XLogRecPtr latestWalEnd; /* latest WAL end reported by sender */
TimestampTz latestWalEndTime;
char conninfo[MAXCONNINFO];
char slotname[NAMEDATALEN];
/* ... */
} WalRcvData;
The on-disk format for slot persistence:
typedef struct ReplicationSlotOnDisk
{
/* Version-independent header */
uint32 magic;
pg_crc32c checksum;
/* Versioned payload */
uint32 version;
uint32 length;
ReplicationSlotPersistentData slotdata;
} ReplicationSlotOnDisk;
The walsender transitions through well-defined states:
Connection
established
|
v
+------------+
| STARTUP | Initializing, handshake
+------------+
|
START_REPLICATION
|
+-----------+-----------+
| |
v v
+------------+ +------------+
| CATCHUP | | BACKUP | (BASE_BACKUP command)
+------------+ +------------+
|
caught up with
primary WAL
|
v
+------------+
| STREAMING | Steady state
+------------+
|
shutdown signal
(PROCSIG_WALSND_INIT_STOPPING)
|
v
+------------+
| STOPPING | Sending final WAL including
+------------+ shutdown checkpoint
|
exit(0)
PostgreSQL supports cascading replication where a standby serves as a
replication source for downstream standbys. This is enabled when
hot_standby = on and max_wal_senders > 0 on the intermediate standby.
Primary -----> Standby A -----> Standby B
(walsender) (walreceiver (walreceiver)
+ walsender)
A cascading walsender (am_cascading_walsender = true) reads WAL from the
standby’s local pg_wal/ directory, which was written by the standby’s own
walreceiver. The standby uses GetStandbyFlushRecPtr() instead of
GetFlushRecPtr() to determine how much WAL is available for streaming.
The hot_standby_feedback parameter enables the walreceiver to send its
oldest active transaction ID (xmin) back to the primary. The primary uses
this to hold back vacuum, preventing the removal of rows that active queries
on the standby might still need.
This feedback is sent via XLogWalRcvSendHSFeedback() as part of the regular
status update cycle. The primary’s walsender calls
PhysicalReplicationSlotNewXmin() to update the slot’s effective xmin.
The trade-off is clear: enabling hot standby feedback prevents “rows removed” query cancellations on the standby, but can cause table bloat on the primary if the standby runs long transactions.
Walsender shutdown follows a carefully orchestrated sequence to ensure the standby receives all WAL including the shutdown checkpoint:
PROCSIG_WALSND_INIT_STOPPING to all walsendersWALSNDSTATE_STOPPING and refuse new commandsSIGUSR2 to walsendersLogical Replication – Logical walsenders share the same
WalSnd infrastructure but use the logical decoding pipeline instead of raw
WAL streaming.
Synchronous Replication – The WalSnd->write,
flush, and apply positions reported by walreceivers are the inputs to
the synchronous commit mechanism.
Conflict Resolution – WAL replay on a hot standby can conflict with read queries, and hot standby feedback is one mechanism to mitigate these conflicts.