This page covers the internal machinery of WAL: how LSNs address the infinite
byte stream, how WAL records are structured on disk, how the WAL buffer ring
works, and how concurrent backends insert records with minimal contention. The
critical path – from XLogBeginInsert() to the returned LSN – touches
surprisingly few locks thanks to a design that separates reservation from
copying.
The WAL subsystem can be understood as three layers:
Record construction (xloginsert.c) – backends build a description of
what changed using XLogRegisterBuffer() and XLogRegisterData(), then call
XLogInsert() to assemble the final record.
Space reservation and copying (xlog.c) – XLogInsertRecord() acquires
one of NUM_XLOGINSERT_LOCKS (default 8) insertion locks, atomically reserves
space in the WAL buffer by advancing CurrBytePos, copies the record, and
releases the lock.
Flush to disk (xlog.c) – XLogFlush() ensures all WAL up to a given
LSN is on stable storage. This involves write() and fsync() (or equivalent)
on the WAL segment files.
| File | Purpose |
|---|---|
src/include/access/xlogrecord.h |
On-disk record format: XLogRecord, block headers, data headers |
src/include/access/xloginsert.h |
Insertion API: XLogBeginInsert, XLogInsert, XLogRegisterBuffer |
src/include/access/xlog_internal.h |
Page headers (XLogPageHeaderData), segment math, XLogRecData chain |
src/include/access/xlogdefs.h |
XLogRecPtr (LSN), TimeLineID, XLogSegNo |
src/backend/access/transam/xloginsert.c |
Record construction: assembles block headers, FPIs, main data |
src/backend/access/transam/xlog.c |
WAL buffer management, insertion locks, write/flush, segment files |
An LSN (Log Sequence Number) is a 64-bit byte offset into the conceptual WAL stream. It is defined as:
/* src/include/access/xlogdefs.h:22 */
typedef uint64 XLogRecPtr;
LSNs are displayed in X/YYYYYYYY format, where X is the upper 32 bits and
YYYYYYYY is the lower 32 bits. For example, 0/1A3B5C00 means byte offset
0x1A3B5C00 in the WAL.
The WAL stream is divided into segment files of wal_segment_size bytes
(default 16 MB). Segment files live in $PGDATA/pg_wal/ with names like
000000010000000000000001 (timeline 1, segment 1). The mapping from LSN to
segment and offset is:
/* src/include/access/xlog_internal.h:117 */
#define XLByteToSeg(xlrp, logSegNo, wal_segsz_bytes) \
logSegNo = (xlrp) / (wal_segsz_bytes)
#define XLogSegmentOffset(xlogptr, wal_segsz_bytes) \
((xlogptr) & ((wal_segsz_bytes) - 1))
Each WAL segment is divided into pages of XLOG_BLCKSZ (default 8 KB). Every
page starts with a header:
/* src/include/access/xlog_internal.h:36 */
typedef struct XLogPageHeaderData
{
uint16 xlp_magic; /* magic value for correctness checks */
uint16 xlp_info; /* flag bits (XLP_LONG_HEADER, etc.) */
TimeLineID xlp_tli; /* TimeLineID of first record on page */
XLogRecPtr xlp_pageaddr; /* XLOG address of this page */
uint32 xlp_rem_len; /* remaining bytes from previous page's record */
} XLogPageHeaderData;
The first page of each segment file uses a long page header that adds the system identifier, segment size, and block size for cross-checking:
/* src/include/access/xlog_internal.h:61 */
typedef struct XLogLongPageHeaderData
{
XLogPageHeaderData std; /* standard header fields */
uint64 xlp_sysid; /* system identifier from pg_control */
uint32 xlp_seg_size; /* segment size cross-check */
uint32 xlp_xlog_blcksz; /* block size cross-check */
} XLogLongPageHeaderData;
When a record crosses a page boundary, the continuation page sets
XLP_FIRST_IS_CONTRECORD in xlp_info and stores the remaining byte count in
xlp_rem_len.
Every WAL record starts with a fixed-size XLogRecord header, followed by zero
or more block reference headers, an optional data header, block data (including
possible full-page images), and main data:
+------------------+
| XLogRecord | 24 bytes (fixed header)
+------------------+
| BlockHeader #0 | variable (block ref + optional FPI header)
| BlockHeader #1 | ...
| ... |
+------------------+
| DataHeader | 1-5 bytes (short or long form)
+------------------+
| Block #0 data | FPI and/or rmgr-specific per-block data
| Block #1 data | ...
+------------------+
| Main data | rmgr-specific payload not tied to a block
+------------------+
/* src/include/access/xlogrecord.h:41 */
typedef struct XLogRecord
{
uint32 xl_tot_len; /* total len of entire record */
TransactionId xl_xid; /* xact id */
XLogRecPtr xl_prev; /* ptr to previous record in log */
uint8 xl_info; /* flag bits: low 4 for WAL, high 4 for rmgr */
RmgrId xl_rmid; /* resource manager for this record */
/* 2 bytes of padding */
pg_crc32c xl_crc; /* CRC for this record */
} XLogRecord;
The xl_info field packs two things: the low 4 bits (XLR_INFO_MASK = 0x0F)
carry WAL-level flags like XLR_SPECIAL_REL_UPDATE; the high 4 bits
(XLR_RMGR_INFO_MASK = 0xF0) are free for the resource manager to encode the
operation type (e.g., XLOG_HEAP_INSERT, XLOG_BTREE_VACUUM).
Each block reference describes a page that was modified:
/* src/include/access/xlogrecord.h:103 */
typedef struct XLogRecordBlockHeader
{
uint8 id; /* block reference ID (0..XLR_MAX_BLOCK_ID) */
uint8 fork_flags; /* fork + flags (HAS_IMAGE, HAS_DATA, etc.) */
uint16 data_length; /* rmgr payload bytes for this block */
/* Optional: XLogRecordBlockImageHeader (if BKPBLOCK_HAS_IMAGE) */
/* Optional: RelFileLocator (unless BKPBLOCK_SAME_REL) */
/* Always: BlockNumber */
} XLogRecordBlockHeader;
When BKPBLOCK_HAS_IMAGE is set, a full-page image (FPI) follows. The image
can be compressed (pglz, LZ4, or zstd) and the “hole” of zeros in the middle of
standard pages is removed to save space:
/* src/include/access/xlogrecord.h:141 */
typedef struct XLogRecordBlockImageHeader
{
uint16 length; /* page image bytes (after hole removal) */
uint16 hole_offset; /* bytes before the "hole" */
uint8 bimg_info; /* BKPIMAGE_HAS_HOLE, compression flags */
} XLogRecordBlockImageHeader;
WAL records are not written directly to disk. They are first assembled in a
ring buffer of shared memory pages. The size is controlled by wal_buffers
(auto-tuned to 1/32 of shared_buffers, clamped between 64 KB and the WAL
segment size).
The ring buffer state lives in XLogCtlData (defined in src/backend/access/transam/xlog.c:453):
typedef struct XLogCtlData
{
XLogCtlInsert Insert; /* insertion state */
XLogwrtRqst LogwrtRqst; /* write/flush request pointers */
XLogRecPtr RedoRecPtr; /* cached redo pointer */
/* Atomic progress counters -- no lock needed to read */
pg_atomic_uint64 logInsertResult; /* last byte + 1 inserted */
pg_atomic_uint64 logWriteResult; /* last byte + 1 written to kernel */
pg_atomic_uint64 logFlushResult; /* last byte + 1 fsynced */
XLogRecPtr InitializedUpTo; /* latest initialized buffer page */
char *pages; /* buffer pages array */
pg_atomic_uint64 *xlblocks; /* end-LSN of each buffer page */
int XLogCacheBlck; /* highest buffer index */
/* ... timeline, recovery state, checkpoint info ... */
} XLogCtlData;
The three atomic counters – logInsertResult, logWriteResult, logFlushResult –
track how far the WAL has progressed through each stage. The invariant is:
FlushResult <= WriteResult <= InsertResult
The insertion path is optimized for high concurrency. Here is the sequence:
sequenceDiagram
participant B as Backend
participant XI as xloginsert.c
participant XL as xlog.c
participant BUF as WAL Buffer
participant DSK as pg_wal/ on disk
B->>XI: XLogBeginInsert()
Note over XI: Reset per-record state
B->>XI: XLogRegisterBuffer(block_id, buf, flags)
B->>XI: XLogRegisterData(data, len)
Note over XI: Build registered_buffers[]<br/>and mainrdata chain
B->>XI: XLogInsert(rmid, info)
XI->>XI: Assemble XLogRecData chain<br/>(headers + FPIs + data)
XI->>XL: XLogInsertRecord(rdata, fpw_lsn, flags)
XL->>XL: WALInsertLockAcquire()
Note over XL: Pick lock via MyLockNo<br/>(rotated per insertion)
XL->>XL: ReserveXLogInsertLocation()<br/>advance CurrBytePos
XL->>BUF: CopyXLogRecordToWAL()<br/>copy record bytes into buffer pages
XL->>XL: WALInsertLockRelease()
XL-->>B: Return LSN (end position)
Note over B: Later, at commit time:
B->>XL: XLogFlush(commit_lsn)
XL->>DSK: write() + fsync()
PostgreSQL uses NUM_XLOGINSERT_LOCKS (8) lightweight locks to allow concurrent
WAL insertion. Each lock tracks how far its holder has progressed:
/* src/backend/access/transam/xlog.c:370 */
typedef struct
{
LWLock lock;
pg_atomic_uint64 insertingAt; /* LSN being inserted at */
XLogRecPtr lastImportantAt; /* LSN of last important record */
} WALInsertLock;
A backend acquires one lock to insert. To flush WAL, the flusher must wait
for all insertion locks whose insertingAt falls within the flush range –
but it does not need to acquire them. This avoids deadlocks between
concurrent inserters that need to evict old WAL buffer pages.
The lock assignment rotates across calls to distribute contention:
/* src/backend/access/transam/xlog.c (WALInsertLockAcquire) */
MyLockNo++;
if (MyLockNo >= NUM_XLOGINSERT_LOCKS)
MyLockNo = 0;
The actual reservation is a spinlock-protected update of two 64-bit values:
/* src/backend/access/transam/xlog.c (XLogCtlInsert) */
typedef struct XLogCtlInsert
{
slock_t insertpos_lck; /* protects CurrBytePos and PrevBytePos */
uint64 CurrBytePos; /* end of reserved WAL (usable byte pos) */
uint64 PrevBytePos; /* start of previous record */
/* ... */
} XLogCtlInsert;
These are “usable byte positions” that exclude page headers. The function
XLogBytePosToRecPtr() converts them to real LSNs by adding back the page
header space. This makes the reservation arithmetic simpler – it is just a
plain addition of the record length.
The first time a page is modified after a checkpoint, the entire page image is included in the WAL record. This protects against torn pages – partial writes where only some of the 8 KB page makes it to disk during a crash.
The decision is made during XLogInsert():
RedoRecPtr (the current checkpoint’s
redo point, cached in XLogCtlInsert.RedoRecPtr).fullPageWrites is enabled, set BKPBLOCK_HAS_IMAGE and include
the page image.The REGBUF_STANDARD flag tells the insertion code that the page follows the
standard layout (pd_lower/pd_upper), so the “hole” between them can be
skipped to save space. With wal_compression enabled, the image is further
compressed using pglz, LZ4, or zstd.
The wal_level GUC controls how much information is written:
/* src/include/access/xlog.h:73 */
typedef enum WalLevel
{
WAL_LEVEL_MINIMAL = 0, /* crash recovery only */
WAL_LEVEL_REPLICA, /* + standby support (hot standby, streaming) */
WAL_LEVEL_LOGICAL, /* + logical decoding info */
} WalLevel;
At minimal, some bulk operations (e.g., CREATE TABLE AS, COPY into a
just-created table) skip WAL entirely and just fsync the data files. At
replica and above, all changes are logged to support replication.
Writing WAL to disk is a two-step process:
write() system call copies WAL from shared memory to the
kernel page cache. Updates logWriteResult.fsync() (or equivalent per wal_sync_method) forces the
kernel to push data to stable storage. Updates logFlushResult.The wal_sync_method GUC selects the flush method:
/* src/include/access/xlog.h:24 */
enum WalSyncMethod
{
WAL_SYNC_METHOD_FSYNC = 0,
WAL_SYNC_METHOD_FDATASYNC,
WAL_SYNC_METHOD_OPEN, /* O_SYNC */
WAL_SYNC_METHOD_FSYNC_WRITETHROUGH,
WAL_SYNC_METHOD_OPEN_DSYNC /* O_DSYNC */
};
With open_datasync, the file is opened with O_DSYNC so every write() is
implicitly synced, eliminating the separate fsync call.
When synchronous_commit = on (the default), each committing backend calls
XLogFlush(). If multiple backends commit concurrently, they naturally
group commit – the first one to call XLogFlush() will flush WAL that
covers all of them.
The CommitDelay and CommitSiblings GUCs add an intentional delay before
flushing, to increase the batch size when there are enough concurrent
committers.
/* src/include/access/xlog_internal.h:312 */
typedef struct XLogRecData
{
struct XLogRecData *next; /* next chunk in chain, or NULL */
const void *data; /* pointer to data */
uint32 len; /* length of data */
} XLogRecData;
This linked list is built by xloginsert.c during record construction. The
chain includes the XLogRecord header, block headers, FPI data, per-block
data, and main data, in order. XLogInsertRecord() walks this chain to copy
bytes into the WAL buffer.
/* src/include/access/xlog_internal.h:339 */
typedef struct RmgrData
{
const char *rm_name;
void (*rm_redo)(XLogReaderState *record); /* replay a record */
void (*rm_desc)(StringInfo buf, XLogReaderState *record); /* describe */
const char *(*rm_identify)(uint8 info); /* name an operation */
void (*rm_startup)(void); /* startup hook */
void (*rm_cleanup)(void); /* cleanup hook */
void (*rm_mask)(char *pagedata, BlockNumber blkno); /* mask for checking */
void (*rm_decode)(...); /* logical decoding */
} RmgrData;
Each resource manager (heap, btree, gin, etc.) registers these callbacks.
During recovery, the main loop calls rm_redo() for each record.
Checkpoints – checkpoints advance the RedoRecPtr, which
controls when full-page writes are needed and which WAL can be recycled.
Recovery – the recovery loop reads records from the WAL using
XLogReadRecord() and dispatches them to resource managers via rm_redo().
Storage Engine (Ch. 1) – every buffer page carries
pd_lsn (the LSN of the last WAL record that modified it). The buffer manager
ensures WAL is flushed to this LSN before writing the page to disk.
Locking (Ch. 5) – the WAL insertion locks are
lightweight locks. WALWriteLock serializes writes to disk.
Replication (Ch. 12) – the WAL sender reads from
WAL segment files (or directly from WAL buffers via WALReadFromBuffers())
and streams records to standbys.