The buffer manager is PostgreSQL’s page cache, sitting between access methods and the storage manager. It maintains a pool of shared-memory buffers (sized by shared_buffers) and uses a clock-sweep algorithm for eviction. Every page read or written by a backend flows through this layer.
The buffer pool is a fixed-size array of 8 KB slots in shared memory. Each slot has a corresponding BufferDesc descriptor that tracks the page identity, reference count, usage count, dirty status, and I/O state – all packed into a single 64-bit atomic variable for lock-free fast-path operations.
Backends interact with the buffer pool through ReadBuffer() (which returns a pinned buffer) and ReleaseBuffer() (which unpins it). The clock-sweep algorithm approximates LRU replacement by scanning circularly through buffers, decrementing usage counts until it finds a victim with usage count zero and reference count zero.
For bulk operations (sequential scans, VACUUM, bulk writes), PostgreSQL uses ring buffers – small, fixed-size buffer pools that prevent large scans from evicting the entire shared buffer cache.
| File | Purpose |
|---|---|
src/include/storage/bufmgr.h |
Public buffer manager API: ReadBuffer(), ReleaseBuffer(), MarkBufferDirty() |
src/include/storage/buf_internals.h |
BufferDesc, BufferTag, state bit definitions, internal functions |
src/backend/storage/buffer/bufmgr.c |
Core buffer manager: ReadBuffer_common(), BufferAlloc(), FlushBuffer() |
src/backend/storage/buffer/freelist.c |
Clock-sweep eviction (StrategyGetBuffer()), ring buffer strategies |
src/backend/storage/buffer/buf_table.c |
Shared hash table mapping BufferTag to buffer ID |
src/backend/storage/buffer/buf_init.c |
Shared memory initialization for buffer descriptors |
src/backend/storage/buffer/localbuf.c |
Local (unshared) buffers for temporary tables |
src/backend/storage/buffer/README |
Detailed explanation of locking rules and access protocols |
Every buffer slot is identified by a BufferTag that uniquely names the on-disk block:
typedef struct buftag
{
Oid spcOid; /* tablespace OID */
Oid dbOid; /* database OID */
RelFileNumber relNumber; /* relation file number */
ForkNumber forkNum; /* fork: MAIN_FORKNUM, FSM_FORKNUM, etc. */
BlockNumber blockNum; /* block number within the fork */
} BufferTag;
The shared buffer hash table (buf_table.c) maps BufferTag to a buffer ID. The hash table is partitioned into NUM_BUFFER_PARTITIONS (default 128) partitions, each protected by its own LWLock, to reduce contention.
Each of the NBuffers slots has a BufferDesc, kept under 64 bytes to fit a single cache line:
typedef struct BufferDesc
{
BufferTag tag; /* which page is stored here */
int buf_id; /* index in BufferDescriptors (0-based) */
pg_atomic_uint64 state; /* atomic: refcount + usagecount + flags + locks */
int wait_backend_pgprocno; /* backend waiting for sole pin */
PgAioWaitRef io_wref; /* set while async I/O is in progress */
proclist_head lock_waiters; /* list of PGPROCs waiting for content lock */
} BufferDesc;
The 64-bit state field packs multiple sub-fields to allow lock-free atomic operations:
Bits 0-17 (18 bits): reference count (pin count)
Bits 18-21 ( 4 bits): usage count (0..BM_MAX_USAGE_COUNT=5)
Bits 22-33 (12 bits): flag bits (BM_DIRTY, BM_VALID, BM_TAG_VALID, BM_IO_IN_PROGRESS, etc.)
Bits 34-53 (20 bits): shared lock + share-exclusive lock + exclusive lock
Key flag bits:
| Flag | Bit | Meaning |
|---|---|---|
BM_DIRTY |
1 | Page has been modified since last write |
BM_VALID |
2 | Buffer contains a valid page |
BM_TAG_VALID |
3 | Tag is assigned (buffer is in the hash table) |
BM_IO_IN_PROGRESS |
4 | Read or write I/O is currently happening |
BM_CHECKPOINT_NEEDED |
8 | Must be written during current checkpoint |
BM_PERMANENT |
9 | Relation is permanent (not unlogged) |
flowchart TD
A["ReadBuffer(relation, blocknum)"] --> B{Block in hash table?}
B -->|Yes| C[Pin existing buffer<br/>increment refcount]
C --> D{BM_VALID set?}
D -->|Yes| E[Return Buffer handle]
D -->|No, IO_IN_PROGRESS| F[Wait for I/O to complete]
F --> E
B -->|No| G["StrategyGetBuffer()<br/>clock-sweep for victim"]
G --> H{Victim dirty?}
H -->|Yes| I["FlushBuffer()<br/>write old page to disk"]
I --> J[Remove old tag from hash table]
H -->|No| J
J --> K[Insert new tag into hash table]
K --> L["smgrread()<br/>read page from disk"]
L --> M[Set BM_VALID, set usage_count=1]
M --> E
The clock-sweep algorithm in StrategyGetBuffer() (freelist.c) works as follows:
nextVictimBuffer pointer rotates through the buffer array.refcount > 0: skip (buffer is pinned).usage_count > 0: decrement usage count, skip.refcount == 0 && usage_count == 0: this is the victim.BM_MAX_USAGE_COUNT = 5, so a recently-used buffer survives at most 6 complete passes of the clock hand.This approximates LRU with O(1) amortized cost per eviction. The tradeoff is that with a large shared_buffers, a full scan of the clock hand can be expensive – but this is rare because most buffers have low usage counts.
graph LR
subgraph "Clock-Sweep (circular scan)"
direction LR
B0["buf 0<br/>usage=3"]
B1["buf 1<br/>usage=0<br/>pinned"]
B2["buf 2<br/>usage=1"]
B3["buf 3<br/>usage=0<br/>FREE"]
B4["buf 4<br/>usage=5"]
B5["buf 5<br/>usage=2"]
end
HAND["nextVictimBuffer"] -.->|"scan"| B2
B2 -.->|"decrement to 0"| B3
B3 -.->|"VICTIM"| EVICT["Evict & reuse"]
Large sequential operations can push the entire working set out of the buffer cache. To prevent this, PostgreSQL uses ring buffers – a private, fixed-size ring of buffer slots that the operation cycles through:
| Strategy | Ring Size | Used By |
|---|---|---|
BAS_BULKREAD |
256 KB (32 buffers) | Sequential scans of large tables |
BAS_BULKWRITE |
16 MB (2048 buffers) | COPY, CREATE TABLE AS, bulk inserts |
BAS_VACUUM |
256 KB (32 buffers) | VACUUM |
When a ring buffer strategy is active, StrategyGetBuffer() first tries to reuse a buffer from the ring. If the buffer is still pinned by someone else, it falls back to the global clock-sweep. The StrategyRejectBuffer() function ensures that a buffer evicted by the ring is not a heavily-used buffer (usage count > 1), in which case the eviction is rejected and the next ring slot is tried.
Access to the data within a buffer is controlled by content locks, implemented directly in the BufferDesc.state field (not as separate LWLocks):
Content locks are separate from and orthogonal to buffer pins. A pin prevents the buffer from being evicted; a content lock controls concurrent access to the page data. The rule is: pin first, then lock.
A pinned buffer cannot be evicted, but it can be written to disk by the background writer or checkpointer. Before flushing a dirty buffer, the system checks pd_lsn on the page and ensures WAL has been flushed at least that far. This enforces the fundamental WAL rule: log before data.
The sequence for modifying a page:
ReadBuffer() – pin the buffer.LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE) – exclusive content lock.START_CRIT_SECTION() – no errors allowed from here.MarkBufferDirty() – set BM_DIRTY.XLogInsert() – write WAL record; this sets pd_lsn on the page.END_CRIT_SECTION().UnlockBuffer() and eventually ReleaseBuffer().Temporary tables (created with CREATE TEMP TABLE) use local buffers that are private to the backend process. These bypass all shared-memory locking and are not WAL-logged. Local buffers are managed in localbuf.c and use a simpler allocation strategy since there is no concurrency.
| Struct | Header | Key Fields |
|---|---|---|
BufferTag |
buf_internals.h |
spcOid, dbOid, relNumber, forkNum, blockNum |
BufferDesc |
buf_internals.h |
tag, buf_id, state (atomic: refcount + usagecount + flags + locks) |
BufferDescPadded |
buf_internals.h |
Union padded to 64 bytes for cache-line alignment |
WritebackContext |
buf_internals.h |
Coalesces pending OS writeback requests |
pd_lsn for WAL ordering but is otherwise page-format-agnostic.smgrread() / smgrstartreadv(). On eviction of dirty buffers, it calls smgrwrite().MarkBufferDirty() + XLogInsert() cooperate to ensure crash safety. The background writer and checkpointer flush dirty buffers in LSN order.StartReadBuffers() for batched, asynchronous page reads, used by the read stream layer.BM_CHECKPOINT_NEEDED set and flushes them, advancing the redo point.