Buffer Manager

The buffer manager is PostgreSQL’s page cache, sitting between access methods and the storage manager. It maintains a pool of shared-memory buffers (sized by shared_buffers) and uses a clock-sweep algorithm for eviction. Every page read or written by a backend flows through this layer.

Overview

The buffer pool is a fixed-size array of 8 KB slots in shared memory. Each slot has a corresponding BufferDesc descriptor that tracks the page identity, reference count, usage count, dirty status, and I/O state – all packed into a single 64-bit atomic variable for lock-free fast-path operations.

Backends interact with the buffer pool through ReadBuffer() (which returns a pinned buffer) and ReleaseBuffer() (which unpins it). The clock-sweep algorithm approximates LRU replacement by scanning circularly through buffers, decrementing usage counts until it finds a victim with usage count zero and reference count zero.

For bulk operations (sequential scans, VACUUM, bulk writes), PostgreSQL uses ring buffers – small, fixed-size buffer pools that prevent large scans from evicting the entire shared buffer cache.

Key Source Files

File	Purpose
`src/include/storage/bufmgr.h`	Public buffer manager API: `ReadBuffer()`, `ReleaseBuffer()`, `MarkBufferDirty()`
`src/include/storage/buf_internals.h`	`BufferDesc`, `BufferTag`, state bit definitions, internal functions
`src/backend/storage/buffer/bufmgr.c`	Core buffer manager: `ReadBuffer_common()`, `BufferAlloc()`, `FlushBuffer()`
`src/backend/storage/buffer/freelist.c`	Clock-sweep eviction (`StrategyGetBuffer()`), ring buffer strategies
`src/backend/storage/buffer/buf_table.c`	Shared hash table mapping `BufferTag` to buffer ID
`src/backend/storage/buffer/buf_init.c`	Shared memory initialization for buffer descriptors
`src/backend/storage/buffer/localbuf.c`	Local (unshared) buffers for temporary tables
`src/backend/storage/buffer/README`	Detailed explanation of locking rules and access protocols

How It Works

Buffer Identification: BufferTag

Every buffer slot is identified by a BufferTag that uniquely names the on-disk block:

typedef struct buftag
{
    Oid           spcOid;       /* tablespace OID */
    Oid           dbOid;        /* database OID */
    RelFileNumber relNumber;    /* relation file number */
    ForkNumber    forkNum;      /* fork: MAIN_FORKNUM, FSM_FORKNUM, etc. */
    BlockNumber   blockNum;     /* block number within the fork */
} BufferTag;

The shared buffer hash table (buf_table.c) maps BufferTag to a buffer ID. The hash table is partitioned into NUM_BUFFER_PARTITIONS (default 128) partitions, each protected by its own LWLock, to reduce contention.

Buffer Descriptor: BufferDesc

Each of the NBuffers slots has a BufferDesc, kept under 64 bytes to fit a single cache line:

typedef struct BufferDesc
{
    BufferTag          tag;                   /* which page is stored here */
    int                buf_id;                /* index in BufferDescriptors (0-based) */
    pg_atomic_uint64   state;                 /* atomic: refcount + usagecount + flags + locks */
    int                wait_backend_pgprocno; /* backend waiting for sole pin */
    PgAioWaitRef       io_wref;               /* set while async I/O is in progress */
    proclist_head      lock_waiters;          /* list of PGPROCs waiting for content lock */
} BufferDesc;

The 64-bit state field packs multiple sub-fields to allow lock-free atomic operations:

 Bits  0-17  (18 bits): reference count (pin count)
 Bits 18-21  ( 4 bits): usage count (0..BM_MAX_USAGE_COUNT=5)
 Bits 22-33  (12 bits): flag bits (BM_DIRTY, BM_VALID, BM_TAG_VALID, BM_IO_IN_PROGRESS, etc.)
 Bits 34-53  (20 bits): shared lock + share-exclusive lock + exclusive lock

Key flag bits:

Flag	Bit	Meaning
`BM_DIRTY`	1	Page has been modified since last write
`BM_VALID`	2	Buffer contains a valid page
`BM_TAG_VALID`	3	Tag is assigned (buffer is in the hash table)
`BM_IO_IN_PROGRESS`	4	Read or write I/O is currently happening
`BM_CHECKPOINT_NEEDED`	8	Must be written during current checkpoint
`BM_PERMANENT`	9	Relation is permanent (not unlogged)

ReadBuffer Flow

flowchart TD
    A["ReadBuffer(relation, blocknum)"] --> B{Block in hash table?}
    B -->|Yes| C[Pin existing buffer<br/>increment refcount]
    C --> D{BM_VALID set?}
    D -->|Yes| E[Return Buffer handle]
    D -->|No, IO_IN_PROGRESS| F[Wait for I/O to complete]
    F --> E

    B -->|No| G["StrategyGetBuffer()<br/>clock-sweep for victim"]
    G --> H{Victim dirty?}
    H -->|Yes| I["FlushBuffer()<br/>write old page to disk"]
    I --> J[Remove old tag from hash table]
    H -->|No| J
    J --> K[Insert new tag into hash table]
    K --> L["smgrread()<br/>read page from disk"]
    L --> M[Set BM_VALID, set usage_count=1]
    M --> E

Clock-Sweep Eviction

The clock-sweep algorithm in StrategyGetBuffer() (freelist.c) works as follows:

A global nextVictimBuffer pointer rotates through the buffer array.
For each buffer examined:
- If refcount > 0: skip (buffer is pinned).
- If usage_count > 0: decrement usage count, skip.
- If refcount == 0 && usage_count == 0: this is the victim.
The maximum usage count is BM_MAX_USAGE_COUNT = 5, so a recently-used buffer survives at most 6 complete passes of the clock hand.

This approximates LRU with O(1) amortized cost per eviction. The tradeoff is that with a large shared_buffers, a full scan of the clock hand can be expensive – but this is rare because most buffers have low usage counts.

graph LR
    subgraph "Clock-Sweep (circular scan)"
        direction LR
        B0["buf 0<br/>usage=3"]
        B1["buf 1<br/>usage=0<br/>pinned"]
        B2["buf 2<br/>usage=1"]
        B3["buf 3<br/>usage=0<br/>FREE"]
        B4["buf 4<br/>usage=5"]
        B5["buf 5<br/>usage=2"]
    end

    HAND["nextVictimBuffer"] -.->|"scan"| B2
    B2 -.->|"decrement to 0"| B3
    B3 -.->|"VICTIM"| EVICT["Evict & reuse"]

Ring Buffers (BufferAccessStrategy)

Large sequential operations can push the entire working set out of the buffer cache. To prevent this, PostgreSQL uses ring buffers – a private, fixed-size ring of buffer slots that the operation cycles through:

Strategy	Ring Size	Used By
`BAS_BULKREAD`	256 KB (32 buffers)	Sequential scans of large tables
`BAS_BULKWRITE`	16 MB (2048 buffers)	COPY, CREATE TABLE AS, bulk inserts
`BAS_VACUUM`	256 KB (32 buffers)	VACUUM

When a ring buffer strategy is active, StrategyGetBuffer() first tries to reuse a buffer from the ring. If the buffer is still pinned by someone else, it falls back to the global clock-sweep. The StrategyRejectBuffer() function ensures that a buffer evicted by the ring is not a heavily-used buffer (usage count > 1), in which case the eviction is rejected and the next ring slot is tried.

Buffer Content Locks

Access to the data within a buffer is controlled by content locks, implemented directly in the BufferDesc.state field (not as separate LWLocks):

Shared lock: Multiple readers can hold shared locks concurrently. Acquired for reading page contents, scanning tuples.
Exclusive lock: Only one writer. Required for inserting tuples, modifying line pointers, updating hint bits that need WAL logging.

Content locks are separate from and orthogonal to buffer pins. A pin prevents the buffer from being evicted; a content lock controls concurrent access to the page data. The rule is: pin first, then lock.

Buffer Pins and the WAL Protocol

A pinned buffer cannot be evicted, but it can be written to disk by the background writer or checkpointer. Before flushing a dirty buffer, the system checks pd_lsn on the page and ensures WAL has been flushed at least that far. This enforces the fundamental WAL rule: log before data.

The sequence for modifying a page:

ReadBuffer() – pin the buffer.
LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE) – exclusive content lock.
START_CRIT_SECTION() – no errors allowed from here.
Modify the page data.
MarkBufferDirty() – set BM_DIRTY.
XLogInsert() – write WAL record; this sets pd_lsn on the page.
END_CRIT_SECTION().
UnlockBuffer() and eventually ReleaseBuffer().

Local Buffers

Temporary tables (created with CREATE TEMP TABLE) use local buffers that are private to the backend process. These bypass all shared-memory locking and are not WAL-logged. Local buffers are managed in localbuf.c and use a simpler allocation strategy since there is no concurrency.

Key Data Structures Summary

Struct	Header	Key Fields
`BufferTag`	`buf_internals.h`	`spcOid`, `dbOid`, `relNumber`, `forkNum`, `blockNum`
`BufferDesc`	`buf_internals.h`	`tag`, `buf_id`, `state` (atomic: refcount + usagecount + flags + locks)
`BufferDescPadded`	`buf_internals.h`	Union padded to 64 bytes for cache-line alignment
`WritebackContext`	`buf_internals.h`	Coalesces pending OS writeback requests

Connections

Page Layout: The buffer manager delivers raw 8 KB pages. It enforces pd_lsn for WAL ordering but is otherwise page-format-agnostic.
Storage Manager (smgr): On cache miss, the buffer manager calls smgrread() / smgrstartreadv(). On eviction of dirty buffers, it calls smgrwrite().
WAL: MarkBufferDirty() + XLogInsert() cooperate to ensure crash safety. The background writer and checkpointer flush dirty buffers in LSN order.
Async I/O: The new AIO framework provides StartReadBuffers() for batched, asynchronous page reads, used by the read stream layer.
Checkpointer: Periodically scans for buffers with BM_CHECKPOINT_NEEDED set and flushes them, advancing the redo point.

← Previous Next →

↑ Back to Table of Contents