PostgreSQL’s storage engine translates logical tables and indexes into physical bytes on disk, managing the full lifecycle of data from write to read through a layered architecture of pages, buffers, and file abstractions.
Every relation in PostgreSQL – whether a heap table, B-tree index, or TOAST table – is physically stored as a collection of 8 KB pages in one or more forks (main data, free space map, visibility map, init). The storage manager (smgr) provides a uniform interface for reading and writing these pages, while the magnetic disk (md) layer underneath handles the details of splitting large relations into 1 GB segment files.
Between the on-disk pages and the executor sits the shared buffer pool, a fixed-size cache in shared memory. The buffer manager uses a clock-sweep eviction algorithm to decide which pages stay resident. Every page modification flows through the buffer pool, and the WAL (write-ahead log) protocol ensures durability: a dirty buffer cannot be flushed to disk until its WAL record has been written.
Two auxiliary per-relation structures accelerate common operations:
PostgreSQL 17+ introduces an asynchronous I/O (AIO) framework with pluggable methods (synchronous fallback, worker processes, and io_uring on Linux). The read stream abstraction builds on AIO to provide look-ahead prefetching for sequential and index scans.
graph TD
subgraph "Backend Process"
EX[Executor] --> AM[Access Method<br/>heapam / indexam]
AM --> BM[Buffer Manager]
end
subgraph "Shared Memory"
BM --> BP["Shared Buffer Pool<br/>(shared_buffers)"]
BP --> BT[Buffer Hash Table]
BP --> BD[BufferDesc Array]
BD --> CS[Clock-Sweep Freelist]
end
subgraph "Storage Manager Layer"
BM --> SMGR[smgr.c<br/>SMgrRelationData]
SMGR --> MD["md.c<br/>Magnetic Disk"]
end
subgraph "Relation Forks on Disk"
MD --> MAIN["Main Fork<br/>(heap/index pages)"]
MD --> FSM["FSM Fork<br/>(free space map)"]
MD --> VM["VM Fork<br/>(visibility map)"]
MD --> INIT["Init Fork<br/>(unlogged rels)"]
end
subgraph "Async I/O Layer"
BM --> AIO[AIO Subsystem]
AIO --> RS[Read Stream]
AIO --> SYNC[method_sync.c]
AIO --> WORKER[method_worker.c]
AIO --> URING[method_io_uring.c]
end
style BP fill:#e1f5fe
style AIO fill:#fff3e0
| Component | Header | Implementation | README |
|---|---|---|---|
| Page Layout | src/include/storage/bufpage.h |
src/backend/storage/page/bufpage.c |
src/backend/storage/page/README |
| Item Pointers | src/include/storage/itemid.h |
– | – |
| Heap Tuple Header | src/include/access/htup_details.h |
– | – |
| Buffer Manager | src/include/storage/bufmgr.h |
src/backend/storage/buffer/bufmgr.c |
src/backend/storage/buffer/README |
| Buffer Internals | src/include/storage/buf_internals.h |
src/backend/storage/buffer/freelist.c |
– |
| Storage Manager | src/include/storage/smgr.h |
src/backend/storage/smgr/smgr.c |
src/backend/storage/smgr/README |
| Magnetic Disk | src/include/storage/md.h |
src/backend/storage/smgr/md.c |
– |
| Fork Numbers | src/include/common/relpath.h |
src/common/relpath.c |
– |
| Free Space Map | src/include/storage/fsm_internals.h |
src/backend/storage/freespace/freespace.c |
src/backend/storage/freespace/README |
| Visibility Map | src/include/access/visibilitymap.h |
src/backend/access/heap/visibilitymap.c |
– |
| AIO Framework | src/include/storage/aio.h |
src/backend/storage/aio/aio.c |
src/backend/storage/aio/README.md |
| Read Streams | src/include/storage/read_stream.h |
src/backend/storage/aio/read_stream.c |
– |
heap_fetch) which calls ReadBuffer().(tablespace, database, relfilenode, fork, block) tag to check the shared buffer hash table.smgrread() to load the page from disk.pread() (or an async read via the AIO layer).Buffer handle (a small integer index).PageHeaderData, ItemIdData line pointers, and HeapTupleHeaderData.| Section | Description |
|---|---|
| Page Layout | The 8 KB page format, line pointers, and tuple headers |
| Buffer Manager | Shared buffer pool, clock-sweep, ring buffers |
| smgr and Forks | Storage manager abstraction and relation forks |
| Free Space Map | FSM binary tree structure for space tracking |
| Visibility Map | VM bits, index-only scans, and VACUUM optimization |
| Async I/O | The new AIO framework, io_uring, and read streams |
pd_lsn. Dirty pages cannot be flushed until their WAL records are on disk.HeapTupleHeaderData fields (t_xmin, t_xmax, t_infomask) drive visibility decisions.