PostgreSQL exploits hardware-specific instructions for three performance-critical operations: CRC-32C checksums (protecting WAL records and data pages), SIMD vector comparisons (accelerating text searches and visibility checks), and population count (used in bitmap operations). On multi-socket systems, the NUMA abstraction layer ensures shared buffers are queried and placed on the correct memory node. Each of these facilities uses a compile-time or runtime dispatch mechanism to select the fastest available implementation, falling back to portable C code when hardware acceleration is unavailable.
The common thread across these facilities is dispatch to the best available hardware path:
Compile time Runtime
+---------------+ +-------------+
CRC-32C: USE_SSE42_CRC32C -or-> function pointer
USE_ARMV8_CRC32C selects SSE4.2 vs
AVX-512 vs SB8
SIMD: USE_SSE2 / USE_NEON -or-> USE_NO_SIMD
(uint64 emulation)
POPCNT: __builtin_popcount -or-> lookup table
POPCNT instruction pg_number_of_ones[]
NUMA: USE_LIBNUMA -or-> stub returning -1
| File | Role |
|---|---|
src/include/port/pg_crc32c.h |
CRC-32C API macros and dispatch logic |
src/port/pg_crc32c_sse42.c |
SSE4.2 hardware CRC implementation |
src/port/pg_crc32c_sse42_choose.c |
Runtime chooser: AVX-512 vs SSE4.2 vs SB8 |
src/port/pg_crc32c_armv8.c |
ARMv8 CRC extension implementation |
src/port/pg_crc32c_sb8.c |
Slicing-by-8 software fallback |
src/include/port/simd.h |
SIMD vector types and operations |
src/include/port/pg_bitutils.h |
Popcount, CLZ, CTZ, bit manipulation |
src/port/pg_bitutils.c |
Lookup tables for non-builtin platforms |
src/port/pg_popcount_x86.c |
x86 AVX-512 VPOPCNT accelerated popcount |
src/port/pg_popcount_aarch64.c |
ARM NEON accelerated popcount |
src/include/port/pg_numa.h |
NUMA query API |
src/port/pg_numa.c |
libnuma wrapper and stub implementation |
Every WAL record and every data page (when data_checksums is enabled) is
protected by a CRC-32C checksum. The speed of this computation directly affects
write throughput and recovery time.
PostgreSQL defines four macros as the public interface:
INIT_CRC32C(crc) /* crc = 0xFFFFFFFF */
COMP_CRC32C(crc, data, len) /* accumulate bytes */
FIN_CRC32C(crc) /* crc ^= 0xFFFFFFFF */
EQ_CRC32C(c1, c2) /* equality check */
The COMP_CRC32C macro is where dispatch happens. The implementation hierarchy:
+----------------------------------------------+
| COMP_CRC32C(crc, data, len) |
+----------------------------------------------+
| |
Compile-time known Runtime dispatch
USE_SSE42_CRC32C via function pointer
| |
v v
pg_comp_crc32c_dispatch() pg_comp_crc32c --> [one of:]
| |
+--------+--------+ +---------------+---------------+
| | | | |
Small constant Large AVX-512 SSE 4.2 SB8
len < 32: input: (if CPUID (if CPUID (software
inline call via confirms) confirms) fallback)
_mm_crc32_u64 fn pointer
_mm_crc32_u32
_mm_crc32_u8
The inline fast path is a compile-time optimization for small, constant-size
inputs. When GCC or Clang can prove that len is a compile-time constant less
than 32, the computation is inlined directly using SSE4.2 intrinsics, avoiding a
function-call overhead:
/* From pg_crc32c.h -- the dispatch function */
pg_attribute_target("sse4.2")
static inline pg_crc32c
pg_comp_crc32c_dispatch(pg_crc32c crc, const void *data, size_t len)
{
if (__builtin_constant_p(len) && len < 32)
{
const unsigned char *p = data;
for (; len >= 8; p += 8, len -= 8)
crc = _mm_crc32_u64(crc, *(const uint64 *) p);
for (; len >= 4; p += 4, len -= 4)
crc = _mm_crc32_u32(crc, *(const uint32 *) p);
for (; len > 0; --len)
crc = _mm_crc32_u8(crc, *p++);
return crc;
}
else
return pg_comp_crc32c(crc, data, len);
}
The runtime chooser (pg_crc32c_sse42_choose.c) runs once at startup. It
uses CPUID to check for AVX-512 support (specifically the vpclmulqdq
instruction for parallel CRC) and sets the pg_comp_crc32c function pointer
accordingly.
The software fallback (slicing-by-8) processes 8 bytes per loop iteration using eight 256-entry lookup tables. This is the fastest portable implementation, roughly 10x slower than hardware CRC on modern CPUs.
On AArch64, the CRC32C instruction is part of the optional CRC extension (mandatory in ARMv8.1+). The dispatch mechanism mirrors x86:
/* Compile-time: known to have CRC extension */
#define USE_ARMV8_CRC32C
/* Or: runtime check via reading auxiliary vector */
#define USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK
The runtime check reads /proc/self/auxv on Linux or uses
sysctlbyname("hw.optional.armv8_crc32", ...) on macOS to detect the
extension at process startup.
The simd.h header provides a platform-independent vector API for bulk
byte/word comparisons. The primary use cases are:
The type definitions adapt to the available instruction set:
#if defined(USE_SSE2)
typedef __m128i Vector8; /* 16 bytes, 16 x uint8 */
typedef __m128i Vector32; /* 16 bytes, 4 x uint32 */
#elif defined(USE_NEON)
typedef uint8x16_t Vector8; /* 16 bytes, ARM NEON */
typedef uint32x4_t Vector32;
#else
#define USE_NO_SIMD
typedef uint64 Vector8; /* 8 bytes, bitwise emulation */
#endif
Key operations and their hardware mapping:
| Function | SSE2 | NEON | No-SIMD |
|---|---|---|---|
vector8_broadcast(c) |
_mm_set1_epi8(c) |
vdupq_n_u8(c) |
SWAR bit trick |
vector8_has(v, c) |
PCMPEQB + PMOVMSKB |
VCEQ + reduce |
byte loop |
vector8_has_zero(v) |
PCMPEQB zero + mask |
VMIN + extract |
SWAR null check |
vector8_has_le(v, c) |
PMINUB + compare |
VCLE + reduce |
byte loop |
vector8_highbit_mask(v) |
PMOVMSKB |
shift + narrow + extract | not available |
The SWAR (SIMD Within A Register) fallback for USE_NO_SIMD packs 8 bytes
into a uint64 and uses the classic null-byte detection trick:
To detect a zero byte in word w:
((w - 0x0101010101010101) & ~w & 0x8080808080808080) != 0
Bitmap operations (visibility map, free-space map, relcache invalidation masks) need to count set bits. PostgreSQL provides:
/* Fast path: compiler builtin (often maps to POPCNT instruction) */
pg_popcount32(uint32 word) /* __builtin_popcount */
pg_popcount64(uint64 word) /* __builtin_popcountll */
/* Bulk: count bits in a byte array */
pg_popcount(const char *buf, int bytes)
On x86-64 with AVX-512 VPOPCNT, the bulk pg_popcount() processes 64 bytes
per iteration using _mm512_popcnt_epi64, delivering roughly 16x the
throughput of scalar POPCNT. On AArch64, the NEON path uses VCNT (count bits
per byte lane) with horizontal adds.
When no hardware popcount is available, a 256-entry lookup table
(pg_number_of_ones[]) provides the fallback.
On multi-socket servers, memory access latency depends on which NUMA node owns the physical memory. PostgreSQL’s NUMA layer (introduced in v18) provides primitives for querying page placement:
/* Initialize libnuma; returns -1 if unavailable */
int pg_numa_init(void);
/* Query which NUMA node each page resides on */
int pg_numa_query_pages(int pid, unsigned long count,
void **pages, int *status);
/* Get the highest NUMA node ID */
int pg_numa_get_max_node(void);
The implementation wraps Linux’s move_pages(2) syscall (via
numa_move_pages() from libnuma), chunking queries into groups of 1024 pages
(or 16 on 32-bit systems, to work around a kernel bug in do_pages_stat()).
Before querying a page’s NUMA node, it must be faulted into physical memory:
static inline void
pg_numa_touch_mem_if_required(void *ptr)
{
volatile uint64 touch pg_attribute_unused();
touch = *(volatile uint64 *) ptr;
}
This volatile read forces the kernel to allocate a physical page, after which
move_pages(2) can report its NUMA node accurately.
On non-Linux platforms (or when libnuma is not available), all three functions
are stubs: pg_numa_init() returns -1, and the query functions are no-ops.
pg_comp_crc32c (function pointer, set once at startup)
|
+---> pg_comp_crc32c_avx512() (if AVX-512 + VPCLMULQDQ)
+---> pg_comp_crc32c_sse42() (if SSE 4.2)
+---> pg_comp_crc32c_armv8() (if ARMv8 CRC extension)
+---> pg_comp_crc32c_loongarch() (if LoongArch CRCC)
+---> pg_comp_crc32c_sb8() (software fallback)
USE_SSE2 USE_NEON USE_NO_SIMD
+-------------------+ +-------------------+ +------------------+
| Vector8 = __m128i | | Vector8 = | | Vector8 = uint64 |
| (16 x uint8) | | uint8x16_t | | (8 x uint8 SWAR) |
| | | (16 x uint8) | | |
| Vector32 = __m128i| | Vector32 = | | Vector32: N/A |
| (4 x uint32) | | uint32x4_t | | (not implemented)|
+-------------------+ +-------------------+ +------------------+
pg_numa_init()
|
v
numa_available() ----[fail]----> return -1 (NUMA disabled)
|
[ok]
v
pg_numa_query_pages(pid, count, pages[], status[])
|
+---> loop in chunks of NUMA_QUERY_CHUNK_SIZE (1024 or 16)
| |
| +---> CHECK_FOR_INTERRUPTS()
| +---> numa_move_pages(pid, chunk, &pages[i], NULL, &status[i], 0)
| | (queries placement, does NOT migrate)
| +---> on error: return immediately
|
v
status[] now contains NUMA node ID for each page
| Operation | Hardware path | Software fallback | Ratio |
|---|---|---|---|
| CRC-32C of 8 KB page | ~30 ns (SSE4.2) | ~1200 ns (SB8) | ~40x |
| CRC-32C of 8 KB page | ~15 ns (AVX-512) | ~1200 ns (SB8) | ~80x |
vector8_has() on 16 bytes |
2 instructions | 16-byte loop | ~8x |
pg_popcount() on 8 KB |
~50 ns (AVX-512 VPOPCNT) | ~800 ns (table) | ~16x |
| NUMA query, 1024 pages | ~50 us (move_pages) | N/A (returns 0) | – |
XLogRecord header includes a CRC-32C computed over the entire record. The choice between SSE4.2 and software CRC directly impacts WAL write throughput and crash-recovery speed.data_checksums is enabled, each 8 KB page carries a CRC that is verified on every read from disk. NUMA-aware buffer allocation ensures shared_buffers pages are local to the accessing CPU’s memory node.