Lightweight locks protect shared-memory data structures such as buffer mapping tables, WAL insertion slots, and the lock manager’s own hash tables. They support shared (read) and exclusive (write) modes, use atomic operations for the fast path, and put waiters to sleep on OS semaphores rather than busy-waiting.
LWLocks sit between spinlocks and heavyweight locks in the locking hierarchy. They are fast enough for per-page or per-hash-partition protection (a few dozen instructions in the uncontended case) yet rich enough to support read/write semantics and ordered wakeup of waiters.
Key properties:
LW_SHARED (multiple readers) and LW_EXCLUSIVE (single
writer, blocks readers).elog() recovery path calls
LWLockReleaseAll(), so it is safe to ereport(ERROR) while holding
LWLocks.die() are held off while any
LWLock is held, preventing partial updates to shared structures.| File | Purpose |
|---|---|
src/backend/storage/lmgr/lwlock.c |
Acquire, release, wait queue management |
src/include/storage/lwlock.h |
LWLock struct, LWLockMode enum, tranche IDs |
src/include/storage/lwlocklist.h |
Enumeration of all individually-named LWLocks |
src/include/storage/lwlocknames.h |
Auto-generated names for individual LWLocks |
Each LWLock contains a single pg_atomic_uint32 state that encodes both the
lock mode and holder count:
Bit layout of LWLock.state (32 bits):
+----+----+----+------------------------------+
| 31 | 30 | 29 | 28 ... 0 |
+----+----+----+------------------------------+
| | | |
| | | +-- Number of shared lockers (up to ~500 million)
| | +---------- LW_FLAG_RELEASE_OK: waiters may be released
| +--------------- LW_FLAG_HAS_WAITERS: wait queue is non-empty
+-------------------- LW_VAL_EXCLUSIVE: lock is held exclusively
LW_VAL_EXCLUSIVE = (1 << 24) -- sentinel for exclusive hold
LW_SHARED_MASK = ((1 << 24) - 1) -- mask for shared holder count
The acquisition protocol addresses a subtle race: between the moment a backend sees the lock is held and the moment it enqueues itself, the holder might release. The four-phase protocol prevents missed wakeups:
LWLockAcquire(lock, mode)
|
+-- Phase 1: Atomic attempt
| If mode == LW_SHARED:
| atomic_fetch_add(&state, 1) -- increment shared count
| If no exclusive holder: SUCCESS
| Else: undo the add, proceed to Phase 2
| If mode == LW_EXCLUSIVE:
| atomic_compare_exchange(&state, 0, LW_VAL_EXCLUSIVE)
| If succeeded: SUCCESS
| Else: proceed to Phase 2
|
+-- Phase 2: Enqueue on wait list
| Acquire lock->waiters spinlock (via atomic ops on state)
| Add PGPROC to lock->waiters list
| Set LW_FLAG_HAS_WAITERS
| Set my lwWaitMode, lwWaiting = LW_WS_WAITING
|
+-- Phase 3: Retry atomic attempt
| Try Phase 1 again (lock may have been released during enqueue)
| If succeeded: dequeue self, SUCCESS
|
+-- Phase 4: Sleep
Call PGSemaphoreLock(proc->sem) -- block on OS semaphore
When woken: goto Phase 1
LWLockRelease(lock)
|
+-- If held exclusively:
| atomic_sub(&state, LW_VAL_EXCLUSIVE)
| Else (held shared):
| atomic_sub(&state, 1)
|
+-- If new state == 0 and LW_FLAG_HAS_WAITERS is set:
Acquire waiters list
Walk the list:
- Wake all waiting shared lockers (if no exclusive waiter ahead)
- Or wake the first exclusive waiter
PGSemaphoreUnlock(each woken proc->sem)
Waiters are woken in FIFO order. If the first waiter wants exclusive access, only that waiter is woken. If the first waiter wants shared access, all consecutive shared waiters are woken together.
typedef struct LWLock
{
uint16 tranche; /* identifies which group this lock belongs to */
pg_atomic_uint32 state; /* encodes exclusive/shared holders + flags */
proclist_head waiters; /* list of waiting PGPROCs */
#ifdef LOCK_DEBUG
pg_atomic_uint32 nwaiters;
struct PGPROC *owner; /* last exclusive owner (debug only) */
#endif
} LWLock;
/* Padded to a full cache line to prevent false sharing */
typedef union LWLockPadded
{
LWLock lock;
char pad[PG_CACHE_LINE_SIZE]; /* typically 64 or 128 bytes */
} LWLockPadded;
/* The main array of pre-allocated LWLocks in shared memory */
extern LWLockPadded *MainLWLockArray;
Each backend’s PGPROC structure contains:
/* In PGPROC: */
LWLockWaitState lwWaiting; /* LW_WS_NOT_WAITING, LW_WS_WAITING, LW_WS_PENDING_WAKEUP */
uint8 lwWaitMode; /* LW_EXCLUSIVE or LW_SHARED */
proclist_node lwWaitLink; /* link in LWLock's waiters list */
PGSemaphore sem; /* semaphore for sleeping */
LWLocks are grouped into tranches for monitoring and debugging. Each
tranche has a name that appears in pg_stat_activity.wait_event:
MainLWLockArray layout:
+-------------------------------------------------------------------+
| Individual named locks | Buffer mapping | Lock manager |
| (BufFreelistLock, | partitions | partitions |
| CheckpointerCommLock, ...) | (128 locks) | (16 locks) |
| NUM_INDIVIDUAL_LWLOCKS | | |
+-------------------------------+----------------+------------------+
| Predicate lock manager partitions (16 locks) |
+-------------------------------------------------------------------+
Additional tranches are allocated dynamically for:
RequestNamedLWLockTranche()#define NUM_BUFFER_PARTITIONS 128
#define NUM_LOCK_PARTITIONS 16 /* 2^4 */
#define NUM_PREDICATELOCK_PARTITIONS 16 /* 2^4 */
atomic_fetch_add(1)
+------ UNLOCKED (state=0) ------+
| |
v v
SHARED (state=N) EXCLUSIVE (state=LW_VAL_EXCLUSIVE)
N = number of shared holders Only one holder
| |
| atomic_sub(1) | atomic_sub(LW_VAL_EXCLUSIVE)
| if N-1 > 0: still shared |
| if N-1 == 0: unlocked |
| |
+--------> UNLOCKED <-------------+
|
If HAS_WAITERS flag set:
wake appropriate waiters
LWLocks support a special pattern for waiting until a protected variable changes value, without holding the lock:
bool LWLockWaitForVar(LWLock *lock,
pg_atomic_uint64 *valptr,
uint64 oldval,
uint64 *newval);
This is used by the WAL subsystem: a backend waiting for WAL to be flushed
calls LWLockWaitForVar() on the WAL write lock, waiting until the flush
position advances past the desired LSN. The lock holder periodically calls
LWLockUpdateVar() to publish progress and wake waiters whose condition is
now satisfied.
Cache-line padding. Each LWLock is padded to a full cache line (64 or 128 bytes) to prevent false sharing. Without padding, two unrelated locks on the same cache line would cause cross-CPU cache invalidation traffic on every acquisition.
Wait-free shared path. The common case of acquiring a shared lock on an
uncontended LWLock is a single atomic_fetch_add – no spinlock, no kernel
call, no cache-line bouncing beyond the lock’s own line.
Tranche-level monitoring. Because each tranche is named, EXPLAIN
(BUFFERS) and pg_stat_activity can report exactly which LWLock a backend is
waiting on (e.g., LWLock:BufferMapping, LWLock:WALInsert), making
contention analysis straightforward.
WALInsertLock tranche allows multiple backends to insert WAL
records concurrently into different WAL insertion slots.LockManagerLock tranche). Fast-path lock
promotion also requires briefly holding a per-backend LWLock.ReplicationSlotLock and ReplicationOriginLock protect
replication slot state in shared memory.