Three things that don't move /read past the NIC

Thread count is flat across 50× load. 32 workers tie 128. Pipelining does nothing. The NIC is the wall.

June 11, 2026 — DONE
threads netpoller GOMAXPROCS pipelining NIC bandwidth c8i c6in

the setup for all three

/read returns a pre-serialized 4.5KB product. On c8i.32xlarge (128 cores, 50 Gbps) it peaks at 1.32M RPS. Check the bandwidth: 1.32M × 4.5KB = 5.94 GB/s. The NIC ceiling is 50 Gbps = 6.25 GB/s. We are at 95% of line rate at the peak operating point.

That number reframes everything. If the NIC is the wall, then anything that doesn’t add bandwidth can’t add RPS. Three plausible-sounding levers, measured against that wall.

1. thread count is independent of connection load

Measured per-worker OS thread count (nlwp) during live benchmarks at increasing connection counts, c6in.8xlarge, 32 prefork workers:

# during each benchmark point, on the server:
ps -eo nlwp,comm | awk '/fiber_server/{sum+=$1;n++} END{printf "avg %.1f threads/worker\n",sum/n}'
connectionsconn/workerworkersthreads/worker (avg)
100~33210.1
500~163210.1
1,000~313210.1
5,000~1563210.1

3 connections per worker or 156 — the thread count does not move. A 50× change in load, flat at 10.1.

Connections are goroutines, not threads. A goroutine waiting on network I/O is parked in the netpoller (epoll) and consumes zero threads — the runtime hands its OS thread back to run other work. One epoll_wait watches all 156 fds at once. A thread is only consumed by a goroutine actually executing or stuck in a blocking syscall right now.

With --pipelining 1, each connection is idle almost the whole time — one request, wait for the round trip, repeat. Little’s Law on a worker at 5000c:

~686k RPS / 32 workers           = 21,400 RPS/worker
service time per request         ≈ 34µs  (85% CPU × 32 cores / 800k RPS)
in-flight, actually on a thread  = 21,400 × 34µs = 0.73 requests

Less than one request per worker is on a thread at any instant. The other ~155 connections sit in epoll for free. Open ≠ active. That is why 3 and 156 connections give the identical thread count — the parked majority costs nothing.

The 10 threads break down as ~1 running the handler, plus sysmon, GC workers, finalizer, and a few M’s lingering from bursts. Not connection-driven, not a tuning surface.

2. 32 workers tie 128 workers

Entry 10 established that fiber’s worker count equals GOMAXPROCS. So GOMAXPROCS=32 on the 128-core c8i spawns 32 workers; the default spawns 128. Same hardware, same /read, 1000 connections:

configworkersthreads/workerRPS (1000c)
GOMAXPROCS=128128~151,322,291
GOMAXPROCS=3232~111,322,240

Identical. 51 RPS apart on 1.32M — noise. 96 fewer workers, same throughput, because /read is NIC-bound. 32 workers already push 6.1 GB/s; the NIC has nothing left for the other 96 to do. They are idle capacity.

This also kills the “thread count tracks GOMAXPROCS” idea from a hasty earlier read. Both configs run children at GOMAXPROCS(1) (fiber forces it — entry 10). The ~15 vs ~11 difference is sampling noise on syscall-blocked M’s, not a GOMAXPROCS effect. The c6in value (~10) lands in the same band. Per-worker threads are the GOMAXPROCS=1 runtime floor on all three.

3. pipelining does nothing for /read, everything for /simple

--pipelining N sends N requests per connection without waiting between them, amortizing per-request packet and syscall overhead. c8i, 128 workers, 1000 connections:

/read (4.5KB):

pipeliningRPSthroughputNIC %
11,322,2916.13 GB/s98%
101,349,1206.25 GB/s100%
201,339,2896.21 GB/s99%
501,340,6546.21 GB/s99%

Flat. +2% from pipelining 1 to 50, noise. Pipelining reduces per-request packet overhead — but the wall here is bytes, not packets. At pipelining=1 the NIC is already at 98% of line rate. There is no bandwidth to amortize into. You cannot pipeline past the physical bit rate of the wire.

/simple (16B):

pipeliningRPSthroughputNIC %
14,401,4930.55 GB/s9%
1014,898,6531.85 GB/s30%
2016,133,3242.00 GB/s32%
5016,947,4732.10 GB/s34%

3.8×. Same machine, same pipelining sweep, opposite result. /simple is 16 bytes — at 17M RPS the NIC is still only 34% used. Bandwidth is irrelevant; the ceiling is packet rate and syscall count. At pipelining=1 every request is its own packet and its own read/write pair. At pipelining=50, fifty 16-byte requests arrive in one TCP segment and the responses batch out together — the per-request syscall cost is amortized ~50×. RPS climbs until CPU or the client becomes the wall around 17M.

the rule

Pipelining helps when you are packet-bound. It does nothing when you are bandwidth-bound. The crossover is payload size:

/read   : 4.5KB × 1.32M RPS = 5.94 GB/s ≈ NIC ceiling  → bandwidth-bound → pipelining flat
/simple : 16B   × 17M  RPS  = 2.1 GB/s  = 34% of NIC    → packet-bound    → pipelining 3.8×
Finding /read on c8i.32xlarge is NIC-bandwidth-bound at 1.32M RPS (98% of 50 Gbps). Three independent levers confirm it: thread count is flat across 50× load (epoll multiplexes idle connections), 32 workers match 128 (no bandwidth left for extra workers), and pipelining 1→50 is flat (no bytes left to amortize). The same pipelining sweep on the 16-byte /simple endpoint goes 4.4M → 17M — proof that pipelining works, just not against a bandwidth wall.
Why the early "18M RPS" numbers mean nothing for a real API Entries 01-06 hit millions of RPS on /simple with pipelining=100. That is a 16-byte payload, packet-bound, with 50× syscall amortization. A real 4.5KB API response is bandwidth-bound and tops out at ~1.3M on this exact hardware no matter what you do with pipelining, workers, or GOMAXPROCS. The headline number is a function of payload size, not server capability.

↑ Back to Journal