/read returns a pre-serialized 4.5KB product. On c8i.32xlarge (128 cores, 50 Gbps) it peaks at 1.32M RPS. Check the bandwidth: 1.32M × 4.5KB = 5.94 GB/s. The NIC ceiling is 50 Gbps = 6.25 GB/s. We are at 95% of line rate at the peak operating point.
That number reframes everything. If the NIC is the wall, then anything that doesn’t add bandwidth can’t add RPS. Three plausible-sounding levers, measured against that wall.
Measured per-worker OS thread count (nlwp) during live benchmarks at increasing connection counts, c6in.8xlarge, 32 prefork workers:
# during each benchmark point, on the server:
ps -eo nlwp,comm | awk '/fiber_server/{sum+=$1;n++} END{printf "avg %.1f threads/worker\n",sum/n}'
| connections | conn/worker | workers | threads/worker (avg) |
|---|---|---|---|
| 100 | ~3 | 32 | 10.1 |
| 500 | ~16 | 32 | 10.1 |
| 1,000 | ~31 | 32 | 10.1 |
| 5,000 | ~156 | 32 | 10.1 |
3 connections per worker or 156 — the thread count does not move. A 50× change in load, flat at 10.1.
Connections are goroutines, not threads. A goroutine waiting on network I/O is parked in the netpoller (epoll) and consumes zero threads — the runtime hands its OS thread back to run other work. One epoll_wait watches all 156 fds at once. A thread is only consumed by a goroutine actually executing or stuck in a blocking syscall right now.
With --pipelining 1, each connection is idle almost the whole time — one request, wait for the round trip, repeat. Little’s Law on a worker at 5000c:
~686k RPS / 32 workers = 21,400 RPS/worker
service time per request ≈ 34µs (85% CPU × 32 cores / 800k RPS)
in-flight, actually on a thread = 21,400 × 34µs = 0.73 requests
Less than one request per worker is on a thread at any instant. The other ~155 connections sit in epoll for free. Open ≠ active. That is why 3 and 156 connections give the identical thread count — the parked majority costs nothing.
The 10 threads break down as ~1 running the handler, plus sysmon, GC workers, finalizer, and a few M’s lingering from bursts. Not connection-driven, not a tuning surface.
Entry 10 established that fiber’s worker count equals GOMAXPROCS. So GOMAXPROCS=32 on the 128-core c8i spawns 32 workers; the default spawns 128. Same hardware, same /read, 1000 connections:
| config | workers | threads/worker | RPS (1000c) |
|---|---|---|---|
| GOMAXPROCS=128 | 128 | ~15 | 1,322,291 |
| GOMAXPROCS=32 | 32 | ~11 | 1,322,240 |
Identical. 51 RPS apart on 1.32M — noise. 96 fewer workers, same throughput, because /read is NIC-bound. 32 workers already push 6.1 GB/s; the NIC has nothing left for the other 96 to do. They are idle capacity.
This also kills the “thread count tracks GOMAXPROCS” idea from a hasty earlier read. Both configs run children at GOMAXPROCS(1) (fiber forces it — entry 10). The ~15 vs ~11 difference is sampling noise on syscall-blocked M’s, not a GOMAXPROCS effect. The c6in value (~10) lands in the same band. Per-worker threads are the GOMAXPROCS=1 runtime floor on all three.
--pipelining N sends N requests per connection without waiting between them, amortizing per-request packet and syscall overhead. c8i, 128 workers, 1000 connections:
/read (4.5KB):
| pipelining | RPS | throughput | NIC % |
|---|---|---|---|
| 1 | 1,322,291 | 6.13 GB/s | 98% |
| 10 | 1,349,120 | 6.25 GB/s | 100% |
| 20 | 1,339,289 | 6.21 GB/s | 99% |
| 50 | 1,340,654 | 6.21 GB/s | 99% |
Flat. +2% from pipelining 1 to 50, noise. Pipelining reduces per-request packet overhead — but the wall here is bytes, not packets. At pipelining=1 the NIC is already at 98% of line rate. There is no bandwidth to amortize into. You cannot pipeline past the physical bit rate of the wire.
/simple (16B):
| pipelining | RPS | throughput | NIC % |
|---|---|---|---|
| 1 | 4,401,493 | 0.55 GB/s | 9% |
| 10 | 14,898,653 | 1.85 GB/s | 30% |
| 20 | 16,133,324 | 2.00 GB/s | 32% |
| 50 | 16,947,473 | 2.10 GB/s | 34% |
3.8×. Same machine, same pipelining sweep, opposite result. /simple is 16 bytes — at 17M RPS the NIC is still only 34% used. Bandwidth is irrelevant; the ceiling is packet rate and syscall count. At pipelining=1 every request is its own packet and its own read/write pair. At pipelining=50, fifty 16-byte requests arrive in one TCP segment and the responses batch out together — the per-request syscall cost is amortized ~50×. RPS climbs until CPU or the client becomes the wall around 17M.
Pipelining helps when you are packet-bound. It does nothing when you are bandwidth-bound. The crossover is payload size:
/read : 4.5KB × 1.32M RPS = 5.94 GB/s ≈ NIC ceiling → bandwidth-bound → pipelining flat
/simple : 16B × 17M RPS = 2.1 GB/s = 34% of NIC → packet-bound → pipelining 3.8×