18M RPS on /simple, 1.7M on /compute: matched c8i hardware

c8i autocannon workers compute simple client-ceiling Little's-Law

setup

Role	Instance	vCPU	RAM	Network
Server	c8i.32xlarge	128	256 GB	50 Gbps
Client	c8i.32xlarge	128	256 GB	50 Gbps

Server: fiber v3, prefork, 128 workers. RPS/RFS off, irqbalance inactive. No tuning — raw baseline.

Client: autocannon with --workers 120. One worker per available core minus a few for system overhead.

benchmark commands

# server — build and start
export PATH=$PATH:/usr/local/go/bin
cd /opt/millionrps/src/http
go build -o fiber_server fiber_server.go
nohup ./fiber_server > /tmp/fiber.log 2>&1 &

# client — connection ramp, pipelining 100, 30s per point
autocannon \
  --connections 1000 \
  --pipelining 100 \
  --workers 120 \
  --duration 30 \
  "http://SERVER_INTERNAL_IP:8083/simple"

# full ramp script (1000 → 2000 → 5000 connections)
./autocannon_bench.sh SERVER_INTERNAL_IP 100 30 120 simple

/simple results

connections	in-flight	RPS	p50 ms	p99 ms	throughput MB/s
1000	100,000	18,164,736	4	13	2252
2000	200,000	17,426,227	9	30	2160
5000	500,000	6,659,743	69	211	825

Live metrics during 1000c point:

SERVER (mpstat -P ALL 1 1):
  AVG  usr: 26%   sys: 12%   idle: 56%

SERVER NIC (enp95s0):
  TX: 2252 MB/s   (36% of 6250 MB/s ceiling)

CLIENT:
  AVG  usr: 82%   sys: 9%    idle: 4%

the 5000c drop — Little’s Law

RPS drops from 17M at 2000c to 6.6M at 5000c. The server didn’t slow down — latency increased.

Little’s Law: RPS = in-flight ÷ latency

1000c:  100k ÷   4ms = 25M theoretical   (actual 18M)
2000c:  200k ÷   9ms = 22M theoretical   (actual 17M)
5000c:  500k ÷  69ms =  7.2M theoretical  (actual 6.6M)  ✓

At 5000c, 500k requests are simultaneously queued across 128 workers. Each worker handles ~3900 pipelined requests. The Go runtime scheduler churns, goroutine wake latency grows, TCP buffers fill. p50 jumps from 9ms to 69ms — 8× — which directly explains the RPS drop.

/compute results

/compute builds 100 products from the pool and JSON-serialises them per request — pure CPU work.

./autocannon_bench.sh SERVER_INTERNAL_IP 100 30 120 compute

connections	in-flight	RPS	p50 ms	p99 ms	throughput MB/s
1000	100,000	1,662,225	7	248	870
2000	200,000	1,686,562	6	439	883
5000	500,000	1,744,862	6	1000	914

Live metrics during 2000c point:

SERVER (mpstat -P ALL 1 1):
  AVG  usr: 86%   sys: 1%    idle: 12%
  All 128 cores at 85-100% usr

CLIENT:
  AVG  usr: 7%    sys: 1%    idle: 90%

Server saturated, client barely loaded. Opposite of /simple.

the client ceiling problem

For /simple, the server processes a request in ~1-2µs. The client must schedule a goroutine, format the request, send it, receive the response, measure latency, and record stats — ~10-20µs total. The client does 5-10× more work per request than the server.

Server ceiling:  ~1µs/req × 128 cores = theoretical ~128M req/s
Client ceiling:  ~15µs/req × 120 workers = ~8M req/s

The client ceiling is always lower. No amount of client hardware tuning fixes this for cheap endpoints — you need multiple client machines, or run client and server on the same box (HAProxy approach).

For /compute, the balance flips: server spends ~600µs on CPU work, client just waits. Server saturates first.

Finding For network-bound endpoints like /simple, the benchmark client is the bottleneck, not the server. The c8i.32xlarge server at 18M RPS is using 26% CPU and 36% of its NIC. It has not been loaded. For CPU-bound endpoints like /compute, the server saturates first and client metrics become irrelevant.

Next IRQ pinning experiment: pin 16 NIC queues to dedicated cores, run fiber on clean cores, measure whether interrupt isolation changes anything at these load levels.

← 2.3M RPS: the --workers flag, connection ramp, and finding the real ceiling IRQ pinning: when it matters and when it doesn't →

↑ Back to Journal