18M RPS on /simple, 1.7M on /compute: matched c8i hardware

c8i.32xlarge × 2, 120 autocannon workers — server at 26% on simple, 94% on compute

June 05, 2026 — DONE
c8i autocannon workers compute simple client-ceiling Little's-Law

setup

Role Instance vCPU RAM Network
Server c8i.32xlarge 128 256 GB 50 Gbps
Client c8i.32xlarge 128 256 GB 50 Gbps

Server: fiber v3, prefork, 128 workers. RPS/RFS off, irqbalance inactive. No tuning — raw baseline.

Client: autocannon with --workers 120. One worker per available core minus a few for system overhead.

benchmark commands

# server — build and start
export PATH=$PATH:/usr/local/go/bin
cd /opt/millionrps/src/http
go build -o fiber_server fiber_server.go
nohup ./fiber_server > /tmp/fiber.log 2>&1 &

# client — connection ramp, pipelining 100, 30s per point
autocannon \
  --connections 1000 \
  --pipelining 100 \
  --workers 120 \
  --duration 30 \
  "http://SERVER_INTERNAL_IP:8083/simple"

# full ramp script (1000 → 2000 → 5000 connections)
./autocannon_bench.sh SERVER_INTERNAL_IP 100 30 120 simple

/simple results

connectionsin-flightRPSp50 msp99 msthroughput MB/s
1000100,00018,164,7364132252
2000200,00017,426,2279302160
5000500,0006,659,74369211825

Live metrics during 1000c point:

SERVER (mpstat -P ALL 1 1):
  AVG  usr: 26%   sys: 12%   idle: 56%

SERVER NIC (enp95s0):
  TX: 2252 MB/s   (36% of 6250 MB/s ceiling)

CLIENT:
  AVG  usr: 82%   sys: 9%    idle: 4%

the 5000c drop — Little’s Law

RPS drops from 17M at 2000c to 6.6M at 5000c. The server didn’t slow down — latency increased.

Little’s Law: RPS = in-flight ÷ latency

1000c:  100k ÷   4ms = 25M theoretical   (actual 18M)
2000c:  200k ÷   9ms = 22M theoretical   (actual 17M)
5000c:  500k ÷  69ms =  7.2M theoretical  (actual 6.6M)  ✓

At 5000c, 500k requests are simultaneously queued across 128 workers. Each worker handles ~3900 pipelined requests. The Go runtime scheduler churns, goroutine wake latency grows, TCP buffers fill. p50 jumps from 9ms to 69ms — 8× — which directly explains the RPS drop.

/compute results

/compute builds 100 products from the pool and JSON-serialises them per request — pure CPU work.

./autocannon_bench.sh SERVER_INTERNAL_IP 100 30 120 compute
connectionsin-flightRPSp50 msp99 msthroughput MB/s
1000100,0001,662,2257248870
2000200,0001,686,5626439883
5000500,0001,744,86261000914

Live metrics during 2000c point:

SERVER (mpstat -P ALL 1 1):
  AVG  usr: 86%   sys: 1%    idle: 12%
  All 128 cores at 85-100% usr

CLIENT:
  AVG  usr: 7%    sys: 1%    idle: 90%

Server saturated, client barely loaded. Opposite of /simple.

the client ceiling problem

For /simple, the server processes a request in ~1-2µs. The client must schedule a goroutine, format the request, send it, receive the response, measure latency, and record stats — ~10-20µs total. The client does 5-10× more work per request than the server.

Server ceiling:  ~1µs/req × 128 cores = theoretical ~128M req/s
Client ceiling:  ~15µs/req × 120 workers = ~8M req/s

The client ceiling is always lower. No amount of client hardware tuning fixes this for cheap endpoints — you need multiple client machines, or run client and server on the same box (HAProxy approach).

For /compute, the balance flips: server spends ~600µs on CPU work, client just waits. Server saturates first.

Finding For network-bound endpoints like /simple, the benchmark client is the bottleneck, not the server. The c8i.32xlarge server at 18M RPS is using 26% CPU and 36% of its NIC. It has not been loaded. For CPU-bound endpoints like /compute, the server saturates first and client metrics become irrelevant.

↑ Back to Journal