| Role | Instance | vCPU | RAM | Network |
|---|---|---|---|---|
| Server | c8i.32xlarge | 128 | 256 GB | 50 Gbps |
| Client | c8i.32xlarge | 128 | 256 GB | 50 Gbps |
Server: fiber v3, prefork, 128 workers. RPS/RFS off, irqbalance inactive. No tuning — raw baseline.
Client: autocannon with --workers 120. One worker per available core minus a few for system overhead.
# server — build and start
export PATH=$PATH:/usr/local/go/bin
cd /opt/millionrps/src/http
go build -o fiber_server fiber_server.go
nohup ./fiber_server > /tmp/fiber.log 2>&1 &
# client — connection ramp, pipelining 100, 30s per point
autocannon \
--connections 1000 \
--pipelining 100 \
--workers 120 \
--duration 30 \
"http://SERVER_INTERNAL_IP:8083/simple"
# full ramp script (1000 → 2000 → 5000 connections)
./autocannon_bench.sh SERVER_INTERNAL_IP 100 30 120 simple
| connections | in-flight | RPS | p50 ms | p99 ms | throughput MB/s |
|---|---|---|---|---|---|
| 1000 | 100,000 | 18,164,736 | 4 | 13 | 2252 |
| 2000 | 200,000 | 17,426,227 | 9 | 30 | 2160 |
| 5000 | 500,000 | 6,659,743 | 69 | 211 | 825 |
Live metrics during 1000c point:
SERVER (mpstat -P ALL 1 1):
AVG usr: 26% sys: 12% idle: 56%
SERVER NIC (enp95s0):
TX: 2252 MB/s (36% of 6250 MB/s ceiling)
CLIENT:
AVG usr: 82% sys: 9% idle: 4%
RPS drops from 17M at 2000c to 6.6M at 5000c. The server didn’t slow down — latency increased.
Little’s Law: RPS = in-flight ÷ latency
1000c: 100k ÷ 4ms = 25M theoretical (actual 18M)
2000c: 200k ÷ 9ms = 22M theoretical (actual 17M)
5000c: 500k ÷ 69ms = 7.2M theoretical (actual 6.6M) ✓
At 5000c, 500k requests are simultaneously queued across 128 workers. Each worker handles ~3900 pipelined requests. The Go runtime scheduler churns, goroutine wake latency grows, TCP buffers fill. p50 jumps from 9ms to 69ms — 8× — which directly explains the RPS drop.
/compute builds 100 products from the pool and JSON-serialises them per request — pure CPU work.
./autocannon_bench.sh SERVER_INTERNAL_IP 100 30 120 compute
| connections | in-flight | RPS | p50 ms | p99 ms | throughput MB/s |
|---|---|---|---|---|---|
| 1000 | 100,000 | 1,662,225 | 7 | 248 | 870 |
| 2000 | 200,000 | 1,686,562 | 6 | 439 | 883 |
| 5000 | 500,000 | 1,744,862 | 6 | 1000 | 914 |
Live metrics during 2000c point:
SERVER (mpstat -P ALL 1 1):
AVG usr: 86% sys: 1% idle: 12%
All 128 cores at 85-100% usr
CLIENT:
AVG usr: 7% sys: 1% idle: 90%
Server saturated, client barely loaded. Opposite of /simple.
For /simple, the server processes a request in ~1-2µs. The client must schedule a goroutine, format the request, send it, receive the response, measure latency, and record stats — ~10-20µs total. The client does 5-10× more work per request than the server.
Server ceiling: ~1µs/req × 128 cores = theoretical ~128M req/s
Client ceiling: ~15µs/req × 120 workers = ~8M req/s
The client ceiling is always lower. No amount of client hardware tuning fixes this for cheap endpoints — you need multiple client machines, or run client and server on the same box (HAProxy approach).
For /compute, the balance flips: server spends ~600µs on CPU work, client just waits. Server saturates first.
/simple, the benchmark client is the bottleneck, not the server. The c8i.32xlarge server at 18M RPS is using 26% CPU and 36% of its NIC. It has not been loaded. For CPU-bound endpoints like /compute, the server saturates first and client metrics become irrelevant.