Last run used autocannon without --workers. The command was:
autocannon --connections 1000 --pipelining 100 --duration 30 \
"http://SERVER_INTERNAL_IP:8083/simple"
One Node.js event loop, one thread. At 1000 connections it managed 200k RPS, then hit 4.9M at 2000 connections — but at p50 1225ms. A response with 1225ms latency at “4.9M RPS” means requests were sitting in the pipeline for over a second before being counted. The event loop was batching and counting a backlog flush, not measuring steady-state throughput.
The correct command — visible in reference benchmarks — uses --workers:
autocannon --connections 1000 --pipelining 100 --workers 120 --duration 30 \
"http://SERVER_INTERNAL_IP:8083/simple"
--workers N spawns N Node.js worker threads, each with its own event loop. Each thread manages connections/N TCP connections independently. Less batching overhead, proper per-request accounting, honest latency numbers.
The difference:
| approach | connections | RPS | p50 latency |
|---|---|---|---|
| no workers | 2000 | 4,913,152 | 1225ms |
| 30 workers | 1000 | 2,299,162 | 38ms |
The 4.9M number was a measurement artifact. The 2.3M with p50 38ms is the real server throughput.
| Role | Instance | vCPU | RAM | Network | Interface |
|---|---|---|---|---|---|
| Server | c8i.32xlarge | 128 | 256 GB | 50 Gbps | enp95s0 |
| Client | c6i.8xlarge | 32 | 64 GB | 25 Gbps | ens5 |
Server: fiber v3, prefork, 128 workers.
Server prep — applied once at boot via server_setup.sh:
# stop irqbalance — it continuously reassigns NIC IRQs and will undo any manual affinity
sudo systemctl stop irqbalance
sudo systemctl disable irqbalance
# RPS/RFS — distribute softirq processing across all 128 cores
NIC=$(ip route show default | awk '/default/{print $5}') # enp95s0 on c8i
for f in /sys/class/net/$NIC/queues/rx-*/rps_cpus; do
echo "ffffffff,ffffffff,ffffffff,ffffffff" | sudo tee $f > /dev/null
done
echo 32768 | sudo tee /proc/sys/net/core/rps_sock_flow_entries > /dev/null
for f in /sys/class/net/$NIC/queues/rx-*/rps_flow_cnt; do
echo 2048 | sudo tee $f > /dev/null
done
Build and start the server:
export PATH=$PATH:/usr/local/go/bin
cd /opt/millionrps/src/http
go build -o fiber_server fiber_server.go
nohup ./fiber_server > /tmp/fiber.log 2>&1 &
Run benchmark from client (autocannon_bench.sh):
# ./autocannon_bench.sh <server-ip> [pipelining] [duration_sec] [workers] [route]
./autocannon_bench.sh SERVER_INTERNAL_IP 100 30 30 simple
The script runs three connection points (1000 → 2000 → 5000) in sequence. Workers default to $(nproc) - 2 — on c6i.8xlarge that’s 30.
Workers auto-detected on client: nproc - 2 = 30.
Connection ramp — 30 workers, pipelining 100, 30s per point, /simple endpoint:
| connections | in-flight | workers | RPS | p50 ms | p99 ms | throughput MB/s |
|---|---|---|---|---|---|---|
| 1000 | 100,000 | 30 | 2,299,162 | 38 | 89 | 285 |
| 2000 | 200,000 | 30 | 2,073,538 | 94 | 193 | 257 |
| 5000 | 500,000 | 30 | 2,229,002 | 240 | 527 | 276 |
Wide ramp to confirm ceiling (20s per point):
| connections | RPS | p50 ms | p99 ms |
|---|---|---|---|
| 100 | 2,502,131 | 3 | 8 |
| 500 | 2,480,602 | 17 | 43 |
| 1000 | 2,324,877 | 38 | 97 |
| 2000 | 2,026,159 | 95 | 229 |
| 5000 | 2,187,676 | 241 | 697 |
| 10000 | 2,490,096 | 485 | 1532 |
RPS is flat across a 100× range of connections — 2.3–2.5M regardless of whether the client opens 100 or 10,000 connections. The server scales fine; the client has hit its own ceiling.
Metrics captured live during the 2000c benchmark point:
Server CPU (mpstat -P ALL 1 1):
AVG usr: 2.25% sys: 2.27% idle: 93.75%
~80 of 128 cores active at 1-8% each
Server NIC:
TX: 236 MB/s (3.8% of 6250 MB/s ceiling)
RX: 138 MB/s
TCP established on :8083:
~2000 connections
Softirq drops: 0
The server is using 6% of its CPU and 4% of its NIC. It has not been loaded at all.
After irqbalance was stopped, the ENA driver’s default IRQ affinity placed the 16 queues on two CPU clusters:
enp95s0-Tx-Rx-0 → CPU91 (3,113,306 interrupts)
enp95s0-Tx-Rx-1 → CPU92 (3,092,195 interrupts)
enp95s0-Tx-Rx-2 → CPU93 (3,070,023 interrupts)
enp95s0-Tx-Rx-3 → CPU94 (3,058,050 interrupts)
enp95s0-Tx-Rx-4 → CPU95 (3,100,314 interrupts)
enp95s0-Tx-Rx-5 → CPU96 (3,102,083 interrupts)
enp95s0-Tx-Rx-6 → CPU1 (3,113,160 interrupts)
enp95s0-Tx-Rx-7 → CPU2 (3,080,034 interrupts)
enp95s0-Tx-Rx-8 → CPU3 (3,133,334 interrupts)
enp95s0-Tx-Rx-9 → CPU4 (3,157,158 interrupts)
enp95s0-Tx-Rx-10 → CPU5 (3,115,349 interrupts)
enp95s0-Tx-Rx-11 → CPU6 (3,180,217 interrupts)
enp95s0-Tx-Rx-12 → CPU7 (3,134,267 interrupts)
enp95s0-Tx-Rx-13 → CPU8 (3,085,940 interrupts)
enp95s0-Tx-Rx-14 → CPU9 (3,072,699 interrupts)
enp95s0-Tx-Rx-15 → CPU10 (3,125,071 interrupts)
Interrupt counts are equal across all queues — the kernel is distributing connections evenly via SO_REUSEPORT. Hardware interrupts are confined to 16 cores (1-10 and 91-96). RPS/RFS distributes packet processing to the remaining 112 cores in software.
This is the state IRQ pinning will improve: we will explicitly assign these 16 queues to cores 112-127, leaving 0-111 clean for fiber workers.
Metrics during the same 2000c point:
Client CPU (mpstat -P ALL 1 1):
AVG usr: 67.57% sys: 8.63% idle: 16.50%
All 32 cores: 48–79% usr each
Client processes:
node: SUM_CPU 2823% (~28.2 cores consumed)
Client NIC:
RX: 233 MB/s (7.5% of 3125 MB/s ceiling)
67% avg CPU, 28 out of 32 cores fully consumed. The NIC has 92% headroom. The bottleneck is pure CPU — the 30 Node.js worker threads are burning cores.
Testing 60 workers confirmed this: more workers on the same 32 cores caused context switching overhead and RPS dropped to 1.7M at 2000c.