2.3M RPS: the --workers flag, connection ramp, and finding the real ceiling

c6i.8xlarge client, 30 workers, 1000→5000 connections — server at 2%, client at 67%

June 04, 2026 — DONE
autocannon workers pipelining c8i c6i ramp client-ceiling

what changed from last run

Last run used autocannon without --workers. The command was:

autocannon --connections 1000 --pipelining 100 --duration 30 \
  "http://SERVER_INTERNAL_IP:8083/simple"

One Node.js event loop, one thread. At 1000 connections it managed 200k RPS, then hit 4.9M at 2000 connections — but at p50 1225ms. A response with 1225ms latency at “4.9M RPS” means requests were sitting in the pipeline for over a second before being counted. The event loop was batching and counting a backlog flush, not measuring steady-state throughput.

The correct command — visible in reference benchmarks — uses --workers:

autocannon --connections 1000 --pipelining 100 --workers 120 --duration 30 \
  "http://SERVER_INTERNAL_IP:8083/simple"

--workers N spawns N Node.js worker threads, each with its own event loop. Each thread manages connections/N TCP connections independently. Less batching overhead, proper per-request accounting, honest latency numbers.

The difference:

approach connections RPS p50 latency
no workers 2000 4,913,152 1225ms
30 workers 1000 2,299,162 38ms

The 4.9M number was a measurement artifact. The 2.3M with p50 38ms is the real server throughput.

setup

Role Instance vCPU RAM Network Interface
Server c8i.32xlarge 128 256 GB 50 Gbps enp95s0
Client c6i.8xlarge 32 64 GB 25 Gbps ens5

Server: fiber v3, prefork, 128 workers.

Server prep — applied once at boot via server_setup.sh:

# stop irqbalance — it continuously reassigns NIC IRQs and will undo any manual affinity
sudo systemctl stop irqbalance
sudo systemctl disable irqbalance

# RPS/RFS — distribute softirq processing across all 128 cores
NIC=$(ip route show default | awk '/default/{print $5}')  # enp95s0 on c8i

for f in /sys/class/net/$NIC/queues/rx-*/rps_cpus; do
    echo "ffffffff,ffffffff,ffffffff,ffffffff" | sudo tee $f > /dev/null
done

echo 32768 | sudo tee /proc/sys/net/core/rps_sock_flow_entries > /dev/null

for f in /sys/class/net/$NIC/queues/rx-*/rps_flow_cnt; do
    echo 2048 | sudo tee $f > /dev/null
done

Build and start the server:

export PATH=$PATH:/usr/local/go/bin
cd /opt/millionrps/src/http
go build -o fiber_server fiber_server.go
nohup ./fiber_server > /tmp/fiber.log 2>&1 &

Run benchmark from client (autocannon_bench.sh):

# ./autocannon_bench.sh <server-ip> [pipelining] [duration_sec] [workers] [route]
./autocannon_bench.sh SERVER_INTERNAL_IP 100 30 30 simple

The script runs three connection points (1000 → 2000 → 5000) in sequence. Workers default to $(nproc) - 2 — on c6i.8xlarge that’s 30.

Workers auto-detected on client: nproc - 2 = 30.

results

Connection ramp — 30 workers, pipelining 100, 30s per point, /simple endpoint:

connectionsin-flightworkersRPSp50 msp99 msthroughput MB/s
1000100,000302,299,1623889285
2000200,000302,073,53894193257
5000500,000302,229,002240527276

Wide ramp to confirm ceiling (20s per point):

connectionsRPSp50 msp99 ms
1002,502,13138
5002,480,6021743
10002,324,8773897
20002,026,15995229
50002,187,676241697
100002,490,0964851532

RPS is flat across a 100× range of connections — 2.3–2.5M regardless of whether the client opens 100 or 10,000 connections. The server scales fine; the client has hit its own ceiling.

where the server actually is

Metrics captured live during the 2000c benchmark point:

Server CPU (mpstat -P ALL 1 1):
  AVG  usr: 2.25%  sys: 2.27%  idle: 93.75%
  ~80 of 128 cores active at 1-8% each

Server NIC:
  TX: 236 MB/s  (3.8% of 6250 MB/s ceiling)
  RX: 138 MB/s

TCP established on :8083:
  ~2000 connections

Softirq drops: 0

The server is using 6% of its CPU and 4% of its NIC. It has not been loaded at all.

NIC interrupt distribution

After irqbalance was stopped, the ENA driver’s default IRQ affinity placed the 16 queues on two CPU clusters:

enp95s0-Tx-Rx-0   → CPU91   (3,113,306 interrupts)
enp95s0-Tx-Rx-1   → CPU92   (3,092,195 interrupts)
enp95s0-Tx-Rx-2   → CPU93   (3,070,023 interrupts)
enp95s0-Tx-Rx-3   → CPU94   (3,058,050 interrupts)
enp95s0-Tx-Rx-4   → CPU95   (3,100,314 interrupts)
enp95s0-Tx-Rx-5   → CPU96   (3,102,083 interrupts)
enp95s0-Tx-Rx-6   → CPU1    (3,113,160 interrupts)
enp95s0-Tx-Rx-7   → CPU2    (3,080,034 interrupts)
enp95s0-Tx-Rx-8   → CPU3    (3,133,334 interrupts)
enp95s0-Tx-Rx-9   → CPU4    (3,157,158 interrupts)
enp95s0-Tx-Rx-10  → CPU5    (3,115,349 interrupts)
enp95s0-Tx-Rx-11  → CPU6    (3,180,217 interrupts)
enp95s0-Tx-Rx-12  → CPU7    (3,134,267 interrupts)
enp95s0-Tx-Rx-13  → CPU8    (3,085,940 interrupts)
enp95s0-Tx-Rx-14  → CPU9    (3,072,699 interrupts)
enp95s0-Tx-Rx-15  → CPU10   (3,125,071 interrupts)

Interrupt counts are equal across all queues — the kernel is distributing connections evenly via SO_REUSEPORT. Hardware interrupts are confined to 16 cores (1-10 and 91-96). RPS/RFS distributes packet processing to the remaining 112 cores in software.

This is the state IRQ pinning will improve: we will explicitly assign these 16 queues to cores 112-127, leaving 0-111 clean for fiber workers.

where the client actually is

Metrics during the same 2000c point:

Client CPU (mpstat -P ALL 1 1):
  AVG  usr: 67.57%  sys: 8.63%  idle: 16.50%
  All 32 cores: 48–79% usr each

Client processes:
  node: SUM_CPU 2823%  (~28.2 cores consumed)

Client NIC:
  RX: 233 MB/s  (7.5% of 3125 MB/s ceiling)

67% avg CPU, 28 out of 32 cores fully consumed. The NIC has 92% headroom. The bottleneck is pure CPU — the 30 Node.js worker threads are burning cores.

Testing 60 workers confirmed this: more workers on the same 32 cores caused context switching overhead and RPS dropped to 1.7M at 2000c.

Finding With 30 workers on 32 cores the client hits its CPU ceiling at ~2.3M RPS. The server at this point is at 2.25% CPU and 3.8% NIC. The c6i.8xlarge client cannot measure the server's actual capacity. Every number in this entry is a client measurement, not a server measurement.
Note on worker count vs core count autocannon workers are I/O-bound, not CPU-bound — they spend most time waiting for responses. But at high connection counts the event loop overhead and epoll syscalls consume enough CPU that 30 workers across 32 cores saturates the machine. Adding more workers beyond core count caused degradation. The right ratio depends on the workload; for this benchmark on c6i.8xlarge, 30 workers was the effective ceiling.

↑ Back to Journal