chasing 1 million rps

a journal of hardware, kernel tuning, and diminishing returns

yes, this is clickbait. kind of.

what “1 million RPS” actually means

“1 million RPS” is not a single number. It depends entirely on what the server is doing per request.

A handler that returns {"status":"ok"} with no I/O can hit 10-20M RPS on a single machine. A handler that does a primary-key SELECT on a warm Postgres index might do 50k-200k RPS before the database becomes the ceiling. A write with fsync — maybe 5k-20k. An aggregation query with a table scan: 500. These are not the same problem and should not share a headline.

The breakdown matters:

Read-heavy (cache or memory): latency is microseconds, bottleneck is the network stack and CPU cycles in the request path — TCP, kernel syscalls, HTTP parsing, serialization. This is where IRQ pinning, prefork, and buffer tuning show up. This is the regime we’re in for /simple.

Compute-intensive (CPU-bound, no I/O): bottleneck is CPU cores. Throughput scales linearly with cores until you hit scheduling overhead. /compute in this journal is this case — 128 cores saturated, 1.7M RPS.

Read with DB (SELECT): you now have a latency floor set by the database round-trip (0.2ms local, 1-5ms over network). At 1ms average, Little’s Law says you need 1000 concurrent connections to drive 1M RPS from a single DB connection pool. Connection limits, query planning, index hits, buffer pool size — all of these cap you before the HTTP layer does.

Write with DB (INSERT/UPDATE): add write amplification, WAL flushes, lock contention, replication lag. fsync at 10ms latency means 100 RPS per write path, not 1M. Batching and async writes push this up but introduce consistency trade-offs.

Failure handling: a server doing retries, circuit breaking, or fallback logic on each request burns CPU per failure. At high RPS, even 0.1% error rate with a retry doubles load on a degraded downstream. Failure handling changes the per-request cost model entirely.

Consistency: a linearizable key-value write (single-node, synchronous) is fast. A distributed write requiring quorum adds network round-trips per request. Eventual consistency lets you write to a single replica and replicate async — throughput goes up, guarantees go down.

The 1M RPS target is a lens, not a finish line. It forces you to be precise about what work the server is actually doing, because vague claims collapse immediately when you try to reproduce them. What workload? What latency? What hardware? What was the client doing?

This journal documents those questions, not the headline number.

what we’re testing

  • HTTP servers in Go — net/http, fasthttp, fiber — with different workloads (tiny JSON, large payloads, CPU-bound computation)
  • Kernel network stack tuning — IRQ affinity, RPS, RFS, SO_REUSEPORT, socket options
  • Infrastructure — single machine vs cross-machine, AWS instance families, NIC limitations
  • Load generation — vegeta, autocannon, wrk — and why the tool matters as much as the server
  • Tradeoffs — prefork vs single process, pipelining vs independent requests, throughput vs latency

the setup

All benchmarks run on AWS (ap-south-2, Hyderabad) unless otherwise noted. Server code is Go + Fiber v3. Load generation uses autocannon (for pipelining tests) or vegeta (for rate-controlled tests).

Source code: ikouchiha47/millionrps

journal

Entries below are in chronological order. Green = done, orange = in progress, grey = planned.


baseline: mac mini, localhost, three servers

Establish a baseline on a Mac mini M2 (10 cores). Compare net/http, fasthttp, and fiber with and without prefork. Measure /read (pool lookup), /list (large payload), and /compute (CPU-bound aggregation).

Result: fiber+prefork: 201k RPS, P50 0.49ms, P99 0.68ms. /compute: 33k RPS, P50 2.88ms. Payload size — not the server — is the real ceiling.

linux confirms it: SO_REUSEPORT actually works

Two c6i.2xlarge on AWS. Server: fiber+prefork 8 workers. Client: vegeta. Confirm SO_REUSEPORT distributes connections across all 8 workers on Linux. Compare P99 against macOS.

Result: P99: 22ms (macOS) → 1.4ms (Linux). All 8 workers active. Still bandwidth-bound at 8-13k RPS with 48KB payloads. /read-only hits 49k RPS — server at 3% CPU, client NIC is the ceiling.

128 cores, NIC queues, RPS and RFS

Upgrade server to c8i.32xlarge (128 vCPU). Switch to autocannon with --pipelining 100. Apply RPS and RFS. Capture live metrics during benchmark.

Result: 634k avg RPS, peaked 1.2M. Server: 1.14% avg CPU, 96% idle. Client (c6i.4xlarge, 16 cores): 82.93% usr CPU, 0% idle. We also used --connections 1000 — the correct value is 5000. Next run will fix this.

2.3M RPS: the --workers flag, connection ramp, and finding the real ceiling

Discovered --workers flag in autocannon. Upgraded client to c6i.8xlarge. Ran connection ramp 1000→2000→5000. Hit 2.3M RPS at p50 38ms. Proved server has 98% CPU headroom — client is the ceiling at 67% avg CPU across 32 cores.

Result: 2,299,162 RPS avg. Server: 2.25% CPU, 236 MB/s TX (3.8% of 6250 MB/s ceiling). Client: 67% avg CPU, 28.2/32 cores consumed. Client is the wall.

18M RPS on /simple, 1.7M on /compute: matched c8i hardware

Upgrade both server and client to c8i.32xlarge (128 vCPU, 50 Gbps). Run autocannon with 120 workers. Hit 18M RPS on /simple and 1.7M on /compute. Discover the client is the ceiling for /simple, server is the ceiling for /compute.

Result: 18,164,736 RPS on /simple (server 26% CPU). 1,744,862 RPS on /compute (server 94% CPU). For /simple the client is always the wall regardless of hardware.

IRQ pinning: when it matters and when it doesn't

Full IRQ pinning experiment on c6i.2xlarge (8 vCPU) as server. Three steps: baseline, IRQ pinning only, IRQ + RPS/RFS. Run on /simple and /compute. Read the HAProxy blog. Understand why packet rate — not RPS — is what makes IRQ pinning matter.

Result: IRQ pinning had no meaningful effect on /simple (+1%). Hurt /compute (-9%) due to taskset reducing compute cores. Root cause: packet rate at our load level is too low to saturate IRQ cores. The HAProxy regime requires millions of packets/sec, not millions of RPS.

IRQ on c8i + vegeta without pipelining: still the client

Ran the IRQ experiment on matched c8i hardware. Then switched to vegeta (no pipelining) to generate realistic packet rates. Both hit client ceiling regardless of tuning.

Result: c8i IRQ+RPS/RFS: 18M RPS, flat vs baseline. Vegeta 8 parallel processes: 580k actual RPS ceiling. Server at 5% CPU in both cases.

planned: IRQ pinning + matched hardware

Apply proper IRQ isolation as described in the HAProxy blog. Stop irqbalance and crond. Pin 16 NIC IRQs to dedicated cores. Run fiber workers on clean cores only. Upgrade client to c8i.32xlarge. Waiting for AWS quota increase to 320 vCPU.

Result: pending