Chasing 1 Million RPS

a journal of hardware, kernel tuning, and diminishing returns

Posted by ikouchiha47 on June 02, 2026 · 4 mins read

yes, this is clickbait. kind of.

What “1 Million RPS” Actually Means

“1 million RPS” is not a single number. It depends entirely on what the server is doing per request.

A handler that returns {"status":"ok"} with no I/O can hit 10-20M RPS on a single machine. A handler that does a primary-key SELECT on a warm Postgres index might do 50k-200k RPS before the database becomes the ceiling. A write with fsync — maybe 5k-20k. An aggregation query with a table scan: 500. These are not the same problem and should not share a headline.

The breakdown matters:

Read-heavy (cache or memory): latency is microseconds, bottleneck is the network stack and CPU cycles in the request path — TCP, kernel syscalls, HTTP parsing, serialization. This is where IRQ pinning, prefork, and buffer tuning show up. This is the regime we’re in for /simple.

Compute-intensive (CPU-bound, no I/O): bottleneck is CPU cores. Throughput scales linearly with cores until you hit scheduling overhead. /compute in this journal is this case — 128 cores saturated, 1.7M RPS.

Read with DB (SELECT): you now have a latency floor set by the database round-trip (0.2ms local, 1-5ms over network). At 1ms average, Little’s Law says you need 1000 concurrent connections to drive 1M RPS from a single DB connection pool. Connection limits, query planning, index hits, buffer pool size — all of these cap you before the HTTP layer does.

Write with DB (INSERT/UPDATE): add write amplification, WAL flushes, lock contention, replication lag. fsync at 10ms latency means 100 RPS per write path, not 1M. Batching and async writes push this up but introduce consistency trade-offs.

Failure handling: a server doing retries, circuit breaking, or fallback logic on each request burns CPU per failure. At high RPS, even 0.1% error rate with a retry doubles load on a degraded downstream. Failure handling changes the per-request cost model entirely.

Consistency: a linearizable key-value write (single-node, synchronous) is fast. A distributed write requiring quorum adds network round-trips per request. Eventual consistency lets you write to a single replica and replicate async — throughput goes up, guarantees go down.

The 1M RPS target is a lens, not a finish line. It forces you to be precise about what work the server is actually doing, because vague claims collapse immediately when you try to reproduce them. What workload? What latency? What hardware? What was the client doing?

This journal documents those questions, not the headline number.

What We’re Testing

  • HTTP servers in Go — net/http, fasthttp, fiber — with different workloads (tiny JSON, large payloads, CPU-bound computation)
  • Kernel network stack tuning — IRQ affinity, RPS, RFS, SO_REUSEPORT, socket options
  • Infrastructure — single machine vs cross-machine, AWS instance families, NIC limitations
  • Load generation — vegeta, autocannon, wrk — and why the tool matters as much as the server
  • Tradeoffs — prefork vs single process, pipelining vs independent requests, throughput vs latency

The Setup

All benchmarks run on AWS (ap-south-2, Hyderabad) unless otherwise noted. Server code is Go + Fiber v3. Load generation uses autocannon (for pipelining tests) or vegeta (for rate-controlled tests).

Source code: ikouchiha47/millionrps

Journal

Entries below are in chronological order. Green = done, orange = in progress, grey = planned.


baseline: mac mini, localhost, three servers

Establish a baseline on a Mac mini M2 (10 cores). Compare net/http, fasthttp, and fiber with and without prefork. Measure /read (pool lookup), /list (large payload), and /compute (CPU-bound aggregation).

Result: fiber+prefork: 201k RPS, P50 0.49ms, P99 0.68ms. /compute: 33k RPS, P50 2.88ms. Payload size — not the server — is the real ceiling.

linux confirms it: SO_REUSEPORT actually works

Two c6i.2xlarge on AWS. Server: fiber+prefork 8 workers. Client: vegeta. Confirm SO_REUSEPORT distributes connections across all 8 workers on Linux. Compare P99 against macOS.

Result: P99: 22ms (macOS) → 1.4ms (Linux). All 8 workers active. Still bandwidth-bound at 8-13k RPS with 48KB payloads. /read-only hits 49k RPS — server at 3% CPU, client NIC is the ceiling.

128 cores, NIC queues, RPS and RFS

Upgrade server to c8i.32xlarge (128 vCPU). Switch to autocannon with --pipelining 100. Apply RPS and RFS. Capture live metrics during benchmark.

Result: 634k avg RPS, peaked 1.2M. Server: 1.14% avg CPU, 96% idle. Client (c6i.4xlarge, 16 cores): 82.93% usr CPU, 0% idle. We also used --connections 1000 — the correct value is 5000. Next run will fix this.

2.3M RPS: the --workers flag, connection ramp, and finding the real ceiling

Discovered --workers flag in autocannon. Upgraded client to c6i.8xlarge. Ran connection ramp 1000→2000→5000. Hit 2.3M RPS at p50 38ms. Proved server has 98% CPU headroom — client is the ceiling at 67% avg CPU across 32 cores.

Result: 2,299,162 RPS avg. Server: 2.25% CPU, 236 MB/s TX (3.8% of 6250 MB/s ceiling). Client: 67% avg CPU, 28.2/32 cores consumed. Client is the wall.

18M RPS on /simple, 1.7M on /compute: matched c8i hardware

Upgrade both server and client to c8i.32xlarge (128 vCPU, 50 Gbps). Run autocannon with 120 workers. Hit 18M RPS on /simple and 1.7M on /compute. Discover the client is the ceiling for /simple, server is the ceiling for /compute.

Result: 18,164,736 RPS on /simple (server 26% CPU). 1,744,862 RPS on /compute (server 94% CPU). For /simple the client is always the wall regardless of hardware.

IRQ pinning: when it matters and when it doesn't

Full IRQ pinning experiment on c6i.2xlarge (8 vCPU) as server. Three steps: baseline, IRQ pinning only, IRQ + RPS/RFS. Run on /simple and /compute. Read the HAProxy blog. Understand why packet rate — not RPS — is what makes IRQ pinning matter.

Result: IRQ pinning had no meaningful effect on /simple (+1%). Hurt /compute (-9%) due to taskset reducing compute cores. Root cause: packet rate at our load level is too low to saturate IRQ cores. The HAProxy regime requires millions of packets/sec, not millions of RPS.

IRQ on c8i + vegeta without pipelining: still the client

Ran the IRQ experiment on matched c8i hardware. Then switched to vegeta (no pipelining) to generate realistic packet rates. Both hit client ceiling regardless of tuning.

Result: c8i IRQ+RPS/RFS: 18M RPS, flat vs baseline. Vegeta 8 parallel processes: 580k actual RPS ceiling. Server at 5% CPU in both cases.

switching to /read: c6in.8xlarge, 50 Gbps, and IRQ pinning on a loaded NIC

Switched from /simple (16 bytes) to /read (4.5KB) on c6in.8xlarge (50 Gbps dedicated NIC). Used autocannon without pipelining. Hit 790k RPS at 500 connections with NIC at 72% utilization. Applied IRQ pinning. Flat.

Result: 790k RPS avg at 500c, server TX 3.49 GB/s (56% of 50 Gbps ceiling), 85% CPU busy. IRQ pinning: within noise across all connection counts. 352k interrupts/sec — IRQ cores not saturated.

403 seconds of waste, per 60 seconds of work

Ran Go pprof against the /read handler at 800k RPS. CPU profile showed gofakeit at 2.25% — looked minor. Block profile showed 403.90s of goroutine wait time in 60 real seconds, all on one mutex, all from one line of code. Fixed it. RPS didn't move on no-prefork. Re-enabled prefork on c8i.32xlarge (128 workers): 1.33M RPS vs 1.01M no-prefork. +31% at 10k connections.

Result: block.prof: gofakeit mutex = 403.90s blocked / 60s real = 6.7 cores wasted. Fix: 807k → 809k (flat — write syscall is the ceiling, not the mutex). Prefork 128w on c8i: 1,329,306 RPS at 10k connections vs 1,011,994 no-prefork (+31%). Prefork wins at every connection count on 128 cores.

GOMAXPROCS sets fiber's worker count, not the thread count

Hypothesis: setting GOMAXPROCS=1 per prefork worker would cut its ~15 OS threads and improve cache locality. Result: only 2 processes started. I assumed the fork machinery broke. It didn't — the fiber v3 source spawns runtime.GOMAXPROCS(0) children, so GOMAXPROCS=1 means one worker. Each child already runs GOMAXPROCS(1) regardless.

Result: GOMAXPROCS=128: 129 processes (1 master + 128 workers), 1,333,453 RPS at 10k connections. GOMAXPROCS=1: 2 processes (1 master + 1 worker), 110,627 RPS at 100c declining to 87,856 at 10k. The env var controls worker count, not per-worker threading — each child overrides to GOMAXPROCS(1) in the source.

Three things that don't move /read past the NIC

/read at 4.5KB hits 1.32M RPS = 6.1 GB/s = 98% of the 50 Gbps NIC. Three things that should plausibly raise it, don't: per-worker thread count is flat across 100c-5000c (epoll), 32 workers match 128 workers (NIC-bound), and pipelining 1→50 stays flat. The same pipelining sweep on /simple (16B) goes 4.4M → 17M, because that endpoint is packet-bound, not bandwidth-bound.

Result: Per-worker threads: ~10 (c6in) / ~15 (c8i), flat across 50× connection range. 32 vs 128 workers on c8i /read: 1,322,240 vs 1,322,291 RPS — identical. Pipelining 1→50 on /read: flat ~1.33M. On /simple: 4.4M → 16.9M. The /read ceiling is NIC bandwidth at every angle.

gzip moved the wall, and exposed a lie

Pre-gzipped the /read pool (4520B → 2289B, 1.97x) to cut bytes on a CPU-bound box. With a c6in client gzip HALVED RPS (794k → 418k) — the client was the bottleneck and decompression overloaded it. Swapped to a c8i.32xlarge client and the real /read ceiling appeared: 1.31M RPS, NIC-bound at 90%, not the 790k 'CPU-bound' from entry 08. At that real ceiling gzip raised RPS 32% (1.31M → 1.73M) by halving NIC load and shifting the wall to CPU. MTU was already 9001, so both payloads are a single packet — the gzip win is bytes copied, not syscalls saved.

Result: Compression: 4520B → 2289B (1.97x). c6in client: /read 794,310 vs /read-gz 417,619 — gzip halved it. c8i client: /read 1,311,710 @ 90% NIC (NIC-bound), /read-gz 1,731,000 @ 80% CPU (CPU-bound), +32%. Per-request server CPU: raw 20.1µs, gz 14.8µs. The 'c6in /read = 790k, CPU-bound' from entry 08 was a client-bound artifact.

planned: IRQ pinning + matched hardware

Apply proper IRQ isolation as described in the HAProxy blog. Stop irqbalance and crond. Pin 16 NIC IRQs to dedicated cores. Run fiber workers on clean cores only. Upgrade client to c8i.32xlarge. Waiting for AWS quota increase to 320 vCPU.

Result: pending