Chasing 1 Million RPS

yes, this is clickbait. kind of.

What “1 Million RPS” Actually Means

“1 million RPS” is not a single number. It depends entirely on what the server is doing per request.

A handler that returns {"status":"ok"} with no I/O can hit 10-20M RPS on a single machine. A handler that does a primary-key SELECT on a warm Postgres index might do 50k-200k RPS before the database becomes the ceiling. A write with fsync — maybe 5k-20k. An aggregation query with a table scan: 500. These are not the same problem and should not share a headline.

The breakdown matters:

Read-heavy (cache or memory): latency is microseconds, bottleneck is the network stack and CPU cycles in the request path — TCP, kernel syscalls, HTTP parsing, serialization. This is where IRQ pinning, prefork, and buffer tuning show up. This is the regime we’re in for /simple.

Compute-intensive (CPU-bound, no I/O): bottleneck is CPU cores. Throughput scales linearly with cores until you hit scheduling overhead. /compute in this journal is this case — 128 cores saturated, 1.7M RPS.

Read with DB (SELECT): you now have a latency floor set by the database round-trip (0.2ms local, 1-5ms over network). At 1ms average, Little’s Law says you need 1000 concurrent connections to drive 1M RPS from a single DB connection pool. Connection limits, query planning, index hits, buffer pool size — all of these cap you before the HTTP layer does.

Write with DB (INSERT/UPDATE): add write amplification, WAL flushes, lock contention, replication lag. fsync at 10ms latency means 100 RPS per write path, not 1M. Batching and async writes push this up but introduce consistency trade-offs.

Failure handling: a server doing retries, circuit breaking, or fallback logic on each request burns CPU per failure. At high RPS, even 0.1% error rate with a retry doubles load on a degraded downstream. Failure handling changes the per-request cost model entirely.

Consistency: a linearizable key-value write (single-node, synchronous) is fast. A distributed write requiring quorum adds network round-trips per request. Eventual consistency lets you write to a single replica and replicate async — throughput goes up, guarantees go down.

The 1M RPS target is a lens, not a finish line. It forces you to be precise about what work the server is actually doing, because vague claims collapse immediately when you try to reproduce them. What workload? What latency? What hardware? What was the client doing?

This journal documents those questions, not the headline number.

What We’re Testing

HTTP servers in Go — net/http, fasthttp, fiber — with different workloads (tiny JSON, large payloads, CPU-bound computation)
Kernel network stack tuning — IRQ affinity, RPS, RFS, SO_REUSEPORT, socket options
Infrastructure — single machine vs cross-machine, AWS instance families, NIC limitations
Load generation — vegeta, autocannon, wrk — and why the tool matters as much as the server
Tradeoffs — prefork vs single process, pipelining vs independent requests, throughput vs latency

The Setup

All benchmarks run on AWS (ap-south-2, Hyderabad) unless otherwise noted. Server code is Go + Fiber v3. Load generation uses autocannon (for pipelining tests) or vegeta (for rate-controlled tests).

Source code: ikouchiha47/millionrps

Chapters

Each chapter is a different class of high-throughput problem. Same hardware, same Go, different bottlenecks.

Chapter 1 — Simple HTTP: a handler returning a large JSON payload. No I/O, no computation. Ceiling is the network stack: IRQ affinity, prefork, NIC bandwidth, compression.

Chapter 2 — Fan-out (SSE likes): one write → N subscribers. The metric shifts from RPS to events delivered per second. Bottlenecks are goroutine contention, IPC overhead, and the cost of fanning a single event to 64K open connections.

Chapter 3 — Metrics ingestion (planned): write-heavy, time-series, aggregation at ingestion. Different shape again.

Journal

Entries below are in chronological order. Green = done, orange = in progress, grey = planned.

Chapter 1 Simple HTTP

May 15, 2026 DONE go fiber fasthttp net/http macos localhost wrk

baseline: mac mini, localhost, three servers

Establish a baseline on a Mac mini M2 (10 cores). Compare net/http, fasthttp, and fiber with and without prefork. Measure /read (pool lookup), /list (large payload), and /compute (CPU-bound aggregation).

Result: fiber+prefork: 201k RPS, P50 0.49ms, P99 0.68ms. /compute: 33k RPS, P50 2.88ms. Payload size — not the server — is the real ceiling.

June 01, 2026 DONE linux aws c6i SO_REUSEPORT prefork vegeta terraform

linux confirms it: SO_REUSEPORT actually works

Two c6i.2xlarge on AWS. Server: fiber+prefork 8 workers. Client: vegeta. Confirm SO_REUSEPORT distributes connections across all 8 workers on Linux. Compare P99 against macOS.

Result: P99: 22ms (macOS) → 1.4ms (Linux). All 8 workers active. Still bandwidth-bound at 8-13k RPS with 48KB payloads. /read-only hits 49k RPS — server at 3% CPU, client NIC is the ceiling.

June 03, 2026 DONE c8i aws RPS RFS IRQ autocannon pipelining kernel

128 cores, NIC queues, RPS and RFS

Upgrade server to c8i.32xlarge (128 vCPU). Switch to autocannon with --pipelining 100. Apply RPS and RFS. Capture live metrics during benchmark.

Result: 634k avg RPS, peaked 1.2M. Server: 1.14% avg CPU, 96% idle. Client (c6i.4xlarge, 16 cores): 82.93% usr CPU, 0% idle. We also used --connections 1000 — the correct value is 5000. Next run will fix this.

June 04, 2026 DONE autocannon workers pipelining c8i c6i ramp client-ceiling

2.3M RPS: the --workers flag, connection ramp, and finding the real ceiling

Discovered --workers flag in autocannon. Upgraded client to c6i.8xlarge. Ran connection ramp 1000→2000→5000. Hit 2.3M RPS at p50 38ms. Proved server has 98% CPU headroom — client is the ceiling at 67% avg CPU across 32 cores.

Result: 2,299,162 RPS avg. Server: 2.25% CPU, 236 MB/s TX (3.8% of 6250 MB/s ceiling). Client: 67% avg CPU, 28.2/32 cores consumed. Client is the wall.

June 05, 2026 DONE c8i autocannon workers compute simple client-ceiling Little's-Law

18M RPS on /simple, 1.7M on /compute: matched c8i hardware

Upgrade both server and client to c8i.32xlarge (128 vCPU, 50 Gbps). Run autocannon with 120 workers. Hit 18M RPS on /simple and 1.7M on /compute. Discover the client is the ceiling for /simple, server is the ceiling for /compute.

Result: 18,164,736 RPS on /simple (server 26% CPU). 1,744,862 RPS on /compute (server 94% CPU). For /simple the client is always the wall regardless of hardware.

June 06, 2026 DONE IRQ irqbalance taskset RPS RFS c6i pipelining packet-rate haproxy

IRQ pinning: when it matters and when it doesn't

Full IRQ pinning experiment on c6i.2xlarge (8 vCPU) as server. Three steps: baseline, IRQ pinning only, IRQ + RPS/RFS. Run on /simple and /compute. Read the HAProxy blog. Understand why packet rate — not RPS — is what makes IRQ pinning matter.

Result: IRQ pinning had no meaningful effect on /simple (+1%). Hurt /compute (-9%) due to taskset reducing compute cores. Root cause: packet rate at our load level is too low to saturate IRQ cores. The HAProxy regime requires millions of packets/sec, not millions of RPS.

June 07, 2026 DONE c8i IRQ vegeta pipelining packet-rate client-ceiling

IRQ on c8i + vegeta without pipelining: still the client

Ran the IRQ experiment on matched c8i hardware. Then switched to vegeta (no pipelining) to generate realistic packet rates. Both hit client ceiling regardless of tuning.

Result: c8i IRQ+RPS/RFS: 18M RPS, flat vs baseline. Vegeta 8 parallel processes: 580k actual RPS ceiling. Server at 5% CPU in both cases.

June 08, 2026 DONE c6in read NIC IRQ pipelining autocannon profiling

switching to /read: c6in.8xlarge, 50 Gbps, and IRQ pinning on a loaded NIC

Switched from /simple (16 bytes) to /read (4.5KB) on c6in.8xlarge (50 Gbps dedicated NIC). Used autocannon without pipelining. Hit 790k RPS at 500 connections with NIC at 72% utilization. Applied IRQ pinning. Flat.

Result: 790k RPS avg at 500c, server TX 3.49 GB/s (56% of 50 Gbps ceiling), 85% CPU busy. IRQ pinning: within noise across all connection counts. 352k interrupts/sec — IRQ cores not saturated.

June 09, 2026 DONE profiling pprof gofakeit mutex block-profile c6in prefork

403 seconds of waste, per 60 seconds of work

Ran Go pprof against the /read handler at 800k RPS. CPU profile showed gofakeit at 2.25% — looked minor. Block profile showed 403.90s of goroutine wait time in 60 real seconds, all on one mutex, all from one line of code. Fixed it. RPS didn't move on no-prefork. Re-enabled prefork on c8i.32xlarge (128 workers): 1.33M RPS vs 1.01M no-prefork. +31% at 10k connections.

Result: block.prof: gofakeit mutex = 403.90s blocked / 60s real = 6.7 cores wasted. Fix: 807k → 809k (flat — write syscall is the ceiling, not the mutex). Prefork 128w on c8i: 1,329,306 RPS at 10k connections vs 1,011,994 no-prefork (+31%). Prefork wins at every connection count on 128 cores.

June 10, 2026 DONE GOMAXPROCS prefork threads fiber source

GOMAXPROCS sets fiber's worker count, not the thread count

Hypothesis: setting GOMAXPROCS=1 per prefork worker would cut its ~15 OS threads and improve cache locality. Result: only 2 processes started. I assumed the fork machinery broke. It didn't — the fiber v3 source spawns runtime.GOMAXPROCS(0) children, so GOMAXPROCS=1 means one worker. Each child already runs GOMAXPROCS(1) regardless.

Result: GOMAXPROCS=128: 129 processes (1 master + 128 workers), 1,333,453 RPS at 10k connections. GOMAXPROCS=1: 2 processes (1 master + 1 worker), 110,627 RPS at 100c declining to 87,856 at 10k. The env var controls worker count, not per-worker threading — each child overrides to GOMAXPROCS(1) in the source.

June 11, 2026 DONE threads netpoller GOMAXPROCS pipelining NIC bandwidth c8i c6in

Three things that don't move /read past the NIC

/read at 4.5KB hits 1.32M RPS = 6.1 GB/s = 98% of the 50 Gbps NIC. Three things that should plausibly raise it, don't: per-worker thread count is flat across 100c-5000c (epoll), 32 workers match 128 workers (NIC-bound), and pipelining 1→50 stays flat. The same pipelining sweep on /simple (16B) goes 4.4M → 17M, because that endpoint is packet-bound, not bandwidth-bound.

Result: Per-worker threads: ~10 (c6in) / ~15 (c8i), flat across 50× connection range. 32 vs 128 workers on c8i /read: 1,322,240 vs 1,322,291 RPS — identical. Pipelining 1→50 on /read: flat ~1.33M. On /simple: 4.4M → 16.9M. The /read ceiling is NIC bandwidth at every angle.

June 12, 2026 DONE gzip compression bandwidth NIC MTU jumbo-frames c6in c8i client-ceiling

gzip moved the wall, and exposed a lie

Pre-gzipped the /read pool (4520B → 2289B, 1.97x) to cut bytes on a CPU-bound box. With a c6in client gzip HALVED RPS (794k → 418k) — the client was the bottleneck and decompression overloaded it. Swapped to a c8i.32xlarge client and the real /read ceiling appeared: 1.31M RPS, NIC-bound at 90%, not the 790k 'CPU-bound' from entry 08. At that real ceiling gzip raised RPS 32% (1.31M → 1.73M) by halving NIC load and shifting the wall to CPU. MTU was already 9001, so both payloads are a single packet — the gzip win is bytes copied, not syscalls saved.

Result: Compression: 4520B → 2289B (1.97x). c6in client: /read 794,310 vs /read-gz 417,619 — gzip halved it. c8i client: /read 1,311,710 @ 90% NIC (NIC-bound), /read-gz 1,731,000 @ 80% CPU (CPU-bound), +32%. Per-request server CPU: raw 20.1µs, gz 14.8µs. The 'c6in /read = 790k, CPU-bound' from entry 08 was a client-bound artifact.

June 15, 2026 PENDING IRQ irqbalance isolcpus c8i kernel planned

planned: IRQ pinning + matched hardware

Apply proper IRQ isolation as described in the HAProxy blog. Stop irqbalance and crond. Pin 16 NIC IRQs to dedicated cores. Run fiber workers on clean cores only. Upgrade client to c8i.32xlarge. Waiting for AWS quota increase to 320 vCPU.

Result: pending

Chapter 2 Fan-out

July 12, 2026 DONE SSE fan-out hub goroutine channel in-process likes fiber

Go SSE fan-out baseline: 7.5M events/s, then it collapses

Switched problem domains: instead of maximising RPS on a read endpoint, we're now fanning out like-count updates to N SSE subscribers per post. One write triggers a broadcast to everyone watching. First baseline: single goroutine per topic, buffered channels per subscriber, fixed 500 writers + ramping SSE readers.

Result: Events/s peaks at ~7.5M at 20K SSE connections then collapses at 64K. Write throughput drops from ~80K/s (no SSE) to ~8K/s at 64K SSE — the write path and fan-out goroutines compete for CPU on the same non-prefork process. The in-process ceiling is CPU contention, not fan-out iteration speed.

July 15, 2026 DONE SSE fan-out sharding goroutine channel coordinator unsafe.Pointer

Sharding the SSE fan-out goroutine: same ceiling, different drop point

Replaced the single run() goroutine per topic with a coordinator + 8 shard goroutines, each owning 1/8 of the subscriber slice and draining its own buffered channel. Hypothesis: parallel iteration would raise the events/s ceiling and improve delivery at high connection counts.

Result: The ceiling is unchanged at ~7.5M events/s. Delivery% is statistically identical to the baseline. The bottleneck is not iteration speed — it is CPU contention between the write path and the fan-out goroutines on a single non-prefork process. The coordinator→shard channel (buffered 128) becomes the new drop point under load, not the subscriber iteration.

July 16, 2026 DONE SSE fan-out dispatcher registry process-separation IPC HTTP-push aggregation

Direct-routed SSE fan-out: process separation regressed 60%, aggregation is the fix

Checkpoint 6: split the monolith into three binaries (likes-server, registry, fanout-node) with consistent-hash routing. likes-server resolves owning node via a local ring cache and enqueues events to a per-node buffered channel. Background goroutine drains via HTTP POST. Fire-and-forget — write path never blocks on fan-out.

Result: Events/s regressed ~60% vs CP5 at all connection counts except 64K (−5%). Root cause: HTTP POST per event costs ~50µs vs ~100ns for an in-process channel send. Research into Ably, Discord, and YouTube revealed nobody fans out individual increments — they aggregate over a 50–200ms window. At 100ms batching, 23K events/s collapses to ~10 fan-out publishes/s per post. The architecture is correct. The event shape is wrong.

July 22, 2026 WIP SSE fan-out aggregation ticker batch dispatcher likes-server

100ms aggregation tier: collapsing 23K dispatches/s into 100

Checkpoint 7: add a 100ms aggregation tier inside likes-server. Instead of one HTTP POST per increment, accumulate postID → latestCount in a map and flush once per tick per fanout-node. Benchmarks not run — AWS shut down due to cost. Implementation is complete and the math holds.

Result: Implemented. At 23K inc/s across 10 posts, fan-out dispatches collapse from ~23K/s to ~100/s (one per post per 100ms tick). push_dropped expected to reach zero. Benchmark pending.

July 23, 2026 PUBLISHED SSE fan-out aggregation popserver CP6 CP7 benchmark local

CP6 vs CP6+agg vs CP7: local benchmark on Colima (2000 SSE connections)

Three-way comparison: CP6 (consistent-hash, per-event) vs CP6+agg (consistent-hash, 100ms window) vs CP7 (popserver K-V, 100ms window). Run locally on Colima 4 vCPU, 16 GB, 1M open-file limit. Aggregation alone eliminates drops and doubles write/s. CP7 popserver adds a further 21%.

Result: CP6: 17,200 write/s, 96.8% drops. CP6+agg: 38,600 write/s, 0% drops. CP7: 46,860 write/s, 0% drops.