baseline: mac mini, localhost, three servers
Establish a baseline on a Mac mini M2 (10 cores). Compare net/http, fasthttp, and fiber with and without prefork. Measure /read (pool lookup), /list (large payload), and /compute (CPU-bound aggregation).
yes, this is clickbait. kind of.
“1 million RPS” is not a single number. It depends entirely on what the server is doing per request.
A handler that returns {"status":"ok"} with no I/O can hit 10-20M RPS on a single machine. A handler that does a primary-key SELECT on a warm Postgres index might do 50k-200k RPS before the database becomes the ceiling. A write with fsync — maybe 5k-20k. An aggregation query with a table scan: 500. These are not the same problem and should not share a headline.
The breakdown matters:
Read-heavy (cache or memory): latency is microseconds, bottleneck is the network stack and CPU cycles in the request path — TCP, kernel syscalls, HTTP parsing, serialization. This is where IRQ pinning, prefork, and buffer tuning show up. This is the regime we’re in for /simple.
Compute-intensive (CPU-bound, no I/O): bottleneck is CPU cores. Throughput scales linearly with cores until you hit scheduling overhead. /compute in this journal is this case — 128 cores saturated, 1.7M RPS.
Read with DB (SELECT): you now have a latency floor set by the database round-trip (0.2ms local, 1-5ms over network). At 1ms average, Little’s Law says you need 1000 concurrent connections to drive 1M RPS from a single DB connection pool. Connection limits, query planning, index hits, buffer pool size — all of these cap you before the HTTP layer does.
Write with DB (INSERT/UPDATE): add write amplification, WAL flushes, lock contention, replication lag. fsync at 10ms latency means 100 RPS per write path, not 1M. Batching and async writes push this up but introduce consistency trade-offs.
Failure handling: a server doing retries, circuit breaking, or fallback logic on each request burns CPU per failure. At high RPS, even 0.1% error rate with a retry doubles load on a degraded downstream. Failure handling changes the per-request cost model entirely.
Consistency: a linearizable key-value write (single-node, synchronous) is fast. A distributed write requiring quorum adds network round-trips per request. Eventual consistency lets you write to a single replica and replicate async — throughput goes up, guarantees go down.
The 1M RPS target is a lens, not a finish line. It forces you to be precise about what work the server is actually doing, because vague claims collapse immediately when you try to reproduce them. What workload? What latency? What hardware? What was the client doing?
This journal documents those questions, not the headline number.
All benchmarks run on AWS (ap-south-2, Hyderabad) unless otherwise noted. Server code is Go + Fiber v3. Load generation uses autocannon (for pipelining tests) or vegeta (for rate-controlled tests).
Source code: ikouchiha47/millionrps
Entries below are in chronological order. Green = done, orange = in progress, grey = planned.
Establish a baseline on a Mac mini M2 (10 cores). Compare net/http, fasthttp, and fiber with and without prefork. Measure /read (pool lookup), /list (large payload), and /compute (CPU-bound aggregation).
Two c6i.2xlarge on AWS. Server: fiber+prefork 8 workers. Client: vegeta. Confirm SO_REUSEPORT distributes connections across all 8 workers on Linux. Compare P99 against macOS.
Upgrade server to c8i.32xlarge (128 vCPU). Switch to autocannon with --pipelining 100. Apply RPS and RFS. Capture live metrics during benchmark.
Discovered --workers flag in autocannon. Upgraded client to c6i.8xlarge. Ran connection ramp 1000→2000→5000. Hit 2.3M RPS at p50 38ms. Proved server has 98% CPU headroom — client is the ceiling at 67% avg CPU across 32 cores.
Upgrade both server and client to c8i.32xlarge (128 vCPU, 50 Gbps). Run autocannon with 120 workers. Hit 18M RPS on /simple and 1.7M on /compute. Discover the client is the ceiling for /simple, server is the ceiling for /compute.
Full IRQ pinning experiment on c6i.2xlarge (8 vCPU) as server. Three steps: baseline, IRQ pinning only, IRQ + RPS/RFS. Run on /simple and /compute. Read the HAProxy blog. Understand why packet rate — not RPS — is what makes IRQ pinning matter.
Ran the IRQ experiment on matched c8i hardware. Then switched to vegeta (no pipelining) to generate realistic packet rates. Both hit client ceiling regardless of tuning.
Switched from /simple (16 bytes) to /read (4.5KB) on c6in.8xlarge (50 Gbps dedicated NIC). Used autocannon without pipelining. Hit 790k RPS at 500 connections with NIC at 72% utilization. Applied IRQ pinning. Flat.
Ran Go pprof against the /read handler at 800k RPS. CPU profile showed gofakeit at 2.25% — looked minor. Block profile showed 403.90s of goroutine wait time in 60 real seconds, all on one mutex, all from one line of code. Fixed it. RPS didn't move on no-prefork. Re-enabled prefork on c8i.32xlarge (128 workers): 1.33M RPS vs 1.01M no-prefork. +31% at 10k connections.
Hypothesis: setting GOMAXPROCS=1 per prefork worker would cut its ~15 OS threads and improve cache locality. Result: only 2 processes started. I assumed the fork machinery broke. It didn't — the fiber v3 source spawns runtime.GOMAXPROCS(0) children, so GOMAXPROCS=1 means one worker. Each child already runs GOMAXPROCS(1) regardless.
/read at 4.5KB hits 1.32M RPS = 6.1 GB/s = 98% of the 50 Gbps NIC. Three things that should plausibly raise it, don't: per-worker thread count is flat across 100c-5000c (epoll), 32 workers match 128 workers (NIC-bound), and pipelining 1→50 stays flat. The same pipelining sweep on /simple (16B) goes 4.4M → 17M, because that endpoint is packet-bound, not bandwidth-bound.
Pre-gzipped the /read pool (4520B → 2289B, 1.97x) to cut bytes on a CPU-bound box. With a c6in client gzip HALVED RPS (794k → 418k) — the client was the bottleneck and decompression overloaded it. Swapped to a c8i.32xlarge client and the real /read ceiling appeared: 1.31M RPS, NIC-bound at 90%, not the 790k 'CPU-bound' from entry 08. At that real ceiling gzip raised RPS 32% (1.31M → 1.73M) by halving NIC load and shifting the wall to CPU. MTU was already 9001, so both payloads are a single packet — the gzip win is bytes copied, not syscalls saved.
Apply proper IRQ isolation as described in the HAProxy blog. Stop irqbalance and crond. Pin 16 NIC IRQs to dedicated cores. Run fiber workers on clean cores only. Upgrade client to c8i.32xlarge. Waiting for AWS quota increase to 320 vCPU.