baseline: mac mini, localhost, three servers
Establish a baseline on a Mac mini M2 (10 cores). Compare net/http, fasthttp, and fiber with and without prefork. Measure /read (pool lookup), /list (large payload), and /compute (CPU-bound aggregation).
yes, this is clickbait. kind of.
“1 million RPS” is not a single number. It depends entirely on what the server is doing per request.
A handler that returns {"status":"ok"} with no I/O can hit 10-20M RPS on a single machine. A handler that does a primary-key SELECT on a warm Postgres index might do 50k-200k RPS before the database becomes the ceiling. A write with fsync — maybe 5k-20k. An aggregation query with a table scan: 500. These are not the same problem and should not share a headline.
The breakdown matters:
Read-heavy (cache or memory): latency is microseconds, bottleneck is the network stack and CPU cycles in the request path — TCP, kernel syscalls, HTTP parsing, serialization. This is where IRQ pinning, prefork, and buffer tuning show up. This is the regime we’re in for /simple.
Compute-intensive (CPU-bound, no I/O): bottleneck is CPU cores. Throughput scales linearly with cores until you hit scheduling overhead. /compute in this journal is this case — 128 cores saturated, 1.7M RPS.
Read with DB (SELECT): you now have a latency floor set by the database round-trip (0.2ms local, 1-5ms over network). At 1ms average, Little’s Law says you need 1000 concurrent connections to drive 1M RPS from a single DB connection pool. Connection limits, query planning, index hits, buffer pool size — all of these cap you before the HTTP layer does.
Write with DB (INSERT/UPDATE): add write amplification, WAL flushes, lock contention, replication lag. fsync at 10ms latency means 100 RPS per write path, not 1M. Batching and async writes push this up but introduce consistency trade-offs.
Failure handling: a server doing retries, circuit breaking, or fallback logic on each request burns CPU per failure. At high RPS, even 0.1% error rate with a retry doubles load on a degraded downstream. Failure handling changes the per-request cost model entirely.
Consistency: a linearizable key-value write (single-node, synchronous) is fast. A distributed write requiring quorum adds network round-trips per request. Eventual consistency lets you write to a single replica and replicate async — throughput goes up, guarantees go down.
The 1M RPS target is a lens, not a finish line. It forces you to be precise about what work the server is actually doing, because vague claims collapse immediately when you try to reproduce them. What workload? What latency? What hardware? What was the client doing?
This journal documents those questions, not the headline number.
All benchmarks run on AWS (ap-south-2, Hyderabad) unless otherwise noted. Server code is Go + Fiber v3. Load generation uses autocannon (for pipelining tests) or vegeta (for rate-controlled tests).
Source code: ikouchiha47/millionrps
Entries below are in chronological order. Green = done, orange = in progress, grey = planned.
Establish a baseline on a Mac mini M2 (10 cores). Compare net/http, fasthttp, and fiber with and without prefork. Measure /read (pool lookup), /list (large payload), and /compute (CPU-bound aggregation).
Two c6i.2xlarge on AWS. Server: fiber+prefork 8 workers. Client: vegeta. Confirm SO_REUSEPORT distributes connections across all 8 workers on Linux. Compare P99 against macOS.
Upgrade server to c8i.32xlarge (128 vCPU). Switch to autocannon with --pipelining 100. Apply RPS and RFS. Capture live metrics during benchmark.
Discovered --workers flag in autocannon. Upgraded client to c6i.8xlarge. Ran connection ramp 1000→2000→5000. Hit 2.3M RPS at p50 38ms. Proved server has 98% CPU headroom — client is the ceiling at 67% avg CPU across 32 cores.
Upgrade both server and client to c8i.32xlarge (128 vCPU, 50 Gbps). Run autocannon with 120 workers. Hit 18M RPS on /simple and 1.7M on /compute. Discover the client is the ceiling for /simple, server is the ceiling for /compute.
Full IRQ pinning experiment on c6i.2xlarge (8 vCPU) as server. Three steps: baseline, IRQ pinning only, IRQ + RPS/RFS. Run on /simple and /compute. Read the HAProxy blog. Understand why packet rate — not RPS — is what makes IRQ pinning matter.
Ran the IRQ experiment on matched c8i hardware. Then switched to vegeta (no pipelining) to generate realistic packet rates. Both hit client ceiling regardless of tuning.
Apply proper IRQ isolation as described in the HAProxy blog. Stop irqbalance and crond. Pin 16 NIC IRQs to dedicated cores. Run fiber workers on clean cores only. Upgrade client to c8i.32xlarge. Waiting for AWS quota increase to 320 vCPU.