Mac mini M2, 10 cores (8P + 2E), 16GB RAM, macOS. Benchmark and server on the same machine via loopback.
Three servers serving the same routes. All responses come from an in-memory pool generated at startup — no database, no disk, no allocations on the hot path.
Routes:
/read — returns one random pre-serialized Product JSON from a pool of 500. ~4KB./list — returns a pre-serialized batch of 20-30 products. ~100KB./compute — takes 100 random products, sorts by price, computes avg/stddev/histogram/top brands, returns aggregated JSON. CPU-bound per request./simple — returns {"message":"hi"} from a package-level []byte. Zero allocation.Product struct is realistic: UUID, name, brand, category, price, rating, review count, a 5-paragraph description, 5-15 tags, 7 attributes, 3-8 image URLs. One product serializes to ~3-5KB.
# wrk with a Lua script mixing /read and /list
wrk -t4 -c100 -d90s --latency --script script_rw.lua http://localhost:8083
# script_rw.lua — 100% /read (used for server comparison)
wrk.method = "GET"
request = function()
return wrk.format(nil, "/read")
end
-t4 = 4 threads, -c100 = 100 connections, -d90s = 90 second duration.
/readAll numbers from run8.txt:
wrk -t4 -c100 -d90s --latency --script script_rw.lua http://localhost:8083
| Server | RPS | P50 | P75 | P90 | P99 |
|---|---|---|---|---|---|
| net/http | 93k | 1.02ms | — | — | 1.64ms |
| fasthttp | 174k | 0.38ms | — | — | 1.37ms |
| fiber (no prefork) | 175k | 0.40ms | — | — | 1.28ms |
| fiber + prefork (10 workers) | 201,777 | 490µs | 567µs | 617µs | 683µs |
Actual wrk output for fiber+prefork:
Thread Stats Avg Stdev Max +/- Stdev
Latency 493.28us 113.56us 8.46ms 72.51%
Req/Sec 50.70k 2.36k 53.99k 94.53%
18180192 requests in 1.50m, 78.39GB read
Requests/sec: 201777.96
Transfer/sec: 0.87GB
prefork’s P99 (683µs) is lower than single-process fiber’s P99 (1.28ms) despite a slightly higher P50. Load distributed across 10 workers reduces tail spikes from GC pauses.
/compute vs /readwrk -t4 -c100 -d90s --latency --script script_compute.lua http://localhost:8083
Actual wrk output (run10.txt):
Thread Stats Avg Stdev Max +/- Stdev
Latency 2.95ms 785.40us 8.25ms 63.39%
Req/Sec 8.52k 183.60 9.56k 71.34%
3054303 requests in 1.50m, 1.49GB read
Requests/sec: 33897.44
| Route | RPS | P50 | P99 | P99/P50 ratio |
|---|---|---|---|---|
| /read (pool lookup) | 175k | 0.40ms | 1.28ms | 3.2× |
| /compute (sort + aggregate 100 items) | 33,897 | 2.88ms | 4.68ms | 1.6× |
/compute is 5× slower in RPS but has a tighter latency distribution — P99 is only 1.6× P50 vs 3.2× for /read. CPU-bound work is predictable: every request does the same sort + aggregation. Near-zero work like /read exposes GC pauses and goroutine scheduler jitter that push P99 disproportionately high.
When we added /list (50-100 products, ~350KB per response) to the benchmark mix:
# 60% /read, 40% /list
vegeta attack -rate=100000 -duration=30s -targets=targets.txt | vegeta report
# vegeta result at 100k target RPS
Requests/sec: 2,529
Mean response: 48,266 bytes
Throughput: ~350 MB/s
2,529 actual RPS against a 100k target. Not a server problem — the average 48KB response × 2,529 = ~350 MB/s through the loopback. Reducing /list to 20-30 products (~48KB avg weighted):
# after reducing list batch size
Requests/sec: 8,601
Mean response: 48,266 bytes
P50: 3.6ms P99: 20.3ms
3.4× more RPS from a config change with no server code changes.
htop during the vegeta benchmark showed all 10 child workers at 0.0% CPU. Only the parent process was active (14.1% CPU). SO_REUSEPORT on macOS does not distribute TCP connections across prefork workers — all connections go to the parent process, which uses Go’s goroutine scheduler to spread across all 10 cores. The children are decorative.
This means the 201k RPS number above is a single Go process result, not 10 workers. Linux tests this correctly.
SO_REUSEPORT on macOS was designed for UDP multicast (BSD origin), not TCP load balancing. Any prefork benchmark on macOS is a single-process benchmark in disguise. Confirmed via htop showing 0% CPU on all 10 child processes during load.