IRQ pinning on /read was flat (entry 08). Server at 85% CPU busy, NIC at 56%. The bottleneck is somewhere in the request path. Looking at CPU% by itself tells you nothing about where goroutines are waiting. You need a profiler.
Go’s pprof captures four distinct things: where goroutines spend CPU time (cpu), which mutexes are contended (mutex), how long goroutines block waiting (block), and what’s live in the heap (heap). They answer different questions. The CPU profile is not enough on its own — a goroutine blocked waiting for a lock is off-CPU and completely invisible there.
Prefork had to be disabled for profiling. In prefork mode, the pprof HTTP server starts in the parent process. The 32 worker children — where all requests are actually handled — are unreachable. Disabling prefork puts everything into one process: all goroutines visible, one pprof endpoint.
# server — no prefork when PPROF_ENABLE=1
PPROF_ENABLE=1 nohup ./fiber_server > /tmp/fiber.log 2>&1 &
// fiber_server.go — pprof setup
func startPprof() {
runtime.SetMutexProfileFraction(5) // sample 1-in-5 mutex contentions
runtime.SetBlockProfileRate(1_000_000) // sample blocking events ≥ 1ms
go func() { http.ListenAndServe(":6060", nil) }()
}
func main() {
if os.Getenv("PPROF_ENABLE") == "1" {
startPprof()
}
// prefork disabled when profiling
prefork := os.Getenv("PPROF_ENABLE") != "1"
app.Listen(":8083", fiber.ListenConfig{EnablePrefork: prefork})
}
Profile capture — all four run in parallel against a live 500-connection autocannon load:
# 150s benchmark on client
autocannon -c 500 --pipelining 1 -w 30 -d 150 "http://SERVER_INTERNAL_IP:8083/read" &
# SSH tunnel to pprof (port 6060 not in security group)
ssh -L 6060:localhost:6060 -N -f ec2-user@SERVER_PUBLIC_IP
# capture all four simultaneously — 60s window each
curl -sf "http://localhost:6060/debug/pprof/profile?seconds=60" -o cpu.prof &
curl -sf "http://localhost:6060/debug/pprof/block?seconds=60" -o block.prof &
curl -sf "http://localhost:6060/debug/pprof/mutex?seconds=60" -o mutex.prof &
curl -sf "http://localhost:6060/debug/pprof/heap" -o heap.prof &
wait
Load during capture: 807k avg RPS, 3.74 GB/s, p50 < 1ms.
The dominant cost is the write path: bufio.Flush → net.Write → poll.Write → syscall.Syscall6. The kernel copying 4.5KB × 807k RPS = 3.63 GB/s into socket send buffers accounts for 42.5% of on-CPU time. That is the physical cost of the workload. It is not tunable without changing the payload.
Scheduler overhead (runtime.schedule → findRunnable → stealWork) at ~20%. At 807k RPS × 2 syscalls = 1.6M goroutine park/unpark cycles per second, this is expected.
gofakeit.Number → sync.Mutex.Lock at 2.25%. Looks minor. This is the problem with only reading the CPU profile: a goroutine blocked waiting for a lock is off-CPU. The CPU profiler cannot see it. 2.25% is the time goroutines spend acquiring the lock. It says nothing about how long they wait for it.
runtime.unlock → findRunnable → schedule → park_m at 93.42%. That is the Go scheduler’s internal run queue lock — every goroutine sleep/wake touches it. Normal noise at 800k RPS. Not a bug.
gofakeit (*lockedSource).Int63 at 6.35%. One application mutex. The call chain: gofakeit.Number → randIntRange → (*lockedSource).Int63 → sync.Mutex.Lock. 6.35% of all mutex contention in the system comes from one line: fake.Number(0, fiberPoolSize-1) in the /read handler.
The mutex profile still understates the damage. It measures contention time. The block profile measures wait time. They are not the same number.
Total samples: 523.90s blocked during 60s real time. That is 873% — on average 8.7 goroutines were parked simultaneously at any moment.
The left branch: fasthttp.serveConn → fiber.requestHandler → main.main.func2 → gofakeit.Number → (*lockedSource).Int63 → sync.Mutex.Lock = 403.90s (77.09%).
403 seconds of goroutine wait time, in 60 real seconds. 403 / 60 = 6.7 — six and a half CPU cores worth of goroutine capacity, permanently blocked, doing nothing, waiting for one global rand.Rand to be released.
Every /read request calls fake.Number(0, 499) once to pick a random product from the pre-built pool. gofakeit.lockedSource wraps math/rand.Rand in a sync.Mutex. All goroutines share it. At 807k requests per second, every goroutine is fighting for the same lock on every request.
The right branch: runtime.selectgo at 120s (22.91%). Fasthttp worker goroutines sitting in select{} waiting for a new connection. Normal idle time.
Total live heap: 8.78 MB. At 807k RPS. Zero per-request allocations — the product pool was pre-serialized at startup (json.Marshal = 2.62MB held permanently), and every /read response is a pointer into that pool. GC has nothing to collect during request handling.
runtime.allocm at 3.08MB (35%): OS thread stacks for GOMAXPROCS=32. Fixed cost.
The heap is not a problem. The design is correct.
The handler was:
return c.Send(fiberProductPoolJSON[fake.Number(0, fiberPoolSize-1)])
fake.Number at runtime means a global mutex on every request. The product pool is already pre-built at startup. The only thing needed per request is an index into it.
Fix: pre-generate 1M random indices at startup, walk them with an atomic counter:
const idxPoolSize = 1 << 20 // 1M entries = 4MB
var (
idxPool [idxPoolSize]int32
idxCursor atomic.Uint64
)
func fiberInitPool() {
// gofakeit runs here at startup only — never per-request
rng := rand.New(rand.NewSource(time.Now().UnixNano()))
for i := range idxPool {
idxPool[i] = int32(rng.Intn(fiberPoolSize))
}
// ... rest of pool init
}
func nextIdx() int32 {
return idxPool[idxCursor.Add(1)&(idxPoolSize-1)]
}
// handler:
return c.Send(fiberProductPoolJSON[nextIdx()])
One atomic.Add per request. No mutex. No RNG. No lock contention regardless of goroutine count.
| metric | before (gofakeit) | after (pre-generated pool) |
|---|---|---|
| RPS avg (500c, no-prefork) | 807,312 | 809,063 |
| block.prof gofakeit entry | 403.90s / 60s (77%) | gone |
| block.prof total blocked time | 523.90s / 60s | 120s / 60s |
| gofakeit in mutex.prof | 6.35% | 0% |
The gofakeit entry is gone from the block profile. The 403s/60s blocking drain is gone. Throughput: 807k → 809k. Flat.
The only blocking that remains in the after profile is the net/http pprof server’s own goroutines in select{} — the monitoring infrastructure, not the application. Application blocking is zero.
The 403s/60s blocking was goroutines parked. Not burning CPU — sleeping, unable to proceed. Freeing them didn’t add CPU capacity. It freed goroutine throughput, which immediately collided with the next constraint: the write syscall.
At 807k RPS × 4.5KB = 3.63 GB/s of TCP writes, the server was already at 85% CPU busy before the fix. The write path (bufio.Flush → net.Write → syscall.Syscall6) at 75% of on-CPU time does not shrink because gofakeit is removed. Those cycles are unavoidable at this payload size.
The goroutines that were blocked on gofakeit reached the write queue faster after the fix. The write queue processed them at the same rate. Same throughput.
With PPROF_ENABLE=1 and no prefork, the single-process server uses one Go scheduler for all goroutines. Profiling ran on c6in.8xlarge (32 vCPU). The prefork comparison ran on both c6in.8xlarge (32 workers) and c8i.32xlarge (128 workers) to measure the scheduler effect at scale.
# no-prefork run (PPROF_ENABLE=1 disables prefork in fiber_server.go)
PPROF_ENABLE=1 nohup ./fiber_server > /tmp/fiber.log 2>&1 &
# prefork run (default — no env var)
nohup ./fiber_server > /tmp/fiber.log 2>&1 &
# client ramp — same command for both runs
for CONN in 100 500 1000 2000 5000 10000; do
autocannon -c $CONN --pipelining 1 -w 120 -d 20 \
--json "http://SERVER_INTERNAL_IP:8083/read"
sleep 3
done
# verify thread count per worker during run
ps -eo pid,ppid,pcpu,nlwp,comm --sort=-pcpu | grep fiber
# nlwp = number of OS threads per process
# verify core affinity
taskset -cp $(pgrep fiber_server | head -2 | tail -1)
# c8i result: pid X's current affinity list: 0-127
c6in.8xlarge — 32 workers:
| connections | no-prefork RPS | prefork RPS | delta | p50 | p95 no-pf | p95 pf | p99 no-pf | p99 pf |
|---|---|---|---|---|---|---|---|---|
| 100 | 205,535 | 446,042 | +117% | <1ms | <1ms | <1ms | <1ms | <1ms |
| 500 | 793,289 | 779,002 | -2% | <1ms | 1ms | 1ms | 1ms | 1ms |
| 1,000 | 811,883 | 776,800 | -4% | 1ms | 2ms | 3ms | 3ms | 3ms |
| 2,000 | 787,674 | 757,267 | -4% | 1ms | 5ms | 6ms | 6ms | 7ms |
| 5,000 | 685,717 | 692,960 | +1% | 5ms | 13ms | 13ms | 17ms | 17ms |
On 32 cores prefork is the choice: +117% at 100 connections, and a statistical tie (within 2-4%, run-to-run noise) from 500-2000. It never meaningfully loses. The low-concurrency win is the headline; the mid-range is a wash. On 128 cores (below) prefork wins at every count by a wide margin. There is no operating point in any run where no-prefork is the better pick.
c8i.32xlarge — 128 workers:
Fiber v3 prefork spawns runtime.GOMAXPROCS(0) child processes (128 here), and each child resets itself to runtime.GOMAXPROCS(1) — one logical processor per worker, SO_REUSEPORT-style. Each child measured ~15 OS threads during 1000c load (taskset -cp confirms affinity 0-127). Those 15 threads are not 15 active cores — with GOMAXPROCS=1 only one runs Go code at a time; the rest are M’s detached into read/write syscalls plus sysmon/GC. The thread count and the worker-count mechanism are measured in detail in entry 11.
| connections | no-prefork RPS | prefork RPS | delta | p50 no-pf | p50 pf | p99 no-pf | p99 pf |
|---|---|---|---|---|---|---|---|
| 100 | 245,414 | 556,653 | +127% | <1ms | <1ms | 13ms | <1ms |
| 500 | 657,382 | 1,308,672 | +99% | <1ms | <1ms | 16ms | <1ms |
| 1,000 | 881,024 | 1,321,267 | +50% | <1ms | <1ms | 15ms | 1ms |
| 2,000 | 939,187 | 1,327,258 | +41% | 1ms | 1ms | 16ms | 3ms |
| 5,000 | 1,002,163 | 1,325,210 | +32% | 4ms | 3ms | 20ms | 6ms |
| 10,000 | 1,011,994 | 1,329,306 | +31% | 9ms | 6ms | 25ms | 17ms |
On 128 cores prefork wins at every connection count and the gap stays wide even at 10k connections (+31%). No-prefork peaks at ~1M RPS and flattens regardless of more connections — one Go scheduler managing all goroutines has hit its run queue ceiling. Prefork at 1.33M RPS is still flat — it has not found its ceiling yet on this hardware.
The no-prefork server on c6in (32 cores) peaked at 812k. On c8i (128 cores) it peaks at 1.01M. More cores help the single scheduler, but the prefork advantage grows with core count — 128 independent schedulers scale better than one scheduler on 128 cores.
The gofakeit fix has zero effect with prefork. Each worker is a separate process with its own address space — atomic.AddInt64 on fiberCounter touches only that process’s memory. No cross-process cache line contention at all. The fix was only relevant in the single-process profiling configuration.