Each prefork worker on c8i.32xlarge runs ~15 OS threads under load (entry 09). 128 workers × 15 threads = ~1,900 OS threads on 128 cores. The hypothesis: that’s overhead, and setting GOMAXPROCS=1 per worker would cut each worker to ~1 thread, reduce kernel scheduling, improve L1/L2 locality. This is how PM2-cluster and Node.js workers run — one event loop per process.
The hypothesis was built on a wrong mental model. Two of them, actually.
# run 1 — prefork, GOMAXPROCS inherits the machine (128 on c8i)
nohup ./fiber_server > /tmp/fiber.log 2>&1 &
sleep 4
echo "processes: $(pgrep -c fiber_server)"
# run 2 — prefork, GOMAXPROCS=1
GOMAXPROCS=1 nohup ./fiber_server > /tmp/fiber_gmp1.log 2>&1 &
sleep 4
echo "processes: $(pgrep -c fiber_server)"
# benchmark — same for both, c8i client
for CONN in 100 500 1000 2000 5000 10000; do
autocannon -c $CONN --pipelining 1 -w 120 -d 20 \
--json "http://SERVER_INTERNAL_IP:8083/read"
sleep 3
done
| connections | GOMAXPROCS=128 RPS | GOMAXPROCS=1 RPS | p50 (128 / 1) | p99 (128 / 1) |
|---|---|---|---|---|
| 100 | 555,642 | 110,627 | <1ms / 1ms | <1ms / 2ms |
| 500 | 1,304,986 | 105,005 | <1ms / 4ms | <1ms / 5ms |
| 1,000 | 1,319,936 | 104,205 | <1ms / 9ms | 1ms / 10ms |
| 2,000 | 1,326,029 | 97,088 | 1ms / 20ms | 3ms / 22ms |
| 5,000 | 1,320,653 | 90,659 | 3ms / 55ms | 6ms / 63ms |
| 10,000 | 1,333,453 | 87,856 | 6ms / 115ms | 17ms / 138ms |
Process count:
GOMAXPROCS=128 : 129 processes (1 master + 128 workers)
GOMAXPROCS=1 : 2 processes (1 master + 1 worker)
GOMAXPROCS=1 produced one worker, not 128 broken ones. My first instinct was that the fork loop crashed. It didn’t. I had to read the source to see I asked for exactly this.
fiber/v3.1.0/prefork.go:
// 👶 child process 👶
if IsChild() {
// use 1 cpu core per child process
runtime.GOMAXPROCS(1) // every child resets itself to 1
...
return app.server.Serve(ln) // SO_REUSEPORT listener
}
// 👮 master process 👮
maxProcs := runtime.GOMAXPROCS(0) // master reads the env value
for range maxProcs { // spawns exactly that many children
cmd := exec.Command(os.Args[0], os.Args[1:]...)
...
}
Two facts, both the opposite of what I assumed:
1. The master spawns runtime.GOMAXPROCS(0) children. Not runtime.NumCPU(). The env GOMAXPROCS directly sets the worker count. 128 → 128 workers, 32 → 32 workers, 1 → 1 worker. The “2 processes” was 1 master + 1 worker. Working as written.
2. Every child overrides to runtime.GOMAXPROCS(1). The comment says it outright: use 1 cpu core per child process. So my entire premise — “cut each worker from GOMAXPROCS=128 down to 1” — was nonsense. The workers were already GOMAXPROCS=1. The env var never touched their internal scheduling. It only ever controlled how many of them exist.
Because GOMAXPROCS caps the number of goroutines running Go code simultaneously (the P count), not the number of OS threads (M count). A GOMAXPROCS=1 process still spawns threads:
read/write syscall — these detach from the P and sit in the kernelA network server constantly has goroutines mid-syscall, so the M pool sits around 10-15 even though only one runs Go at any instant. The thread count is the syscall-concurrency floor, not a GOMAXPROCS knob. You can’t tune it down with the env var, and the GOMAXPROCS=1 experiment never could have — the workers were always there.
One worker, GOMAXPROCS=1, one P. One goroutine runs Go code at a time. At 100 connections it manages 110k RPS; at 10k connections it degrades to 88k as the single P thrashes between more goroutines. This is the same single-threaded ceiling autocannon hits as a client. 128 workers at 1.33M is just this number × the workers that actually exist.
runtime.GOMAXPROCS(0) workers, each worker runs runtime.GOMAXPROCS(1). The env GOMAXPROCS is a worker-count dial. Its default (NumCPU) gives one worker per core, which is correct — SO_REUSEPORT distributes connections across them. Don't set it to 1.