GOMAXPROCS sets fiber's worker count, not the thread count

GOMAXPROCS prefork threads fiber source

why we tried this

Each prefork worker on c8i.32xlarge runs ~15 OS threads under load (entry 09). 128 workers × 15 threads = ~1,900 OS threads on 128 cores. The hypothesis: that’s overhead, and setting GOMAXPROCS=1 per worker would cut each worker to ~1 thread, reduce kernel scheduling, improve L1/L2 locality. This is how PM2-cluster and Node.js workers run — one event loop per process.

The hypothesis was built on a wrong mental model. Two of them, actually.

setup

# run 1 — prefork, GOMAXPROCS inherits the machine (128 on c8i)
nohup ./fiber_server > /tmp/fiber.log 2>&1 &
sleep 4
echo "processes: $(pgrep -c fiber_server)"

# run 2 — prefork, GOMAXPROCS=1
GOMAXPROCS=1 nohup ./fiber_server > /tmp/fiber_gmp1.log 2>&1 &
sleep 4
echo "processes: $(pgrep -c fiber_server)"

# benchmark — same for both, c8i client
for CONN in 100 500 1000 2000 5000 10000; do
  autocannon -c $CONN --pipelining 1 -w 120 -d 20 \
    --json "http://SERVER_INTERNAL_IP:8083/read"
  sleep 3
done

results

connections	GOMAXPROCS=128 RPS	GOMAXPROCS=1 RPS	p50 (128 / 1)	p99 (128 / 1)
100	555,642	110,627	<1ms / 1ms	<1ms / 2ms
500	1,304,986	105,005	<1ms / 4ms	<1ms / 5ms
1,000	1,319,936	104,205	<1ms / 9ms	1ms / 10ms
2,000	1,326,029	97,088	1ms / 20ms	3ms / 22ms
5,000	1,320,653	90,659	3ms / 55ms	6ms / 63ms
10,000	1,333,453	87,856	6ms / 115ms	17ms / 138ms

Process count:

GOMAXPROCS=128 : 129 processes  (1 master + 128 workers)
GOMAXPROCS=1   :   2 processes  (1 master +   1 worker)

GOMAXPROCS=1 produced one worker, not 128 broken ones. My first instinct was that the fork loop crashed. It didn’t. I had to read the source to see I asked for exactly this.

what the source actually says

fiber/v3.1.0/prefork.go:

// 👶 child process 👶
if IsChild() {
    // use 1 cpu core per child process
    runtime.GOMAXPROCS(1)        // every child resets itself to 1
    ...
    return app.server.Serve(ln)  // SO_REUSEPORT listener
}

// 👮 master process 👮
maxProcs := runtime.GOMAXPROCS(0)   // master reads the env value
for range maxProcs {                 // spawns exactly that many children
    cmd := exec.Command(os.Args[0], os.Args[1:]...)
    ...
}

Two facts, both the opposite of what I assumed:

1. The master spawns runtime.GOMAXPROCS(0) children. Not runtime.NumCPU(). The env GOMAXPROCS directly sets the worker count. 128 → 128 workers, 32 → 32 workers, 1 → 1 worker. The “2 processes” was 1 master + 1 worker. Working as written.

2. Every child overrides to runtime.GOMAXPROCS(1). The comment says it outright: use 1 cpu core per child process. So my entire premise — “cut each worker from GOMAXPROCS=128 down to 1” — was nonsense. The workers were already GOMAXPROCS=1. The env var never touched their internal scheduling. It only ever controlled how many of them exist.

then why ~15 threads per worker, if GOMAXPROCS=1?

Because GOMAXPROCS caps the number of goroutines running Go code simultaneously (the P count), not the number of OS threads (M count). A GOMAXPROCS=1 process still spawns threads:

1 thread running Go code at a time (the single P)
one M per goroutine currently blocked inside a read/write syscall — these detach from the P and sit in the kernel
fixed runtime threads: sysmon, GC workers, finalizer

A network server constantly has goroutines mid-syscall, so the M pool sits around 10-15 even though only one runs Go at any instant. The thread count is the syscall-concurrency floor, not a GOMAXPROCS knob. You can’t tune it down with the env var, and the GOMAXPROCS=1 experiment never could have — the workers were always there.

why 1 worker = 110k RPS

One worker, GOMAXPROCS=1, one P. One goroutine runs Go code at a time. At 100 connections it manages 110k RPS; at 10k connections it degrades to 88k as the single P thrashes between more goroutines. This is the same single-threaded ceiling autocannon hits as a client. 128 workers at 1.33M is just this number × the workers that actually exist.

The mistake I reasoned about fiber's threading model from the outside — counted threads, assumed GOMAXPROCS controlled them, predicted a tuning win. Every step was wrong, and the benchmark "confirming" a problem (2 workers, low RPS) reinforced the wrong story. The source settled it in four lines. Read the source before theorising about what a library does internally.

Finding Fiber v3 prefork: master spawns runtime.GOMAXPROCS(0) workers, each worker runs runtime.GOMAXPROCS(1). The env GOMAXPROCS is a worker-count dial. Its default (NumCPU) gives one worker per core, which is correct — SO_REUSEPORT distributes connections across them. Don't set it to 1.

Next If GOMAXPROCS only changes worker count, does 32 workers vs 128 workers change /read throughput? And does the per-worker thread count actually depend on anything? Measured against the real ceiling — the NIC — in entry 11.

← 403 seconds of waste, per 60 seconds of work Three things that don't move /read past the NIC →

↑ Back to Journal