After the c6i.2xlarge showed no effect, we ran the same three-step experiment on matched c8i.32xlarge hardware (128 cores each) to see if scale changed the result.
c8i.32xlarge has 16 NIC queues (IRQs 143-158) on interface enp95s0. Pin to cores 112-127:
# cores 112-127 in 128-core mask
# 4 × 32-bit groups: [bits127-96] [bits95-64] [bits63-32] [bits31-0]
# cores 112-127 = bits 16-31 of the leftmost group = ffff0000
MASK="ffff0000,00000000,00000000,00000000"
for irq in $(seq 143 158); do
echo $MASK | sudo tee /proc/irq/$irq/smp_affinity > /dev/null
done
# verify
cat /proc/irq/143/smp_affinity_list
# 112-127
# restart fiber on cores 0-111
pkill fiber_server; sleep 1
nohup taskset -c 0-111 ./fiber_server > /tmp/fiber.log 2>&1 &
NIC=enp95s0 # c8i uses enp95s0, not ens5
# 128-core bitmask
for f in /sys/class/net/$NIC/queues/rx-*/rps_cpus; do
echo "ffffffff,ffffffff,ffffffff,ffffffff" | sudo tee $f > /dev/null
done
echo 32768 | sudo tee /proc/sys/net/core/rps_sock_flow_entries > /dev/null
# 32768 total / 16 queues = 2048 per queue
for f in /sys/class/net/$NIC/queues/rx-*/rps_flow_cnt; do
echo 2048 | sudo tee $f > /dev/null
done
| config | 1000c RPS | p50 | 2000c RPS | p50 | 5000c RPS | p50 |
|---|---|---|---|---|---|---|
| baseline | 18,164,736 | 4ms | 17,426,227 | 9ms | 6,659,743 | 69ms |
| IRQ only | 17,898,223 | 4ms | 17,434,283 | 9ms | 6,624,715 | 70ms |
| IRQ + RPS/RFS | 18,097,698 | 4ms | 17,437,150 | 9ms | 6,294,493 | 76ms |
Flat. Server still at 26% CPU across all three configurations. No tuning moves the needle when the server is idle.
| config | cores | 1000c RPS | p99 ms | 5000c RPS | p99 ms |
|---|---|---|---|---|---|
| baseline | 128 | 1,662,225 | 248 | 1,744,862 | 1000 |
| RPS/RFS only | 128 | 1,663,727 | 251 | 1,735,578 | 976 |
| IRQ+RPS/RFS (taskset 0-111) | 112 | 1,515,554 | 304 | 1,595,767 | 1161 |
| IRQ+RPS/RFS (taskset 0-123) | 124 | 1,625,600 | 259 | 1,715,439 | 1102 |
The taskset penalty is directly proportional to cores lost:
128 cores → 1,744,862 RPS (baseline)
124 cores → 1,715,439 RPS (-1.7%) = 4/128 = 3.1% fewer cores
112 cores → 1,595,767 RPS (-8.6%) = 16/128 = 12.5% fewer cores
RPS/RFS alone (no taskset) matched baseline exactly — zero cost, zero gain at this load.
Autocannon pipelining batches many requests into few TCP segments. High RPS, low packet rate. To generate realistic packet rates we need vegeta — one request per connection slot, no pipelining.
Single vegeta process caps at ~55-80k RPS (goroutine scheduler limit). Run 8 in parallel, each at rate/8, merge results via vegeta encode:
PARALLEL=8
TARGET=1000000
PER_PROC=$(( TARGET / PARALLEL )) # 125000 each
WORKERS=$(( PER_PROC / 100 )) # 1250 workers per process
for i in $(seq 1 $PARALLEL); do
echo "GET http://SERVER_INTERNAL_IP:8083/simple" | vegeta attack \
-rate=$PER_PROC \
-duration=20s \
-workers=$WORKERS \
-keepalive=true \
| vegeta encode > run_${i}.jsonl &
done
wait
# merge JSON lines (NOT binary — gob streams can't be naively cat'd)
cat run_*.jsonl | vegeta report -type=json > results.json
cat run_*.jsonl | vegeta report
| target RPS | actual RPS | p50 ms | p95 ms | p99 ms | success |
|---|---|---|---|---|---|
| 100,000 | 99,982 | 0.17 | 4.1 | 11.4 | 100% |
| 200,000 | 199,914 | 0.22 | 10.4 | 19.6 | 100% |
| 500,000 | 409,269 | 6.3 | 29.5 | 42.3 | 100% |
| 1,000,000 | 522,820 | 8.5 | 44.4 | 59.7 | 100% |
| 1,500,000 | 588,883 | 12.6 | 59.0 | 78.6 | 100% |
| 2,000,000 | 580,209 | 17.2 | 75.8 | 105 | 100% |
| 3,000,000 | 558,839 | 26.4 | 110 | 160 | 100% |
Client hits ~580k actual RPS and plateaus. Server at 5% CPU. No errors, 100% success rate throughout.
Metric during 2M target run:
CLIENT: usr: 67% sys: 18% idle: 12% — approaching client ceiling
SERVER: usr: 5% sys: 7% idle: 86% — barely loaded
Without pipelining, even 8 parallel vegeta processes on a 128-core machine can only drive ~580k RPS to this server. The client overhead per request (goroutine scheduling, TCP stack, stat recording) costs more than the server’s response.
The only thing that consistently made the server the bottleneck was /compute:
For /simple on any endpoint configuration, the client hits its ceiling first. To remove this limitation:
/compute.