IRQ on c8i + vegeta without pipelining: still the client

18M RPS stays flat with tuning. 580k vegeta ceiling. The tool is always the wall.

June 07, 2026 — DONE
c8i IRQ vegeta pipelining packet-rate client-ceiling

c8i IRQ pinning — same experiment, bigger hardware

After the c6i.2xlarge showed no effect, we ran the same three-step experiment on matched c8i.32xlarge hardware (128 cores each) to see if scale changed the result.

IRQ pinning on c8i

c8i.32xlarge has 16 NIC queues (IRQs 143-158) on interface enp95s0. Pin to cores 112-127:

# cores 112-127 in 128-core mask
# 4 × 32-bit groups: [bits127-96] [bits95-64] [bits63-32] [bits31-0]
# cores 112-127 = bits 16-31 of the leftmost group = ffff0000
MASK="ffff0000,00000000,00000000,00000000"

for irq in $(seq 143 158); do
    echo $MASK | sudo tee /proc/irq/$irq/smp_affinity > /dev/null
done

# verify
cat /proc/irq/143/smp_affinity_list
# 112-127

# restart fiber on cores 0-111
pkill fiber_server; sleep 1
nohup taskset -c 0-111 ./fiber_server > /tmp/fiber.log 2>&1 &

RPS/RFS on c8i

NIC=enp95s0   # c8i uses enp95s0, not ens5

# 128-core bitmask
for f in /sys/class/net/$NIC/queues/rx-*/rps_cpus; do
    echo "ffffffff,ffffffff,ffffffff,ffffffff" | sudo tee $f > /dev/null
done

echo 32768 | sudo tee /proc/sys/net/core/rps_sock_flow_entries > /dev/null

# 32768 total / 16 queues = 2048 per queue
for f in /sys/class/net/$NIC/queues/rx-*/rps_flow_cnt; do
    echo 2048 | sudo tee $f > /dev/null
done

c8i /simple results — all three steps

config1000c RPSp502000c RPSp505000c RPSp50
baseline18,164,7364ms17,426,2279ms6,659,74369ms
IRQ only17,898,2234ms17,434,2839ms6,624,71570ms
IRQ + RPS/RFS18,097,6984ms17,437,1509ms6,294,49376ms

Flat. Server still at 26% CPU across all three configurations. No tuning moves the needle when the server is idle.

c8i /compute results

configcores1000c RPSp99 ms5000c RPSp99 ms
baseline1281,662,2252481,744,8621000
RPS/RFS only1281,663,7272511,735,578976
IRQ+RPS/RFS (taskset 0-111)1121,515,5543041,595,7671161
IRQ+RPS/RFS (taskset 0-123)1241,625,6002591,715,4391102

The taskset penalty is directly proportional to cores lost:

128 cores → 1,744,862 RPS  (baseline)
124 cores → 1,715,439 RPS  (-1.7%)   = 4/128 = 3.1% fewer cores
112 cores → 1,595,767 RPS  (-8.6%)   = 16/128 = 12.5% fewer cores

RPS/RFS alone (no taskset) matched baseline exactly — zero cost, zero gain at this load.

switching to vegeta: removing pipelining

Autocannon pipelining batches many requests into few TCP segments. High RPS, low packet rate. To generate realistic packet rates we need vegeta — one request per connection slot, no pipelining.

vegeta parallel script

Single vegeta process caps at ~55-80k RPS (goroutine scheduler limit). Run 8 in parallel, each at rate/8, merge results via vegeta encode:

PARALLEL=8
TARGET=1000000
PER_PROC=$(( TARGET / PARALLEL ))   # 125000 each
WORKERS=$(( PER_PROC / 100 ))       # 1250 workers per process

for i in $(seq 1 $PARALLEL); do
    echo "GET http://SERVER_INTERNAL_IP:8083/simple" | vegeta attack \
        -rate=$PER_PROC \
        -duration=20s \
        -workers=$WORKERS \
        -keepalive=true \
        | vegeta encode > run_${i}.jsonl &
done
wait

# merge JSON lines (NOT binary — gob streams can't be naively cat'd)
cat run_*.jsonl | vegeta report -type=json > results.json
cat run_*.jsonl | vegeta report

vegeta results — /simple, 8 parallel processes

target RPSactual RPSp50 msp95 msp99 mssuccess
100,00099,9820.174.111.4100%
200,000199,9140.2210.419.6100%
500,000409,2696.329.542.3100%
1,000,000522,8208.544.459.7100%
1,500,000588,88312.659.078.6100%
2,000,000580,20917.275.8105100%
3,000,000558,83926.4110160100%

Client hits ~580k actual RPS and plateaus. Server at 5% CPU. No errors, 100% success rate throughout.

Metric during 2M target run:

CLIENT: usr: 67%   sys: 18%   idle: 12%   — approaching client ceiling
SERVER: usr:  5%   sys:  7%   idle: 86%   — barely loaded

Without pipelining, even 8 parallel vegeta processes on a 128-core machine can only drive ~580k RPS to this server. The client overhead per request (goroutine scheduling, TCP stack, stat recording) costs more than the server’s response.

what would actually stress the server

The only thing that consistently made the server the bottleneck was /compute:

  • Server at 94% CPU, client at 7%
  • 1.7M RPS, p50 6ms

For /simple on any endpoint configuration, the client hits its ceiling first. To remove this limitation:

  1. Multiple client machines simultaneously
  2. Client and server on same machine (HAProxy approach — client overhead is free on loopback)
  3. c6n instances with 100 Gbps NIC — at that bandwidth, packet rates approach the HAProxy regime
Finding Removing pipelining did not help — it made things worse for client throughput while also lowering packet rates significantly. The client ceiling exists regardless of benchmark tool. The only reproducible way to stress this server is via CPU-bound routes like /compute.

↑ Back to Journal