IRQ on c8i + vegeta without pipelining: still the client

c8i IRQ vegeta pipelining packet-rate client-ceiling

c8i IRQ pinning — same experiment, bigger hardware

After the c6i.2xlarge showed no effect, we ran the same three-step experiment on matched c8i.32xlarge hardware (128 cores each) to see if scale changed the result.

IRQ pinning on c8i

c8i.32xlarge has 16 NIC queues (IRQs 143-158) on interface enp95s0. Pin to cores 112-127:

# cores 112-127 in 128-core mask
# 4 × 32-bit groups: [bits127-96] [bits95-64] [bits63-32] [bits31-0]
# cores 112-127 = bits 16-31 of the leftmost group = ffff0000
MASK="ffff0000,00000000,00000000,00000000"

for irq in $(seq 143 158); do
    echo $MASK | sudo tee /proc/irq/$irq/smp_affinity > /dev/null
done

# verify
cat /proc/irq/143/smp_affinity_list
# 112-127

# restart fiber on cores 0-111
pkill fiber_server; sleep 1
nohup taskset -c 0-111 ./fiber_server > /tmp/fiber.log 2>&1 &

RPS/RFS on c8i

NIC=enp95s0   # c8i uses enp95s0, not ens5

# 128-core bitmask
for f in /sys/class/net/$NIC/queues/rx-*/rps_cpus; do
    echo "ffffffff,ffffffff,ffffffff,ffffffff" | sudo tee $f > /dev/null
done

echo 32768 | sudo tee /proc/sys/net/core/rps_sock_flow_entries > /dev/null

# 32768 total / 16 queues = 2048 per queue
for f in /sys/class/net/$NIC/queues/rx-*/rps_flow_cnt; do
    echo 2048 | sudo tee $f > /dev/null
done

c8i /simple results — all three steps

config	1000c RPS	p50	2000c RPS	p50	5000c RPS	p50
baseline	18,164,736	4ms	17,426,227	9ms	6,659,743	69ms
IRQ only	17,898,223	4ms	17,434,283	9ms	6,624,715	70ms
IRQ + RPS/RFS	18,097,698	4ms	17,437,150	9ms	6,294,493	76ms

Flat. Server still at 26% CPU across all three configurations. No tuning moves the needle when the server is idle.

c8i /compute results

config	cores	1000c RPS	p99 ms	5000c RPS	p99 ms
baseline	128	1,662,225	248	1,744,862	1000
RPS/RFS only	128	1,663,727	251	1,735,578	976
IRQ+RPS/RFS (taskset 0-111)	112	1,515,554	304	1,595,767	1161
IRQ+RPS/RFS (taskset 0-123)	124	1,625,600	259	1,715,439	1102

The taskset penalty is directly proportional to cores lost:

cores → 1,744,862 RPS  (baseline)
cores → 1,715,439 RPS  (-1.7%)   = 4/128 = 3.1% fewer cores
cores → 1,595,767 RPS  (-8.6%)   = 16/128 = 12.5% fewer cores

RPS/RFS alone (no taskset) matched baseline exactly — zero cost, zero gain at this load.

switching to vegeta: removing pipelining

Autocannon pipelining batches many requests into few TCP segments. High RPS, low packet rate. To generate realistic packet rates we need vegeta — one request per connection slot, no pipelining.

vegeta parallel script

Single vegeta process caps at ~55-80k RPS (goroutine scheduler limit). Run 8 in parallel, each at rate/8, merge results via vegeta encode:

PARALLEL=8
TARGET=1000000
PER_PROC=$(( TARGET / PARALLEL ))   # 125000 each
WORKERS=$(( PER_PROC / 100 ))       # 1250 workers per process

for i in $(seq 1 $PARALLEL); do
    echo "GET http://SERVER_INTERNAL_IP:8083/simple" | vegeta attack \
        -rate=$PER_PROC \
        -duration=20s \
        -workers=$WORKERS \
        -keepalive=true \
        | vegeta encode > run_${i}.jsonl &
done
wait

# merge JSON lines (NOT binary — gob streams can't be naively cat'd)
cat run_*.jsonl | vegeta report -type=json > results.json
cat run_*.jsonl | vegeta report

vegeta results — /simple, 8 parallel processes

target RPS	actual RPS	p50 ms	p95 ms	p99 ms	success
100,000	99,982	0.17	4.1	11.4	100%
200,000	199,914	0.22	10.4	19.6	100%
500,000	409,269	6.3	29.5	42.3	100%
1,000,000	522,820	8.5	44.4	59.7	100%
1,500,000	588,883	12.6	59.0	78.6	100%
2,000,000	580,209	17.2	75.8	105	100%
3,000,000	558,839	26.4	110	160	100%

Client hits ~580k actual RPS and plateaus. Server at 5% CPU. No errors, 100% success rate throughout.

Metric during 2M target run:

CLIENT: usr: 67%   sys: 18%   idle: 12%   — approaching client ceiling
SERVER: usr:  5%   sys:  7%   idle: 86%   — barely loaded

Without pipelining, even 8 parallel vegeta processes on a 128-core machine can only drive ~580k RPS to this server. The client overhead per request (goroutine scheduling, TCP stack, stat recording) costs more than the server’s response.

what would actually stress the server

The only thing that consistently made the server the bottleneck was /compute:

Server at 94% CPU, client at 7%
1.7M RPS, p50 6ms

For /simple on any endpoint configuration, the client hits its ceiling first. To remove this limitation:

Multiple client machines simultaneously
Client and server on same machine (HAProxy approach — client overhead is free on loopback)
c6n instances with 100 Gbps NIC — at that bandwidth, packet rates approach the HAProxy regime

Finding Removing pipelining did not help — it made things worse for client throughput while also lowering packet rates significantly. The client ceiling exists regardless of benchmark tool. The only reproducible way to stress this server is via CPU-bound routes like /compute.

Next c6n two-box setup with 100 Gbps NIC. Without pipelining at 100 Gbps, packet rates approach 10M pps — the regime where IRQ core dedication becomes necessary.

← IRQ pinning: when it matters and when it doesn't switching to /read: c6in.8xlarge, 50 Gbps, and IRQ pinning on a loaded NIC →

↑ Back to Journal