gzip moved the wall, and exposed a lie

gzip compression bandwidth NIC MTU jumbo-frames c6in c8i client-ceiling

why this

Entry 11 closed /read on c8i as NIC-bandwidth-bound: 4.5KB × 1.32M = 6.1 GB/s ≈ line rate. To go faster you change the bytes on the wire. Three ways to do that on the cost-efficient c6in.8xlarge: bigger frames (fewer header bytes per packet), huge pages (less CPU per request, not bytes), and compression (fewer bytes per response).

One of them was already done for us. ip link show ens5: MTU 9001. AWS ENA defaults to jumbo frames inside a VPC. Nothing to tune. Remember that number — it decides how much the third lever is worth.

That leaves compression. Pre-gzip the static pool once at startup, serve the bytes with Content-Encoding: gzip, zero per-request compression cost.

setup

var fiberProductPoolGz [fiberPoolSize][]byte
// at startup, alongside the raw JSON pool:
var buf bytes.Buffer
gw, _ := gzip.NewWriterLevel(&buf, gzip.BestCompression)
gw.Write(fiberProductPoolJSON[i]); gw.Close()
fiberProductPoolGz[i] = append([]byte(nil), buf.Bytes()...)

app.Get("/read-gz", func(c fiber.Ctx) error {
    c.Set("Content-Type", "application/json")
    c.Set("Content-Encoding", "gzip")
    return c.Send(fiberProductPoolGz[nextIdx()])
})

/read (raw) and /read-gz (pre-gzipped) live on the same server, same data. A/B at a fixed client config. The startup log prints the ratio:

pool: raw avg 4520B  gz avg 2289B  ratio 1.97x

1.97x, not the 3x you’d hope. The descriptions are random word salad and the UUIDs and image URLs are high-entropy. gzip halves it. That’s all there is.

first run: gzip made it slower

c6in.8xlarge client, autocannon -w 32 --pipelining 1, 500 connections, 20s.

endpoint	RPS	server CPU	NIC
/read	794,310	75.4%	55%
/read-gz	417,619	39.7%	15%

Half the throughput. The bytes got smaller and RPS fell off a cliff.

The server CPU also fell — 75% to 40%. That looks like gzip worked: less server CPU. It didn’t. The server CPU fell because fewer requests reached it. Look at where the work went: Content-Encoding: gzip means the client decompresses every response. autocannon was already the bottleneck at 794k — the server sat at 75% CPU with the NIC half-empty, waiting on the client. Add decompression to the client’s job and it chokes sooner. 418k.

gzip didn’t make the server slower. It made the client slower, and the client was the wall.

the lie that surfaced

If the c6in client is the wall at 794k, then 794k was never the server’s ceiling. And entry 08 called c6in /read “790k, CPU-bound at 85%.” That was the same weak client. We measured the load generator, not the server.

Swap the client for a c8i.32xlarge — 128 vCPU, four times the cores. Ramp /read (raw), -w 64:

connections	RPS	server CPU	NIC
200	817,852	69.7%	53%
350	1,128,260	80.3%	78%
500	1,309,321	83.1%	89%
1000	1,326,524	83.4%	90%

c6in /read does 1.33M, not 790k. And the wall is the NIC at 90% — 1.33M × 4520B = 6.0 GB/s ≈ 50 Gbps. Same bandwidth wall c8i hit in entry 11. The “CPU-bound” story was a client-bound artifact. The 32-core box is bandwidth-bound on /read, same as the 128-core box, because the payload and the NIC are the same on both.

then gzip does the thing it is for

Now the wall is bandwidth, and gzip halves the bandwidth. c8i client, -w 96, 1000 connections, reproduced:

endpoint	RPS	server CPU	NIC	bound by
/read	1,311,710	82.5%	90.1%	NIC
/read-gz	1,721,105	79.9%	60.8%	CPU
/read-gz	1,741,517	80.5%	61.7%	CPU

+32%. 1.31M → 1.73M. gzip halved the bytes, the NIC load dropped from 90% to 61%, the bandwidth wall is gone, and RPS climbs until it hits the next wall — CPU at 80%. gzip didn’t make the server faster. It moved the bottleneck from the NIC to the CPU, and the CPU sits higher.

(Same gzip, opposite of the first run. There the client was the wall and gzip overloaded it. Here the client has 128 cores to spare and the wall is bandwidth, which gzip relieves. Whether gzip helps or hurts is decided entirely by what the bottleneck is and whether the client can afford to decompress.)

why 32% and not 2x

The bytes halved. RPS rose 32%. The gap is two things.

The NIC was not the only thing near its limit. At the raw ceiling the NIC was at 90% but CPU was already at 82%. gzip removed the NIC wall; CPU was right behind it. You only get back the headroom between the old wall and the next one.

MTU 9001 means both payloads are one packet. The raw response is ~4.7KB, the gz response ~2.4KB, and the MSS is ~8949. Both fit in a single TCP segment. gzip removed zero packets and zero write syscalls. The per-request server saving is only the bytes copied through the write path:

per-request server CPU at the ceiling:
  /read     0.825 × 32 cores / 1.31M = 20.1µs
  /read-gz  0.80  × 32 cores / 1.73M = 14.8µs   → 26% cheaper

26% less CPU per request — copying 2KB fewer bytes, not skipping a syscall. If the MTU were 1500, the raw response would be 4 packets and gz would be 2, and gzip would have cut packet rate and syscall count too. At 9001 it can’t. The jumbo frames that came for free capped the size of this win.

Finding c6in.8xlarge /read is NIC-bandwidth-bound at 1.31M RPS (90% of 50 Gbps), not CPU-bound — the earlier 790k was the load generator's ceiling. Pre-gzip (1.97x) raises the ceiling to 1.73M (+32%) by halving NIC load and shifting the wall to CPU at 80%. Per-request server CPU drops 26% (20.1µs → 14.8µs), all from fewer bytes copied — at MTU 9001 both payloads are a single packet, so no syscalls are saved.

Two mistakes, one cause Entry 08 called c6in /read "CPU-bound at 790k." It was client-bound — measured with a 32-core c6in client that couldn't generate more load. And the first gzip run "showed" gzip cutting server CPU; it only cut the requests reaching the server, because the same weak client choked on decompression. Both errors come from the same blind spot: not checking who the bottleneck is before reading the result. A benchmark where the client is the wall tells you about the client.

Next /read-gz is now CPU-bound at 1.73M with the NIC at 61% — back in the regime where CPU-per-request tuning pays. The remaining lever from the top of this entry is huge pages: cut TLB misses on the pool, lower CPU per request. The catch — virtualized EC2 reports <not supported> for hardware TLB counters, so the mechanism can't be measured directly, only the RPS that comes out the other end.

← Three things that don't move /read past the NIC planned: IRQ pinning + matched hardware →

↑ Back to Journal