Entry 11 closed /read on c8i as NIC-bandwidth-bound: 4.5KB × 1.32M = 6.1 GB/s ≈ line rate. To go faster you change the bytes on the wire. Three ways to do that on the cost-efficient c6in.8xlarge: bigger frames (fewer header bytes per packet), huge pages (less CPU per request, not bytes), and compression (fewer bytes per response).
One of them was already done for us. ip link show ens5: MTU 9001. AWS ENA defaults to jumbo frames inside a VPC. Nothing to tune. Remember that number — it decides how much the third lever is worth.
That leaves compression. Pre-gzip the static pool once at startup, serve the bytes with Content-Encoding: gzip, zero per-request compression cost.
var fiberProductPoolGz [fiberPoolSize][]byte
// at startup, alongside the raw JSON pool:
var buf bytes.Buffer
gw, _ := gzip.NewWriterLevel(&buf, gzip.BestCompression)
gw.Write(fiberProductPoolJSON[i]); gw.Close()
fiberProductPoolGz[i] = append([]byte(nil), buf.Bytes()...)
app.Get("/read-gz", func(c fiber.Ctx) error {
c.Set("Content-Type", "application/json")
c.Set("Content-Encoding", "gzip")
return c.Send(fiberProductPoolGz[nextIdx()])
})
/read (raw) and /read-gz (pre-gzipped) live on the same server, same data. A/B at a fixed client config. The startup log prints the ratio:
pool: raw avg 4520B gz avg 2289B ratio 1.97x
1.97x, not the 3x you’d hope. The descriptions are random word salad and the UUIDs and image URLs are high-entropy. gzip halves it. That’s all there is.
c6in.8xlarge client, autocannon -w 32 --pipelining 1, 500 connections, 20s.
| endpoint | RPS | server CPU | NIC |
|---|---|---|---|
| /read | 794,310 | 75.4% | 55% |
| /read-gz | 417,619 | 39.7% | 15% |
Half the throughput. The bytes got smaller and RPS fell off a cliff.
The server CPU also fell — 75% to 40%. That looks like gzip worked: less server CPU. It didn’t. The server CPU fell because fewer requests reached it. Look at where the work went: Content-Encoding: gzip means the client decompresses every response. autocannon was already the bottleneck at 794k — the server sat at 75% CPU with the NIC half-empty, waiting on the client. Add decompression to the client’s job and it chokes sooner. 418k.
gzip didn’t make the server slower. It made the client slower, and the client was the wall.
If the c6in client is the wall at 794k, then 794k was never the server’s ceiling. And entry 08 called c6in /read “790k, CPU-bound at 85%.” That was the same weak client. We measured the load generator, not the server.
Swap the client for a c8i.32xlarge — 128 vCPU, four times the cores. Ramp /read (raw), -w 64:
| connections | RPS | server CPU | NIC |
|---|---|---|---|
| 200 | 817,852 | 69.7% | 53% |
| 350 | 1,128,260 | 80.3% | 78% |
| 500 | 1,309,321 | 83.1% | 89% |
| 1000 | 1,326,524 | 83.4% | 90% |
c6in /read does 1.33M, not 790k. And the wall is the NIC at 90% — 1.33M × 4520B = 6.0 GB/s ≈ 50 Gbps. Same bandwidth wall c8i hit in entry 11. The “CPU-bound” story was a client-bound artifact. The 32-core box is bandwidth-bound on /read, same as the 128-core box, because the payload and the NIC are the same on both.
Now the wall is bandwidth, and gzip halves the bandwidth. c8i client, -w 96, 1000 connections, reproduced:
| endpoint | RPS | server CPU | NIC | bound by |
|---|---|---|---|---|
| /read | 1,311,710 | 82.5% | 90.1% | NIC |
| /read-gz | 1,721,105 | 79.9% | 60.8% | CPU |
| /read-gz | 1,741,517 | 80.5% | 61.7% | CPU |
+32%. 1.31M → 1.73M. gzip halved the bytes, the NIC load dropped from 90% to 61%, the bandwidth wall is gone, and RPS climbs until it hits the next wall — CPU at 80%. gzip didn’t make the server faster. It moved the bottleneck from the NIC to the CPU, and the CPU sits higher.
(Same gzip, opposite of the first run. There the client was the wall and gzip overloaded it. Here the client has 128 cores to spare and the wall is bandwidth, which gzip relieves. Whether gzip helps or hurts is decided entirely by what the bottleneck is and whether the client can afford to decompress.)
The bytes halved. RPS rose 32%. The gap is two things.
The NIC was not the only thing near its limit. At the raw ceiling the NIC was at 90% but CPU was already at 82%. gzip removed the NIC wall; CPU was right behind it. You only get back the headroom between the old wall and the next one.
MTU 9001 means both payloads are one packet. The raw response is ~4.7KB, the gz response ~2.4KB, and the MSS is ~8949. Both fit in a single TCP segment. gzip removed zero packets and zero write syscalls. The per-request server saving is only the bytes copied through the write path:
per-request server CPU at the ceiling:
/read 0.825 × 32 cores / 1.31M = 20.1µs
/read-gz 0.80 × 32 cores / 1.73M = 14.8µs → 26% cheaper
26% less CPU per request — copying 2KB fewer bytes, not skipping a syscall. If the MTU were 1500, the raw response would be 4 packets and gz would be 2, and gzip would have cut packet rate and syscall count too. At 9001 it can’t. The jumbo frames that came for free capped the size of this win.
<not supported> for hardware TLB counters, so the mechanism can't be measured directly, only the RPS that comes out the other end.