Chapter 11

The Cache Line

In the hierarchy, we saw that accessing RAM is slow. To survive, the CPU uses a trick: it never fetches just what you ask for.

If you ask for 1 Byte, the CPU will fetch 64 Bytes.

This block is called a Cache Line. It is the atomic unit of memory transfer. You cannot negotiate this. You take the whole line, or you take nothing.

The Granularity Tax

This mechanism creates a stark physical reality: **Spatial Locality is not optional.**

If you use only 1 byte out of that 64-byte line, you have wasted 63 bytes of bandwidth. You have clogged the memory bus with garbage. Every wasted byte steals bandwidth from something that mattered. This is called "Poor Utilization."

This is why Linked Lists die. Each node might be 16 bytes, but it pulls in a fresh 64-byte line from a random location. You are paying for 4x the data you actually need.

Physics Lens: Latency ↓ (Line Hit) | Throughput ↑ (64 Bytes/Cycle) | Energy ↑ (Over-fetch) | Waste ↑↑ (Unused Slots)

Experiment:
1. Read Array (Stride 1): We need 4 numbers. They fit in 1 Line. Waste is low.
2. Read List (Stride 16): We need 4 numbers. They are scattered. We load 4 full lines. Waste is massive.

The Betrayal

You wrote code that looked efficient. You allocated only what you needed. But the hardware betrayed you.

Because of the Cache Line, adding a single boolean to a class can trigger a performance cliff if it pushes the structure slightly over the 64-byte boundary, forcing a second fetch.

Hypotheses

Can’t I request fewer than 64 bytes?

No. The memory subsystem does not understand your intent. It moves cache lines or nothing.

If cache lines cause waste, why not shrink them?

Smaller lines increase metadata overhead and bus traffic. You trade one kind of waste for another. There is no free configuration — only compromises.

Isn’t overfetching bad for power?

Yes. Cache lines trade energy for latency hiding. Mobile CPUs suffer the most from bad locality.

Why does adding a small field to a struct sometimes kill performance?

Because crossing a cache-line boundary doubles fetches. Your code didn’t get slower — the hardware bill doubled.

What programmers usually get wrong here

They optimize memory size instead of memory shape. The machine doesn’t care how clever your compression is if access stays scattered.

This works — until the CPU starts guessing. We are fetching blocks. But the CPU gets bored waiting for you to ask. It starts fetching things you didn't ask for.