Chapter 20

The Factory (The Pipeline)

Memory taught us that distance hurts. Data structures taught us that shape matters.

Now we enter the heart of the machine: The CPU.

The CPU exists for one reason only: to lie about time.

The Assembly Line

A CPU is not a thinker. It is a factory assembly line.

It breaks every instruction into 5 distinct stations (Stages):

Fetch (IF): Get the bytes from memory.
Decode (ID): Figure out what they mean.
Execute (EX): Do the math.
Memory (MEM): Read/Write to RAM.
Writeback (WB): Save result to register.

If we did these one at a time, a 3GHz CPU would only run at 600MHz effective speed. This is Scalar execution. It is stupid.

Physics Lens: Throughput ↑↑↑ (Parallelism) | Latency ↔ (Unchanged) | Complexity ↑ (Hazards)

Experiment: Compare Scalar vs Pipelined.
In Scalar mode, we handle 1 item at a time. The factory is empty 80% of the time.
In Pipelined mode, all 5 stations work at once. We finish 1 instruction every cycle, even though each one takes 5 cycles to build.

Throughput vs Latency

This is the critical distinction in modern computing.

Latency: How long it takes to execute one instruction (5 cycles).
Throughput: How many instructions we finish per cycle (1 IPC).

A slow instruction doesn't necessarily slow the CPU down — as long as it keeps moving. The pipeline hides the cost.

But notice the metric: 1 IPC. This is not a speed limit. It is a marketing lie sustained by perfect conditions.

The Three Ways a Pipeline Breaks

When the pipeline stalls, the factory doesn't slow down — it goes idle. A stalled pipeline is silicon doing nothing at 3 billion cycles per second.

Pipelines fail in exactly three ways (Hazards):

Data Hazards: You need a value that doesn't exist yet (Instruction B needs output of A).
Control Hazards: You don't know which instruction comes next (Branches/Ifs).
Structural Hazards: Two instructions want the same hardware at the same time.

Modern CPUs eliminate the third, suffer the first, and panic over the second.

Hypotheses

Why can't we just make the pipeline 100 stages deep?

We tried that (Intel Pentium 4, "NetBurst", 31 stages). It failed.

Deeper pipelines allow higher clock speeds, but they make Hazards much more expensive. If you predict a branch wrong in a 31-stage pipeline, you have to flush 31 stages of work. The penalty outweighs the speed.

Does the compiler know about this?

Yes. Compilers try to reorder your instructions to avoid Data Hazards (Static Scheduling). But the compiler cannot see Cache Misses. Only the hardware sees memory latency at runtime. That is why we need Chapter 21.

Code is dependency. What happens when Instruction B needs Instruction A, but A hasn't finished?

At this point, a naïve CPU would wait. Modern CPUs refuse to wait.

Instead, they break your program apart, reorder it, and promise to put it back together before you notice.