The Other Cache
We have spent 30 chapters optimizing Data. We aligned our structs, we avoided linked lists, we respected the cache line.
But there is a second stream of data flowing into the CPU that is just as critical: The Code Itself.
Instructions are not ghosts. They are bytes. They live in RAM. They must be fetched, cached, and fed into the pipeline.
The I-Cache
Every CPU has a split personality. It has a D-Cache (for your data) and an I-Cache (for your instructions).
The 32KB Cap: While your D-Cache might be huge, the I-Cache is typically tiny (often just 32KB). That can hold a few thousand instructions. Code size matters.
If your program is huge ("Code Bloat"), it won't fit.
When you call a function that hasn't run in a while, the CPU knows what it wants to do next — but the bytes aren't here yet. It stalls. It waits for RAM.
1. Linear Scan: Click "Step (Linear)". Watch the IPC (Instructions Per Cycle). It stays high because the cache predicts the next line.
2. Spaghetti Code: Click "Step (Jump)". You are calling random functions all over memory. The Cache Miss rate spikes. The IPC crashes.
The Cost of Abstraction
We love small functions. We love "Clean Code" strategies that break logic into tiny, reusable pieces.
But physically, a function call is a Jump.
If you have a loop that calls 5 different functions, and each function lives in a different page of memory, you are thrashing the I-Cache.
Why Loops are King: A tight loop (e.g., matrix multiplication) fits entirely inside the I-Cache. The CPU fetches the code once and re-runs it millions of times at full speed (High IPC).
Spaghetti code forces the CPU to constantly evict and fetch new blocks. It turns the front-end into a revolving door.
Why is "Inline" faster?
Inlining copies the function body directly into the caller. This removes the Jump. It improves spatial locality (the code is right there). The CPU loves linear code.
Is Java/C# Bytecode checked?
JIT (Just-In-Time) compilers try to compile "Hot Paths" into contiguous blocks of machine code to solve exactly this problem. They try to linearize your spaghetti logic.
The CPU has the bytes.
We fetched them successfully. We avoided the stall.
But even if we have them, we have a new problem: Bandwidth. The CPU can only decode 4 or 5 instructions per cycle.
If your instructions are complex, or variable length, this decoding becomes the bottleneck.
The CPU doesn't know what they mean yet. It must Decode them.