Libraries Are Weapons
A junior engineer says: "I'll write this matrix multiplication myself. It's just three for-loops."
A senior engineer says: "I'll use BLAS."
Libraries like NumPy, PyTorch, and Eigen are not just "convenient." They are Pre-Compiled Hardware Knowledge.
The Naive vs The Tuned
When you write for i, j, k, you are traversing memory linearly.
But as we saw in Ch40, linear traversal fails on 2D data (matrices). You get stride misses.
Libraries don't iterate linearly. They Tile. They break the matrix into small 32x32 blocks that fit perfectly in L1 Cache. They compute one block entirely before moving to the next.
1. Naive (Col-Major): Scan down columns. EVERY access is a Cache Miss (Red). Hit Rate is near 0%.
2. BLAS (Tiled): Scan small squares. First access misses, subsequent accesses HIT (Green). Hit Rate jumps to 75%+.
Why NumPy Beats Python
NumPy is not fast because it is "C". It is fast because it is Blocked, Aligned, and Vectorized.
It respects the Cache Line. It respects Arithmetic Intensity.
If you try to beat a mature library, you are fighting against 30 years of micro-architecture tuning. Unless your problem is very special, you will lose.
Can't the compiler (`-O3`) do this automatically?
Compilers try (Loop Blocking, Auto-Vectorization). But they are conservative. They cannot optimize if there is even a 1% chance of breaking your logic (e.g., Pointer Aliasing). Libraries are written by humans who know the intent and can force the optimization.
Is this why Game Engines use C++?
Yes. Not just because "C++ is fast," but because C++ allows you to control memory layout. You can force data to be contiguous. You can align structs to 64 bytes. You can talk to the cache directly.
Libraries are crystallized performance.
Use them. Or be prepared to replace them fully.
We have reached the end of the machine.
It is time for the final truth.