Shape Betrays Speed
We know that Contiguity is power. But simply packing variables together isn't enough. You have to pack them in the right shape.
The CPU is a bit of a diva. It doesn't like picking up a 4-byte Integer if it sits at Address 0x1001. It wants it at 0x1000 or 0x1004.
This preference is called Alignment.
The Cost of Air
If you define a structure like this:
struct Bad {
char a; // 1 byte
int b; // 4 bytes
char c; // 1 byte
};
You might think this takes 1 + 4 + 1 = 6 bytes.
You are wrong. It takes 12 bytes.
Because the Integer b demands a 4-byte aligned address, the compiler inserts 3 bytes of
invisible "Padding" after a. And to keep the array aligned, it adds 3 more bytes after
c. You are paying for 50% air.
Tetris for Memory
This is Memory Tetris. By simply reordering your variables (biggest to smallest), you can eliminate padding.
struct Good {
int b; // 4 bytes
char a; // 1 byte
char c; // 1 byte
// Padding: 2 bytes
};
This takes 8 bytes. We saved 33% memory just by moving lines of code. In a game with 1 million particles, that is Megabytes of saved RAM and significantly fewer cache misses.
Aligned data allows CPUs to load and operate on many values at once using SIMD instructions.
Why does the CPU care about alignment?
Because the memory bus connects to the CPU with, say, 64 wires. It reads address 0x00 to 0x3F in one go. If your Integer straddles the boundary (starts at 0x3E and ends at 0x41), the hardware has to perform two fetches and stitch the bytes together. This is complex and slow, so many CPUs (like ARM) simply crash or throw an exception if you try it.
Can I turn off padding?
Yes, in C you can use #pragma pack(1). This forces the compiler to squeeze
everything together. But be warned: accessing these unaligned variables will be slower, and on
some mobile phones, your app will instantly crash (SIGBUS).
Do Java/Python classes have padding?
Yes. Objects usually have a header (8-16 bytes overhead) and then fields. The JVM often reorders fields automatically to minimize waste, so you don't have to worry about it as much as in C/C++/Rust. But the memory overhead per object is massive compared to a raw struct.
What programmers usually get wrong here
Ignoring padding. You define a struct, measure `sizeof`, and are shocked it is 50% larger than the sum of its parts. This invisible waste destroys cache performance.
Teaser: Two independent variables in the same cache line can slow each other down — even on different cores. This is called False Sharing.
This works — until we scale it. We have optimized the single struct. Now we must ask: how do we organize millions of them?