Chapter 18

Probability (Hashes)

We have explored Order (Stack), Time (Queue), and Fragmentation (List). Now we embrace Chaos.

The Hash Table is the greatest trick in computer science. It promises O(1) access to any data, anywhere, instantly.

It achieves this by replacing structure with Math.

Speed with Betrayal

To find data instantly, we must know exactly where it is. We use a Hash Function to turn a Key (string) into an Address (integer).

This function is designed to be chaotic. It scatters data across memory to avoid collisions. It maximizes local unpredictability.

The Hidden Variable: Load Factor

Hash Tables are not O(1) by default. They are O(1) only if sparsity is maintained.

The Load Factor (α) is the ratio: items / buckets.

As α approaches 1:

Collisions explode
Linear probing turns into linear scanning
Cache behavior collapses

Hash Tables are fast because they waste space. Space is the bribe paid to probability.

The Worst Case Is Real

Hash Tables have a worst case of O(N). This is not academic.

If an attacker controls the keys, they can force every value into the same bucket. Your O(1) table becomes a linked list.

This is why modern languages randomize hash seeds. They are defending against probability becoming weaponized against you.

Physics Lens: Cache ↓ (Random Jumps) | Prefetch ↓↓↓ (Disabled) | Entropy ↑↑↑ (Maximized)

Experiment: Compare Linear Probing (Scanning) vs Separate Chaining (Pointers).
Probing keeps data in the array (Cache Friendly). Chaining creates a Linked List (Cache Killer). Modern CPUs prefer Probing.

The Four Lenses

Let's audit the Hash Table.

Cache Locality (The Trade): To get O(1) logical access, we accept O(1) random physical access. We jump into the dark. If we collide, Linear Probing saves us by scanning neighbors (spatial locality). Chaining kills us by chasing pointers (randomness on top of randomness).
TLB Pressure (High): Because hashes scatter data, they touch many pages. Large hash tables thrash the TLB.
Prefetching (Dead): The hardware prefetcher relies on patterns (1, 2, 3...). A specific goal of hashing is to destroy patterns. The watcher is blind.
Coherence (Nightmare): Resizing a hash table requires moving every single item. This stalls the world. Writes are globally disruptive.

Hypotheses

Why is Linear Probing faster if it searches?

Searching 10 slots in a contiguous CPU Cache Line is faster than fetching 1 pointer from main RAM. Math: 10 * 1 cycle < 1 * 200 cycles.

Why do we call it "Weaponized Randomness"?

Because precise order is slow (requires sorting). Precise structure is slow (requires pointers). Randomness allows us to be lazy. We let probability do the work of organization.

Why not use Hash Tables everywhere?

Because they destroy predictability. Hash Tables are fast on average, but terrible for iteration in order, range queries, real-time systems, and latency-sensitive paths.

They trade guarantees for probability. Sometimes that trade is unacceptable.

The Cost of resizing

When the table fills, we must double it. This is not just a copy; we must re-hash every key. It is a catastrophic event for latency. This is why "Game Loops" never use dynamic Hash Tables.

We have traded order for speed. Probability is fast, but it cannot reason. To reason, we must impose structure again.