Waiting Lines (Queues)
Stacks optimize for execution. Queues optimize for coordination.
In the real world, things happen at different speeds. The Network Card receives packets faster than the CPU can process them. The GPU draws frames slower than the Game Engine produces them.
If we forced them to lock-step, the faster component would stall. We need to decouple time.
We do this with a Queue. It is a buffer that absorbs bursts.
The Turnstile
The most common hardware queue is the Ring Buffer. It is a fixed-size array with two pointers: a Head (Write) and a Tail (Read). They chase each other in circle.
It seems perfect. O(1) insert. O(1) remove. No allocation.
But there is a hidden war.
The Danger: If Head and Tail get too close (same Cache Line), the Producer core and Consumer core will fight over the memory line. This is False Sharing.
The Four Lenses
Let's audit the Queue.
- Cache Locality (Mixed): The data is contiguous (good). But the pointers (Head/Tail) are often stored together in the same struct. This kills performance.
- TLB Pressure (Good): Like Stacks, Ring Buffers sit on one (or few) pages. Low pressure.
- Prefetching (Good): The access pattern is linear (circular). The hardware watcher loves it.
- Coherence (The Killer): This is the weak point. To write to Head, the Producer needs the cache line. To read Tail, the Producer also needs the cache line (to check for full). The Consumer mirrors this behavior, triggering the same invalidations in reverse. They constantly invalidate each other's cache lines.
Backpressure: Where Pain Goes
A Queue does not eliminate overload. It delays it.
When the Producer is faster than the Consumer for too long, the Queue fills. At that moment, someone must suffer.
- Drop data (network packets)
- Block the producer (threads stall)
- Allocate more memory (latency explosion)
This is not a software problem. It is a physics problem. Queues obey conservation of work.
FIFO is not fairness. FIFO is obedience. You give up reordering, prioritization, and locality to preserve arrival order. The machine does not care about order. Humans do.
How do we fix the Coherence war?
Padding. We deliberately insert useless bytes between the `Head` and `Tail` variables in memory, forcing them into separate cache lines. We waste 64 bytes of RAM to save thousands of cycles of bus traffic.
Why do high-performance queues assume one producer and one consumer?
Because adding more producers or consumers multiplies coherence traffic. Each additional writer introduces contention, atomic operations, and cache line invalidations. SPSC (Single-Producer Single-Consumer) queues can be almost lock-free. MPMC queues are never free — they just move the cost.
Why fixed size?
Because dynamic resizing requires memory allocation (Malloc), which is slow and unpredictable. In high-performance systems (Network Drivers, Audio Engines), we cannot afford to pause. We allocate once, upfront.
What programmers usually get wrong here
Using `std::queue` or `LinkedList` for high-speed buffering. Those involve pointers and allocations for every node. A simple array-based Ring Buffer is orders of magnitude faster because it respects the Cache Line.
This works — until we need to search. We have handled stacking and waiting. But what if we need to find something? Queues and Stacks are blind. To find, we must abandon linearity. And the moment we abandon linearity, the machine stops helping us.