The Great Lock-Free Queue Shootout (Part Four)
Cache and Memory Analysis: Challenging Performance Intuition
One of the most counterintuitive findings from our large-scale benchmark is how poorly traditional performance metrics predict actual throughput. The conventional wisdom that cache efficiency and instruction-level metrics directly correlate with performance breaks down completely when we operate at the scale of hundreds of megabytes of data transfer.
The Cache Miss Paradox
Traditional performance analysis would suggest that L1D cache miss rates should strongly predict throughput performance. After all, cache misses are expensive, and avoiding them should lead to faster execution. Our results tell a very different story.
Looking across our implementations, L1D cache miss counts cluster remarkably tightly between 1300 and 2100 misses per 10 million items—less than 60% variation. Yet the same implementations show throughput variations of 50x. FastQueue2 and Deaod SPSC, despite achieving vastly different performance levels across configurations, show nearly identical cache miss patterns. Even more puzzling, some of the slower implementations (like WeakQueue Michael & Scott) actually show slightly lower cache miss rates than the fastest performers.