Low Latency Trading Insights

Low Latency Trading Insights

The Great Lock-Free Queue Shootout (Part Four)

Cache and Memory Analysis: Challenging Performance Intuition

Henrique Bucher's avatar
Henrique Bucher
Sep 10, 2025
∙ Paid
1
1
Share

One of the most counterintuitive findings from our large-scale benchmark is how poorly traditional performance metrics predict actual throughput. The conventional wisdom that cache efficiency and instruction-level metrics directly correlate with performance breaks down completely when we operate at the scale of hundreds of megabytes of data transfer.

The Cache Miss Paradox

Traditional performance analysis would suggest that L1D cache miss rates should strongly predict throughput performance. After all, cache misses are expensive, and avoiding them should lead to faster execution. Our results tell a very different story.

Looking across our implementations, L1D cache miss counts cluster remarkably tightly between 1300 and 2100 misses per 10 million items—less than 60% variation. Yet the same implementations show throughput variations of 50x. FastQueue2 and Deaod SPSC, despite achieving vastly different performance levels across configurations, show nearly identical cache miss patterns. Even more puzzling, some of the slower implementations (like WeakQueue Michael & Scott) actually show slightly lower cache miss rates than the fastest performers.

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 Henrique Bucher
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture