Low Latency Trading Insights

Low Latency Trading Insights

The Great Lock-Free Queue Shootout (Part Two)

Building a Research-Grade Lock-Free Queue Benchmark

Henrique Bucher's avatar
Henrique Bucher
Sep 05, 2025
∙ Paid
3
1
Share

*How we built the testing infrastructure that revealed 50x performance variations and overturned conventional wisdom*

Performance benchmarking in systems programming is deceptively difficult. Most benchmarks tell comfortable lies—they show incremental differences, validate existing assumptions, and reinforce conventional wisdom. But when you're trying to understand the true performance characteristics of lock-free data structures at production scale, comfortable lies become dangerous obstacles to optimization.

This article chronicles the development of the benchmark framework that revealed shocking truths about lock-free queue performance: 50x performance spreads between implementations, 5x variations within single implementations across compiler-architecture combinations, and the complete failure of traditional performance metrics to predict real-world throughput.

The framework didn't just measure performance—it exposed fundamental gaps in how we think about high-performance computing at scale.

Why Traditional Benchmarks Fail

The Micro-Benchmark Trap

Most lock-free queue benchmarks follow a predictable pattern: transfer 100,000 to 1,000,000 small items, measure the time, report throughput numbers. These benchmarks are fast to run, easy to understand, and completely misleading about production performance.

The fundamental problem is scale. At small scales (1MB of data transfer), you're primarily measuring algorithmic efficiency and micro-optimizations. Cache effects are minimal, memory bandwidth is abundant, and thermal constraints don't exist. The performance characteristics you observe bear little resemblance to what happens when you're transferring hundreds of megabytes of data in production systems.

Consider what's missing from typical micro-benchmarks:

**Memory Bandwidth Saturation:** With small data volumes, memory bandwidth is never the constraint. You're measuring how fast the queue can process operations, not how efficiently it can utilize the memory subsystem. At production scale, memory bandwidth becomes the overwhelming bottleneck, completely changing the performance landscape.

**Cache Hierarchy Effects:** Micro-benchmarks often fit entirely in L1 cache, masking the cache miss patterns that dominate real-world performance. When your working set exceeds cache capacity, different queue designs show radically different performance characteristics.

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 Henrique Bucher
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture