Low Latency Trading Insights

Low Latency Trading Insights

The SIMD Instructions Your Compiler Won’t Generate (And Why That’s Costing You)

Part 2 of the Assembly Reading Series for HFT Developers

Henrique Bucher's avatar
Henrique Bucher
Oct 25, 2025
∙ Paid

Three years ago, a quantitative developer at one of the big market makers showed me code that was processing trade signals. Clean C++ code. Well-tested. Deployed to production. Running on brand new Skylake servers with all the latest AVX-512 extensions. The code was processing one signal at a time. The hardware could process sixteen signals at a time. They were leaving 93% of their computational capacity on the floor.

When I pulled up the assembly in Godbolt, the problem was obvious. Every operation was scalar. No vector instructions anywhere. The compiler saw opportunities to vectorize—the optimization report said so—but then decided not to, because of one std::vector method call that made the compiler think pointers might alias.

Cost of that decision? About 180 microseconds per batch of signals. In their world, 180 microseconds meant the difference between being first in queue and being fourth. Being fourth meant worse fills. Worse fills meant giving back edge to the market. Giving back edge meant their model’s theoretical Sharpe ratio of 2.1 became a realized Sharpe ratio of 1.4.

All because the compiler wouldn’t vectorize.

The Lie About Auto-Vectorization

Compilers can auto-vectorize your code. This is true. Modern compilers are incredibly sophisticated at recognizing patterns and generating SIMD instructions. What they don’t tell you is how often they fail. And when they fail, they fail silently. Your code still works. It just runs at one-fourth the speed it could.

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 Henrique Bucher
Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture