How AMD’s Prefetch Instructions Saved My Ass in Production: A 15ns Journey
It was 3:47 AM on a Tuesday when my phone started buzzing. Our Asia trading desk was losing money. Not the slow bleed kind - the hemorrhaging kind where every microsecond of delay translates to someone else eating your lunch. The latency graphs showed we’d gone from a comfortable 8 microsecond round-trip to nearly 11 microseconds. In high-frequency trading, that’s the difference between printing money and burning it.
The kicker? Nothing had changed. Same code that had been running beautifully for three months. Same configuration. Same network path. The only difference was we’d moved our primary trading system from an Intel Xeon Gold 6248R to a brand new AMD EPYC 7763. On paper, the AMD chip should have been faster - higher clock speeds, more cache, better memory bandwidth. In reality, we were getting our asses handed to us by firms still running five-year-old Intel boxes.
Let me walk you through the most expensive lesson I’ve learned about CPU-specific optimizations and why treating all x86-64 processors the same is a fantastic way to lose money.
The Crime Scene
Our order book processing looked innocent enough. We had a nice, clean data structure for tracking price levels, optimized to death over the years. The hot path was a simple loop that updated prices as market data arrived:


