Low Latency Trading Insights

Low Latency Trading Insights

Intel's Instruction Parallelism is 3.67× Better Than AMD's. It Still Lost.

What CPU Cache Really Reveals About Intel vs AMD Performance (Part 2)

Henrique Bucher's avatar
Henrique Bucher
Jan 26, 2026
∙ Paid

Parts 1 and 2 covered the foundation and core tests. Part 1 showed memory dominance and latency measurement. Part 2 revealed cache boundaries (AMD’s 2.4× cliff), Intel’s 3.67× ILP advantage, a failed associativity test, and the devastating 13× false sharing penalty on Intel vs 6× on AMD.

The key findings so far:

  • Test 1: Memory access patterns matter more than arithmetic (300 cycles vs 1 cycle)

  • Test 2: Pure latency is similar (Intel 1.49, AMD 1.57 cycles/element)

  • Test 3: AMD has steeper cache cliffs but better L3 performance

  • Test 4: Intel’s instruction parallelism is 80% better than AMD’s

  • Test 5: Failed to measure associativity—prefetchers are too smart

  • Test 6: False sharing penalty: Intel 13.25× vs AMD 6.40× (the most critical difference)

Now for the final test, the verdict, and what all this means for your code.

Test 7: Hardware is Weird

The final test demonstrated something more philosophical: hardware is too complex to predict.

I accessed fields in a struct in different orders:

Code Block 45

Same number of operations, just different access orders:

Assembly Verification: Identical Structure, Different Offsets

Do these patterns compile differently? Let’s examine the assembly:

User's avatar

Continue reading this post for free, courtesy of Henrique Bucher.

Or purchase a paid subscription.
© 2026 Henrique Bucher · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture