Now that we covered our bases on Part One and Part Two, we have sufficient background to make sense of ChatGPT’s conversation.
Recapitulating the story, the CEO asks ChatGPT for a string reversal algorithm written in C++. Once ChatGPT presents its answer - which is not shown - CEO gives his “optimized” solution, upon which ChatGPT responds with
I used index calculations to reverse manually. Your XOR swap reversal is more efficient and is a clever way to reverse.
That is, ChatGPT is comparing these two swap implementations:
void swapTrivial( char& c1, char& c2 ) {
char ch = c1;
c1 = c2;
c2 = ch;
}
void swapXor( char& c1, char& c2 ) {
c1 ^= c2;
c2 ^= c1;
c1 ^= c2;
}
The first function swapTrivial
is the one typically used in the standard libraries. For example, on Ubuntu 24.04 under /usr/include/c++/13/bits/move.h you may find this section:
template< typename _Tp>
typename enable_if<__and_<__not_<__is_tuple_like<_Tp>>,
is_move_constructible<_Tp>,
is_move_assignable<_Tp>>::value>::type
swap(_Tp& __a, _Tp& __b)
noexcept(__and_<is_nothrow_move_constructible<_Tp>,
is_nothrow_move_assignable<_Tp>>::value)
{
// concept requirements
__glibcxx_function_requires(_SGIAssignableConcept<_Tp>)
_Tp __tmp = _GLIBCXX_MOVE(__a);
__a = _GLIBCXX_MOVE(__b);
__b = _GLIBCXX_MOVE(__tmp);
}
In other words, what this is saying is - only use this implementation of std::swap if the type _Tp is not a tuple, and if it can move another object of type _Tp’s resources into it by assignment as well as through one of it constructors. Removing all that C++ yada yada at the end we can see the trivial swap.
I found that assertion from ChatGPT very strange to hear, which is very concerning. ChatGPT is promoted as an authoritative voice in certain areas of knowledge and people, even very senior technical folks, do take that in consideration, even more than a human expert. If ChatGPT is just agreeing with the user - and learning from it - then we have a trust problem.
I have a hunch from practice that a XOR swap would only make sense in very primitive machines. On modern machines, it would be just a plain waste of time. But I don’t like to especulate so I compiled the two functions and disassembled the respective binaries.
Looking at the generated assembly we can see that the trivial compilation of these functions lead to nothing useful.
swapTrivial(char&, char&):
movzbl (%rdi), %eax
movzbl (%rsi), %ecx
movb %cl, (%rdi)
movb %al, (%rsi)
retq
swapXor(char&, char&):
movzbl (%rdi), %eax
xorb (%rsi), %al ;; XOR
movb %al, (%rdi)
xorb (%rsi), %al ;; XOR
movb %al, (%rsi)
xorb %al, (%rdi) ;; XOR
retq
Machine Code Analysis
We could analyze these functions further using tools like OSACA or llvm-mca, below, to gather further evidence to which is better. It looks like from inspection that the Xor method will waste more cpu cycles. But that’s beyond the point.
===============================================================
#uOps Latency RTR Load Store SideEff Instructions
===============================================================
swapTrivial:
1 5 0.50 * movzbl (%rdi), %eax
1 5 0.50 * movzbl (%rsi), %ecx
1 1 0.50 * movb %cl, (%rdi)
1 1 0.50 * movb %al, (%rsi)
Explanation of Columns
The table from the LLVM Machine Code Analyzer (llvm-mca) output provides detailed performance metrics for a sequence of instructions in the swapTrivial function, which swaps the byte values at memory locations pointed to by rdi and rsi. It includes columns for #uOps, Latency, RTR (Reciprocal Throughput), Load, Store, and SideEff (Side Effects), and lists four instructions.
The first column means micro-operations. uops are the smallest executable units within a microprocessor. Complex instructions from a processor's instruction set architecture (ISA), like x86 or ARM, are often broken down into simpler uops by the processor's front-end (e.g., during decoding). This decomposition allows modern superscalar, out-of-order processors to optimize execution. uops are easier to schedule, reorder, and execute in parallel across multiple execution units, improving performance and efficiency. A single x86 instruction like ADD EAX, EBX might be translated into one or more uops, such as fetching operands, performing the addition, and storing the result. Complex instructions (e.g., MUL or memory operations) may generate multiple uops.
Latency is the delay (measured in clock cycles or time) between when an operation starts and when its result is available for use by subsequent operations. For uops, it’s the time from when a uop is dispatched to an execution unit until its output is ready. Latency is expressed in clock cycles (e.g., an ADD uop has a latency of 1 cycle).
RTR means Reciprocal Throughput. Throughput refers to the rate at which a processor can execute micro-operations (uops) or instructions over time, typically measured in uops or instructions per clock cycle (IPC). It represents how many operations the CPU can complete in parallel or in a given period, leveraging features like pipelining, superscalar execution, and out-of-order processing.
Reciprocal Throughput (RTR) is then he time (in cycles) between starting successive uops of the same type. For example, if a uop has a throughput of 1 uop per cycle, the reciprocal throughput is 1 cycle. For a division uop with a throughput of 1 uop every 10 cycles, the reciprocal throughput is 10 cycles.
he Load column in the llvm-mca output indicates whether an instruction may perform a memory load operation, meaning it reads data from memory (e.g., RAM or cache) into a register. If the column shows a marker (e.g., *), the instruction is identified as potentially accessing memory to load data. This is distinct from instructions that only operate on registers or perform stores.
The Store column indicates whether an instruction may perform a memory store operation, meaning it writes data from a register (or immediate value) to a memory location (e.g., RAM or cache).
swapTrivial Analysis
Back to the machine code, in the context of x86-64 architecture, the RDI and RSI registers are general-purpose registers used primarily for passing function arguments according to the System V AMD64 Application Binary Interface (ABI), which is the calling convention used by most Unix-like operating systems (e.g., Linux, macOS).
In the x86-64 System V AMD64 ABI, RDI is used to pass the first integer or pointer argument to a function, while RSI is used to pass the second integer or pointer argument to a function. In x86-64, RDI and RSI are 64-bit registers, so they hold 64-bit memory addresses. RAX and RCX in the table are used for temporary storage (eax and ecx hold loaded values, al and cl for stores), as they are caller-saved registers in the ABI and commonly used for intermediate results. Unlike RDI and RSI, which hold addresses, RAX and RCX hold the data being swapped.
In the llvm-mca output, RDI holds a pointer to the first memory address involved in the swap operation. For example, in movzbl (%rdi), %eax
, the instruction loads a byte from the memory address stored in RDI into eax. Therefore, RDI points to one of the memory locations being swapped and RSI holds a pointer to the second memory address involved in the swap.
The swapTrivial function swaps the byte values at two memory locations:
movzbl (%rdi), %eax
: Loads a byte from the address in RDI into eax (zero-extending to 32 bits).movzbl (%rsi), %ecx:
Loads a byte from the address in RSI into ecx (zero-extending to 32 bits).movb %cl, (%rdi)
: Stores the byte in cl (low 8 bits of ecx) to the address in RDI.movb %al, (%rsi):
Stores the byte in al (low 8 bits of eax) to the address in RSI.
The use of (%rdi) and (%rsi) indicates indirect addressing, where the registers hold memory addresses.
As for performance, the movzbl instructions use RDI and RSI for load operations, each generating 1 uop and taking 5 cycles of latency (likely due to L1 cache access plus zero-extension). The movb instructions use RDI and RSI for store operations, each generating 1 uop with 1 cycle of latency to the store buffer. The RTR of 0.50 cycles suggests the CPU can issue 2 loads or stores per cycle, but dependencies (e.g., eax needed for movb %al, (%rsi)) may serialize execution.
Note that lvm-mca assumes L1 cache hits for all memory accesses (loads and stores), which simplifies latency and throughput estimates but may not reflect real-world scenarios with cache misses ().
swapXor Analysis
===============================================================
#uOps Latency RTR Load Store SideEff Instructions
===============================================================
swapXor:
1 5 0.50 * movzbl (%rdi), %eax
2 6 0.50 * xorb (%rsi), %al
1 1 0.50 * movb %al, (%rdi)
2 6 0.50 * xorb (%rsi), %al
1 1 0.50 * movb %al, (%rsi)
3 7 0.50 * * xorb %al, (%rdi)
The swapXor function swaps the byte values at memory locations pointed to by RDI and RSI using the XOR swap algorithm. This algorithm uses XOR operations to swap two values without a temporary variable, as follows:
Let
A = [RDI] and B = [RSI].
Steps:
A = A ^ B, B = B ^ A, A = A ^ B.
In the code:
Load
[RDI]
intoeax
(movzbl).XOR
[RSI]
with al and store back to al(xorb (%rsi), %al).
Store al to
[RDI]
(movb).XOR
[RSI]
with al again (xorb (%rsi), %al
).Store al to
[RSI]
(movb).XOR al with
[RDI]
and store back to [RDI] (xorb %al, (%rdi)
).
Side-by-Side Comparison
Comparing to swapTrivial, which has 4 instructions - two loads (movzbl) and two stores (movb), swapXor has 6 instructions: One load (movzbl), three XORs with memory (xorb), and two stores (movb), swapXor has higher complexity due to arithmetic (XOR) and combined load/store operations (e.g., xorb %al, (%rdi)), increasing decoding and execution complexity. Therefore, swapTrivial is simpler with fewer instructions (4 vs. 6), reducing front-end pressure (e.g., decoder or uop cache) and potentially improving performance in instruction fetch and decode stages.
As far as micro-ops, in swapTrivial each movzbl and movb generates 1 uop, indicating efficient decoding for a total of 4 uOps. Stores typically require 2 uops (address + data) on Intel CPUs (e.g., Skylake), but the 1-uop store is common on an AMD Zen-like architecture with fused store uops. Low uop count (4) minimizes pressure on the CPU’s dispatch/retire width (e.g., 4–6 uops per cycle on Skylake).
On swapXor, there are 10 uOps:
movzbl: 1 uop (load with zero-extension).
xorb (%rsi), %al:
2 uops (load + XOR).movb: 1 uop (store, possibly simplified).
xorb %al, (%rdi)
: 3 uops (load + XOR + store).
Higher uop count (10) increases pressure on the decoder, uop cache, and dispatch/retire pipeline, potentially causing bottlenecks if the CPU’s uop throughput is limited (e.g., Skylake’s 4–6 uops per cycle).
Therefore, swapTrivial uses 60% fewer uops (4 vs. 10), reducing front-end and execution unit contention. swapXor’s complex xorb %al, (%rdi) (3 uops) is particularly costly, stressing the pipeline.
Critical Paths
The critical path for swapTrivial are two load-store pairs: movzbl (%rdi), %eax → movb %al, (%rsi) (5 + 1 = 6 cycles), and movzbl (%rsi), %ecx → movb %cl, (%rdi) (6 cycles).
movzbl: 5 cycles (load from [RDI] or [RSI]).
movb: 1 cycle (store to store buffer).
If [RDI] and [RSI] don’t alias, the pairs can execute in parallel, yielding a total latency of ~6 cycles (assuming out-of-order execution). If aliasing occurs, execution may serialize, increasing latency to ~12 cycles (6 + 6).
Critical Path for swapXor
There is a tight dependency chain through al
:
movzbl (%rdi), %eax: 5 cycles.
xorb (%rsi), %al: 6 cycles (depends on al).
movb %al, (%rdi): 1 cycle (depends on al).
xorb (%rsi), %al: 6 cycles (depends on al).
movb %al, (%rsi): 1 cycle (depends on al).
xorb %al, (%rdi): 7 cycles (depends on al).
The total latency is 5 + 6 + 1 + 6 + 1 + 7 = 26 cycles, assuming serial execution due to al dependencies.
Out-of-order execution may overlap some operations, but the repeated modification of al
limits parallelism. The repeated use of al creates a serial chain, with each instruction (except the first) depending on the previous result.
Therefore, swapTrivial has a much shorter critical path (~6–12 cycles vs. 26 cycles), making it 2–4x faster in terms of latency. swapXor’s XOR operations and serial dependencies significantly increase execution time.
Verdict swapXor vs swapTrivial
swapTrivial is significantly more efficient than swapXor due to fewer instructions and uOps (4 vs. 6 instructions, 4 vs. 10 uops), reducing front-end and execution unit pressure; lower latency ~6–12 cycles vs. 26 cycles, a 2–4x speedup, due to simpler operations and potential parallelism; lower resource contention since it uses only load/store ports, avoiding ALU contention and fitting within CPU dispatch limits; and robustness as it handles cases where [RDI] == [RSI], unlike swapXor, which fails due to XOR properties.
Compiler Smart Optimization
But nothing of this matters IF the compiler understands what the routine is planning to do. But can the compiler understand that both routines can do the same thing? Let’s try to “use” those swap functions in a larger function.
So I brought one of the previous reversal functions and templated it by size and function - instead to only function as before. Then I explicitly instantiated it for sizes 3 and 8, for both functions and disassembled the result.
#include <cstdint>
template< std::size_t N, void FN(char&,char&) >
void swapit( char str[N] ) {
for ( std::size_t j=0; j<N/2; ++j ) {
FN(str[j],str[N-j-1]);
}
}
template void swapit<8,swapTrivial>(char str[8]);
template void swapit<8,swapXor>(char str[8]);
GCC Implementation
When looking at the generated assembly, we should notice that both implementations are identical - apart from a minor move order at the end, which is insignificant. It is clear that GCC understood the intent of both swap functions and generated SIMD instructions to perform the operation.
void swapit<8ul, &swapTrivial(char&, char&)>(char*):
vmovd (%rdi), %xmm0
vmovd 4(%rdi), %xmm1
vmovdqa .LC0(%rip), %xmm2
vpshufb %xmm2, %xmm1, %xmm1
vpshufb %xmm2, %xmm0, %xmm0
vmovd %xmm1, (%rdi)
vmovd %xmm0, 4(%rdi)
ret
void swapit<8ul, &swapXor(char&, char&)>(char*):
vmovd (%rdi), %xmm1
vmovd 4(%rdi), %xmm0
vmovdqa .LC0(%rip), %xmm2
vpshufb %xmm2, %xmm1, %xmm1
vpshufb %xmm2, %xmm0, %xmm0
vmovd %xmm1, 4(%rdi)
vmovd %xmm0, (%rdi)
ret
Both these functions takes a pointer in %rdi
to a memory location containing the string to be shuffled (8 bytes total).
It uses SSE/AVX instructions, specifically loading the integers into %xmm0
and %xmm1
with vmovd
, applying a byte-shuffle mask (loaded from .LC0
into %xmm2) via vpshufb
to reverse each integer’s byte order (e.g., converting little-endian to big-endian), and storing the results back to memory in swapped positions using vmovd.
Breakdown
vmovd (%rdi), %xmm0
Loads a 32-bit value (4 bytes) from the memory address in %rdi into the lower 32 bits of the %xmm0 register (128-bit SSE register). The upper bits of %xmm0 are zeroed. Assumes %rdi points to a valid, aligned memory address. Misalignment could increase latency or cause a fault.
vmovd 4(%rdi), %xmm1
Loads a 32-bit value from the memory address %rdi + 4 into the lower 32 bits of %xmm1, zeroing the upper bits.
The offset 4(%rdi)
indicates the next 32-bit integer in memory.
vmovdqa .LC0(%rip), %xmm2
Loads a predefined byte-shuffle mask for the vpshufb instruction, which will reorder bytes in %xmm0 and %xmm1.
The mask at .LC0 is [3, 2, 1, 0, ...] to reverse the byte order of a 32-bit value (e.g., for endianness conversion from little-endian to big-endian).
vpshufb %xmm2, %xmm1, %xmm1
Reorders the bytes of the 32-bit value in %xmm1 (e.g., reverses byte order for endianness conversion).
Only the lower 32 bits of %xmm1 are relevant, as the upper bits are zeroed from vmovd.
vpshufb %xmm2, %xmm0, %xmm0
Same as above, but shuffles the bytes of %xmm0 using the mask in %xmm2, storing the result in %xmm0.
Identical operation to the previous instruction, applied to %xmm0.
vmovd %xmm1, (%rdi)
Writes the byte-swapped second integer back to the original memory location.
Overwrites the first 32-bit integer in memory.
vmovd %xmm0, 4(%rdi)
Writes the byte-swapped first integer to the second integer’s memory location.
Completes the swap by placing the first integer’s byte-swapped value in the second slot.
Summary of GCC’s implementation
The function generates approximately 7-8 micro-operations (uops), including 3 loads, 2 shuffles, 2 stores, and 1-2 for the return, with a total latency of about 7-9 cycles (assuming L1 cache hits) due to parallel processing of the two integers.
The critical path involves load (4-5 cycles), shuffle (1 cycle), and store (1 cycle) per integer, with throughput limited by memory bandwidth to about 4-5 cycles per call on modern CPUs like Intel Skylake or AMD Zen.
From a microarchitectural perspective, swap1 leverages out-of-order execution to make uops Ready-to-Run (RTR) as soon as their dependencies (e.g., loaded data or shuffle mask) are available, with the retirement stage ensuring in-order completion.
Clang Implementation
However, I was disappointed since I expected it to boil down to a single BSWAP (or MOVBE) instruction. Instead, GCC chose a convoluted string of SSE and AVX/2 instructions.
When I ran it through Clang though, it generated smarter code using a single PSHUFB instruction, below, although still much more complex than our old friend BSWAP. I only displayed one implementation (for swapTrivial) since swapXor is again, identical.
.LCPI0_0:
.byte 7
.byte 6
.byte 5
.byte 4
.byte 3
.byte 2
.byte 1
.byte 0
.zero 1
.zero 1
.zero 1
.zero 1
.zero 1
.zero 1
.zero 1
.zero 1
void swapit<8ul, &swapTrivial(char&, char&)>(char*):
vmovq (%rdi), %xmm0
vpshufb .LCPI0_0(%rip), %xmm0, %xmm0
vmovq %xmm0, (%rdi)
retq
This implemetation is identical to GCC’s, reversing the byte order of an 8-byte value at %rdi using vmovq, vpshufb, and vmovq (~4-5 uops, ~7-9 cycles latency, ~3-4 cycles/call throughput). It’s more efficient than the former due to fewer uops and memory accesses, achieving the same result (byte-swapped and position-swapped 32-bit integers) with a simpler approach.
Breakdown
vmovq (%rdi), %xmm0
Reads the 8-byte input (e.g., a 64-bit integer or two 32-bit integers) for processing. Assumes %rdi points to a valid, 8-byte-aligned address. Misalignment may incur a penalty.
vpshufb .LCPI0_0(%rip), %xmm0, %xmm0
Shuffles the bytes of %xmm0 using the mask at .LCPI0_0, storing the result in %xmm0. The mask [7, 6, 5, 4, 3, 2, 1, 0, 0, ...] reverses the lower 8 bytes, setting the upper 8 bytes to zero. The mask load is typically fast due to caching in read-only data.
vmovq %xmm0, (%rdi)
Stores the lower 64 bits of %xmm0 to the memory address in %rdi. Overwrites the original 8 bytes.
Performance
vmovq prefers 8-byte alignment; misalignment may add latency.
vpshufb uses specific ports (e.g., port 5 on Skylake), but with only one shuffle, contention is minimal.
Comparison between all implementations
Comparing the original two swap implementations with the (fixed size) GCC-optimized version for 8 bytes and the Clang-optimized version, we arrive in the table below.
However I was still not happy with the previous implementations. I wanted a simpler implementation with BSWAP, which I could only achieve when I made a temporary copy of the input string, as in
template< std::size_t N, void FN(char&,char&) >
void swapit3( char str[N] ) {
char tmp[N];
memcpy(tmp,str,N);
for ( std::size_t j=0; j<N/2; ++j ) {
FN(tmp[j],tmp[N-1-j]);
}
memcpy(str,tmp,N);
}
However, I was able to get this result only with Clang 12+, not with GCC.
void swapit3<8ul, &swapTrivial(char&, char&)>(char*):
movq (%rdi), %rax
movbeq %rax, (%rdi)
retq
The swapit3 function reverses the byte order of an 8-byte value pointed to by %rdi
using two instructions: movq
to load the value into %rax
, and movbeq
to store it back with byte reversal.
It generates ~3-4 uops (1 load, 1 store, 1-2 return) with a latency of ~7-10 cycles (4-5 for load, 2-3 for movbeq, 1-2 for ret) and throughput of ~2-3 cycles/call on modern CPUs (e.g., Skylake, Zen 3). The load is Ready-to-Run (RTR) when %rdi
is ready, movbeq
when %rax
is ready, and the return after the store commits.
Its simplicity (no mask, no SSE) minimizes resource use and memory traffic, but it requires movbe support (Intel Nehalem+, AMD Bobcat+), which should be the majority of the current installed base since these are +15 years old.
Comparison with swapit<N,FN>
Compared to swapit<N,FN>
, which uses vmovq, vpshufb with a mask [7, 6, 5, 4, 3, 2, 1, 0, ...], and vmovq (~4-5 uops, ~7-9 cycles latency, ~3-4 cycles/call), swapit3 is more efficient due to fewer uops, no mask load, and no vector port contention.
Both achieve identical results, but swapit3
’s
use of general-purpose registers and movbe reduces complexity and memory accesses, improving throughput. The original templated swapit
is slightly more portable (requires SSSE3 vs. movbe) and supports vectorization for larger datasets, but for a single 64-bit swap, swapit3 is optimal due to its minimal resource footprint.
Next Steps
And then just like that, we are postponing the ChatGPT confrontation once again. The reason was that the current analysis went too deep and it reached (my) the limit for Substack posts.
Be ready for the 4th episode of this article.