In the first article we stopped after reviewing the first half of the string reversal algorithm. In this second one, we will get into the contentious part of ChatGPT’s solution.
The Reversal Algorithm
Once we spent O(N) to retrieve the string’s length, we will then measure the performance of three string reversal algorithms: (1) IndexWhile from ChatGPT’s solution (2) ForLoop from ChatGPT’s solution and (3) ForPointer which was the solution provided by the CEO - read the 1st article for background on this one.
All these three algorithms took the form of a templated function that takes another function SWAPFN
as template parameter. We left this opening hook so to insert either a standard swap or the XOR swap that the CEO claims to be faster.
The IndexWhile loop function from ChatGPT looks like this
template< void SWAPFN( char&, char& ) >
void stringReverseIndexWhile(char *str, size_t length)
{
size_t start = 0;
size_t finish = length -1;
while ( finish > start ) {
SWAPFN( str[start], str[finish] );
start++;
finish--;
}
}
The ForLoop function from Grok looks like this
template< void SWAPFN( char&, char& ) >
void stringReverseIndexForLoop(char *str, size_t length )
{
for ( size_t j=0; j< length/2; ++j ) {
SWAPFN( str[length-1-j], str[j] );
}
}
And the CEO pointer loop is
template< void SWAPFN( char&, char& ) >
void stringReversePointer(char *str, size_t length)
{
char* p1 = str;
char* p2 = str + length -1;
for (; p2 > p1; ++p1, --p2)
{
SWAPFN( *p1, *p2 );
}
}
Notice that at this point we are able to test the efficiency of each solution without the added complexity of the strlen method.
For this first analysis we are going to use all these string reversal algorithms with the “trivial” swap function below.
void swapTrivial( char& c1, char& c2 ) {
char ch = c1;
c1 = c2;
c2 = ch;
}
The benchmark calling functions from Google Benchmark are
BENCHMARK_TEMPLATE( reverseBenchmark,
stringReversePointer<swapTrivial> )->ArgsProduct(args);
BENCHMARK_TEMPLATE( reverseBenchmark,
stringReverseIndexWhile<swapTrivial> )->ArgsProduct(args);
BENCHMARK_TEMPLATE( reverseBenchmark,
stringReverseIndexForLoop<swapTrivial> )->ArgsProduct(args);
Running these benchmarks we get the timings below. This run was taken from the AWS c6a instance.
---------------------------------------------------------------------
Benchmark time/Byte
---------------------------------------------------------------------
reverseB...<stringReverseIndexWhile<swapTrivial>>/4 392.552 ps
reverseB...<stringReverseIndexWhile<swapTrivial>>/16 330.648 ps
reverseB...<stringReverseIndexWhile<swapTrivial>>/256 769.142 ps
reverseB...<stringReverseIndexWhile<swapTrivial>>/4096 794.134 ps
reverseB...<stringReverseIndexWhile<swapTrivial>>/262144 741.373 ps
reverseB...<stringReverseIndexForLoop<swapTrivial>>/4 401.463 ps
reverseB...<stringReverseIndexForLoop<swapTrivial>>/16 277.306 ps
reverseB...<stringReverseIndexForLoop<swapTrivial>>/256 67.447 ps
reverseB...<stringReverseIndexForLoop<swapTrivial>>/4096 88.1894 ps
reverseB...<stringReverseIndexForLoop<swapTrivial>>/262144 56.1823 ps
reverseB...<stringReversePointer<swapTrivial>>/4 428.849 ps
reverseB...<stringReversePointer<swapTrivial>>/16 295.342 ps
reverseB...<stringReversePointer<swapTrivial>>/256 67.050 ps
reverseB...<stringReversePointer<swapTrivial>>/4096 100.724 ps
reverseB...<stringReversePointer<swapTrivial>>/262144 55.7631 ps
We notice that for small and medium strings (4-16 bytes) are the costliest per byte, around 300 to 400 picoseconds per byte, and it really does not matter much the algorithm used.
For larger strings, the index-while loop from ChatGPT severely under-performs the other two methods. Instead of taking advantage of the scale, it adds another 400 ps of usage per byte.
This is particularly puzzling to me since the algorithm structure of stringReverseIndexWhile and stringReversePointer
are inherently the same! The only difference is that one access the vector by index while the other uses pointers directly!
Puzzled by that behavior, I pasted that function into a standalone binary and disassembled it.
void stringReverseIndexWhile<&swapTrivial(char&, char&)>(char*, unsigned long):
decq %rsi
je .LBB6_3
movl $1, %eax
.LBB6_2:
movzbl -1(%rdi,%rax), %ecx
movzbl (%rdi,%rsi), %edx
movb %dl, -1(%rdi,%rax)
movb %cl, (%rdi,%rsi)
decq %rsi
leaq 1(%rax), %rcx
cmpq %rax, %rsi
movq %rcx, %rax
ja .LBB6_2
.LBB6_3:
retq
From inspection we learn that the way that code is written does not allow the compiler to vectorize and unroll that inner loop, which is still done byte per byte. Worse, none of these assembly instructions are basic moves, no block moves. No optimizations at all. This clearly has to be very slow.
When I looked at the generated assembly code for stringReversePointer, to my surprise GCC 13.3 was able to recognize and unroll the entire loop. The function was now composed of hundreds of lines of AVX2 instructions - a clear indication that the loop was unrolled properly. I had to remove dozens of lines to be able to paste at least part of the function in here. If you are interested in seeing the full version, here is the link.
void stringReversePointer<&swapTrivial(char&, char&)>(char*, unsigned long):
movq %rdi, %rdx
leaq -1(%rsi), %rcx
movq %rsi, %rdi
leaq (%rdx,%rcx), %rax
cmpq %rax, %rdx
jnb .L105
leaq -2(%rsi), %rsi
cmpq $29, %rsi
jbe .L91
movq %rsi, %r10
shrq %r10
leaq 1(%r10), %r9
subq %r10, %rcx
...
.L87:
vmovdqu8 (%rcx), %ymm0
vpermb (%rsi), %ymm1, %ymm2
addq $32, %rcx
subq $32, %rsi
vpermb %ymm0, %ymm1, %ymm0
vmovdqu8 %ymm2, -32(%rcx)
vmovdqu8 %ymm0, 32(%rsi)
cmpq %r8, %rcx
jne .L87
movq %r8, %rcx
subq %r11, %rax
cmpq %r9, %r11
je .L103
...
movzbl -14(%rax), %esi
movb %sil, 14(%rcx)
movb %dl, -14(%rax)
ret
.L86:
xorl %r11d, %r11d
movq %rdx, %r8
jmp .L93
.L104:
vzeroupper
jmp .L89
.L103:
vzeroupper
ret
So the question arises - was GCC able to grab on a pattern and optimize it? Would clang be able to optimize both? Curious by that, I compiled (one of) the latest 19.1.6 (Dec 2024) LLVM branch and built the benchmarks with it.
To my disbelief, clang 19 not only did not catch the index-while case while also did not optimize any of the remaining two implementations. As one can see from the results below, the time per byte for stringReverseIndexWhile is roughly the same for the previous version compiled with GCC 13.3.
---------------------------------------------------------------------
Benchmark time/Byte
---------------------------------------------------------------------
rev...rk<stringReverseIndexWhile<swapTrivial>>/4 467.833 ps
rev...rk<stringReverseIndexWhile<swapTrivial>>/16 353.718 ps
rev...rk<stringReverseIndexWhile<swapTrivial>>/256 776.428 ps
rev...rk<stringReverseIndexWhile<swapTrivial>>/4096 798.71 ps
rev...rk<stringReverseIndexWhile<swapTrivial>>/262144 746.997 ps
rev...rk<stringReverseIndexForLoop<swapTrivial>>/4 623.805 ps
rev...rk<stringReverseIndexForLoop<swapTrivial>>/16 320.415 ps
rev...rk<stringReverseIndexForLoop<swapTrivial>>/256 796.04 ps
rev...rk<stringReverseIndexForLoop<swapTrivial>>/4096 789.242 ps
rev...rk<stringReverseIndexForLoop<swapTrivial>>/262144 738.624 ps
rev...rk<stringReversePointer<swapTrivial>>/4 470.287 ps
rev...rk<stringReversePointer<swapTrivial>>/16 354.627 ps
rev...rk<stringReversePointer<swapTrivial>>/256 761.139 ps
rev...rk<stringReversePointer<swapTrivial>>/4096 794.557 ps
rev...rk<stringReversePointer<swapTrivial>>/262144 743.981 ps
However, the performance for the other two implementations is now similar to the index-while version, meaning that clang was not able to unroll and optimize it. The difference between the implementations that the compiler was able to optimize vs the ones it was not able to is significant. This is over ten times faster.
Conclusion and next steps
In this second article of the series, we were able to check that implementation matters - and compiler matters too.
GCC seemed to be able to pick up on patterns and unroll the loop easier than clang, which flunked at unrolling the reversal loop.
The matter of accessing each byte through a pointer or through an indexed vector did not seem to matter at all.
By looking at the compiler’s optimization code, by using the for loop with indices seems to me a better bet that the compiler will match the patterns and unroll/optimize the inner loops.
Next article we will finally confront ChatGPT for giving us false information.