ChatGPT took my Job! (Part Two)

The Intriguing Case of an Inefficient Byte Swap

Jul 27, 2025

In the first article we stopped after reviewing the first half of the string reversal algorithm. In this second one, we will get into the contentious part of ChatGPT’s solution.

The Reversal Algorithm

Once we spent O(N) to retrieve the string’s length, we will then measure the performance of three string reversal algorithms: (1) IndexWhile from ChatGPT’s solution (2) ForLoop from ChatGPT’s solution and (3) ForPointer which was the solution provided by the CEO - read the 1st article for background on this one.

All these three algorithms took the form of a templated function that takes another function SWAPFN as template parameter. We left this opening hook so to insert either a standard swap or the XOR swap that the CEO claims to be faster.

The IndexWhile loop function from ChatGPT looks like this

template< void SWAPFN( char&, char& ) >
void stringReverseIndexWhile(char *str, size_t length)
{
    size_t start = 0;
    size_t finish =  length -1;
    while ( finish > start ) {
        SWAPFN( str[start], str[finish] );
        start++;
        finish--;
    }
}

The ForLoop function from Grok looks like this

template< void SWAPFN( char&, char& ) >
void stringReverseIndexForLoop(char *str, size_t length )
{
    for ( size_t j=0; j< length/2; ++j ) {
        SWAPFN( str[length-1-j], str[j] );
    }
}

And the CEO pointer loop is

template< void SWAPFN( char&, char& ) >
void stringReversePointer(char *str, size_t length)
{
    char* p1 = str;
    char* p2 = str + length -1;

    for (; p2 > p1; ++p1, --p2)
    {
        SWAPFN( *p1, *p2 );
    }
}

Notice that at this point we are able to test the efficiency of each solution without the added complexity of the strlen method.

For this first analysis we are going to use all these string reversal algorithms with the “trivial” swap function below.

void swapTrivial( char& c1, char& c2 ) {
    char ch = c1;
    c1 = c2;
    c2 = ch;
}

The benchmark calling functions from Google Benchmark are

BENCHMARK_TEMPLATE( reverseBenchmark, 
            stringReversePointer<swapTrivial> )->ArgsProduct(args);
BENCHMARK_TEMPLATE( reverseBenchmark, 
            stringReverseIndexWhile<swapTrivial> )->ArgsProduct(args);
BENCHMARK_TEMPLATE( reverseBenchmark, 
            stringReverseIndexForLoop<swapTrivial> )->ArgsProduct(args);

Running these benchmarks we get the timings below. This run was taken from the AWS c6a instance.

---------------------------------------------------------------------
Benchmark                                                  time/Byte
---------------------------------------------------------------------
reverseB...<stringReverseIndexWhile<swapTrivial>>/4       392.552 ps
reverseB...<stringReverseIndexWhile<swapTrivial>>/16      330.648 ps
reverseB...<stringReverseIndexWhile<swapTrivial>>/256     769.142 ps
reverseB...<stringReverseIndexWhile<swapTrivial>>/4096    794.134 ps
reverseB...<stringReverseIndexWhile<swapTrivial>>/262144  741.373 ps

reverseB...<stringReverseIndexForLoop<swapTrivial>>/4     401.463 ps
reverseB...<stringReverseIndexForLoop<swapTrivial>>/16    277.306 ps
reverseB...<stringReverseIndexForLoop<swapTrivial>>/256    67.447 ps
reverseB...<stringReverseIndexForLoop<swapTrivial>>/4096   88.1894 ps
reverseB...<stringReverseIndexForLoop<swapTrivial>>/262144 56.1823 ps

reverseB...<stringReversePointer<swapTrivial>>/4          428.849 ps
reverseB...<stringReversePointer<swapTrivial>>/16         295.342 ps
reverseB...<stringReversePointer<swapTrivial>>/256         67.050 ps
reverseB...<stringReversePointer<swapTrivial>>/4096       100.724 ps
reverseB...<stringReversePointer<swapTrivial>>/262144      55.7631 ps

We notice that for small and medium strings (4-16 bytes) are the costliest per byte, around 300 to 400 picoseconds per byte, and it really does not matter much the algorithm used.

For larger strings, the index-while loop from ChatGPT severely under-performs the other two methods. Instead of taking advantage of the scale, it adds another 400 ps of usage per byte.

This is particularly puzzling to me since the algorithm structure of stringReverseIndexWhile and stringReversePointer are inherently the same! The only difference is that one access the vector by index while the other uses pointers directly!

Puzzled by that behavior, I pasted that function into a standalone binary and disassembled it.

void stringReverseIndexWhile<&swapTrivial(char&, char&)>(char*, unsigned long):
        decq    %rsi
        je      .LBB6_3
        movl    $1, %eax
.LBB6_2:
        movzbl  -1(%rdi,%rax), %ecx
        movzbl  (%rdi,%rsi), %edx
        movb    %dl, -1(%rdi,%rax)
        movb    %cl, (%rdi,%rsi)
        decq    %rsi
        leaq    1(%rax), %rcx
        cmpq    %rax, %rsi
        movq    %rcx, %rax
        ja      .LBB6_2
.LBB6_3:
        retq

From inspection we learn that the way that code is written does not allow the compiler to vectorize and unroll that inner loop, which is still done byte per byte. Worse, none of these assembly instructions are basic moves, no block moves. No optimizations at all. This clearly has to be very slow.

When I looked at the generated assembly code for stringReversePointer, to my surprise GCC 13.3 was able to recognize and unroll the entire loop. The function was now composed of hundreds of lines of AVX2 instructions - a clear indication that the loop was unrolled properly. I had to remove dozens of lines to be able to paste at least part of the function in here. If you are interested in seeing the full version, here is the link.

void stringReversePointer<&swapTrivial(char&, char&)>(char*, unsigned long):
        movq    %rdi, %rdx
        leaq    -1(%rsi), %rcx
        movq    %rsi, %rdi
        leaq    (%rdx,%rcx), %rax
        cmpq    %rax, %rdx
        jnb     .L105
        leaq    -2(%rsi), %rsi
        cmpq    $29, %rsi
        jbe     .L91
        movq    %rsi, %r10
        shrq    %r10
        leaq    1(%r10), %r9
        subq    %r10, %rcx

...

.L87:
        vmovdqu8        (%rcx), %ymm0
        vpermb  (%rsi), %ymm1, %ymm2
        addq    $32, %rcx
        subq    $32, %rsi
        vpermb  %ymm0, %ymm1, %ymm0
        vmovdqu8        %ymm2, -32(%rcx)
        vmovdqu8        %ymm0, 32(%rsi)
        cmpq    %r8, %rcx
        jne     .L87
        movq    %r8, %rcx
        subq    %r11, %rax
        cmpq    %r9, %r11
        je      .L103

... 

        movzbl  -14(%rax), %esi
        movb    %sil, 14(%rcx)
        movb    %dl, -14(%rax)
        ret
.L86:
        xorl    %r11d, %r11d
        movq    %rdx, %r8
        jmp     .L93
.L104:
        vzeroupper
        jmp     .L89
.L103:
        vzeroupper
        ret

So the question arises - was GCC able to grab on a pattern and optimize it? Would clang be able to optimize both? Curious by that, I compiled (one of) the latest 19.1.6 (Dec 2024) LLVM branch and built the benchmarks with it.

To my disbelief, clang 19 not only did not catch the index-while case while also did not optimize any of the remaining two implementations. As one can see from the results below, the time per byte for stringReverseIndexWhile is roughly the same for the previous version compiled with GCC 13.3.

---------------------------------------------------------------------
Benchmark                                               time/Byte
---------------------------------------------------------------------
rev...rk<stringReverseIndexWhile<swapTrivial>>/4        467.833 ps
rev...rk<stringReverseIndexWhile<swapTrivial>>/16       353.718 ps
rev...rk<stringReverseIndexWhile<swapTrivial>>/256      776.428 ps
rev...rk<stringReverseIndexWhile<swapTrivial>>/4096      798.71 ps
rev...rk<stringReverseIndexWhile<swapTrivial>>/262144   746.997 ps

rev...rk<stringReverseIndexForLoop<swapTrivial>>/4      623.805 ps
rev...rk<stringReverseIndexForLoop<swapTrivial>>/16     320.415 ps
rev...rk<stringReverseIndexForLoop<swapTrivial>>/256     796.04 ps
rev...rk<stringReverseIndexForLoop<swapTrivial>>/4096   789.242 ps
rev...rk<stringReverseIndexForLoop<swapTrivial>>/262144 738.624 ps

rev...rk<stringReversePointer<swapTrivial>>/4           470.287 ps
rev...rk<stringReversePointer<swapTrivial>>/16          354.627 ps
rev...rk<stringReversePointer<swapTrivial>>/256         761.139 ps
rev...rk<stringReversePointer<swapTrivial>>/4096        794.557 ps
rev...rk<stringReversePointer<swapTrivial>>/262144      743.981 ps

However, the performance for the other two implementations is now similar to the index-while version, meaning that clang was not able to unroll and optimize it. The difference between the implementations that the compiler was able to optimize vs the ones it was not able to is significant. This is over ten times faster.

Conclusion and next steps

In this second article of the series, we were able to check that implementation matters - and compiler matters too.

GCC seemed to be able to pick up on patterns and unroll the loop easier than clang, which flunked at unrolling the reversal loop.

The matter of accessing each byte through a pointer or through an indexed vector did not seem to matter at all.

By looking at the compiler’s optimization code, by using the for loop with indices seems to me a better bet that the compiler will match the patterns and unroll/optimize the inner loops.

Next article we will finally confront ChatGPT for giving us false information.

ChatGPT took my Job! (Part Two)

The Intriguing Case of an Inefficient Byte Swap

The Reversal Algorithm

Conclusion and next steps

Discussion about this post