In recent years, artificial intelligence has swept through the programming landscape like a wildfire, fundamentally transforming how developers write, debug, and optimize code. From intelligent code completion tools to automated testing and even generating entire applications from natural language prompts, AI has become an indispensable ally for programmers. This seismic shift, driven by advancements in machine learning and natural language processing, has democratized coding, boosted productivity, and sparked a new era of innovation. As AI continues to evolve, its integration into programming is not just a trend but a revolution that is redefining the boundaries of software development.
In fact, this very introduction was made by Grok, which is not something I’m very proud of. I feel like writing with Ai is cheating. I’m old style, I guess, but I do like to give my readers the pleasure of constantly finding flaws in my writing - it makes reading more human after all.
Today I wanted to dab into an episode that showcases the opposite - how Ai is far from perfect in optimizing critical C++ code - and how it cheats on you when it gives the wrong answer.
Motivation
Let’s start with a post on LinkedIn where a CEO asks ChatGPT to write a string reversal routine in C/C++. Then, when the CEO presents his own “optimized” version, ChatGPT agrees instantly, and even finds a possible reason for his version to be better. I found that very strange and set off to investigate.
And here is what ChatGPT wrote to me when I asked it to write the code:
// Function to reverse the string
void reverseString(char *str) {
int start = 0;
int end = strlen(str) - 1;
char temp;
// Swap characters from both ends of the string
while (start < end) {
temp = str[start];
str[start] = str[end];
str[end] = temp;
start++;
end--;
}
}
Very simple algorithm: strlen(str)-1
Gets the index of the last character (ignoring the null terminator), the while
loop swaps characters from the beginning and end until the string is reversed.
And for completeness, here is what Grok wrote for the same prompt:
void reverseString(char* str) {
// Find string length
int length = 0;
while (str[length] != '\0') {
length++;
}
// Swap characters from start and end
for (int i = 0; i < length / 2; i++) {
char temp = str[i];
str[i] = str[length - 1 - i];
str[length - 1 - i] = temp;
}
}
Similar setup but ChatGPT uses strlen while Grok’s manually counts characters until \0 is found. Both “should” be O(n) although I suspect glibc has some SIMD tricks behind strlen to make it faster. We will investigate that later.
Comparison
ChatGPT’s while loop swaps characters from both ends until the pointers meet, performing swaps for n/2 iterations. Each swap is O(1), so the reversal is O(n/2) ≈ O(n). For Grok, the for loop iterates n/2 times, with each iteration performing a swap (O(1)). Thus, the reversal is O(n/2) ≈ O(n). Identical - on paper.
Both algorithms access memory sequentially from opposite ends, so cache performance is similar. The two-pointer approach for Grok may have a slight edge in clarity for the CPU's branch predictor, as it avoids repeated index calculations.
Modern compilers (e.g., GCC, Clang) are likely to generate nearly identical machine code for both algorithms, as they can unroll loops and optimize index calculations. The use of strlen in Algorithm 1 might give it a slight edge if the standard library implementation is highly optimized.
As far as readability goes, ChatGPT’s is slightly more concise and intuitive due to the two-pointer approach, which is a common idiom for in-place reversal. Grok’s algorithm is slightly more verbose due to the manual length calculation and index-based swapping.
In terms of maintenance, ChatGPT relies on strlen, which is standard and less prone to errors. Grok's manual length calculation could be error-prone in edge cases (e.g., if the string isn't null-terminated, though this is rare for C-style strings).
As per portability, ChatGPT requires for strlen, while Grok’s is self-contained. This is a minor difference, as is standard in C++. However it might be important for embedded software or in High Level Synthesis (FPGA).
It looks fairly clear that we have two distinct focus regions - finding the string length and performing the reversal itself.
String length
ChatGPT uses strlen, a standard library function that is typically optimized (often implemented in assembly or highly tuned C). However, it may introduce a small overhead due to function call setup.
Grok manually computes the length with a while loop. This is straightforward but may be slightly less optimized than strlen on some platforms, as it lacks the potential for low-level optimizations in the standard library.
Google Benchmark
To stay out of the speculation realm, I decided to create a Google Benchmark project. I decided for a templated benchmark that I could use with several tests I had overseen at this point. The basic test looks like this
template< void Fn( char* str, size_t ) >
void reverseBenchmark(benchmark::State& state) {
// Computes the number of strings (N) and max memory (SIZE)
// based on the given string size (L)
const size_t L = state.range(0);
const size_t NMIN = 64;
const size_t MAXMEM = 512*1024*1024;
const size_t SIZE = std::max(L*NMIN,MAXMEM);
const size_t N = SIZE/L;
// Creates a memory space with all the N strings
std::vector<char> vec(N*L,0);
// Fills memory with random ASCII characters
randomize( vec.data(), vec.size() );
// Puts a C-string terminator at the proper places
for ( size_t j=0; j<N; ++j ) {
vec[(j+1)*L-1] = '\0';
}
// Benchmark loop
for (auto _ : state) {
for ( size_t j=0; j<N; ++j ) {
Fn( &vec[j*L], L );
benchmark::DoNotOptimize(&vec[j*L]);
}
}
// Prevents memory to be optimized out
benchmark::DoNotOptimize(vec);
// Creates a derived rate counter as time per byte processed
state.counters["Bytes"] = benchmark::Counter(N*L,
benchmark::Counter::kIsIterationInvariantRate|
benchmark::Counter::kInvert,
benchmark::Counter::OneK::kIs1024);
}
The benchmark function is templated on a function that accepts a string pointer and length. We will use this routine to benchmark our two separate sections: (1) computing the string length, in which case it will dutifully ignore the passed length and (2) reverse the string using the length passed. This will allow us to pinpoint pain points better than computing the two sections at the same time.
The two string-length functions to be evaluated are as follows.
void strlenWhile(char *str, size_t length)
{
char* start = str;
while ( *str != '\0' ) str++;
size_t len = str - start;
benchmark::DoNotOptimize( len );
}
void strlenBuiltin(char *str, size_t length)
{
size_t len = ::strlen(str);
benchmark::DoNotOptimize( len );
}
The first one computes the string length as Grok does. The second follows ChatGPT’s choice. Then we wrote the benchmark-calling code.
// Create an array of string sizes to be tested
std::vector< std::vector<long int> > args {{4,16,256,4*1024,256*1024}};
// Benchmark for the while loop
BENCHMARK_TEMPLATE( reverseBenchmark, strlenWhile )->ArgsProduct(args);
// Benchmark for the builtin strlen
BENCHMARK_TEMPLATE( reverseBenchmark, strlenBuiltin )
->ArgsProduct(args);
BENCHMARK_MAIN();
This will make Google Benchmark test each of the algorithms for the string sizes of 4 up to 256kB.
Hardware Setup
As the number of hardware platforms out there is extensive these days, I decided to run these tests in two wildly different x86-64 machines: my old Thinkpad laptop and a last generation Amazon c6a instance powered by AMD 3rd Gen EPYC (Milan) processors. Both codes compiled for the native platform (-march=native
) so to fully use all the latest hardware enhancements, if available. Both codes compiled with GCC 13.3 on Ubuntu 24.04 LTS.
Results for the trivial while loop (Grok)
On the AWS server, the results were these for the trivial While loop:
------------------------------------------------------
Benchmark ps/Byte
------------------------------------------------------
reverseBenchmark<strlenWhile>/16 389.582ps
reverseBenchmark<strlenWhile>/16 331.112ps
reverseBenchmark<strlenWhile>/256 364.662ps
reverseBenchmark<strlenWhile>/4096 315.437ps
reverseBenchmark<strlenWhile>/262144 311.899ps
The times were the same for all string sizes: around 330 picoseconds per byte, which was surprising to me. I expected the compiler would do a better job at optimizing it. Ends up, looking at the assembly generated code, the result is just a simple boring loop:
strlenWhile(char*, unsigned long):
cmpb $0, (%rdi) ;; compare pointer (RDI) to zero char
je .L41 ;; equal to zero? jump to the end
movq %rdi, %rax ;; move pointer to RAX register
.L40:
incq %rax ;; increment pointer by one byte
cmpb $0, (%rax) ;; compare pointed memory with zero byte
jne .L40 ;; if not equal, jump back to L40
subq %rdi, %rax ;; subtract current pointer from original
movq %rax, -8(%rsp) ;; move result to local stack
ret
.L41:
xorl %eax, %eax ;; set RAX to zero
movq %rax, -8(%rsp) ;; move RAX to local stack
ret
Note that the routine moves the end result to a variable in the current stack. This is because we used benchmark::DoNotOptimize(len)
which forces the code to store the result somewhere, even if it does not use it.
Clang 20.0 writes a similar and equally boring linear code.
strlenWhile(char*, unsigned long):
movq $-1, %rax ;; move -1 to RAX
.LBB2_1:
cmpb $0, 1(%rdi,%rax) ;; Compare str[RAX+1] with zero
leaq 1(%rax), %rax ;; Increment RAX by one
jne .LBB2_1 ;; Not equal, jump back to LBB2_1
movq %rax, -8(%rsp) ;; Move result to stack
retq
That byte-per-byte explains why the algorithm does not scale.
Results for the builtin strlen benchmark (ChatGPT)
This is where things get very interesting. Look at the numbers for the builtin strlen benchmark:
------------------------------------------------------
Benchmark ps/Byte
------------------------------------------------------
reverseBenchmark<strlenBuiltin>/4 669.746ps
reverseBenchmark<strlenBuiltin>/16 177.886ps
reverseBenchmark<strlenBuiltin>/256 53.8608ps
reverseBenchmark<strlenBuiltin>/4096 47.5174ps
reverseBenchmark<strlenBuiltin>/262144 47.9541ps
It looks like for very small strings, strlen presents a high overhead while blasting past the trivial implementation for larger strings. The best results of 47 picoseconds per byte is seven times faster than our trivial implementation!
The GLIBC implementation is clearly using some clever optimizations here. I can think of some simple optimizations like loading 16 bytes at a time and checking any them for a zero using 17 year old SSE 4.2 PCMPESTRI.
As strlen is baked into GLIBC’s shared library (libc.so.6), as shown below, we have no way to determine exactly which code GLIBC is calling. This is because GLIBC uses a trampoline jump that is only resolved at process’ start, depending on the capabilities of the current running CPU.
$ ldd ./bm_reverse
linux-vdso.so.1 (0x00007ffff7383000)
libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6
/lib64/ld-linux-x86-64.so.2
$ readelf -a /lib/x86_64-linux-gnu/libc.so.6 | grep strlen
1153: 129 IFUNC GLOBAL DEFAULT 17 strlen@@GLIBC_2.2.5
To determine which one is being used, we have to start the benchmark under GDB. Once it gets to the strlen benchmarks, we hit ^C and stop execution. We then set the breakpoint at the call for strlen inside GLIBC and continue.
$ gdb ./bm_reverse
GNU gdb (Ubuntu 15.0.50.20240403-0ubuntu1) 15.0.50.20240403-git
Copyright (C) 2024 Free Software Foundation, Inc.
...
reverseBenchmark<strlenBuiltin>/4 Bytes=552.33ps
reverseBenchmark<strlenBuiltin>/16 Bytes=158.252ps
reverseBenchmark<strlenBuiltin>/256 Bytes=60.888ps
^C
(gdb) break strlen@plt
(gdb) continue
Breakpoint 1.1, 0x000055555555aa50 in strlen@plt ()
(gdb) bt
#0 0x000055555555aa50 in strlen@plt ()
#1 0x0000555555561dc5 in void reverseBenchmark<&(strlenBuiltin(char*, unsigned long))>(benchmark::State&) ()
#2 0x00005555555baac8 in benchmark::internal::BenchmarkInstance::Run(long, int, benchmark::internal::ThreadTimer*, benchmark::internal::ThreadManager*, benchmark::internal::PerfCountersMeasurement*, benchmark::ProfilerManager*) const ()
...
#9 0x000055555555efcd in main ()
When the breakpoint traps, we produce a backtrace to show that we are inside the strlen function at the PLT section. We can also see that the frame #1 is exactly inside our benchmark function reverseBenchmark
.
Then we disassemble to show the trampoline, which is just an indirect jump into the address resolved at startup (and stored in the GOT section).
A couple of step into (si) commands and we finally jump into the executing routine, inside strlen-avx2.S. By browsing the source code of GLIBC we can check that function is highly optimized for cache using AVX2, which allows to do block loads and computations of 32 bytes at a time.
(gdb) disas
Dump of assembler code for function strlen@plt:
=> 0x000055555555aa50 <+0>: endbr64
0x000055555555aa54 <+4>: jmp *0x721a6(%rip) # 0x5555555ccc00 <strlen@got.plt>
0x000055555555aa5a <+10>: nopw 0x0(%rax,%rax,1)
End of assembler dump.
(gdb) si
0x000055555555aa54 in strlen@plt ()
(gdb) si
__strlen_avx2 () at ../sysdeps/x86_64/multiarch/strlen-avx2.S:52
52 in ../sysdeps/x86_64/multiarch/strlen-avx2.S
This is the reason why the strlen implementation is much faster at larger strings, where SIMD instructions make a huge difference and bork at very small strings. The big problem with SIMD is that they display good throughput for large datasets but have a big startup time.
Laptop Results
The results on the old Thinkpad laptop with an Intel(R) Core(TM) i7-8550U CPU are remarkably close to the AMD EPYC server.
----------------------------------------------------------
Benchmark
-----------------------------------------------------------
reverseBenchmark<strlenWhile>/4 Bytes=399.69ps
reverseBenchmark<strlenWhile>/16 Bytes=406.374ps
reverseBenchmark<strlenWhile>/256 Bytes=460.297ps
reverseBenchmark<strlenWhile>/4096 Bytes=489.647ps
reverseBenchmark<strlenWhile>/262144 Bytes=439.324ps
reverseBenchmark<strlenBuiltin>/4 Bytes=591.648ps
reverseBenchmark<strlenBuiltin>/16 Bytes=170.199ps
reverseBenchmark<strlenBuiltin>/256 Bytes=63.1613ps
reverseBenchmark<strlenBuiltin>/4096 Bytes=55.6196ps
reverseBenchmark<strlenBuiltin>/262144 Bytes=58.7308ps
The plateau time for the naive/trivial while loop was just 30% higher than that of the AMD server.
The dreadful elapsed time for the builtin strlen on a 4-byte string was actually lower - 591ps against 690ps! The per-byte time for large strings was just marginally higher - 60ps against 50ps.
The main reason is that the laptop also runs AVX2 and therefore there was no substantial difference between the processing power.
Conclusions
We asked ChatGPT and Grok to write a reverse string algorithm in C. We saw that Ai’s presented implementations are correct but can be naive and far from optimal.
We benchmarked the first part of the reverse string algorithm - the part that calculates the length of the string - and found out that the trivial while loop was much faster than strlen for tiny strings (~4 bytes) but seven times slower otherwise.
In the second part of this article we will check the performance of the body of the string reversal algorithm and check if the CEO’s algorithm was indeed better than ChatGPT’s one!