Discover more from Low Latency Trading Insights
Shared, LTO, PLT: Friends or Foes?
Compiler settings you need to know
In the last post, I dwelled on the question of whether function pointers and virtual calls are, in fact, slow. I posted the article on social media and got butchered with nonsense comments. However, some good insights came up in the middle of the rubble.
Some users tried to replicate my results and, as they could not, asked for my source, which I gladly made available on my RedditHelp repository, which I use to post answers to questions on social media. From those interactions, good points were made and noted:
Low Latency Trading Insights is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.
The original test put all the code in a shared library. Functions in shared libraries pay a price for the extra indirection if compared to functions linked statically.
LTO would make a difference in the benchmarks.
Clang implements switch cases differently, so it was worth trying with a different compiler - I used gcc for those tests.
The choice of architecture and instruction set might affect the tests in a non-obvious way, clobbering the conclusions.
ELF Semantic Interposition could be used to improve the baseline tests further.
All these are good points. None of them changed the heart of the article, though. Looking ahead at the results, you will notice that even with the most optimized case for a function call, function pointers, and virtual calls were never more than 1.5 nanoseconds away from the most optimized case, which is a fully inlined body function.
Note that a fully inlined function is not an apples-to-apples comparison with a virtual call or a function pointer. If an inline occurs, you lose the ability to call different behavior in real-time, ie, you lose dynamic polymorphism. However, that’s the baseline we chose, and we also need to take that aspect in the conclusions.
Risking being tediously lengthy, I need to cover some basic terminology even though most of my audience would already know these for a long time. This section might seem very basic to my average reader, but it is necessary to cover the audience.
Inlining means inserting code from one location into another, making it run faster. Inlining can be done by the compiler or by the linker.
An interesting curiosity is that contrary to what many think, inlining in Clang is not done by the C++-specific part of the compiler but by the internal passes common to the entire LLVM infrastructure. Therefore, inlining in code is common to all LLVM-derived languages such as C, C++, Rust, Flang, Kotlin, etc. This is because inlining is not done at the AST level but at the lower IR levels.
Let’s look at one example. The code below defines one variable, “var,” and two functions, “add” and “doit.” “doit” calls “add” twice with parameters 1 and 3 and then returns the current value of “var.”
If compiled without linking “-c” and without optimizations “-O0”, we obtain an object file “same_unit.o”. Before being spoiled with Compiler Explorer, we used objdump to print that object file. The printout has the assembly code “.text,” the functions “add” and “doit,” and the variable “var” in the symbol table. Both functions have defined bodies and I commented them out in green so you can see what’s happening even if you don’t understand assembly well.
Note in particular that the function “doit” is calling “add(int)” explicitly twice.
If I compile with optimizations enabled “-O3” (but O1 would do it too) then you can see clearly that there is no call to “doit” anymore and that the contents of the two calls were magically merged into a single side effect: adding 4 to “var”.
So what happens now if we place the variable “var” and function “add” in another compilation unit (.cpp file) as in the code below?
Compiling this code with optimizations enabled, we will see that the compiler could not inline anything - even because we haven’t written the “add” code yet!
Let’s define a body for the callee as below, and let’s compile it - I’m not showing the compiled code right now because it is irrelevant, but we will revisit it later in this article.
Compiling and linking the caller and dumping the executable, we see that the main function still has two calls to “add(int).”
But inlining is still possible in this case. Inlining at the binary level can be done by various methods, one by adding bytecode to the object files, which could then be “recompiled” by the linker. This is done with the option “-flto=thin” on clang.
Now we see that the compiler could inline the code even if the caller and the callee are in two distinct compilation units!
When using CMAKE to build your code, refrain from adding LTO options directly to the command line with, for example, CMAKE_CXX_FLAGS. The reason is that cmake understands the compiler in question and will add the proper flags when necessary. It is much better to enable interprocedural optimization as a cmake variable instead. In the above case, if you pass “-flto” to clang, it would not have inlined the above example.
Procedure Linkage Table
When software is linked as a shared library, it has to be assembled with position-independent semantics - think about if all the shared libraries in the system would have to be coordinated. Hence, there is no clash of absolute addresses - it was the case 20 years ago! Therefore, relocatable code was introduced to Linux.
When an executable is loaded in memory, all its references to shared libraries are replaced with the resolved addresses. Right? Well, it just so happens that it would take too long. Going through every reference in an executable and resolving each of them makes a delay so big that it is visually seen. For those cases, lazy loading was created.
Instead of the resolved address, the address on a call is replaced with a call to a function in charge of resolving the address! This intermediary function is located in the Procedure Linkage Table or PLT. This is now the default on Linux - unless you disable it.
Let’s compile our previous example as a shared library and see what happens. We need the compiler options “-fPIC” to generate relocatable code and “-shared” to generate a shared library. If you don’t provide “-fPIC,” the compiler will still work, but a linker error will be displayed.
I cleaned up the output above so only the relevant sections were displayed. The next three paragraphs attempt to describe what these are in the least amount of words. However, I understand if you will find it complex, so feel free to skim through the next three paragraphs and then jump into the example.
Notice that there are the usual sections “.text” for assembled code and the references to “var”, “main,” and “add(int)” as usual. In addition, there are four extra sections “.plt”, “.plt.got”, “.got” and “.got.plt”.
The “.got” section is the Global Offset Table, which is the place where the resolved addresses of variables and functions will be stored. The “.got.plt” is the same but is always writeable, while the “.got” can be made read-only for security purposes. These sections are just plain arrays so there is nothing to show.
The “.plt” section is where the lazy jumps are located, ie, they contain stubs that, when executed, resolve that function call address and place it in the “.got.plt” section. The “.plt.got” section is used when lazy jumps are not allowed or available.
A call into these stubs is made when a function gets called (as below). Notice that the calls are into “add(int)@plt” while the actual function resides in “add(int).” This means that main() is not calling “add(int)” directly - it is jumping into a resolver subroutine to update the GOT tables with the resolved address.
This extra resolving process adds extra time to the first call into a lazy-linked routine (and also some time to all calls thereafter), but the startup time of the executable will be much shorter in the average case. It is a tradeoff.
Better work out an example. We recompile callee.cpp as a shared library and caller.cpp as an executable and then link both together using all default options. Then I start the caller inside the debugger, setting the library path to the current folder as search for libraries in the current path was disabled a long time ago for security purposes.
I then set a breakpoint into main and run the program. When it stops, I disassemble the code. Note that “_Z3addi@plt” is the C++ mangled name for “add(int)@plt”. Note that this function address is called twice, once with the argument “1” and the second with the argument “3”. Remember that %edi is the register taking the first argument in a call according to the AMD64 ABI.
Stepping into the code twice, we enter the call and disassemble it again. We are now inside the “.plt” section we described before.
The first jump “jmpq *0x2fe2(%rip)” is effectively just a no-op, leading to the second line “pushq $0x0”. The address pointed by the indirect reference “*0x2fe2(%rip)” is inside the GOT table, and it is pointing to the next line. We can examine the GOT table and check that this is true.
We can confirm this by stepping into the code once. Note the location of the current instruction marked with the “=>” sign to the left. That’s the next line.
Stepping again towards the second jump, we enter a code section with no associated function. We are now inside the resolver.
We step out of it with “fin” and disassemble. We are now back into main after the call was complete. Nothing changed in the main because the resolver stored the resolved address in the GOT tables without modifying our assembly.
As we step twice, we end up again inside the PLT stub.
However, as we step again into the jump, we end up in our callee function and not the resolver anymore. This is because the address pointed by “*0x2fe2(%rip)” was previously updated.
Examining the GOT table again, we can confirm that the GOT table changed and got updated with the address of the function “_Z3addi,” which is the C++ mangled name for “add(int).”
This time, when we step in again into the indirect jump, we end up in libcallee.so at the function “add(int)”.
This process is called lazy binding, the default for shared libraries. Lazy binding adds a performance penalty for the first and subsequent calls, although much smaller.
It is possible to disable lazy binding - and the PLT overhead altogether - by passing the option “-fno-plt” while creating the shared library. It is also possible to force this behavior in code by adding the function attribute “noplt” to the function prototype, but it works only on gcc as the clang patch seems to be on hold.
When we recompile with the new flag and examine the resulting assembly, we can see that the address of our function “add(int)” is already known, and it is loaded into the register %rbx. The call is still an indirect one, but no extra stub/trampoline call will be made. This change significantly impacts performance, as seen in the benchmarks below.
ELF Semantic Interposition
Semantic interposition means that if a function is compiled in a shared library, the caller should not rely on information other than the function prototype and agreed semantics. This allows the use of LD_PRELOAD to highjack functions. A classic example is how Solarflare’s ONLOAD interposes its version of the socket libraries (read, write, poll, etc) to offer kernel bypass.
However, this also means that the compiler cannot optimize further, even if it has all the information about the caller and the callee up front.
With GCC 5, we were offered the ‘-fno-semantic-interposition’ option, which does allow the compiler to break that rule and make assumptions that lead to further optimizations like inlining.
It is said that Python can get a boost of up to 30% with this flag alone, so we have to analyze it. However, for the sake of example, I could not find a single toy example that would make a difference to illustrate here. I’ll keep that on my search list.
I am using the previous article’s benchmarks as they already cover quite a lot of ground in the sense of diversity of calls. There are straight calls, function pointers, switch statements, and virtual calls with multiple parents.
As platforms, I’m using an AMD Zen2 3960x with 24 physical cores (6 of them isolated) and one Intel Xeon Platinum 8259CL (AWS t3-micro) as to represent each family. Granted, it’s narrow, but beyond this point, the combinatorial explosion would be unwarranted.
In all cases, I use clang 18.0 (trunk) or GCC 12.1.
There were quite a few tradeoffs to analyze for this article:
type of test: baseline, switch, virtual, etc (see previous article)
static vs. shared
native vs. x86-64 (architecture)
LTO vs no-LTO
clang vs. gcc
PLT vs no-PLT
Intel Xeon vs AMD Zen
Semantic Interposition vs no semantic interposition
That would have been easy to run, but some options have collateral effects when used with other options. For example, I knew a priori that GCC does a very good job with link-time optimizations, but would that matter if I chose a simpler instruction set? Would all this change if I ran these tests on an Intel rather than on AMD? How would semantic interposition play with all the other options?
The solution to accommodate the combinatorial explosion was to rely on the graph below, which is the result for running all tests on the AMD platform. First, I start with the all-default case:
compiler is gcc
x86-64 architecture (vanilla)
Linking as a shared library
This is the first yellow line, which displays results as nanoseconds per call. This is the only line where results are displayed as time. The results will be a percentage delta from the reference run in all the other lines.
In the second line, we replace one single option, we compile the library as static instead of shared as previously. The numbers now correspond to the relative delta to the performance numbers as shared, so we are looking at how much performance changed as we changed this single option.
We return to the shared library in the third line but compile with the native machine architecture. The delta displayed on this line is the relative difference from the option where the single option changed, which is the first line.
We change no-lto to lto on the fourth line and use the first line as a reference. The fifth line uses the second line as a reference, and so on. We reference a previous line with a single option change to the current in every line.
This way, we keep generating all possible combinations for a total of 32 cases.
So let’s jump into the interpretation of the results:
Across the board, there is a significant improvement in execution time when static linkage is chosen versus shared.
The gains with static linkage are only meaningful and effective if combined with LTO, which makes sense since LTO can’t do much in shared linkage.
clang has a horrible showing versus gcc. It is a loss across the board unless for the specific case of LTO in static linkage, in which case the gains are significant.
clang behaves worse than gcc when function pointers are heavily used.
there is only a marginal effect of all these options on virtual calls.
disabling PLT has a significant effect on improving performance in all cases.
compiling the code for advanced architectures versus vanilla x86-64 had very little to no impact on performance, sometimes making it worse.
Now, let’s see if all these conclusions hold when we move all the tests from the AMD machine to the Intel box.
Again, static linkage is considered a good practice
The comment on static linkage with LTO remains true. LTO only makes sense for static linkage - which is obvious and expected.
clang (versus gcc) is an even bigger loss, particularly for function pointers and virtual calls.
disabling PLT significantly boosted performance, except when clang was used, in which case it didn’t make much difference.
As for semantic interposition, we didn’t find any strong and significant effect of its use on the benchmarks at hand. The two graphs below, the top for the AMD and the bottom for Intel, show negligible impact on the tests. Unless I missed it, there is no pattern to make sense in here.
Throughout these conclusions, we must remember that all the results above do not represent an actual binary in production. They only test the particular cases of function calls, function pointer calls, and virtual calls performance. Those are typically a diminutive percentage of the total time spent in a typical application.
The first big conclusion is that linking statically to your libraries has a significant impact on performance, particularly if combined with link-time optimizations (LTO).
The second takeaway is that disabling PLT (with -fno-plt) positively impacts performance in pretty much any combination with the remaining options.
The third conclusion is that Clang does a bad job overall. I am unsure if that’s in a tradeoff with other metrics we did not notice, but replacing gcc with clang led to a +50% increase in execution speed on average.
Finally, I touched on this in a previous article. Still, it makes sense to repeat: unless you are using some particular algorithm known to benefit from SIMD-heavy computation, there is no gain to be made to issue “-arch=native” as opposed to “-arch=x86-64”. According to Linus Torvalds, the new instruction sets are only marketing sales material.
I hope AVX512 dies a painful death, and that Intel starts fixing real problems instead of trying to create magic instructions to then create benchmarks that they can look good on. - Linus Torvalds
So the recommendations from this article seem to be:
Prefer gcc whenever possible
Statically link your code and enable LTO
Disable PLT (and lazy loading)
Compile for x86-64 for general processing
Low Latency Trading Insights is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.