Discover more from Low Latency Trading Insights
Are Function Pointers and Virtual Functions Really Slow?
There are so many wrong truths repeated everywhere that sometimes I think it’s my patriotic duty to dispel all this misinformation. Not sure where these lies were created, if they are made up by lazy devs who never ran a single benchmark and want to look good to their peers, or if those are 20 years of benchmarks that reflected the hardware and software environment of the time and just stuck around in the collective mindset.
One of these crazy urban legends is that the “C-style” calling method is very expensive and must be avoided at all costs. I have also seen that even old-style (pre-C++11) OOP constructs are slow, like virtual function calls.
I have a real reason to get my facts straight: I am a consultant, and if I make a public mistake, it has a true impact on my reputation and, consequently, on my revenue. So before I write an article, and that’s a reason why I don’t write an article per day as many other bloggers out there, I do take my time to think about the many ways I could be fooling myself and presenting the wrong conclusions.
Low Latency Trading Insights is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.
OOP in C?
There are several ways to do Object Oriented Programming in C. Yes, you heard it right. We can do OOP in C. It’s not as incorporated in the language as in C++ and Java. Still, nonetheless, you can implement virtually any abstraction seen in C++ (single and multiple inheritance, methods, etc.) in C as well - it will cost more lines of code and a somewhat awkward syntax.
The most interesting example is the Linux Kernel. On the Linux kernel source,
include/linux/fs.h there is a struct that defines the file operations. In Unix, everything is a file, from a regular file in the filesystem to a socket, so each of these types needs to implement different access methods like read, write, seek, poll, etc.
This struct is filled with actual function pointers when a driver or module is implemented like for the ext4 filesystem. In that way, structs in C can hold not only distinct data but different behaviors, thus creating true OOP polymorphism. In some sense, this C-style polymorphism is much more powerful than C++’s because it allows freedom to change individual methods - in real time!
As another example, the most popular window manager for the Linux Desktop - Gnome - uses Gtk to manage the Graphical User Interface (GUI). Both are written in C. Gtk follows a very particular object-oriented programming paradigm. Windows also uses a similar pattern.
Alternatives to OOP dispatching
So, with all this extensive usage of function pointers, it begs the question - how does it compare to “modern” C++ techniques like templates and virtual function calls? Should we ever use function pointers in C++ programs? The collective understanding is that function pointers and virtual call calls are extremely slow and should never be used. Is that true?
Let’s see what happens at the assembly level when we use a function pointer so we are clear about the consequences. The snippet below was coded in our Compiler Explorer which completed ten years last month.
In our snippet, we defined three functions that take a single parameter. Each one increments one external variable by the amount passed. On x86-64, with its rich addressing modes, all that load/add/store is done with a single ADDL instruction. On ARM64, with a more simplistic architecture, the variable address is loaded with ADRP, the actual value is loaded with LDR, the value is incremented with ADD, and the result is stored with STR. Mind you, this is not necessarily slower than the single x86-64 instruction as the x86-64 instructions get “parsed” internally into micro-ops, equivalent to the ARM ones.
For the doit function, the x86-64 will store the first parameter (fn in %rdi) in the %rax register, load the second argument (value in %esi) into the first parameter passing register (%edi), and then jump into the address pointed by the %rax register. Notice that for x86-64, I used the GAS assembly display, which has left-to-write semantics, so “movq %rdi,%rax” means “move the contents of %rdi into %rax.” I do so because I don’t like the Intel display.
In the ARM64 assembly, the drill is about the same. The first parameter (x0) is temporarily saved in the register x2, the second parameter (w1) is moved into the first parameter (w0) for the subsequent call, the saved value in x2 is moved into the register r16, and then the program jumps (BReak) into the address pointed by x2. Note that in ARM64, the display is (unfortunately) in Intel syntax, which I hate but can’t change. In Intel syntax, the direction is backward right-to-left, ie, “mov x2, x0” means “move x0 into x2”.
Straight Function Calls
How is that different from a simple function call? Let’s write the equivalent code using a simple dispatching switch. The code is available on Godbolt.
We can see two main effects up front:
The switch statements were inlined because the functions are in the same translation unit, which is good.
The size of the resulting assembly got way bigger. In particular, the switch statement was converted into a chain of three chained comparisons and respective jumps, which is horrible from a performance perspective.
So, we see a tremendous advantage of using function pointers when it replaces switch-dispatch. This difference gets even bigger when more functions are involved, although it gets partially mitigated by using jump tables to avoid chained comparisons. However, jump tables incur the very penalty that function pointers would: loading an address from memory!
Virtual Function Calls
What would happen if we replaced this code with virtual function calls? The code is available on Godbolt.
First, let’s tackle the gorilla in the room: the virtual functions were not implemented in assembly because the compiler realized that as these classes were defined in the body of this compilation unit, there was no possibility that these implementations would ever be used. Second, the dispatch function is considerably simpler than the respective version with function pointers. On the bad side, there is an extra memory reference, which can lead to a potential cache miss or even trigger a full memory page load in rare cases. That said, any memory reference can lead to these consequences.
Virtual function calls are arguably the most insanely vilified construct in the history of computer science. Unfortunately, this is unwarranted. Virtual function calls can take a few more cycles than a normal function call or a templated call, which might get inlined under the proper conditions.
However, this is not an apples-to-apples comparison. We use virtual calls when we need a dispatch method into an unknown function or method or a function/method to be defined in other compilation units (cpp files). If we consider the time spent selecting the proper call, we will see later in the benchmark section that virtual function calls, function pointers, and regular direct/inlined calls are equivalent in speed. It is misguided to take the intrinsic cost of virtual function calls without considering how much they save.
Because OOP and virtual inheritance are now very old features, compilers and languages have grown mechanisms to mitigate the original performance penalties. Compilers can now “devirtualize” function calls if they know the object that will be called upfront. Look at the example below:
Notice how the main() function got devirtualized and inlined as if the virtual function call never existed in the first place. If you are interested in learning more about devirtualization, there is a very good blog article about that by Piotr Padlewski.
But as always, I hate dwelling on speculation as my intuition has often proven wrong. Let’s test it and measure it.
I will use Google Benchmark for this benchmark, which you should know now. The CMakeLists.txt below is pretty much all boilerplate: define the minimum version, set the C++ standard, name the project, enable C++ and ASM (which is actually not used in this particular project), and then find the Google Benchmark package.
Then, ensure no “single-compilation-unit” inlinings occur in our code that would unfairly skew results in some special cases that are not representative of our test universe. Inlining could still occur if we use Link Time Optimizations (LTO), so we explicitly avoid using those flags.
We then put all the actual functions in a shared library, “func.” The executable with the benchmark code “bm_fnpointer” is created and linked against the func shared library. Note I defined the variables here as double to give the CPU more work converting ints to doubles.
The first test is trivial when we straight-call functions well-known to the compiler in a loop. Because Google Benchmark does not use RDTSC for micro-benchmarking, I built 1,000,000 loops inside which these functions will be called sequentially.
The second case is dispatching using a switch statement. Note that this is a bit unfair that we added a division inside the loop. Integer divisions are notoriously expensive, more than floating point divisions. However, this division knows the denominator, so it’s optimized away and cheap.
If you want to know how it works, you take the number you want to divide “a” and multiply by a magic number, then take only the resulting lower 32-bits (highlighted below), discarding the overflow. This page from Hacker’s Delight can be used to compute these magic numbers. If you want a more descriptive, formal article, look at this paper from the gmplib team.
The third benchmark case uses a vector to store function pointers, which are called in a loop inside the benchmarked section.
Function Pointer in an Array
Just for peace of mind, to measure the impact of the vector, we built another test using std::array. The main difference is that the array’s size is known at compile time, so it would potentially lead to better loop unrolling.
Switch + Vector
It bothered me that the switch test was too biased by the inner division, so I built another one using a vector, where the division is computed outside.
Switch + Array
And to be 100% complete, I also built the same example using a std::array instead of a vector.
Virtual Function Call
I created three classes with a virtual abstract method “func” with functionality parallel to the one used in the function pointer cases.
Then, I created three objects and a factory function. Note that all this code is inside func.cpp, so it is not in the same compilation unit as the benchmark code. This will make it impossible for the benchmarks to devirtualize these function calls, so we are making sure that in all these tests, all virtual function calls are being called in the most expensive way.
Now I can call the factory function and populate my vector of object pointers.
Virtual Function Call with New Layer
I have seen many cases where simple one-level inheritance cases get optimized away by the compiler so I build a second virtual function call test where I build a second layer of classes as follows.
I am going the extra mile here because I advocate for the (reasonable) use of virtual function calls or the dispelling of its demonization. In this case, Func4 is derived from Func1, which is derived from Func, so we have two layers of inheritance here.
Then, I can create a second factory function with all these classes mixed up:
The benchmark code is then very similar to the previous case.
For comparison, I ran these benchmarks in a relatively modern AMD ThreadRipper (Zen2 architecture) at 3.8 GHz and a cloud-based Intel at 2.5 GHz.
AMD ThreadRipper 3960X (arch=znver2)
Note that this machine is rated nominally at 4.5GHz but running at 3.8 GHz since Turbo is disabled. The AMD results are shown below.
First, we acknowledge that the baseline is the fastest of all benchmark cases, clocking at 1.75 nanoseconds per call, equivalent to 6.6 cycles at 3.8 GHz (remember, we run a loop of 1,000,000 items). Note that those 1.75 nanoseconds incorporate not only the call itself but the loop controls as well. But the important is not precisely time all the components - we are more interested in the differential between each case.
Next, the switch case is the slowest at 2.46 nanoseconds. This is not due to the inner division; the other switch benchmarks (SwitchVector and SwitchArray) also display the same timing, between 2.37 and 2.46 nanoseconds. Remember, in these later cases, the division was pre-computed. This is 40% slower than the baseline. Converting to cycles, these numbers are close between 9.0 and 9.36 cycles.
Then, we get to the function pointers. Both versions using vector and array are about the same at 1.84 nanoseconds, very close, only 0.09 nanoseconds slower than the baseline. In context, at 3.8 GHz, it would be equivalent to just 0.34 cycles. The choice of vector or array has not impacted the results much. This is equivalent to 7.0 cycles.
At last, the virtual function calls. They have the same timing of 1.84 nanoseconds per item, or 7.0 cycles, pretty much the same as for function pointers. Again, this is extremely close to the highly optimized baseline case, adding less than half a cycle in overhead.
Intel Xeon Platinum 8259CL (arch=skylake-avx512)
I ran the same tests on an Intel Xeon Platinum 8259CL (Amazon AWS t3-micro) to ensure our results were not biased to a given Intel architecture.
The baseline test ran at 2.10 nanoseconds, equivalent to 5.25 cycles. This is very close to the 6.6 cycles for the AMD.
The Switch, SwitchVector, and SwitchArray were again the worst at 2.94 to 3.60 nanoseconds. This is equivalent to 7.25 and 9.00 cycles, respectively, which barely matches the 9.0 cycles we obtained for the AMD.
The function pointer got about 2.1 nanoseconds, which is 5.25 cycles. This was the only result where the value in cycles was much less in the obtained for the previous case, which was 7.0 cycles.
The virtual calls took 2.27 nanos and 2.52 nanos, indicating a slower time for the case with a deeper inheritance tree. I have observed progressive slowness in more complex cases, although I was surprised that there was no difference in the AMD. I need to dig into this further. The respective cycles were 5.6 and 6.3, very close to the 7.0 cycles for the AMD.
The most important conclusion is that there is not much penalty for using function pointers and virtual function calls compared to the most optimized case. Of course, there is the penalty of a pipeline break compared to the inlined case, but that is unavoidable for all the use cases we cover here in this article.
The difference between the baseline and the benchmark with function pointers was just about one cycle per call, which is negligible in the large picture. Therefore, the hysteria around demonizing function pointers is unwarranted - according to this test.
The same applies to virtual function calls. They do not necessarily add much latency to the extremely optimized baseline case and provide ungodly help breaking down code complexity. It is obtuse to throw this technique away because someone told you on Reddit that vIrTuAl FuNcTiOnS aRe BaD.
I am not proposing that function pointers and virtual function calls will always be as performant as direct calls. There are edge cases where I can see virtual calls behaving badly, cases with extremely deep inheritance trees, and very large classes. However, I have tried hard to go the extra mile to tap into that effect, and it did not quite show up in the tests. Perhaps I should extend this; I have done it in the past. I chained thousands of virtual function calls a few years ago and timed the result. It ended up not more than five cycles per call. However, this is a degenerated case. The cases in this article represent more of what people would run into in real life.
My last thought is how close the results are in cycles for two machines of different architectures and frequencies. That is how I strongly advocate for measuring cycles, not time. I covered this in a previous post below.
All code is available on Github
Low Latency Trading Insights is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.