The True Cost of Calls

The cost of virtual functions is often invoked as a reason to C++’s poor performance compared to other languages, especially C. This is an enduring myth that, like most myths, have always bugged me. C++ myths are propagated by individuals that did not know C++ very well, tried it one weekend in 1996, used a bad compiler, knew nothing about optimization switches, and peremptorily declared C++ as fundamentally broken. Well, I must agree that C++ compilers in the mid-90s weren’t all that hot, but in the last fifteen years, a lot have been done. Compilers are now rather good at generating efficient C++ code.

However, the cost of calls, whether or not they are virtual, is not dominated by the the call itself (getting the address to jump to and jumping) but by everything else surrounding the call, like the stack setup and argument passing. Let us debunk that myth by looking at what types of calls are available in C and C++, how they translate to machine code, and see how faster or slower they are relative to each other.

Function call methods. C and C++ offer the following call methods:

  • Direct. This is the normal function call, where the address of the function is known at compile-time—or patched at link time, in either cases, the address is fixed in the code at run-time. The call itself will consist of stack set up (we’ll be back later on this) and of the call on a constant.
  • Indirect. This is usually known as call by function pointer, where the address of the function is unknown at compile-time but held in a variable that may be modifiable by the program at run-time. Example of these are callbacks, and object oriented programming, which is entirely possible (if painful) to do in C. Function pointers will have their natural use in a number of design patterns such as visitor, observer, or just event handling.
  • Inline. inline is a compiler directive to instruct the compiler to eliminate the function call by placing the function’s body at the point of call. By doing so, the compiler eliminates the overhead of setting up the stack and doing the call, but at the possible cost of code expansion. The compiler takes inline as a hint, as it will apply heuristic to determine whether or not inlining the function will result in performance gain. If the compiler estimates a positive gain, the function is inlined, if the compiler estimates a negative gain, the function is called normally. This is indeed what the standard says the compiler should do (C++ ISO 14882:1998 § 7.1.2). In C and C++, inline also modifies the function’s storage semantics, preventing the function modified by inline from having external storage (C99, ISO 9899:1999 § 6.7.4). However, the compilers are free to instantiate a hidden copy of the inline function to make normal calls.

    Basically, the compiler does what it wants with the keyword inline, treating it like register, a hint for optimization, and nothing more. However, compilers are rather smart when you use proper optimization switches, they may well inline the function you asked for, and even other functions. Generally, a function with one or two arguments with just two or three lines of code will be inlined automagically.

  • Virtual. In Object Oriented Programming, inheritance and polymorphism warrant the use of indirect function calls so that the correct function is called even though a given object is manipulated as a base type (with a is-a relation). Virtual functions ensures that the right function is called for the actual object type, regardless of as what type it is manipulated, greatly simplifying code and helping to ensure correctness.

    Implementation may vary from language to language, but in C++, for each object having virtual functions, there’s a hidden pointer to a table that is created for the class, the VMT, the virtual method table that holds the function pointers. Each time a virtual method (method is object oriented lingo for “member function”) is called on an object, the address of the function is retrieved from the table. This is clearly only a special case of indirect calls, only that the function pointers are hidden and that the compiler generates all the necessary code and storage to manipulate them transparently. In particular, the compiler generates the code to fill the table when the object’s constructor is called.

The mechanics of function calls. Although the particular may somewhat vary from processor to processor, and from compiler to compiler, calls are performed more or less the same way because C and C++ enforce a particular type of stack management. In C, and C++, the stack management is split between caller (the one that performs the call) and the callee (the function that is called).

The caller pushes the arguments on the stack in reverse order, then performs the call. The call instruction pushes the current instruction address on the stack, and this address will become the return address, where the code is to resume execution when the function call returns. Upon return, it is the caller’s task to clean up the stack by popping (unstacking) the arguments.

The callee will set its own stack environment, ensuring that the stack is restored when it exits, so that the at the top of the stack the return address is found, allowing the code to resume normally, returning to the caller. The callee will only modify the stack in its own frame, using it to store its local variables.

Stack manipulation is supported by a set of specialized instructions which vary depending on the processor, but they basically offer push, pop, and clean up (a massive pop that simply discards pushed data). The call instruction is basically a push-jump instruction, where the current address (or the address of the next instruction) is pushed on the stack before jumping to the specified address. The return instruction is therefore a pop-jump instruction, where the target address is retrieved from the top of the stack (popped) before jumping at the retrieved address.

So, to perform a call, the caller evaluates the arguments and push the resulting values in reverse order on the stack before calling the function. The function creates its own local storage, accesses the arguments, cleans up its local storage, returns. The caller then cleans up the arguments from the stack, and program execution resumes.

You may ask yourself why the the caller must push the arguments in reverse order. Well, the reason is that C (and C++) allows the use of the ellipsis, the feared and troublesome ... that one finds in function such as printf, and that is used to pass an indefinite number of arguments to a function. The prototype of printf is:

int printf(const char *format, ...);

And only the position of the first argument must be known. Pushing it last, it means that it is a known position, that is, the top of the stack (just under the return address).

The cost of calls. So, clearly, the cost of a call, by itself, is not much:

  • Direct. If the address is known at compile time, the call is merely a push with a direct jump at a known address embedded in the instruction. The penalty incurred is processor-specific: whether it breaks the address prediction, flushes the pipeline, jumps to an address for which the code is not in cache, etc, but those penalties apply to all types of calls.
  • Indirect., If the call is indirect, we add the extra cost of fetching the address from a memory location. Clearly, using a simple function pointer is not going to cost much more than adding a single read from memory. Typically, a function call such as
     fonction_ptr(arg1,arg2);
    

    will translate into this x86 code:

     push arg2
     push arg1
     call function_ptr // loads the value from
                       // memory location
    

    where function_ptr may expand to a complex address generation instruction, like eds:[ebx+some_offset], but that is not very costly as processors are optimized to handle those efficiently. Of course, if you have the idea of doing something like:

     function_ptr[thingie->method[arg3]](arg1,arg2);
    

    the cost increases dramatically, as a lot of operations are needed to get the destination address. This may be useful, but you can’t complain about the performance loss.

  • Inline. Inlined function have no call cost. They may perturb the optimization of your code and have an adverse effect, but the call code is removed. However, that does not dispense one from evaluating the arguments, which is always costly. Also, the compiler may decide not to inline the function at all, in which case it will behave exactly as a normal direct call.
  • Virtual Functions. Virtual functions are only indirect calls, but the pointer is stored in a table for the class rather than some other location in memory. The following:
     an_object->virtual_function(arg1,arg2);
    

    will compile to something similar to:

     push arg2 
     push arg1
     // esi almost always contains THIS already
     mov edx,[esi+vmt_offset] // loads the vmt pointer (*)
     call [edx+method_offset*sizeof_ptr] // calls from memory location
    

    where the vmt_offset is a constant, determined at compile time. The instructions may even be optimized further using the fact that the this pointer is already in a register, or using more complex instructions such as lea (for load effective address, a “do it all” address calculation instruction provided by x86 CPUs). Moreover, the line marked by (*) may be simplified if the VMT is stored as the beginning of the object, as it is with G++.

But the cost of a call, whatever the flavor, is dominated by the evaluation of its arguments. We saw that indirect calls, whether C-style or C++ virtual methods, are inherently inexpensive. A call to a virtual method is not any more expensive than an indirect call using a struct member (something->function(arg1,arg2)) so deeming virtual function as incredibly slow is just misinformed.

However, care must be exercised when comparing performances between languages. One must understand exactly what the programs in different language do exactly, and how the semantics—the subtle, tenuous differences—impact program performance. Of course, if a minimal translation of a program in language A to language B doesn’t do the same thing, any conclusions drawn on performance are moot.

Further readings. Read more on the stack data structure, and on the call stack. Read more about inline expansion. Read more on object oriented programming. Read how to do Object Oriented Programming in Ansi C pdf-24x24

19 Responses to The True Cost of Calls

  1. Paul de Vrieze says:

    Actually, virtual method tables are only created per class, not per object. They are created by the compiler (in the gcc case in the translation unit containing the implementation of the first non-inline virtual method).

  2. Steven Pigeon says:

    After verification, you are right about g++. I have updated the text to reflect that. Thanks for pointing that out.

    It seems that the VMT pointer is also right at the start of the object (instance) so the call code may be optimized a bit further, to something like

     push arg2
     push arg1
     // assuming esi is already the this pointer
     mov edx,[esi] // to keep esi
     call [edx+method_offset*sizeof_ptr]
    
  3. Paranoid says:

    Alas, you are mistaken when it comes to the cost of indirect calls. Stack manipulation is one of the most heavily optimized operations on modern CPUs. (the instruction decoder rewrites stack manipulation to something very efficient internally)

    Familiarize yourself with modern CPU design (pipelined, out-of-order, superscalar execution) and especially branch prediction. (more than you have so far)

    http://en.wikipedia.org/wiki/Pipeline_(computing)

    The assumption that an instruction always takes n cycles is long history.

    The cost of virtual calls is dominated by the indirect branch instruction, which on pipelined, non-branch-predicted CPUs takes about [the length of the pipeline] cycles. (a huge simplification: Pentium 4 pipeline = 20 stages, PIII/Core pipeline ~ 10-12 stages). The processor has to wait until the branch address is calculated before it can branch, and the instruction that calculates the branch address will take [~the length of the pipeline] instructions to complete.

    If the virtual call always branches to the same location, modern CPUs have a cache called the BTB (branch target buffer) which records the destination address of branches. When the branch address is in the BTB, the CPU speculatively executes instructions from the branch target and does not wait for the branch address calculation to complete.

    http://en.wikipedia.org/wiki/Branch_target_predictor (lacking, unfortunately Wikipedia is not a great resource for CPU design)

    On Core 2 processors, in the best case direct branches take 0 cycles (CPU designers are clever), making them effectively free, and direct call instructions are heavily optimized also (using return address caches etc.).

    Indirect branches though, take [cost of BTB lookup/administrative overhead] cycles in the best case, and [~length of pipeline] cycles in the words case, but out-of-order execution might erase the penalty, if your code allows it.

    So, to conclude: Virtual calls are almost as fast (slow) as indirect calls, but indirect calls in general *can* be a lot slower than direct calls. (in programming language parlance: polymorphic (when the class of the call target changes) virtual calls = slow, monomorphic = ok). They also can be almost as fast as normal calls, given the right circumstances.

    Also, the Java VM optimizes virtual calls to direct calls based on runtime information when possible, while C++ does not.

    And never write a performance post without a benchmark.

  4. Steven Pigeon says:

    Using:

    class A { public: virtual int get_a() const { return 3; } };
    class B: public A { public: virtual int get_a() const { return 4; } };
    int simple1(void *z) { return (long)z+3; } 
    int simple2(void *z) { return (long)z+4; } 
    

    Here are benchmarks, for 10 000 000 calls (nothing fits in cache:)
    (also compiling with g++ -O3 -fno-inline)

    (no-inline) direct : elapsed: 86.376 ms
    (simple) indirect : elapsed: 82.838 ms
    (double) indirect : elapsed: 97.046 ms
    (triple) indirect : elapsed: 142.162 ms
    (pointer to) virtual : elapsed: 120.323 ms

    To make sure the comparison is fair (a virtual method is passed the this pointer even though it is not used) a pointer is passed to the simple C-style functions. The objects have been created using placement new to make sure they are contiguous in memory just as are the other arrays.

    For the triple indirect call (the equivalent of the pointer to the object that points to the vmt that points to the function) and pointers to objects using virtual functions, there’s a significant difference that is due to the compiler collapsing the memory access because the pointer is at a known offset in the VMT (in this case, 0), pushes the this, placing the cost of calling a virtual function somewhere between a double indirection and a tripple indirection.

    (btw, g++ does use direct calls when a virtual is called on a terminal class (one that has no descendants). I’m not sure how useful of an optimization it is in general, since we then to use classes higher in the hierarchy.)

    * * *

    Of course, you’re right about the cost of pipeline stalls due to mispredictions. I was not assuming anything like an instruction takes “n cycles” to complete. I know about superscalar/superpipeline execution, it is just that in practice, it is very hard to get any accurate timing, especially on a multi-task OS, where there’s always stuff going on: other processes, interrupts.

    Moreover, execution characteristics change dramatically between processors. Not only are there the numerous intel processors, there are also AMD (I run an aging AMD4000+), and other brands/ISA as well, all with different Branch Prediction Algorithms, Cache Sizes, etc. Arguments about branch prediction and out-of-order excution are moot on a shallow pipelined, in-order-processor like a micro-controller (or older Intel CPUs). In that case, execution would be (safe for interrupts) deterministic.

  5. Paranoid says:

    Did a benchmark myself (benchmark in one object file, implementation functions in another object file to ensure no inlining)

    On a Core 2 system (executing 10^9 calls), compilation options -O3, virtual calls are polymorphic:

    C calling convention:

    time ./direct
    4.74user 0.00system 0:04.75elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
    0inputs+0outputs (0major+197minor)pagefaults 0swaps
    time ./virtual
    6.75user 0.00system 0:06.77elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
    0inputs+0outputs (0major+197minor)pagefaults 0swaps
    

    __attribute__((regparm(3))) calling convention (parameters passed in registers):

    time ./direct
    4.18user 0.00system 0:04.18elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
    0inputs+0outputs (0major+196minor)pagefaults 0swaps
    time ./virtual
    5.09user 0.00system 0:05.10elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
    0inputs+0outputs (0major+196minor)pagefaults 0swaps
    

    Signatures of functions:

    API int direct_call1(int i);
    API int direct_call2(int i);
    
    class Base {
    public:
        API virtual int call(int i) = 0;
    };
    
    class Concrete1 : public Base  {
    public:
        API virtual int call(int i);
    };
    
    class Concrete2 : public Base {
    public:
        API virtual int call(int i);
    };
    

    What to conclude? Passing parameters in registers helps, but not very much.
    Benchmark is dominated by calling and loop overhead.
    Virtual functions have 25-40% overhead compared to normal functions on modern architectures (Core 2, I assume we’d see similar results on K8, Opteron).

    Strangely, I can duplicate your result of monomorphic indirect calls being faster than direct calls!

    Shows how complex and variable modern CPU architecture really is.

    • Tom says:

      Allow me to say that, and please excuse my bloomy languge, this benchmark is utter bullshit.

      The cost of virtual function calls comes (roughly in that order) from:
      – double TLB and double cache misses (upwards of 1500 clocks in the very worst case)
      – single TLB and/or cache misses (few hundred clocks)
      – the call being a call, instead of inline (a dozen or two clocks)
      – badly predicted branch harming out of order execution (a dozen clocks or less)

      If you perform a benchmark with 10^9 virtual function calls in a loop, you are *guaranteed* that both the vtable and the code are in the TLB and the level-1 cache (except for the very first call). Also, all recent CPUs (during the last 5-8 years) have some notion of “branch prediction cache” which can predict a virtual function call in the same way as a non-virtual one, if the last (or one of the last 2-3, on some) call went the same way. Again, in your benchmark, this is pretty much guaranteed to be the case.

      In a real application where there’s more than two subclasses, more than a dozen virtual addresses involved, and more than a few hundred cache lines, none of this holds true.
      You will certainly have /some/ cache hits (otherwise cache would be useless), but it will not even be remotely similar to the conditions in your benchmark. Mind you, TLB on Core 2 is 16 entries(!), and level-1 cache is 2x32kB (data and code, both of which is used in a virtual call). Neither of these are awfully big numbers for a real, non-trivial program.
      Therefore, your benchmark has exactly no bearing.

      • kiran says:

        Re-read Paranoid’s first post. He is claiming exactly what you are arguing. Compare the “C calling convention” results to the “__attribute__((regparm(3)))” results, NOT the direct results to the virtual results.

        The claim is that argument evaluation from the stack is only slightly slower than passing arguments directly through registers. The dominating factor is whether there is indirection involved.

        Unfortunately we don’t know how the testbed is set up so we can’t make any conclusions about cache/TLB hits. I would love to see results where the working set clearly oversaturates the cache… or possibly results for varying working set sizes. Also it would be prudent to prime the cache into some known state before timing (it looks like the timing is done cold in the examples).

        • Steven Pigeon says:

          With trivial argument lists, the cost will be of course neglible. What I meant is not pushing vs passing by register, but the time to actually evaluate the values to push or pass by register. One would expect f(0) to take negligibly more time if the argument is pushed onto the stack; but the call f( y*thingie->getwidth()+x) will be dominated by the cost of evaluating the expression before push or pass by register.

      • Gary says:

        Dudes, you’re splitting hairs… the bottom line (as you’ve already mentioned in a previous post) is that it seriously depends on what’s going on in the machine at the time. Yours and Paranoid’s arguments are subjectively right. So to be fair, from opposite perspectives you can both be perceived as missing the point. From where I’m sitting I can see that neither of you are wrong in terms of the arguments you’re framing.

  6. Paranoid says:

    It gets more fun, when I specify -march=native (optimize for Core 2), I get this:

    With register calling convention:

    time ./direct
    3.74user 0.00system 0:03.75elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
    0inputs+0outputs (0major+196minor)pagefaults 0swaps
    time ./virtual
    12.89user 0.00system 0:12.93elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
    0inputs+0outputs (0major+197minor)pagefaults 0swaps
    

    With C calling convention:

    time ./direct
    4.31user 0.00system 0:04.34elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
    0inputs+0outputs (0major+196minor)pagefaults 0swaps
    time ./virtual
    7.63user 0.00system 0:07.63elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
    0inputs+0outputs (0major+197minor)pagefaults 0swaps
    
  7. I used -m64 on my box and it gives similar results to the first test.

    In my case, the direct call is probably a tiny bit slower because I randomized the call sequence (and use the same randomization in all test. Exactly the same “random” order for each series of test).

      start=now();
      for (int i=0;i<a_lot;i++)
       sum+=choices[i] ? simple1(0) : simple2(0);
      stop=now();
    

    That’s where, I think, the tiny slowdown comes from.

  8. [...] myself wondering: how much is it costing me to perform all these vtable lookups and indirect calls? The usual truism is that computers are so fast now that it doesn’t matter and that the idea of virtuals being a problem is just another [...]

  9. I was also wondering about the cost of vtable calls relative to switch statements and jump-table style calls so I performed an experiment too. I used C because vtable lookup was only one style of dynamic call I wanted to examine. I don’t think it’s in the post but I benchmarked C++ virtual calls and they performed exactly the same as vtable calls in C (duh, I guess).

    http://samdanielson.com/2009/5/4/comparing-schemes-of-dynamic-dispatch-in-c

    Direct calls are consistently fast so I prefer them. In tight loops where virtual call overhead matters it is often be possible to lift the virtual call out of the loop by splitting the loop into several, one for each type. Small switch statements can be faster than vtables but big switches are slow if they cannot be made into a jump-table. It’s still best to lift types out of a tight loop if possible.

    IMHO, C++ has the most flexible and generally fastest style of dynamic dispatch because of the vtable (maybe not always the best choice for a C design though). In a good C++ design the base class with virtual functions will only be used where polymorphism is needed. Every where else the compiler will generate direct calls. If the vtable overhead turns out to be a burden one can always create vectors of pointers to objects of the derived types and loop over each vector with a direct call. If the indirection can’t be lifted out of a loop then it is probably genuinely needed anyway.

    My two cents.

  10. [...] components. While doing some research, I fell up on the CRTP wiki page while researching the enduring battle between the performance of polymorphic code and the complexities of modern CPU [...]

  11. Ken says:

    How come the original article (and not a single response to it) mentions any of the costs associated with allocation from heap storage for object oriented message passing? These are essentially additional “hidden” calls to obtain memory blocks that persist beyond the call and may not be freed for some time. The efficiency of the precise heap allocation subroutine determines a significant proportion of the true cost of calls involving message passing.

    • Steven Pigeon says:

      Probably because “message passing” in C++ doesn’t actually use a “message”. When object’s method is called, it is not a message (in the obvious, traditionnal meaning of the word) that is sent; it is just a function that is called with one hidden parameter that is the pointer of the object on which it is called and one or many other parameters that define what you call the “message”.

      Very often, for performance reasons, if the “message” is bulky, it will be offered to the object by reference, that is, a method of the form int A::thingie(const gizmo & this_gizmo), reducing the message-passing to a pointer-passing. Removing the &, you’d get a copy-ctor for the parameter each time, but that is usually/easily avoided by keeping the reference & in there.

  12. [...] The C call is basically always faster than the other two. I could write a whole blog post (easily) about the speed of these three. It gets complicated, but the simple version is that the C++ and Obj-C calls are often comparable in speed (reference), and the C call could be anywhere from 20% faster to twice as fast (reference, see the comments). [...]

  13. everything about java…

    [...]The True Cost of Calls « Harder, Better, Faster, Stronger[...]…

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 74 other followers

%d bloggers like this: