Quite a while ago, I presented the Collatz conjecture and I was then interested in the graphical representation of the problem—and not really going anywhere with it.
In this entry, let us have a look at the implementation of the Collatz function.
Of course, all kind of people complained that I couldn’t compare a dynamic, interpreted language with static, compiled languages. First, let met tell you that I sure can. First, the goal was to measure speed, and not the effects of type system of the language (although logically correlated) nor the programming paradigm: the amount of CPU used to solve a given problem was the primary (if not only) point in interest.
Implementing data structures in a way that uses efficiently memory should always be on your mind. I do not mean going overboard and micro-optimizing memory allocation right down to the bit. I mean organize data structures in memory so that you can avoid pointers, so that you can use contiguous memory segments, etc. Normally, minimizing storage by avoiding extra pointers when possible will benefit your program in at least two ways.
First, the reduced memory requirement will make your data structure fit in cache more easily. Remember that if pointers are 4 bytes long in 32 bits programming, they are 8 bytes long in 64 bits environments. This yields better run time performance because you maximize your chances of having the data you need in cache.
Second, contiguous memory layouts also allow for efficient scans of data structures. For example, if you have a classical binary tree, implemented using nodes having each two pointers, you will have to use a tree traversal algorithm, possibly recursive, to enumerate the tree’s content. If you don’t really care about the order in which the nodes are visited, what’s quite cumbersome.
It turns out that for special classes of trees, complete trees, there is a contiguous, and quite simple, layout.
The cost of virtual functions is often invoked as a reason to C++’s poor performance compared to other languages, especially C. This is an enduring myth that, like most myths, have always bugged me. C++ myths are propagated by individuals that did not know C++ very well, tried it one weekend in 1996, used a bad compiler, knew nothing about optimization switches, and peremptorily declared C++ as fundamentally broken. Well, I must agree that C++ compilers in the mid-90s weren’t all that hot, but in the last fifteen years, a lot have been done. Compilers are now rather good at generating efficient C++ code.
However, the cost of calls, whether or not they are virtual, is not dominated by the the call itself (getting the address to jump to and jumping) but by everything else surrounding the call, like the stack setup and argument passing. Let us debunk that myth by looking at what types of calls are available in C and C++, how they translate to machine code, and see how faster or slower they are relative to each other.
Remember the old days where you had five or six “memory models” to choose from when compiling your C program? Memory models allowed you to chose from a mix of short (16 bits) and long (32 bits) offsets and pointers for data and code. The tiny model, if I recall correctly, made sure that everything—code, data and stack—would fit snugly in a single 16 bits segment.
With the advent of 64 bits computing on the x86 platform with the AMD64 instruction set, we find ourselves in a somewhat similar position. While the tiny memory model disappeared (phew!), we still have to chose between different compilation models although this time they do not support mixed offset/pointer sizes. The new models, such as LP32 or ILP64, specify what are the data sizes of int, long and pointers. Linux on AMD64 uses the LP64 model, where int is 32 bits, and long and pointers are 64 bits.
Using 64 bits pointers uses a bit more memory for the pointer itself, but it also opens new possibilities: more than 4GB allocated to a single process, the capability of using virtual memory mapping for files that exceed 4GB in size. 64 bits arithmetic also helps some applications, such as cryptographic software, to run twice as fast in some cases. The AMD64 mode doubles the number of SSE registers available enabling, potentially, significant performance enhancement into video codecs and other multimedia applications.
However one might ask himself what’s the impact of using LP64 for a typical C program. Is LP64 the best memory compilation model for AMD64? Will you get a speedup from replacing int (or int32_t) by long (or int64_t) in your code?