The LP64 model and the AMD64 instruction set

Remember the old days where you had five or six “memory models” to choose from when compiling your C program? Memory models allowed you to chose from a mix of short (16 bits) and long (32 bits) offsets and pointers for data and code. The tiny model, if I recall correctly, made sure that everything—code, data and stack—would fit snugly in a single 16 bits segment.

With the advent of 64 bits computing on the x86 platform with the AMD64 instruction set, we find ourselves in a somewhat similar position. While the tiny memory model disappeared (phew!), we still have to chose between different compilation models although this time they do not support mixed offset/pointer sizes. The new models, such as LP32 or ILP64, specify what are the data sizes of int, long and pointers. Linux on AMD64 uses the LP64 model, where int is 32 bits, and long and pointers are 64 bits.

Using 64 bits pointers uses a bit more memory for the pointer itself, but it also opens new possibilities: more than 4GB allocated to a single process, the capability of using virtual memory mapping for files that exceed 4GB in size. 64 bits arithmetic also helps some applications, such as cryptographic software, to run twice as fast in some cases. The AMD64 mode doubles the number of SSE registers available enabling, potentially, significant performance enhancement into video codecs and other multimedia applications.

However one might ask himself what’s the impact of using LP64 for a typical C program. Is LP64 the best memory compilation model for AMD64? Will you get a speedup from replacing int (or int32_t) by long (or int64_t) in your code?

Let us examine a simple example. What code does G++ generates for a simple function like the following for the AMD64 mode:

int_t function(int_t u)
  int_t x=0;
  do x++; while (u>>=1);
  return x;

I used g++ -O3 -m64 -c function.cpp to generate the object code. With objdump --disassemble function.o, you can inspect the code generated by the compiler, including the opcode bytes. If, in the above C/C++ code, int_t is defined as int64_t, we get the following code:

0000:  31 c0                  xor    %eax,%eax
0002:  48 83 c0 01            add    $0x1,%rax
0006:  48 d1 ff               sar    %rdi
0009:  75 f7                  jne    2 <_Z8functionl+0x2>
000b:  f3 c3                  repz   retq

The opcodes bytes with leading byte 48 on lines 2 and 3 are data operand size prefixes that instruct the CPU to interpret the following instruction’s operands as 64 bits values, which is quite surprising since you expect the operand size to already be 64 bits in 64 bits mode.

Repeating the exercise but now defining int_t as int32_t, we obtain the following code:

0000:  31 c0                  xor    %eax,%eax
0002:  83 c0 01               add    $0x1,%eax
0005:  d1 ff                  sar    %edi
0007:  75 f9                  jne    2 <_Z8functioni+0x2>
0009:  f3 c3                  repz   retq

So, despite the computer being run in 64 bits mode, the default operand size for instructions is 32 bits. The second version of the code is two bytes shorter, which may not seem like much. But the two missing prefixes do have an important cumulative impact on code speed. First, the code being smaller, albeit not tremendously so, more code fits in the instruction caches. Second, the prefixes do have a cost at execution.

Another non-negligible cost of 64 bits computing is the impact on memory bandwidth. Using 64 bits integers also means, basically, using twice as much memory whenever an integer is used. While this has a moderate impact when the value is a register (as the instructions may, or may not need the 48 operand size opcode prefix), this may prove quite a problem when large amounts of data are read and written to memory—whether in cache or main memory.

Taking back my old sorting algorithms suite (for which you can find a tutorial here (but in French)), I compared the performance by using int as int32_t then as int64_t for the values to sort.

Using int64_t for the value type, I noticed an important pessimization of performances. In 64 bits mode but using 32 bits integers, the suite runs the series of tests in about 9.2s average time, while using 64 bits integer, the performances drops noticeably to 9.7s average time. In this first test, I used arrays of 10,000, which fits in cache memory. Boosting the array sizes to 1,000,000, they clearly do not fit in cache and the performance difference is even more striking. From a mere 5% more, the cost of using 64 bits integers grows to 17% (or so): from 1m23s, the run time grew to 1m37s! 1

So what is the morale of the story? Using int64_t carelessly 2, incurs a performance penalty on AMD64 processors. Not only does it uses more memory—potentially causing further delays due to cache trashing—it also cause instructions to be longer, therefore larger code size (more thrashing, although probably not very severe) and somewhat slower code. That’s a bummer because it mitigates the boons of 64 bits computing in a way I didn’t expect. At least on the x86 platform, as I surmise that other, saner, architectures, like the MIPS or the Power Architecture, deals much better with 64 bits instructions than the AMD64 instruction set.

To make sure you’re using a smart data type for your application, use the platform-safe basic types definitions from C99’s <stdint.h> and from <stddef.h>. In particular, <stddint.h> provides definitions such as int_fastN_t, where N is 8, 16, 32, or 64. These definition provides the user with the fastest, platform-dependant, integer types that provides at least the number of requested bits. You may get more, but never less. These types, while they may appear cumbersome to use, are a great way to write portable yet faster software.

1 I also removed the bubble-sort like algorithms, because it’s likely that I’d still be waiting for the results.

2 Some software do get a massive speed-up from 64 bits arithmetic. For example, OpenSSH encryption/decryption is twice as fast.

One Response to The LP64 model and the AMD64 instruction set

  1. […] One possibility is to use the machine specific integer types to do all the work for us. Using the standard C library <stddint.h> header that provides types such as int16_t, we can create a fixed point number, one that is friendly to the machine’s natural register sizes. On 32 bits x86, those are 16 and 32 bits registers. On a 64 bits machine, we could still use 32 and 16 bits (and I’ve already explained why). […]

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: