Yet More __builtins

So last week we saw how to use some of GCC’s built-ins, this week, let’s have a look at how we can create our own, if need be. Say because you need to have access to some instruction and that GCC does not offer the corresponding built-in.

coin-reverse-small

To do so, we’ll use a bit of the C preprocessor and GCC’s inline assembly extension.

First, we cannot really make real built-ins, just simulate some. We can either wrap them in an inline function (which works correctly with more recent C and C++ compilers), or we can use a macro. Let’s say we want to use macros to wrap asm statements.

The asm statement is a GCC extension that allows to insert assembly language code in C or C++, and provides some interface to the language, facilitating the use of C/C++ variables and values in assembly. However, it’s only really convenient if the assembly code is small, and has simple interaction with the rest of the code. If the assembly code is large and complex, one should favor using a .s file and compile assembly separately code then linking with the main C or C++ program. That being said, there are a number of good introductions to GCC’s asm extension out there.

So basically, an asm clause introduces a few instructions and we have “constraints” that help interface C or C++ with assembly language. In the following example:

#define __read_eax(x) asm("movl %%eax,%0":"=m"(x):"r"(x):)

We use the preprocessor as a wrapper and as a mean to provide a useful name. In this case, the __read_eax(x) intrinsic does just that, it reads the contents of the eax register and writes it to the variable x. The three trailing fields, separated by : , are particularly interesting. The first one declares the output. Here, we declare x as a “=m” (possibly) memory location where to write the results. The second field declares that x (or its address) should be put in some register (we don’t care which one). Lastly, the third one describes which registers would be clobbered (or overwritten) by our code. In this case, it’s a read only operation (except for the compiler-determined register that will hold the address of x).

The syntax is unwieldy, inelegant, and not very intuitive. However it allows, when used correctly, the compiler to remap registers as it wishes, blend the code with its surroundings, that is, to proceed to as many code generation optimizations as possible. This would be impossible if we used the classic function-call approach where we have a separate file containing the assembly code, hidden in a separate object linked later on with the main code (because it would use explicit register assigments, extra code to save registers one needs to preserve, etc., etc.).

*
* *

OK, let’s consider now a not completely trivial case. Let’s say we want to read the timestamp counter. The timestamp counter is a register that increments by 1 every clock cycle. If your processor is a 3.6GHz i7, then the counter increments by 3.6G every second. It’s very convenient to measure time quite precisely.

The rdtsc instruction is a 32-bits area instruction that puts a 64-bits counter in the register pair eax:edx (or in rax:rdx with higher parts to zero in 64 bits mode). These registers represent a single 64-bits (small endian) value. Of course, we do not have access to the timestamp counter directly from C, and we will have to hack our own intrinsic function to get it.

I propose:

#define __rdtsc(x)                        \
 do                                       \
  {                                       \
   uint64_t lo,hi;                        \
   asm ("rdtsc" : "=a"(lo), "=d"(hi));    \
   x=(hi<<32)|lo;                         \
  } while (0)

The first thing you might have noticed is the do {...} while (0) construct around the code. That’s one way of ensuring that the macro will obey pretty much every syntactic rule in C and C++, whether it is followed by ; or not. The second is the list of outputs. Here, we have “=a” (which indicates eax as the register, or rax if you’re in 64 bits mode) written to lo, and “=d” (corresponding to edx or rdx) written to high. In 32 bits mode, we should be able to write "=A"(x), that designates the combination eax:edx to be written in x, but that simply does not work in 64 bits mode. (I would guess it’s a compiler bug: it doesn’t know how to copy two 32 bits integers side by side to form a single 64 bits value; "=A" would mean rax:rdx, which makes no sense here for us.)

*
* *

Now we know how to write our own intrinsics that lets the compiler perform its various optimizations. However, a word of caution: use sparingly. Sure, they give performance or access to specific instructions when needed, but they will prove ultimately hard to maintain because the next programmer might not be familiar with assembly language and the intricacies of GCC—assembly language is indeed a rare skill, in this time of Java, Python, and C#.

Also, if your code is becoming a soup of intrinsics, you may consider write larger assembly language functions in an external file. The compiler is very good, but probably not as clever as a good programmer to perform large-scale optimizations. The compiler won’t figure out how to reorder your data to use vectorization: you will. Also, the compiler obeys rules from C and C++ that prevent it from doing certain optimizations. For example, it will not change the order of evaluation around the coma operator (,) because the standard forbids it. You can.

*
* *

Lastly, the other problem I see with the GCC asm extension is that it tries maybe a bit too much to be architecture-independent, and as such, isn’t very good in mixing 32 and 64 bits registers. For example, in 32 bits mode, “a” means eax but the same “a” in 64 bits mode now means rax, while “A” means eax:edx or rax:rdx, which is a bit half-baked because while it made sense to return a 64 bits value in two 32 bits registers in 32 bits mode, the same 64 bits value should be returned in a single 64 bits register, rax (see, for example X86 calling conventions).

Conclusion: cool, but use sparingly.

2 Responses to Yet More __builtins

  1. Arnulfo F. Wise says:

    In your article, you mention the compiler intrinsics that can be used in place of inline assembly. With the loss of inline assembly for 64-bit development, I’m looking for quick and efficient alternatives for a few things… Is there a compiler intrinsic or equivalent that accomplishes the same as the FPU instruction “FRNDINT” By the way, nice article. Very informative and well written.

  2. […] at least getting . Well, in fact, we can. We can estimate using one of GCC‘s builtin functions. We can then use for the lower bound, and for the upper […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: