In the discussion of The Speed of GCD, Daniel Lemire remarked that one could use the compiler-specific intrinsic function __builtin_ctz to get a good speed-up on binary GCD. That remark made me look into all the others intrinsics and built-ins offered by GCC.
Let’s have a look to at least a few of them!
The current list can be found here.
Some built-ins are useful for compile-time decisions, others help control the state of the machine, others again helps us do some low-level operations.
The compile-time predicates (that’s that the _p at the end of the intrinsic names is for) will help you write better C code by exploiting the compile-time environment. For example, int __builtin_types_compatible_p (type1, type2) can tell you, compile time, if the two arguments are the same type (not of the same type, the same type).
I’m not sure I’m totally for those. You’ll always have the problem of dealing with machine-specificities such as compiler switches, defined symbols, and other things that may affect your code if you want to write portable code. While meddling with the preprocessor (and compile-time predicates) is useful—and more or less inevitable—in C, I feel very differently about them in C++. In C++, one should really prefer inlines, basic overloading, and template-based metaprogramming techniques to achieve desired results, whenever possible.
There are few built-ins to help with the program or machine state. For the program state, there’s __builtin_trap what will cause the program to exit abnormally in an implementation-specific way. The other two intrinsics that may be worthwhile exploring are __builtin___clear_cache (char *begin, char *end) and __builtin_prefetch (const void *addr, ...) that may help with cache memory management. The __builtin_prefetch takes up to three arguments:
The value of addr is the address of the memory to prefetch. There are two optional arguments, rw and locality. The value of rw is a compile-time constant one or zero; one means that the prefetch is preparing for a write to the memory address and zero, the default, means that the prefetch is preparing for a read. The value locality must be a compile-time constant integer between zero and three. A value of zero means that the data has no temporal locality, so it need not be left in the cache after the access. A value of three means that the data has a high degree of temporal locality and should be left in all levels of cache possible. Values of one and two mean, respectively, a low or moderate degree of temporal locality. The default is three.
We can use this to control cache usage, especially with streaming data. Note that the default value is to assume that the data will stay in cache, which may result in thrashing. (Trashing is when you do an operation that cause everything in the cache to be evicted and replaced by data that you will use only once, only to be reloaded again when normal program operation resumes. Trashing, of course, results in severe performance penalty.)
The last category of intrinsic deals with low level operations such as counting leading zero bits, numbers of bits set to one, the number of trailing zeros, and represent machine-specific value such as NaNs, and other operations we rarely need—like computing the parity of a value.
The interesting thing with these primitives is that they map to machine-specific instructions whenever possible, thus giving interesting speed-ups on machines that support these specialized instructions (this implies that if you have a somewhat recent CPU, you’re going to benefit from those instructions).
At first, I did not think that an intrinsic such as __builtin_ctz would yield a real speed-up over the equivalent C function. Or, more exactly, I expected the compiler to figure it out and use the specialized instruction. But experiments, as in here, show that the compiler can’t always figure it out and that intrinsics have their use.