Modern processors are equipped with sophisticated branch prediction algorithms (the Pentium family, for example, can predict a vast array of patterns of jumps taken/not taken) but if they, for some reason, mispredict the next jump, the performance can take quite a hit. Branching to an unexpected location means flushing the pipelines, prefetching new instructions, etc, leading to a stall that lasts for many tens of cycles. In order to avoid such dreadful stalls, one can use a branchless equivalent, that is, a code transformed to remove the if-then-elses and therefore jump prediction uncertainties.
Let us start by a simple function, the integer abs( ) function. abs, for absolute value, returns… well, the absolute value of its argument. A straightforward implementation of abs( ) in the C programming language could be
inline unsigned int abs(int x) { return (x<0) ? -x : x; }
Which is simple enough but contains a hidden if-then-else. As the argument, x, isn’t all that likely to follow a pattern that the branch prediction unit can detect, the simple function becomes potentially costly as the jump will be mispredicted quite often. How can we remove the if-then-else, then?
Let us first introduce the sex( ) helper function—I still use the mnemonic sex to amuse and chock friends and coworkers, but it comes from a Motorola 6809 instruction, sign extend. The sex function will return an integer where the sign bit of its argument have been copied in all the bits. For example, sex(321)=0, but sex(-3)=0xff...ff. This function is ideal to generate a mask based on the sign of the argument. Of course, sex must be branchless to be of any use to us. At the assembly language level the instruction exists on most processors (it is one of the cbw (convert byte to word), cwd (convert word to double word), etc, instructions on x86/AMD64), but what can we do at the C language level to force the compiler to use the specialized instruction, or at least an efficient replacement? One can use the right shift operator:
inline unsigned int sex(int x) { return x >> (CHAR_BIT*sizeof(int)-1); }
where the (compile-time) safe expression (CHAR_BIT*sizeof(int)-1) evaluates to 15, 31, or 63 depending on the size of integers on the target computer (CHAR_BIT comes from limits.h, and is worth 8, most of the times). However, this one-liner relies on the underlying processor’s shift instruction which, in some case, can be dreadfully slow (a few cycles for each bit shifted in micro-controllers) or very fast (one cycle simultaneously executed with other instructions in bigger processors). One can also use an union, which will compile to memory manipulation instructions, completely removing shifts from the function:
inline int sex(int x) { union { // let us suppose long is twice as wide as int long w; // should be hi,lo on a big endian machine struct { int lo, hi; } } z = { .w=x }; return z.hi; }
This will basically force the compiler to use the cbw family of instructions. Let us rewrite abs using sex:
inline unsigned int abs(int x) { return (x ^ sex(x)) - sex(x); }
Now, how does that work? If x is negative, sex(x) will be 0xff...ff, what is, filled with ones. If x is not negative (zero or positive), sex(x) will be zero. So, if the number of negative, it computes its two’s complement, otherwise leaves it unchanged. For example, if x is negative, say -3 (no point in using large, weird, numbers here), sex(-3) is 0xff...ff and -3 ^ 0xff...ff is the same as ~(-3), the bitwise negation of -3. Then, we subtract -1 (which is the same as adding 1), computing ~(-3)+1 which is the correct two’s complement. If on the other hand x is positive (or null), sex(x) evaluates to zero, and lo! (x ^ 0)-0 = x, which leaves the value of x unchanged!
Of course, when compiling the above abs function the compiler generates very little code, especially when one uses the union version of sex. For example, on Intel x86, it could compile down to
abs: cdq eax xor eax,edx sub eax,edx
assuming the value is already in (and returned by) eax. The cdq instruction sign-extends eax into the edx register: it promotes a 32 bits value to a 64 bits value held in edx:eax.
Now, we can use sex for other if-then-else type function. Take min and max for example. The pair is usually implemented as
inline int min(int a, int b) { return (a<b) ? a : b; } inline int max(int a, int b) { return (a>b) ? a : b; }
Using sex, the pair becomes
inline int min(int a, int b) { return b + ((a-b) & sex(a-b)); } inline int max(int a, int b) { return a + ((b-a) & ~sex(b-a)); }
which are now thoroughly branchless. Hurray!
Can you think of other common, simple functions, what would benefit from branch removal?
You have “(x ^ 0)=0 = x” at one point; it should be “(x ^ 0)-0 = x”
Lots and lots of these here:
http://graphics.stanford.edu/~seander/bithacks.html
Nice posting! Thanks.
You should look at teh booke Hacker’s Delight, by Henry S. Warren Jr. (Addison-Wesley, 2003). Its companion website is http://www.hackersdelight.org/.
It has non-branching methods for integral comparisons, including methods optimized for PowerPC machine language. It features odd and stunning ways to reverse the bits in a byte or word. And, it has non-branching ways to do integral division by small integers using only bit shifts and addition.
hi,
I think the expression (x ^ 0)=0 = x,
should be corrected to (x ^ 0)-0 = x
(x^0)=0 have been corrected. Thanks you two for spotting the typo.
Is there a reason you’re not just setting the sign bit to 0?
I know about the book, I have it on my shelve! For sex, for example, he proproses (x>0)-(x<0) which cannot be efficiently implemented if your processor lacks predicated movs.
But while the book is a treasure trove, I find it somewhat not as well explained as it could be, nor as visual as it should be. I intend to present some of the stuff in a much more visual fashion.
I’m not sure in reference to what step you ask why we can’t just turn the sign bit to 0, but the general reason is that the number representation is two’s-complement. In two’s complement, the sign bit is set if the number is negative, but setting the bit to zero doesn’t give you abs(x), but abs(x-1).
Two’s-complement arithmetic is based on modular arithmetic. On n bits, -1 is not 0x800…0, buf 0xff…ff. so, any number + 0xff…fff (modulo 2^n) is the same as the number minus 1 (modulo 2^n).
See, for example: http://en.wikipedia.org/wiki/Two's_complement
Thanks for the article, i enjoyed it. You should make sure abs returns an unsigned otherwise abs(INT_MIN) will fail. And, as you probably know, the code is not portable as C (stupidly, IMO) allows for three different implementations of signed integers and leaves signed integer overflow undefined …which then gets abused by gcc language lawyers.
see:
http://www.airs.com/blog/archives/120
To avoid any misunderstandings, i used the words ‘not portable’ in a very wide sense, as i believe all currently used processors and C compilers use 2’s-complement so in practice it is a non-issue. It’s just one of my pet peeves about C.
You’re right on both counts (I corrected it, including the other one).
While you are correct at pointing this, compilers however revert to expected behavior when undefined behaviors occurs. I’ve never seen a compiler not wrapping around, nor, for exemple, not honoring signed/unsigned shift-rights. This kind of pitfall will, I think, play an important role when I’ll be writing about (non-)portability.
That is what i always thought, too. Unfortunately, this is not true for e.g. gcc.
See:
http://www.airs.com/blog/archives/120
http://gcc.gnu.org/ml/gcc/2006-08/msg00513.html
http://gcc.gnu.org/ml/gcc/2006-12/msg00460.html
I think my previous comment was eaten up.
Gcc unfortunately considers it ok for undefined behavior to result in unexpected behavior:
http://www.airs.com/blog/archives/120
http://gcc.gnu.org/ml/gcc/2006-08/msg00513.html
Hacker’s Delight (http://books.google.com/books?id=iBNKMspIlqEC&dq=Hacker's+Delight&pg=PP1&ots=UTozQTQwJG&sig=RrYNUKr81zvOkti-yYxxV9K99xM&hl=en&sa=X&oi=book_result&resnum=1&ct=result) has a lot of this stuff, among other (more esoteric) things.
Read the link i posted above, you will see that this is not the case for gcc.
Well, you’re right (and so is the other blog’s author).
Gcc/g++ will “optimize” the function to false (or 0), but gcc/g++ is inconsistent. For example, it is readily verified that a piece of code like
x=0x7ffffff0;
cout << hex << x << endl;
x+=0xff;
cout << x << endl;
produces
7fffffff0
800000ef
which is a wrap around behavior. Icc (Intel’s C Compiler) makes the function return true, however, as it uses (correctly?) wrapping. Icc’s behavior is, maybe, more consistant. But that raises the question about how this difference affects larger program such as, say, codecs.
One must wonder if the compiler would automatically do a better job using CMOV (on Intel)…
cmov is a possibily too, but I think it’s about the same as the cwd-based solution. Using cmov, you’d get something like
[…] “sex” funkcija ir izvilkta no šī bloga. […]
[…] Further detailed reading about this technique here: Branchless Equivalents of Simple Functions. […]
Thank you for this explanation! I couldn’t help but notice the example code is broken in the first section, as it contains some of your blog text.
Thanks for noticing. I think it’s OK now.
[…] rather long time ago, I wrote a blog entry on branchless equivalents of simple functions such as sex, abs, min, max. The […]
[…] when such instructions are not available, there are techniques to convert a piece of code into a branchless version. Performance may suffer, but in this scenario it’s not the primary goal — running in […]
[…] https://hbfs.wordpress.com/2008/08/05/branchless-equivalents-of-simple-functions/ […]
[…] https://hbfs.wordpress.com/2008/08/05/branchless-equivalents-of-simple-functions/ […]
[…] Branchless Equivalents of Simple Functions […]