Modern processors are equipped with sophisticated branch prediction algorithms (the Pentium family, for example, can predict a vast array of patterns of jumps taken/not taken) but if they, for some reason, mispredict the next jump, the performance can take quite a hit. Branching to an unexpected location means flushing the pipelines, prefetching new instructions, etc, leading to a stall that lasts for many tens of cycles. In order to avoid such dreadful stalls, one can use a *branchless equivalent*, that is, a code transformed to remove the if-then-elses and therefore jump prediction uncertainties.

Let us start by a simple function, the integer `abs( )` function. `abs`, for absolute value, returns… well, the absolute value of its argument. A straightforward implementation of `abs( )` in the C programming language could be

inline unsigned int abs(int x) { return (x<0) ? -x : x; }

Which is simple enough but contains a hidden if-then-else. As the argument, `x`, isn’t all that likely to follow a pattern that the branch prediction unit can detect, the simple function becomes potentially costly as the jump will be mispredicted quite often. How can we remove the if-then-else, then?

Let us first introduce the `sex( )` helper function—I still use the mnemonic `sex` to amuse and chock friends and coworkers, but it comes from a Motorola 6809 instruction, `s`ign `ex`tend. The `sex` function will return an integer where the sign bit of its argument have been copied in all the bits. For example, `sex(321)=0`, but `sex(-3)=0xff...ff`. This function is ideal to generate a mask based on the sign of the argument. Of course, `sex` must be branchless to be of any use to us. At the assembly language level the instruction exists on most processors (it is one of the `cbw` (convert byte to word), `cwd` (convert word to double word), etc, instructions on x86/AMD64), but what can we do at the C language level to force the compiler to use the specialized instruction, or at least an efficient replacement? One can use the right shift operator:

inline unsigned int sex(int x) { return x >> (CHAR_BIT*sizeof(int)-1); }

where the (compile-time) safe expression `(CHAR_BIT*sizeof(int)-1)` evaluates to 15, 31, or 63 depending on the size of integers on the target computer (`CHAR_BIT` comes from `limits.h`, and is worth 8, most of the times). However, this one-liner relies on the underlying processor’s shift instruction which, in some case, can be dreadfully slow (a few cycles for each bit shifted in micro-controllers) or very fast (one cycle simultaneously executed with other instructions in bigger processors). One can also use an `union`, which will compile to memory manipulation instructions, completely removing shifts from the function:

inline int sex(int x) { union { // let us suppose long is twice as wide as int long w; // should be hi,lo on a big endian machine struct { int lo, hi; } } z = { .w=x }; return z.hi; }

This will basically force the compiler to use the `cbw` family of instructions. Let us rewrite `abs` using `sex`:

inline unsigned int abs(int x) { return (x ^ sex(x)) - sex(x); }

Now, how does *that* work? If `x` is negative, `sex(x)` will be `0xff...ff`, what is, filled with ones. If `x` is not negative (zero or positive), `sex(x)` will be zero. So, if the number of negative, it computes its two’s complement, otherwise leaves it unchanged. For example, if `x` is negative, say -3 (no point in using large, weird, numbers here), `sex(-3)` is `0xff...ff` and `-3 ^ 0xff...ff` is the same as `~(-3)`, the bitwise negation of `-3`. Then, we subtract `-1` (which is the same as *adding* `1`), computing `~(-3)+1` which is the correct two’s complement. If on the other hand `x` is positive (or null), `sex(x)` evaluates to zero, and lo! `(x ^ 0)-0 = x`, which leaves the value of `x` unchanged!

Of course, when compiling the above `abs` function the compiler generates very little code, especially when one uses the `union` version of `sex`. For example, on Intel x86, it could compile down to

abs: cdq eax xor eax,edx sub eax,edx

assuming the value is already in (and returned by) `eax`. The `cdq` instruction sign-extends `eax` into the `edx` register: it promotes a 32 bits value to a 64 bits value held in `edx:eax`.

Now, we can use `sex` for other if-then-else type function. Take `min` and `max` for example. The pair is usually implemented as

inline int min(int a, int b) { return (a<b) ? a : b; } inline int max(int a, int b) { return (a>b) ? a : b; }

Using `sex`, the pair becomes

inline int min(int a, int b) { return b + ((a-b) & sex(a-b)); } inline int max(int a, int b) { return a + ((b-a) & ~sex(b-a)); }

which are now thoroughly branchless. Hurray!

Can you think of other common, simple functions, what would benefit from branch removal?

You have “(x ^ 0)=0 = x” at one point; it should be “(x ^ 0)-0 = x”

Lots and lots of these here:

http://graphics.stanford.edu/~seander/bithacks.html

Nice posting! Thanks.

You should look at teh booke Hacker’s Delight, by Henry S. Warren Jr. (Addison-Wesley, 2003). Its companion website is http://www.hackersdelight.org/.

It has non-branching methods for integral comparisons, including methods optimized for PowerPC machine language. It features odd and stunning ways to reverse the bits in a byte or word. And, it has non-branching ways to do integral division by small integers using only bit shifts and addition.

hi,

I think the expression (x ^ 0)=0 = x,

should be corrected to (x ^ 0)-0 = x

(x^0)=0 have been corrected. Thanks you two for spotting the typo.

Is there a reason you’re not just setting the sign bit to 0?

I know about the book, I have it on my shelve! For sex, for example, he proproses (x>0)-(x<0) which cannot be efficiently implemented if your processor lacks predicated movs.

But while the book is a treasure trove, I find it somewhat not as well explained as it could be, nor as visual as it

shouldbe. I intend to present some of the stuff in a much more visual fashion.I’m not sure in reference to what step you ask why we can’t just turn the sign bit to 0, but the general reason is that the number representation is two’s-complement. In two’s complement, the sign bit is set if the number is negative, but setting the bit to zero doesn’t give you abs(x), but abs(x-1).

Two’s-complement arithmetic is based on modular arithmetic. On n bits, -1 is not 0x800…0, buf 0xff…ff. so, any number + 0xff…fff (modulo 2^n) is the same as the number minus 1 (modulo 2^n).

See, for example: http://en.wikipedia.org/wiki/Two's_complement

Thanks for the article, i enjoyed it. You should make sure abs returns an unsigned otherwise abs(INT_MIN) will fail. And, as you probably know, the code is not portable as C (stupidly, IMO) allows for three different implementations of signed integers and leaves signed integer overflow undefined …which then gets abused by gcc language lawyers.

see:

http://www.airs.com/blog/archives/120

To avoid any misunderstandings, i used the words ‘not portable’ in a very wide sense, as i believe all currently used processors and C compilers use 2’s-complement so in practice it is a non-issue. It’s just one of my pet peeves about C.

You’re right on both counts (I corrected it, including the other one).

While you are correct at pointing this, compilers however revert to

expectedbehavior when undefined behaviors occurs. I’ve never seen a compilernotwrapping around, nor, for exemple,nothonoring signed/unsigned shift-rights. This kind of pitfall will, I think, play an important role when I’ll be writing about (non-)portability.That is what i always thought, too. Unfortunately, this is not true for e.g. gcc.

See:

http://www.airs.com/blog/archives/120

http://gcc.gnu.org/ml/gcc/2006-08/msg00513.html

http://gcc.gnu.org/ml/gcc/2006-12/msg00460.html

I think my previous comment was eaten up.

Gcc unfortunately considers it ok for undefined behavior to result in unexpected behavior:

http://www.airs.com/blog/archives/120

http://gcc.gnu.org/ml/gcc/2006-08/msg00513.html

Hacker’s Delight (http://books.google.com/books?id=iBNKMspIlqEC&dq=Hacker's+Delight&pg=PP1&ots=UTozQTQwJG&sig=RrYNUKr81zvOkti-yYxxV9K99xM&hl=en&sa=X&oi=book_result&resnum=1&ct=result) has a lot of this stuff, among other (more esoteric) things.

Read the link i posted above, you will see that this is not the case for gcc.

Well, you’re right (and so is the other blog’s author).

Gcc/g++ will “optimize” the function to false (or 0), but gcc/g++ is inconsistent. For example, it is readily verified that a piece of code like

`x=0x7ffffff0;`

cout << hex << x << endl;

x+=0xff;

cout << x << endl;

produces

`7fffffff0`

800000ef

which is a wrap around behavior. Icc (Intel’s C Compiler) makes the function return true, however, as it uses (correctly?) wrapping. Icc’s behavior is, maybe, more consistant. But that raises the question about how this difference affects larger program such as, say, codecs.

One must wonder if the compiler would automatically do a better job using CMOV (on Intel)…

cmov is a possibily too, but I think it’s about the same as the cwd-based solution. Using cmov, you’d get something like

[…] “sex” funkcija ir izvilkta no šī bloga. […]

[…] Further detailed reading about this technique here: Branchless Equivalents of Simple Functions. […]

Thank you for this explanation! I couldn’t help but notice the example code is broken in the first section, as it contains some of your blog text.

Thanks for noticing. I think it’s OK now.

[…] rather long time ago, I wrote a blog entry on branchless equivalents of simple functions such as sex, abs, min, max. The […]

[…] when such instructions are not available, there are techniques to convert a piece of code into a branchless version. Performance may suffer, but in this scenario it’s not the primary goal — running in […]

[…] https://hbfs.wordpress.com/2008/08/05/branchless-equivalents-of-simple-functions/ […]

[…] https://hbfs.wordpress.com/2008/08/05/branchless-equivalents-of-simple-functions/ […]

[…] Branchless Equivalents of Simple Functions […]