Branchless Equivalents of Simple Functions

Modern processors are equipped with sophisticated branch prediction algorithms (the Pentium family, for example, can predict a vast array of patterns of jumps taken/not taken) but if they, for some reason, mispredict the next jump, the performance can take quite a hit. Branching to an unexpected location means flushing the pipelines, prefetching new instructions, etc, leading to a stall that lasts for many tens of cycles. In order to avoid such dreadful stalls, one can use a branchless equivalent, that is, a code transformed to remove the if-then-elses and therefore jump prediction uncertainties.

Let us start by a simple function, the integer abs( ) function. abs, for absolute value, returns… well, the absolute value of its argument. A straightforward implementation of abs( ) in the C programming language could be

inline unsigned int abs(int x)
 {
  return (x<0) ? -x : x;
 }

Which is simple enough but contains a hidden if-then-else. As the argument, x, isn’t all that likely to follow a pattern that the branch prediction unit can detect, the simple function becomes potentially costly as the jump will be mispredicted quite often. How can we remove the if-then-else, then?

Let us first introduce the sex( ) helper function—I still use the mnemonic sex to amuse and chock friends and coworkers, but it comes from a Motorola 6809 instruction, sign extend. The sex function will return an integer where the sign bit of its argument have been copied in all the bits. For example, sex(321)=0, but sex(-3)=0xff...ff. This function is ideal to generate a mask based on the sign of the argument. Of course, sex must be branchless to be of any use to us. At the assembly language level the instruction exists on most processors (it is one of the cbw (convert byte to word), cwd (convert word to double word), etc, instructions on x86/AMD64), but what can we do at the C language level to force the compiler to use the specialized instruction, or at least an efficient replacement? One can use the right shift operator:

inline unsigned int sex(int x)
 {
  return x >> (CHAR_BIT*sizeof(int)-1);
 }

where the (compile-time) safe expression (CHAR_BIT*sizeof(int)-1) evaluates to 15, 31, or 63 depending on the size of integers on the target computer (CHAR_BIT comes from limits.h, and is worth 8, most of the times). However, this one-liner relies on the underlying processor’s shift instruction which, in some case, can be dreadfully slow (a few cycles for each bit shifted in micro-controllers) or very fast (one cycle simultaneously executed with other instructions in bigger processors). One can also use an union, which will compile to memory manipulation instructions, completely removing shifts from the function:

inline int sex(int x)
 {
  union
   {
    // let us suppose long is twice as wide as int
    long w; 

    // should be hi,lo on a big endian machine
    struct { int lo, hi; }
   } z = { .w=x };

  return z.hi;
 }

This will basically force the compiler to use the cbw family of instructions. Let us rewrite abs using sex:

inline unsigned int abs(int x)
 {
  return (x ^ sex(x)) - sex(x);
 }

Now, how does that work? If x is negative, sex(x) will be 0xff...ff, what is, filled with ones. If x is not negative (zero or positive), sex(x) will be zero. So, if the number of negative, it computes its two’s complement, otherwise leaves it unchanged. For example, if x is negative, say -3 (no point in using large, weird, numbers here), sex(-3) is 0xff...ff and -3 ^ 0xff...ff is the same as ~(-3), the bitwise negation of -3. Then, we subtract -1 (which is the same as adding 1), computing ~(-3)+1 which is the correct two’s complement. If on the other hand x is positive (or null), sex(x) evaluates to zero, and lo! (x ^ 0)-0 = x, which leaves the value of x unchanged!

Of course, when compiling the above abs function the compiler generates very little code, especially when one uses the union version of sex. For example, on Intel x86, it could compile down to

abs: cdq eax
     xor eax,edx
     sub eax,edx

assuming the value is already in (and returned by) eax. The cdq instruction sign-extends eax into the edx register: it promotes a 32 bits value to a 64 bits value held in edx:eax.

Now, we can use sex for other if-then-else type function. Take min and max for example. The pair is usually implemented as

inline int min(int a, int b) { return (a<b) ? a : b; }
inline int max(int a, int b) { return (a>b) ? a : b; }

Using sex, the pair becomes

inline int min(int a, int b)
 {
  return b + ((a-b) & sex(a-b));
 }

inline int max(int a, int b)
 {
  return a + ((b-a) & ~sex(b-a));
 }

which are now thoroughly branchless. Hurray!

Can you think of other common, simple functions, what would benefit from branch removal?

29 Responses to Branchless Equivalents of Simple Functions

  1. You have “(x ^ 0)=0 = x” at one point; it should be “(x ^ 0)-0 = x”

  2. George says:

    Nice posting! Thanks.

  3. Eric Jablow says:

    You should look at teh booke Hacker’s Delight, by Henry S. Warren Jr. (Addison-Wesley, 2003). Its companion website is http://www.hackersdelight.org/.

    It has non-branching methods for integral comparisons, including methods optimized for PowerPC machine language. It features odd and stunning ways to reverse the bits in a byte or word. And, it has non-branching ways to do integral division by small integers using only bit shifts and addition.

  4. Nikhil says:

    hi,

    I think the expression (x ^ 0)=0 = x,

    should be corrected to (x ^ 0)-0 = x

  5. Steven Pigeon says:

    (x^0)=0 have been corrected. Thanks you two for spotting the typo.

  6. Is there a reason you’re not just setting the sign bit to 0?

  7. Steven Pigeon says:

    I know about the book, I have it on my shelve! For sex, for example, he proproses (x>0)-(x<0) which cannot be efficiently implemented if your processor lacks predicated movs.

    But while the book is a treasure trove, I find it somewhat not as well explained as it could be, nor as visual as it should be. I intend to present some of the stuff in a much more visual fashion.

  8. Steven Pigeon says:

    I’m not sure in reference to what step you ask why we can’t just turn the sign bit to 0, but the general reason is that the number representation is two’s-complement. In two’s complement, the sign bit is set if the number is negative, but setting the bit to zero doesn’t give you abs(x), but abs(x-1).

    Two’s-complement arithmetic is based on modular arithmetic. On n bits, -1 is not 0x800…0, buf 0xff…ff. so, any number + 0xff…fff (modulo 2^n) is the same as the number minus 1 (modulo 2^n).

    See, for example: http://en.wikipedia.org/wiki/Two's_complement

  9. mcm says:

    Thanks for the article, i enjoyed it. You should make sure abs returns an unsigned otherwise abs(INT_MIN) will fail. And, as you probably know, the code is not portable as C (stupidly, IMO) allows for three different implementations of signed integers and leaves signed integer overflow undefined …which then gets abused by gcc language lawyers.
    see:
    http://www.airs.com/blog/archives/120

  10. mcm says:

    To avoid any misunderstandings, i used the words ‘not portable’ in a very wide sense, as i believe all currently used processors and C compilers use 2’s-complement so in practice it is a non-issue. It’s just one of my pet peeves about C.

  11. Steven Pigeon says:

    You’re right on both counts (I corrected it, including the other one).

    While you are correct at pointing this, compilers however revert to expected behavior when undefined behaviors occurs. I’ve never seen a compiler not wrapping around, nor, for exemple, not honoring signed/unsigned shift-rights. This kind of pitfall will, I think, play an important role when I’ll be writing about (non-)portability.

  12. mcm says:

    I think my previous comment was eaten up.

    Gcc unfortunately considers it ok for undefined behavior to result in unexpected behavior:

    http://www.airs.com/blog/archives/120
    http://gcc.gnu.org/ml/gcc/2006-08/msg00513.html

  13. mcm says:

    Read the link i posted above, you will see that this is not the case for gcc.

  14. Steven Pigeon says:

    Well, you’re right (and so is the other blog’s author).

    Gcc/g++ will “optimize” the function to false (or 0), but gcc/g++ is inconsistent. For example, it is readily verified that a piece of code like

    x=0x7ffffff0;
    cout << hex << x << endl;
    x+=0xff;
    cout << x << endl;

    produces
    7fffffff0
    800000ef

    which is a wrap around behavior. Icc (Intel’s C Compiler) makes the function return true, however, as it uses (correctly?) wrapping. Icc’s behavior is, maybe, more consistant. But that raises the question about how this difference affects larger program such as, say, codecs.

  15. Raulin Brook says:

    One must wonder if the compiler would automatically do a better job using CMOV (on Intel)…

  16. Steven Pigeon says:

    cmov is a possibily too, but I think it’s about the same as the cwd-based solution. Using cmov, you’d get something like

    mov edx,eax
    neg  edx        // updates ZF,SF...
    cmovns eax,edx // overwrites eax if edx is >=0
    
  17. […] “sex” funkcija ir izvilkta no šī bloga. […]

  18. […] Further detailed reading about this technique here: Branchless Equivalents of Simple Functions. […]

  19. MPelletier says:

    Thank you for this explanation! I couldn’t help but notice the example code is broken in the first section, as it contains some of your blog text.

  20. […] rather long time ago, I wrote a blog entry on branchless equivalents of simple functions such as sex, abs, min, max. The […]

  21. […] when such instructions are not available, there are techniques to convert a piece of code into a branchless version. Performance may suffer, but in this scenario it’s not the primary goal — running in […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: