Much Ado About Nothing

A rather long time ago, I wrote a blog entry on branchless equivalents of simple functions such as sex, abs, min, max. The Sing EXtension instruction propagates the sign bit in the upper bits, and is typically used in the promotion of, say, a 16 bits signed value into a 32 bits variable.

But this time, I needed something a bit different: I only wanted the sign-extended part. Could I do much better than last time? Turns out, the compiler has a mind of its own.

Long ago, I proposed:

////////////////////////////////////////
int32_t sex(int32_t x)
 {
  // this relies on automagic promotion
  // (which is about the same as
  // (int32_t)((int16_t)x))
  union
   {
    int64_t w;

    struct { int32_t lo, hi; }; // should be hi,lo on a big endian machine
   } z = { .w=x };

  return z.hi;
 }

This should rely on some variant of movsx (move with sign extend), cdqe (convert double to quadword in accumulator, i.e., RAX), and a second move instruction to get the high part of the register into another register’s lower part. Well, nope. Disassembly (g++ -O3 ) reveals

400ed0:   48 63 c7        movsxd rax,edi
400ed3:   48 c1 f8 20     sar    rax,0x20
400ed7:   c3              ret

The compiler does a shift right of 32 positions to get to the higher part. So that’s basically the same as

int32_t sex_shift(int32_t x)
 {
  return (x>>31);
 }

neglecting the fact that it will overwrite/discard the argument. This disassembles to

400ef0:   89 f8           mov    eax,edi
400ef2:   c1 f8 1f        sar    eax,0x1f
400ef5:   c3              ret

This variant just propagates the sign bit, using expected signed-shift behavior. On some CPU, that’s fine because the execution time of a shift doesn’t depend on the number of bits shifted, but on some other architecture, that might be n cycles per position shifted. That’d be pretty inefficient. So that got me thinking that there ought to be some other way to propagate the sign bit than using shift.

Using a bunch o’shifts like this:

int32_t sex_shift_3(int32_t x)
 {
  x&=0x80000000;
  x|=(x>>1);
  x|=(x>>2);
  x|=(x>>4);
  x|=(x>>8);   // maybe some low-level
  x|=(x>>16);  // byte-copy can help?
  return x;
 }

Uses a bunch of instructions, most of which are short shifts, some of which could be replaced by register-level movs instructions. This time the compiler doesn’t seem to know what to do with it:

400f00:   89 f8               mov    eax,edi
400f02:   25 00 00 00 80      and    eax,0x80000000
400f07:   89 c2               mov    edx,eax
400f09:   d1 fa               sar    edx,1
400f0b:   09 d0               or     eax,edx
400f0d:   89 c2               mov    edx,eax
400f0f:   c1 fa 02            sar    edx,0x2
400f12:   09 d0               or     eax,edx
400f14:   89 c2               mov    edx,eax
400f16:   c1 fa 04            sar    edx,0x4
400f19:   09 d0               or     eax,edx
400f1b:   89 c2               mov    edx,eax
400f1d:   c1 fa 08            sar    edx,0x8
400f20:   09 d0               or     eax,edx
400f22:   89 c2               mov    edx,eax
400f24:   c1 fa 10            sar    edx,0x10
400f27:   09 d0               or     eax,edx
400f29:   c3                  ret

So that’s not that good, is it? We’re not heading in the right direction at all. Let’s see what else we can do:

int32_t sex_shift_4(int32_t x)
 {
  return ~((x<0)-1);
 }

This sets a value, exactly 0 if x<0 is false, and exactly 1 if it is true. If it is true, ~(1-1)=~0=0xff..ff, if it is false, ~(0-1)=~(-1)=~0xff..ff=0. That’s what we want. However, this disassemble to…

400f30:   89 f8           mov    eax,edi
400f32:   c1 f8 1f        sar    eax,0x1f
400f35:   c3              ret

Oh g*d d*ammit!

*
* *

This is the perfect example of how “optimizations” are quite relative. relative to the underlying machine and to the compiler. While ~((x<0)-1) is branchless, and should rely on cute instructions like test and setcc, the compiler sees through it and replaces it by a shift. On my machine, that’s probably indeed much faster than the alternative, naïve, implementation of the same function. Oh well. Time well wasted, I guess.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: