More __builtins

Last week we discussed GCC intrinsics a bit. This week, let’s have a look at what kind of speed-ups we can get, and how the use of intrinsics affect code generation.

coin-obverse-small

Sometimes, I do strange things. I mean, my experiments aren’t necessarily self-evident, and sometimes, I need performance from primitives that usually are not bottlenecks—such as computing the GCD. This time, I need to get k and b in n=2^k+b as fast as possible. Let’s have a look at how intrinsics help us.

To compute the decomposition n=2^k+b from n, we must find k. In terms of binary representation of n, finding k is equivalent to finding the position of the most significant bit of n. In Hacker’s Delight, Henry Warren proposes the following solution to count the number of leading zeros:

////////////////////////////////////////
//
// hacker's delight's version (thus the _hd)
//
unsigned clz_hd(unsigned x)
 {
  unsigned n=0;
  if (x<=0x0000ffff) n+=16, x<<=16;
  if (x<=0x00ffffff) n+=8, x<<=8;
  if (x<=0x0fffffff) n+=4, x<<=4;
  if (x<=0x3fffffff) n+=2, x<<=2;
  if (x<=0x7fffffff) n++;
  return n;
 }

The technique is clever, multiplying x by a power of two each round, allowing a number of step proportional to the logarithm (base 2) of the register size. For 32 bits, we need 5 iterations. However, each iteration implies a comparison, a conditional branch, an addition, and a shift. Except for the conditional branch, none of these expressions are very expensive, so the overall procedure should be rather fast. Certainly more than a naive loop that could do up to 32 iterations (with essentially the same operations).

Computing the decomposition n=2^k+b would look something like:

////////////////////////////////////////
//
// decompose, based on the clz_hd
//
void decompose_hd(unsigned x, unsigned & k, unsigned & b)
 {
  if (x)
   {
    k=31-clz_hd(x);
    b=x-(1<<k);
   }
  else
   {
    k=0;
    b=0;
   }
 }

We need a test for zero because clz_hd can’t deal with zero (and that the decomposition should be 0=2^0-1, which cannot be computed with this function; not a problem since in my problem space n=0 makes no sense).

If we use the __builtin_clz(), we can just replace clz_hd above; and we get

//////////////////////////////////////////
//
// decompose, based on the builtin
//
void decompose_builtin(unsigned x, unsigned & k, unsigned & b)
 {
  if (x)
   {
    k=31-__builtin_clz(x); 
    b=x-(1<<k);
   }
  else 
   {
    k=0;
    b=0;
   }
 }

*
* *

Let’s compare performance. I’ve tested against two configurations: one uses full optimization, the other prevents inlining. In both cases, n ranges from 0 to 1000000000, in a simple for-loop, with a dummy s+= to prevent dead code elimintatino.

With full optimization, the Hacker’s delight’s version takes 1.58s to compute all decompositions from 0 to 1000000000. The version using the intrinsic takes 1.01s, a speed-up of about 1.5:1, or more or less 60% faster.

When we inhibit inlining, we find that the first version takes about 5.2s for the same task, while the intrinsic-based version takes 2.04s, a speed-up of 2.5:1, or 2.5 times faster. Not bad.

*
* *

And now, let’s compare generated code.

The code based on the Hacker’s Delight generates the following (with -O3 -fno-inline):

0000000000400da0 <_Z6clz_hdj>:
  400da0:       81 ff ff ff 00 00       cmp    edi,0xffff
  400da6:       77 48                   ja     400df0 <_Z6clz_hdj+0x50>
  400da8:       c1 e7 10                shl    edi,0x10
  400dab:       ba 18 00 00 00          mov    edx,0x18
  400db0:       b8 10 00 00 00          mov    eax,0x10
  400db5:       81 ff ff ff ff 00       cmp    edi,0xffffff
  400dbb:       77 05                   ja     400dc2 <_Z6clz_hdj+0x22>
  400dbd:       c1 e7 08                shl    edi,0x8
  400dc0:       89 d0                   mov    eax,edx
  400dc2:       81 ff ff ff ff 0f       cmp    edi,0xfffffff
  400dc8:       77 06                   ja     400dd0 <_Z6clz_hdj+0x30>
  400dca:       83 c0 04                add    eax,0x4
  400dcd:       c1 e7 04                shl    edi,0x4
  400dd0:       81 ff ff ff ff 3f       cmp    edi,0x3fffffff
  400dd6:       77 06                   ja     400dde <_Z6clz_hdj+0x3e>
  400dd8:       83 c0 02                add    eax,0x2
  400ddb:       c1 e7 02                shl    edi,0x2
  400dde:       81 ff 00 00 00 80       cmp    edi,0x80000000
  400de4:       83 d0 00                adc    eax,0x0
  400de7:       c3                      ret    
  400de8:       0f 1f 84 00 00 00 00    nop    DWORD PTR [rax+rax*1+0x0]
  400def:       00 
  400df0:       ba 08 00 00 00          mov    edx,0x8
  400df5:       31 c0                   xor    eax,eax
  400df7:       eb bc                   jmp    400db5 <_Z6clz_hdj+0x15>
  400df9:       0f 1f 80 00 00 00 00    nop    DWORD PTR [rax+0x0]

0000000000400e00 <_Z12decompose_hdjRjS_>:
  400e00:       41 54                   push   r12
  400e02:       85 ff                   test   edi,edi
  400e04:       49 89 f4                mov    r12,rsi
  400e07:       55                      push   rbp
  400e08:       48 89 d5                mov    rbp,rdx
  400e0b:       53                      push   rbx
  400e0c:       89 fb                   mov    ebx,edi
  400e0e:       75 18                   jne    400e28 <_Z12decompose_hdjRjS_+0x28>
  400e10:       5b                      pop    rbx
  400e11:       5d                      pop    rbp
  400e12:       c7 06 00 00 00 00       mov    DWORD PTR [rsi],0x0
  400e18:       c7 02 00 00 00 00       mov    DWORD PTR [rdx],0x0
  400e1e:       41 5c                   pop    r12
  400e20:       c3                      ret    
  400e21:       0f 1f 80 00 00 00 00    nop    DWORD PTR [rax+0x0]
  400e28:       e8 73 ff ff ff          call   400da0 <_Z6clz_hdj>
  400e2d:       b9 1f 00 00 00          mov    ecx,0x1f
  400e32:       29 c1                   sub    ecx,eax
  400e34:       b8 01 00 00 00          mov    eax,0x1
  400e39:       d3 e0                   shl    eax,cl
  400e3b:       41 89 0c 24             mov    DWORD PTR [r12],ecx
  400e3f:       29 c3                   sub    ebx,eax
  400e41:       89 5d 00                mov    DWORD PTR [rbp+0x0],ebx
  400e44:       5b                      pop    rbx
  400e45:       5d                      pop    rbp
  400e46:       41 5c                   pop    r12
  400e48:       c3                      ret    
  400e49:       0f 1f 80 00 00 00 00    nop    DWORD PTR [rax+0x0]

Note the ifs that became jumps in the above code. There are indeed quite a few of them.

The code based on the intrinsic is as follows:

0000000000400e50 <_Z17decompose_builtinjRjS_>:
  400e50:       85 ff                   test   edi,edi
  400e52:       75 14                   jne    400e68 <_Z17decompose_builtinjRjS_+0x18>
  400e54:       c7 06 00 00 00 00       mov    DWORD PTR [rsi],0x0
  400e5a:       c7 02 00 00 00 00       mov    DWORD PTR [rdx],0x0
  400e60:       c3                      ret    
  400e61:       0f 1f 80 00 00 00 00    nop    DWORD PTR [rax+0x0]
  400e68:       0f bd c7                bsr    eax,edi
  400e6b:       b9 1f 00 00 00          mov    ecx,0x1f
  400e70:       83 f0 1f                xor    eax,0x1f
  400e73:       29 c1                   sub    ecx,eax
  400e75:       b8 01 00 00 00          mov    eax,0x1
  400e7a:       d3 e0                   shl    eax,cl
  400e7c:       89 0e                   mov    DWORD PTR [rsi],ecx
  400e7e:       29 c7                   sub    edi,eax
  400e80:       89 3a                   mov    DWORD PTR [rdx],edi
  400e82:       c3                      ret    
  400e83:       66 2e 0f 1f 84 00 00    nop    WORD PTR cs:[rax+rax*1+0x0]
  400e8a:       00 00 00 
  400e8d:       0f 1f 00                nop    DWORD PTR [rax]

Where we notice that there is only one jump (the one that tests for zero), with the rest of the code being quite linear. The intrinsic was converted to bsr, an instruction that should be somewhat efficient on current processors.

*
* *

While one may think in this case that the main advantage of using an intrinsic function is the speed-up, one should also consider the fact that it also means less code. In my case, I can safely assume that __builtin_clz is always available because I use GCC and, for now, care somewhat little about portability to other compilers (while I usually take great care in using the correct types such as size_t and the others to avoid surprises). This eliminates a function, a few lines of code, but that’s a few lines of code for which I will not worry ever again.

4 Responses to More __builtins

  1. […] last week we saw how to use some of GCC’s built-ins, this week, let’s have a look at how we can […]

  2. Zhihua Lai says:

    unsigned leading_zero_naive(int x)
    {
    unsigned n = 0;
    for (int i = 1; i < 32; i ++) {
    if (x < 0) break;
    n ++;
    x <<= 1;
    }
    return n;
    }
    This is faster on average, because the loop will terminate once a '1' is met.

  3. […] efficiently, at least getting . Well, in fact, we can. We can estimate using one of GCC‘s builtin functions. We can then use for the lower bound, and for the upper […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: