## More __builtins

Last week we discussed GCC intrinsics a bit. This week, let’s have a look at what kind of speed-ups we can get, and how the use of intrinsics affect code generation.

Sometimes, I do strange things. I mean, my experiments aren’t necessarily self-evident, and sometimes, I need performance from primitives that usually are not bottlenecks—such as computing the GCD. This time, I need to get $k$ and $b$ in $n=2^k+b$ as fast as possible. Let’s have a look at how intrinsics help us.

To compute the decomposition $n=2^k+b$ from $n$, we must find $k$. In terms of binary representation of $n$, finding $k$ is equivalent to finding the position of the most significant bit of $n$. In Hacker’s Delight, Henry Warren proposes the following solution to count the number of leading zeros:

```////////////////////////////////////////
//
// hacker's delight's version (thus the _hd)
//
unsigned clz_hd(unsigned x)
{
unsigned n=0;
if (x<=0x0000ffff) n+=16, x<<=16;
if (x<=0x00ffffff) n+=8, x<<=8;
if (x<=0x0fffffff) n+=4, x<<=4;
if (x<=0x3fffffff) n+=2, x<<=2;
if (x<=0x7fffffff) n++;
return n;
}
```

The technique is clever, multiplying x by a power of two each round, allowing a number of step proportional to the logarithm (base 2) of the register size. For 32 bits, we need 5 iterations. However, each iteration implies a comparison, a conditional branch, an addition, and a shift. Except for the conditional branch, none of these expressions are very expensive, so the overall procedure should be rather fast. Certainly more than a naive loop that could do up to 32 iterations (with essentially the same operations).

Computing the decomposition $n=2^k+b$ would look something like:

```////////////////////////////////////////
//
// decompose, based on the clz_hd
//
void decompose_hd(unsigned x, unsigned & k, unsigned & b)
{
if (x)
{
k=31-clz_hd(x);
b=x-(1<<k);
}
else
{
k=0;
b=0;
}
}
```

We need a test for zero because clz_hd can’t deal with zero (and that the decomposition should be $0=2^0-1$, which cannot be computed with this function; not a problem since in my problem space $n=0$ makes no sense).

If we use the __builtin_clz(), we can just replace clz_hd above; and we get

```//////////////////////////////////////////
//
// decompose, based on the builtin
//
void decompose_builtin(unsigned x, unsigned & k, unsigned & b)
{
if (x)
{
k=31-__builtin_clz(x);
b=x-(1<<k);
}
else
{
k=0;
b=0;
}
}
```

*
* *

Let’s compare performance. I’ve tested against two configurations: one uses full optimization, the other prevents inlining. In both cases, $n$ ranges from 0 to 1000000000, in a simple for-loop, with a dummy s+= to prevent dead code elimintatino.

With full optimization, the Hacker’s delight’s version takes 1.58s to compute all decompositions from 0 to 1000000000. The version using the intrinsic takes 1.01s, a speed-up of about 1.5:1, or more or less 60% faster.

When we inhibit inlining, we find that the first version takes about 5.2s for the same task, while the intrinsic-based version takes 2.04s, a speed-up of 2.5:1, or 2.5 times faster. Not bad.

*
* *

And now, let’s compare generated code.

The code based on the Hacker’s Delight generates the following (with -O3 -fno-inline):

```0000000000400da0 <_Z6clz_hdj>:
400da0:       81 ff ff ff 00 00       cmp    edi,0xffff
400da6:       77 48                   ja     400df0 <_Z6clz_hdj+0x50>
400da8:       c1 e7 10                shl    edi,0x10
400dab:       ba 18 00 00 00          mov    edx,0x18
400db0:       b8 10 00 00 00          mov    eax,0x10
400db5:       81 ff ff ff ff 00       cmp    edi,0xffffff
400dbb:       77 05                   ja     400dc2 <_Z6clz_hdj+0x22>
400dbd:       c1 e7 08                shl    edi,0x8
400dc0:       89 d0                   mov    eax,edx
400dc2:       81 ff ff ff ff 0f       cmp    edi,0xfffffff
400dc8:       77 06                   ja     400dd0 <_Z6clz_hdj+0x30>
400dca:       83 c0 04                add    eax,0x4
400dcd:       c1 e7 04                shl    edi,0x4
400dd0:       81 ff ff ff ff 3f       cmp    edi,0x3fffffff
400dd6:       77 06                   ja     400dde <_Z6clz_hdj+0x3e>
400dd8:       83 c0 02                add    eax,0x2
400ddb:       c1 e7 02                shl    edi,0x2
400dde:       81 ff 00 00 00 80       cmp    edi,0x80000000
400de4:       83 d0 00                adc    eax,0x0
400de7:       c3                      ret
400de8:       0f 1f 84 00 00 00 00    nop    DWORD PTR [rax+rax*1+0x0]
400def:       00
400df0:       ba 08 00 00 00          mov    edx,0x8
400df5:       31 c0                   xor    eax,eax
400df7:       eb bc                   jmp    400db5 <_Z6clz_hdj+0x15>
400df9:       0f 1f 80 00 00 00 00    nop    DWORD PTR [rax+0x0]

0000000000400e00 <_Z12decompose_hdjRjS_>:
400e00:       41 54                   push   r12
400e02:       85 ff                   test   edi,edi
400e04:       49 89 f4                mov    r12,rsi
400e07:       55                      push   rbp
400e08:       48 89 d5                mov    rbp,rdx
400e0b:       53                      push   rbx
400e0c:       89 fb                   mov    ebx,edi
400e0e:       75 18                   jne    400e28 <_Z12decompose_hdjRjS_+0x28>
400e10:       5b                      pop    rbx
400e11:       5d                      pop    rbp
400e12:       c7 06 00 00 00 00       mov    DWORD PTR [rsi],0x0
400e18:       c7 02 00 00 00 00       mov    DWORD PTR [rdx],0x0
400e1e:       41 5c                   pop    r12
400e20:       c3                      ret
400e21:       0f 1f 80 00 00 00 00    nop    DWORD PTR [rax+0x0]
400e28:       e8 73 ff ff ff          call   400da0 <_Z6clz_hdj>
400e2d:       b9 1f 00 00 00          mov    ecx,0x1f
400e32:       29 c1                   sub    ecx,eax
400e34:       b8 01 00 00 00          mov    eax,0x1
400e39:       d3 e0                   shl    eax,cl
400e3b:       41 89 0c 24             mov    DWORD PTR [r12],ecx
400e3f:       29 c3                   sub    ebx,eax
400e41:       89 5d 00                mov    DWORD PTR [rbp+0x0],ebx
400e44:       5b                      pop    rbx
400e45:       5d                      pop    rbp
400e46:       41 5c                   pop    r12
400e48:       c3                      ret
400e49:       0f 1f 80 00 00 00 00    nop    DWORD PTR [rax+0x0]
```

Note the ifs that became jumps in the above code. There are indeed quite a few of them.

The code based on the intrinsic is as follows:

```0000000000400e50 <_Z17decompose_builtinjRjS_>:
400e50:       85 ff                   test   edi,edi
400e52:       75 14                   jne    400e68 <_Z17decompose_builtinjRjS_+0x18>
400e54:       c7 06 00 00 00 00       mov    DWORD PTR [rsi],0x0
400e5a:       c7 02 00 00 00 00       mov    DWORD PTR [rdx],0x0
400e60:       c3                      ret
400e61:       0f 1f 80 00 00 00 00    nop    DWORD PTR [rax+0x0]
400e68:       0f bd c7                bsr    eax,edi
400e6b:       b9 1f 00 00 00          mov    ecx,0x1f
400e70:       83 f0 1f                xor    eax,0x1f
400e73:       29 c1                   sub    ecx,eax
400e75:       b8 01 00 00 00          mov    eax,0x1
400e7a:       d3 e0                   shl    eax,cl
400e7c:       89 0e                   mov    DWORD PTR [rsi],ecx
400e7e:       29 c7                   sub    edi,eax
400e80:       89 3a                   mov    DWORD PTR [rdx],edi
400e82:       c3                      ret
400e83:       66 2e 0f 1f 84 00 00    nop    WORD PTR cs:[rax+rax*1+0x0]
400e8a:       00 00 00
400e8d:       0f 1f 00                nop    DWORD PTR [rax]
```

Where we notice that there is only one jump (the one that tests for zero), with the rest of the code being quite linear. The intrinsic was converted to bsr, an instruction that should be somewhat efficient on current processors.

*
* *

While one may think in this case that the main advantage of using an intrinsic function is the speed-up, one should also consider the fact that it also means less code. In my case, I can safely assume that __builtin_clz is always available because I use GCC and, for now, care somewhat little about portability to other compilers (while I usually take great care in using the correct types such as size_t and the others to avoid surprises). This eliminates a function, a few lines of code, but that’s a few lines of code for which I will not worry ever again.

### 4 Responses to More __builtins

1. […] last week we saw how to use some of GCC’s built-ins, this week, let’s have a look at how we can […]

2. Zhihua Lai says: