Being Shifty

Hacks with integer arithmetic are always fun, especially when they’re not too hard to understand (because some are really strange and make you wonder what the author was thinking when he wrote that). One such simple hack is to replace divisions or multiplies by series of shifts and additions.

However, these hacks make a lot of assumptions that aren’t necessarily verified on the target platform. The assumptions are that complex instructions such as mul and div are very slow compared to adds, right shifts or left shifts, that execution time of shifts only depends very loosely on the number of bit shifted, and that the compiler is not smart enough to generate the appropriate code for you.

Of course, these assumptions may or mayn’t be verified on your target platform and it is important to assess the goodness of any optimization with a stopwatch in hand—anyway, more on this later. Let us come back to our shifts.

Let’s say we want to multiply 16 bits integers by $\frac{2}{7}$. This isn’t a nice fraction we can replace by a simple shift right away; it’s conspicuously not a simple power of 2. But it’s a series of sums of power of two. Indeed, if we look at it differently,

$\displaystyle\frac{2}{7}\approx 0.285714 = 0.01001001..._2$

To each 1 in the binary expansion of $\frac{2}{7}$ corresponds a shifted value of $x$:

$\displaystyle \frac{2}{7}x = 2^{-2}x+2^{-5}x+2^{-8}x+2^{-11}x+\cdots$

Which suggests the direct (but naïve) approximation (x>>2)+(x>>5)+.... That actually works, but may not be very precise because bits that should contribute to the sum get shifted out. C shift operators will just truncate the fractional bits—it’s an integer operation, after all—so you loose bits that should contribute to the sum. The first impression is that it’s just a rounding error (meaning, the last bit or so) but it may yield a large error.

The second thing you may try is to scale x before shifting. One way of doing a ‘free’ scaling is to use a trick similar to what I used for an alternative line algorithm. That is, we’ll use the language to have a shift free scaling:

  typedef
union
{
// endian-dependant!
struct { uint16_t lo, hi; };
uint32_t s;
} fixed_t;


Using this structure, setting lo to 0 and hi to
x promotes x to a 32 bits version shifted by 16 bits. The
whole method could look like:

inline int shifts_more(uint16_t x)
{
typedef
union
{
// endian-dependant!
struct { uint16_t lo, hi; };
uint32_t s;
} fixed_t;

fixed_t scaled={0,x}; // C99 extensions?
fixed_t sum={0};

sum.s+=scaled.s >> 2;
sum.s+=scaled.s >> 5;
sum.s+=scaled.s >> 8;
sum.s+=scaled.s >> 11;
sum.s+=scaled.s >> 14;
// the next one, >> 17, gives zero!

return sum.hi;
}


Using this version, all bits contribute to the results and the (relative) error is reduced to its minimum. We could have used an actual shift rather than an union:

inline int shifts_less(uint16_t x)
{
const int scale=4;
uint32_t sx = (x << scale);
uint32_t s=0;

s+=sx >>= 2;
s+=sx >>= 3;
s+=sx >>= 3;
s+=sx >>= 3;
s+=sx >>= 3;

return (s>>scale);
}


Where you can control time versus precision by changing the value of scale. Or you can look at it the other way ’round:

inline int shifts_left(uint16_t x)
{
int tx=x; // promote to 32 bits

int s=tx; // 14
s+=(tx <<=3); // 11
s+=(tx <<=3); // 8
s+=(tx <<=3); // 5
s+=(tx <<=3); // 2

return (s >> 14);
}


Or you can use an integer multiply:

inline int shifts_mult(uint16_t x)
{
// 01 0010 0100 1001 = 0x1249
return (int(x) * 0x1249) >> 14;
}


So which one’s better? Precision-wise, they’re all the same, except for the naïve version that has a higher error and for shift_less that depends on a scaling constant. On all 16 bits integers, the average error (difference between the float version and the integer version) is given by:

 Method Average error Naïve 2.07144 others 0.214264

So the other methods have basically 10 times smaller errors. Which is good. What about speed? Well that’s where the catch is. Let us compare, again, g++ against icc, the two compilers I use. On a T9600 Core2 CPU (the one from by Macbook Pro), the times (in micro-seconds by 65536 calls):

 method icc gcc Naïve 192 179 More 229 192 Less 225 204 Left 189 142 Mult 166 165 Float 236 236

Let’s see what we have here. G++ seems to be a bit more efficient than icc on these series of tests. The Float method is merely using a float to perform the multiply and truncating to int, and is included as a fair comparison, the control. The Mult method is also the same using both compilers, unsurprisingly, they generate exactly the same code. G++ has the clear advantage for the Left method. Why? What happened?

Examining the code, we make a stunning discovery. Icc generates the following:

<_Z11shifts_leftt>:
0f b7 c7                movzx  eax,di
8d 14 c5 00 00 00 00    lea    edx,[rax*8]
89 c6                   mov    esi,eax
c1 e6 06                shl    esi,0x6
89 c1                   mov    ecx,eax
c1 e1 09                shl    ecx,0x9
c1 e0 0c                shl    eax,0xc
c1 e8 0e                shr    eax,0xe
c3                      ret


which is somewhat expected, while we don’t know why it prefers to do long shifts rather than a series of shorter shift lefts. Gcc figured it out a lot better:

<_Z11shifts_leftt>:
0f b7 c7                movzx  eax,di
69 c0 49 12 00 00       imul   eax,eax,0x1249
c1 f8 0e                sar    eax,0xe
c3                      ret


(the thing I can’t explain right now is why the shifts_left version and the shifts_mult version, which compiles to the exact same code, have different timings; and it’s not OS noise because the timings are quite different and the results are repeatable).

Despite GCC generating the same code on my AMD4000 box (yes, I know, I should get a newer computer), the run-time characteristics are quite different. Of course, it is slower, but the float version us much slower, about twice the time of More.

*
* *

So what is the morale of all this? First, don’t trust your compiler. The “optimizing” compiler didn’t optimize all that well, but the “non-optimizing” compiler did. Second, implementations that you think are roughly equivalent may be quite different speed-wise, such as, say, the Less and Left methods; turns out Left is much faster with both compilers. Third, test on the target platform(s), you may be surprised.

8 Responses to Being Shifty

1. pdq says:

Nice article.

Are you compiling with gcc with -m32?

With -m32, I see the same performance for shifts_left and shifts_mult. Otherwise when I compile it replaces the function call with inline MMX ops, and gcc seems to optimize differently between the two functions.

• Steven Pigeon says:

No, I used -m64 (all my machines are in 64bit mode). You also have to be careful about those results (and I mention it in the post as the final warning) : they’re quite sensitive to the machine you will use. AMD64 (the AMD4000+) vs T9600 (Intel Core2) will exhibit different run-times because beyond their GHz difference, their execution engines are different.

Probably that using your processor I would observe exactly what you describe. What CPU are you using?

2. pdq says:

Core i7 (Xeon 3580) 3.33 GHz

3. Paolo Bonzini says:

Calling GCC a non-optimizing compiler is not really correct. It is a bit weaker in its instruction selection and (obviously) in knowledge of the microarchitecture, and loop optimizations are worse. Also, GCC only had link-time optimization in GCC 4.5 while icc had it for a while (this is probably the reason why it beat GCC on quite a few benchmarks).

However, the scalar optimizations of GCC (arithmetic simplification, elimination of C++ abstraction penalty, jump threading) are _very_ good and _very_ hard to beat. If the best implementation is a multiply, but your scalar optimizations cannot transform shifts into multiplies, a perfect knowledge of the core will not help.

Besides, microarchitecture knowledge is easy to add to GCC when bugs are reported that hint to the problem. It’s unfortunately a slow process, because usually it’s found by chance when an older version generated the better code for some mysterious reason. However, it does happen.

• Steven Pigeon says:

You’re quite right… I should have written “non-optimizing” with ironic quotation marks. I’ll fix that right away.

So, the next question is: what IS the process in the GCC/G++ community to include micro-architectural details?

• Paolo Bonzini says:

Microarchitectural details that are public are already in, usually.

For the others, since we don’t have a crystal ball :) the easiest thing to do is to find some code where a small, apparently innocuous assembly change has a big performance impact. We can then instruct GCC to favor the better code, usually with some tweak to the backend or a new peephole optimization. As I said, most of the time these reports come from comparison with older versions of GCC when the newer version regresses.

See for example http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827#c32 (just one comment in a huge audit trail). A similar peephole is now in for SSE too.

4. nlp says:

Although I have read the above comments and noted your correction that now refers to GCC as a “non-optimizing” compiler, I’ve never before heard it described so, “ironically” or otherwise.

GCC has always been an optimizing compiler, right from the first beta release in 1987: