Being Shifty

Hacks with integer arithmetic are always fun, especially when they’re not too hard to understand (because some are really strange and make you wonder what the author was thinking when he wrote that). One such simple hack is to replace divisions or multiplies by series of shifts and additions.

However, these hacks make a lot of assumptions that aren’t necessarily verified on the target platform. The assumptions are that complex instructions such as mul and div are very slow compared to adds, right shifts or left shifts, that execution time of shifts only depends very loosely on the number of bit shifted, and that the compiler is not smart enough to generate the appropriate code for you.

Of course, these assumptions may or mayn’t be verified on your target platform and it is important to assess the goodness of any optimization with a stopwatch in hand—anyway, more on this later. Let us come back to our shifts.

Let’s say we want to multiply 16 bits integers by \frac{2}{7}. This isn’t a nice fraction we can replace by a simple shift right away; it’s conspicuously not a simple power of 2. But it’s a series of sums of power of two. Indeed, if we look at it differently,

\displaystyle\frac{2}{7}\approx 0.285714 = 0.01001001..._2

To each 1 in the binary expansion of \frac{2}{7} corresponds a shifted value of x:

\displaystyle \frac{2}{7}x = 2^{-2}x+2^{-5}x+2^{-8}x+2^{-11}x+\cdots

Which suggests the direct (but naïve) approximation (x>>2)+(x>>5)+.... That actually works, but may not be very precise because bits that should contribute to the sum get shifted out. C shift operators will just truncate the fractional bits—it’s an integer operation, after all—so you loose bits that should contribute to the sum. The first impression is that it’s just a rounding error (meaning, the last bit or so) but it may yield a large error.

The second thing you may try is to scale x before shifting. One way of doing a ‘free’ scaling is to use a trick similar to what I used for an alternative line algorithm. That is, we’ll use the language to have a shift free scaling:

  typedef
  union
   {
    // endian-dependant!
    struct { uint16_t lo, hi; };
    uint32_t s;
   } fixed_t;

Using this structure, setting lo to 0 and hi to
x promotes x to a 32 bits version shifted by 16 bits. The
whole method could look like:

inline int shifts_more(uint16_t x)
 {
  typedef
  union
   {
    // endian-dependant!
    struct { uint16_t lo, hi; };
    uint32_t s;
   } fixed_t;

  fixed_t scaled={0,x}; // C99 extensions?
  fixed_t sum={0};

  sum.s+=scaled.s >> 2;
  sum.s+=scaled.s >> 5;
  sum.s+=scaled.s >> 8;
  sum.s+=scaled.s >> 11;
  sum.s+=scaled.s >> 14;
  // the next one, >> 17, gives zero!

  return sum.hi;
 }

Using this version, all bits contribute to the results and the (relative) error is reduced to its minimum. We could have used an actual shift rather than an union:

inline int shifts_less(uint16_t x)
 {
  const int scale=4;
  uint32_t sx = (x << scale);
  uint32_t s=0;

  s+=sx >>= 2;
  s+=sx >>= 3;
  s+=sx >>= 3;
  s+=sx >>= 3;
  s+=sx >>= 3;

  return (s>>scale);
 }

Where you can control time versus precision by changing the value of scale. Or you can look at it the other way ’round:

inline int shifts_left(uint16_t x)
 {
  int tx=x; // promote to 32 bits

  int s=tx; // 14
  s+=(tx <<=3); // 11
  s+=(tx <<=3); // 8
  s+=(tx <<=3); // 5
  s+=(tx <<=3); // 2

  return (s >> 14);
 }

Or you can use an integer multiply:

inline int shifts_mult(uint16_t x)
 {
  // 01 0010 0100 1001 = 0x1249
  return (int(x) * 0x1249) >> 14;
 }

So which one’s better? Precision-wise, they’re all the same, except for the naïve version that has a higher error and for shift_less that depends on a scaling constant. On all 16 bits integers, the average error (difference between the float version and the integer version) is given by:

Method Average error
Naïve 2.07144
others 0.214264

So the other methods have basically 10 times smaller errors. Which is good. What about speed? Well that’s where the catch is. Let us compare, again, g++ against icc, the two compilers I use. On a T9600 Core2 CPU (the one from by Macbook Pro), the times (in micro-seconds by 65536 calls):

method icc gcc
Naïve 192 179
More 229 192
Less 225 204
Left 189 142
Mult 166 165
Float 236 236

Let’s see what we have here. G++ seems to be a bit more efficient than icc on these series of tests. The Float method is merely using a float to perform the multiply and truncating to int, and is included as a fair comparison, the control. The Mult method is also the same using both compilers, unsurprisingly, they generate exactly the same code. G++ has the clear advantage for the Left method. Why? What happened?

Examining the code, we make a stunning discovery. Icc generates the following:

<_Z11shifts_leftt>:
  0f b7 c7                movzx  eax,di
  8d 14 c5 00 00 00 00    lea    edx,[rax*8]
  03 d0                   add    edx,eax
  89 c6                   mov    esi,eax
  c1 e6 06                shl    esi,0x6
  03 f2                   add    esi,edx
  89 c1                   mov    ecx,eax
  c1 e1 09                shl    ecx,0x9
  c1 e0 0c                shl    eax,0xc
  03 f1                   add    esi,ecx
  03 c6                   add    eax,esi
  c1 e8 0e                shr    eax,0xe
  c3                      ret    

which is somewhat expected, while we don’t know why it prefers to do long shifts rather than a series of shorter shift lefts. Gcc figured it out a lot better:

<_Z11shifts_leftt>:
  0f b7 c7                movzx  eax,di
  69 c0 49 12 00 00       imul   eax,eax,0x1249
  c1 f8 0e                sar    eax,0xe
  c3                      ret

(the thing I can’t explain right now is why the shifts_left version and the shifts_mult version, which compiles to the exact same code, have different timings; and it’s not OS noise because the timings are quite different and the results are repeatable).

Despite GCC generating the same code on my AMD4000 box (yes, I know, I should get a newer computer), the run-time characteristics are quite different. Of course, it is slower, but the float version us much slower, about twice the time of More.

*
* *

So what is the morale of all this? First, don’t trust your compiler. The “optimizing” compiler didn’t optimize all that well, but the “non-optimizing” compiler did. Second, implementations that you think are roughly equivalent may be quite different speed-wise, such as, say, the Less and Left methods; turns out Left is much faster with both compilers. Third, test on the target platform(s), you may be surprised.

8 Responses to Being Shifty

  1. pdq says:

    Nice article.

    Are you compiling with gcc with -m32?

    With -m32, I see the same performance for shifts_left and shifts_mult. Otherwise when I compile it replaces the function call with inline MMX ops, and gcc seems to optimize differently between the two functions.

    • Steven Pigeon says:

      No, I used -m64 (all my machines are in 64bit mode). You also have to be careful about those results (and I mention it in the post as the final warning) : they’re quite sensitive to the machine you will use. AMD64 (the AMD4000+) vs T9600 (Intel Core2) will exhibit different run-times because beyond their GHz difference, their execution engines are different.

      Probably that using your processor I would observe exactly what you describe. What CPU are you using?

  2. pdq says:

    Core i7 (Xeon 3580) 3.33 GHz

  3. Paolo Bonzini says:

    Calling GCC a non-optimizing compiler is not really correct. It is a bit weaker in its instruction selection and (obviously) in knowledge of the microarchitecture, and loop optimizations are worse. Also, GCC only had link-time optimization in GCC 4.5 while icc had it for a while (this is probably the reason why it beat GCC on quite a few benchmarks).

    However, the scalar optimizations of GCC (arithmetic simplification, elimination of C++ abstraction penalty, jump threading) are _very_ good and _very_ hard to beat. If the best implementation is a multiply, but your scalar optimizations cannot transform shifts into multiplies, a perfect knowledge of the core will not help.

    Besides, microarchitecture knowledge is easy to add to GCC when bugs are reported that hint to the problem. It’s unfortunately a slow process, because usually it’s found by chance when an older version generated the better code for some mysterious reason. However, it does happen.

    • Steven Pigeon says:

      You’re quite right… I should have written “non-optimizing” with ironic quotation marks. I’ll fix that right away.

      So, the next question is: what IS the process in the GCC/G++ community to include micro-architectural details?

      • Paolo Bonzini says:

        Microarchitectural details that are public are already in, usually.

        For the others, since we don’t have a crystal ball :) the easiest thing to do is to find some code where a small, apparently innocuous assembly change has a big performance impact. We can then instruct GCC to favor the better code, usually with some tweak to the backend or a new peephole optimization. As I said, most of the time these reports come from comparison with older versions of GCC when the newer version regresses.

        See for example http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827#c32 (just one comment in a huge audit trail). A similar peephole is now in for SSE too.

  4. nlp says:

    Although I have read the above comments and noted your correction that now refers to GCC as a “non-optimizing” compiler, I’ve never before heard it described so, “ironically” or otherwise.

    GCC has always been an optimizing compiler, right from the first beta release in 1987:

    http://groups.google.com/group/comp.lang.misc/msg/32eda22392c20f98

    And of course, there have been 23 years of further improvements made since that time. There are very few compilers that are competitive with GCC in the realm of code quality/optimization. Therefore, it should not be surprising to find that GCC produced superior results in this test. Of course, ICC is also an excellent compiler and might perform better on other tests — no compiler has a monopoly on good optimizations.

    • Steven Pigeon says:

      I am aware that all mainstream compilers are “optimizing compilers”, but ICC, for example, claims to be better at it, and, on the average, it is confirmed by my observations; code compiled using ICC is often faster, sometimes slower. That’s not to say that gcc/g++ is a bad compiler, quite the contrary: I think it’s a world-class compiler.

      I wonder if someone did a really comprehensive survey on the code generated from different compilers… I have heard that the last few version of the Microsoft compiler is also quite good, although I did not have a chance of testing it (not my work environment).

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: