Bundling Memory Accesses (Part I)

There’s always a question whether having “more bits” in a CPU will help. Is 64 bits better than 16? If so, how? Is it only that you have bigger integers to count further? Or maybe more accessible memory? Well, quite obviously, being able to address a larger memory or performing arithmetic on larger number is quite useful because, well, 640KB isn’t all that much, and counting on 16 bits doesn’t get your that far.

AMD Phenom

But there are other advantages to using the widest registers available for computation. Often, algorithms that scan the memory using only small chunks—like bytes or words—can be sped up quite a bit using bundled reads/writes. Let us see how.

The first simple example is about moving memory around. In the olden days, we used the x86 instructions rep mosvb to move bytes around. This instruction used the register pairs ds:si and es:di as source and destination pointers respectively, and moved cx bytes from the source to the destination. This complex instruction was quite helpful on systems with a data bus only 8 bits wide, but already the instruction resp movsw was much faster if the bus was 16 bits wide. Generalizing this, you can use wider and wider registers to reduce the number of times the move instruction is repeated.

The performance in such a case is bound to machine-specific details such as the bus width, and how well the CPU manages to perform multi-byte reads. If you perform a read that is wider than the bus, the CPU can either not allow you to do so or it has to perform multiple 32-bits wide reads (assuming a 32-bits wide bus).

The only thing that helps you is that those multiple reads are managed by the CPU itself (or its memory management unit) without the intervention of the user’s program, and so it is still faster than performing the same number of reads via explicit instructions.

*
* *

Moving around data is not the only thing that can be performed using wider registers. You can do all sorts of stuff, and that’s why SIMD instructions have been added to most CPU lines, whether it’s Intel’s SSE-2-3-4 or Power’s AltiVec.

But even without using SIMD instructions, there are things that can be done. Let us look back at memfrob—I still think it’s stupid—and let us use it as an example. The basic memfrob implementation must look something like:

void * my_memfrob(void * buffer, size_t length)
 {
  for (size_t i=0;i<length;i++)
   ((uint8_t*)buffer)[i]^=42;

  return buffer;
 }

This is a direct implementation, but it reads and writes memory one byte at a time. And, given the timing results we’ll discuss in an instant, I’m pretty sure that’s how it’s implemented in the GNU lib C as well. Using objdump -D -M intel we see that the code generated is straightforward:

<my_memfrob>:
   48 85 f6     test   rsi,rsi
   74 0f        je     done
   31 c0        xor    eax,eax
xor_loop:
   80 34 38 2a  xor    BYTE PTR [rax+rdi],0x2a
   48 83 c0 01  add    rax,0x1
   48 39 f0     cmp    rax,rsi
   75 f3        jne    xor_loop
done:
   48 89 f8     mov    rax,rdi
   c3           ret   

I have added labels for readability. So how fast is it? It’s the same speed as the GNU-supplied function. On my AMD4000+, both take about 0.000014s (14μs) to frobnicate a block of 10000 bytes. Now, let us bundle the reads and writes into 32 bits registers. The naïve implementation now becomes:

void * aligned32_memfrob( void * buffer, size_t length)
 {
  size_t stop_offset= length & ~((size_t)0x3);
  size_t i;
  for (i=0;i<stop_offset;i+=4)
   *(uint32_t*)(buffer+i)^=UINT32_C(0x2a2a2a2a);
  for ( ;i<length;i++)
   ((uint8_t*)buffer)[i]^=42;

  return buffer;
 }

This implementation reads/writes as many 32 bits memory chunks as possible, without exceeding the buffer’s length. The last few bytes, if any, are managed in the second loop. This implementation reads 4 bytes, xor them at the same time using the constant 0x2a2a2a2a (because 0x2a is hexadecimal for 42) and writes them back together. The generated code looks like:

<aligned32_memfrob>:
  48 89 f2              mov    rdx,rsi
  31 c0                 xor    eax,eax
  48 83 e2 fc           and    rdx,0xfffffffffffffffc
  74 15                 je     second_stage
  0f 1f 44 00 00        nop    DWORD PTR [rax+rax+0x0]
first_loop:
  81 34 38 2a 2a 2a 2a  xor    DWORD PTR [rax+rdi],0x2a2a2a2a
  48 83 c0 04           add    rax,0x4
  48 39 c2              cmp    rdx,rax
  77 f0                 ja     first_loop
second_stage:
  48 39 c6              cmp    rsi,rax
  76 17                 jbe    done:
  48 8d 04 07           lea    rax,[rdi+rax]
  48 8d 14 37           lea    rdx,[rdi+rsi]
  0f 1f 00              nop    DWORD PTR [rax]
second_loop:
  80 30 2a              xor    BYTE PTR [rax],0x2a
  48 83 c0 01           add    rax,0x1
  48 39 d0              cmp    rax,rdx
  75 f4                 jne    second_loop
done:
  48 89 f8              mov    rax,rdi
  c3                    ret    

This version generates more code, but is also immensely faster: it takes about 2.2μs to perform the same task as my_memfrob. We have traded memory accesses for code complexity, but it paid off quite a bit! We could repeat the same exercise with 64 bits, but the gain, on my machine anyway, is smaller: 1.7μs vs 2.2; a modest speed-up. Still, it is immensely faster than the byte-oriented routine. In summary:

Method Time per Block Speed-up
memfrob 14.7μs none
my_memfrob 15.0μs none
aligned32_memfrob 2.2μs 6.7:1
aligned64_memfrob 1.7μs 8.6:1

*
* *

Let us say something about unit testing—again! I have verified that all proposed implementations are strictly equivalent to the GNU lib C version by using the GNU lib C’s version as an inverse. The original buffer is copied, then I apply one of the alternative implementations, then I use the original memfrob to get back the original contents in the secondary buffer. I then compare the original buffer to the defrobnicated buffer. To ensure that it works for real, I initialize the buffer with random contents each time. The test is repeated a few thousand times.

*
* *

There a few resources ’round the net that will help you to transform naïve algorithms—I don’t say naïve in a derogatory way; it’s just to say that the implementation is straightforward, direct, and has no-frills whatsoever—into bundled, memory-bandwidth efficient, algorithms:

7 Responses to Bundling Memory Accesses (Part I)

  1. rmn says:

    Nice post.

    Just a side note: when you transform your original one loop into two loops, you’re adding at least one branch prediction error – which costs (very little) performance.

    Also, Duff’s Device can be used to implement what your other loop did, in a more elegant (although complex) manner: http://en.wikipedia.org/wiki/Duff's_device

    • Steven Pigeon says:

      Yes, there’s an extra branch prediction cost, but since it is immensely faster with the bundled read/write loops, the cost is most certainly negligible.

      As for Duff’s device (also attributed to Thompsons, IIRC, even if http://en.wikipedia.org/wiki/Duff%27s_device doesn’t seem to mention him) that’s probably a more hackerish way of implementing this; as for performance, it depends how switch statements are managed by the compiler.

      • Steven Pigeon says:

        Ok, I’ve done some tests and it turns out that gcc isn’t very smart with the switches/cases. Optimizing in -O3, I find that the switches/cases with n=4 are merely an if cascade:

        83 e6 03          and    esi,0x3
        48 83 fe 02       cmp    rsi,0x2
        74 1c             je     case2
        48 83 fe 03       cmp    rsi,0x3
        74 0e             je     case3 
        48 83 fe 01       cmp    rsi,0x1
        74 18             je     case1
        5b                pop    rbx
        c3                ret    
        

        The n=8 version doesn’t seem any bit smarter. At any rate, it doesn’t try to have O(\lg n) comparisons. It results in a small pessimisation of results, in fact. 2.3µs vs 1.7µs for 64 bits Duff vs 64 bits non-Duff. ( I changed compilers in the mean time, from gcc 4.2.x to 4.4.1 as I upgraded to Koala since I wrote the first entry; for some reason, someone thought it’d be cool to optimize memfrob, the new version is now considerably faster than the one distributed with 8.04 LTS… It performs similarly to align64_memfrob.)

        • rmn says:

          I’m a little surprised that the switch/case approach is transformed to the simplest (and most stupid) list of ifs. It could either compute the O(log(n)) solution (like you mentioned), or even use a branch table (although it could be a little slower since predicting it will be alot harder).

          It’s indeed pretty funny that memfrob has been optimized. I believe you now have a challenge – make an even faster version :) Perhaps you could go with SSE and its 128bit registers ?

          Thanks for the interesting read.

          • Steven Pigeon says:

            Further optimizing memfrob with SSE would be SStupid ;)

            But the if cascade is not that stupid, we notice that case2 is compared first, as maybe understood by the compiler to be the most likely?

  2. Steven Pigeon says:

    I like to think that the guys that write compilers are more often smarter than not. I suppose someone long time ago studied the behavior of code generation for typical switch statements and came up with these heuristics?

    Maybe I could have used profile-guided optimization, but that’s a bit out of the scope of the current post ;)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: