Of small copies and templates.

Sometimes, it’s the little things that make a big difference. For example, moving small, not machine-sized, pieces of data may be a problem. If it happens to be machine-sized (that is, the size of something like uint32_t), then the copy is readily done by a single (integer) copy. But what if it’s, say, 5 bytes?

The first solution that comes to mind is either a for-loop or a call to a library function like memcpy. The for-loop would give something like:

char * dest = ...
char * src = ...
for (std::size_t i=0;i<5;i++) dest[i]=src[i];

If you’re lucky, the compiler understands the small copy and optimizes. More likely, it will merely unroll the loop and generate something like:

14d4:   0f b6 07    movzx  eax,BYTE PTR [rdi]
14df:   88 04 24    mov    BYTE PTR [rsp],al
14e2:   0f b6 47 01 movzx  eax,BYTE PTR [rdi+0x1]
14e6:   88 44 24 01 mov    BYTE PTR [rsp+0x1],al
14ea:   0f b6 47 02 movzx  eax,BYTE PTR [rdi+0x2]
14ee:   88 44 24 02 mov    BYTE PTR [rsp+0x2],al
14f2:   0f b6 47 03 movzx  eax,BYTE PTR [rdi+0x3]
14f6:   88 44 24 03 mov    BYTE PTR [rsp+0x3],al
14fa:   0f b6 47 04 movzx  eax,BYTE PTR [rdi+0x4]
14fe:   88 44 24 04 mov    BYTE PTR [rsp+0x4],al

Let’s see how we can fix that!

If the number of bytes to copy is unknown at compile time, there’sn’t much we can do but to rely on memcpy that has quite good performance. I once spent an afternoon trying to get better performance than memcpy with the same kind of tricks and only managed a shadow of a ghost of a flea of a speedup. But if the number of bytes is known at compile time, we can use templates to force the compiler to generate better code!

Let’s start with something simple: copy the first byte, and if there’s more to copy, call (recursively) again:

template<int n>
inline void copy(char *src, char *dest)
 {
  *dest=*src;
  copy<n-1>(src+1,dest+1);
 }

// base case!
template<> inline void copy<0>(char *,char *) {}

The base case is needed to stop the recursion “at the bottom”. This unrolls the copy and basically produce the same code as the previous for-loop. To speed things up, we must copy in bigger—as big as possible—chunks. Let’s suppose that the largest basic type is uint64_t (while in fact it may be given by max_align_t, probably long double, if you have a 64-bits CPU). Then while there are more than 64 bits left to copy, copy 64 bits, decrease the number of bytes to copy by 8, and do it again. Then repeat (once!) with 32 bits, then (once) with 16, and finally, the last byte.

template <std::size_t size>
void copy(char *s, char *d)
  {
   if (size>=8) // 8 bytes, 64 bits!
    {
     *(std::uint64_t*)d=*(std::uint64_t*)s;
     copy<size-8>(s+8,d+8);
    }
   else
    if (size>=4) // 4, 32 bits!
     {
      *(std::uint32_t*)d=*(std::uint32_t*)s;
      copy<size-4>(s+4,d+4);
     }
    else
     if (size>=2) // 4, 16 bits!
      {
       *(std::uint16_t*)d=*(std::uint16_t*)s;
       copy<size-2>(s+2,d+2);
      }
     else
      *d=*s; // last char.
  }

// aha-a. base caseS.
template <> void copy<              0>(char *, char *) {}
template <> void copy<(std::size_t)-1>(char *, char *) {}
template <> void copy<(std::size_t)-2>(char *, char *) {}
template <> void copy<(std::size_t)-3>(char *, char *) {}
template <> void copy<(std::size_t)-4>(char *, char *) {}
template <> void copy<(std::size_t)-5>(char *, char *) {}
template <> void copy<(std::size_t)-6>(char *, char *) {}
template <> void copy<(std::size_t)-7>(char *, char *) {}

Why so many base cases? Well, it seems the compiler can’t quite figure out that the recursive function is greedy and therefore tries to unroll all cases simultaneously, and therefore if size is zero, size-4 may be 0, -1, -2, -3, … And since we have size-8 in the template, there may be up to 8 bases cases!

The generated code is now, for 5 bytes:

14d4:   8b 07       mov   eax,DWORD PTR [rdi]
14de:   89 04 24    mov   DWORD PTR [rsp],eax
14e1:   0f b6 47 04 movzx eax,BYTE PTR [rdi+0x4]
14e5:   88 44 24 04 mov   BYTE PTR [rsp+0x4],al

*
* *

With the template, we went from 10 instructions to only 4, and also to simpler instructions. I do not know that’s the (real) speed difference between a simple mov and a movzx (move with zero extend), but the instruction takes one fewer byte. Maybe it’s inconsequential.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: