Bit blenders: Move bits around quite a bit

Efficient bit manipulation is often found at the lower levels of many applications, let it be data compression, game programming, graphics manipulations, embedded control, or even efficient data representations, including databases bitmaps and indexes.

In some cases, one might rely on the often limited features of one’s favorite language—C++ in my case—to manipulate bits. In C++ (and in C), there are bit-fields that allow you to define how a piece of memory will be chopped up in regions each having a specific number of bits. Bit-fields can bring you only so far as to describe how bits are mapped onto a chunk of memory; they are of little or no help when you need to efficiently move bits around in a somewhat complicated fashion.

Fortunately, C/C++ provides a number of integer/bit-wise operators that will allow you to manipulate bits. These are the shifts operators, <<, >>, the bit-wise and &, the bit-wise or |, the exclusive or ^ (that exist only for bit-wise operations, there are no operator ^^ for logical exclusive or), and the negation ~. Strictly speaking, only the shifts, & and ~ are really necessary—indeed, consider the numerous applications of de Morgan‘s laws.

To extract a bit-string from a memory location, hardly anything more than shifting and masking (applying a bit-wise and with a a priori chosen value) is needed. For example, to set and test a bit within a bitmap, we only have to use

int test_bit(const unsigned char map[],int offset)
  return (map[offset/8] >> offset%8) & 1;

int set_bit(unsigned char map[], int offset)
  map[offset/8] |= 1 << offset%8;

To extract a bit-string (a series of bits) one can use essentially the same trick, provided that one can also map an integer at an arbitrary location in memory (which, for the time being, we will suppose is possible), shift the resulting integer by the desired amount and apply the mask:<br><br>

// Arbitrary alignment,
// small-endian version (more
// about endian in a later post)
unsigned read_bitstring( char * some_location,
                         int offset,
                         int length )
 int bitstring = *(int *)some_location; // a ptr
 return (bitstring >> offset) & 
             (~0u >> BITS_PER_INT-length);

The kind of operations shown above are commonplace in a number of applications. Bitmaps are often used as an alternative to tree-based set data structures when the range is not too big and when speed is more important than the memory expended; bitstring extraction and insertion are found in just about any audio and video codec there is. Where intuition (or instinct) fails us, is where we have to reorder bits within a machine-sized register in a non-trivial way. In C, machine-sized register means a char, int, long or long long.

Why would one want to reorder bits within a register in the first place you might ask. That’s a valid question and there are many answers. Let me tell you about an almost amusing story (because it wasn’t nearly so at the time). In a previous job, the board designer hooked the data port upside down to the micro-controller port bits. That is, data bit 0 was wired on wire 15. Of course, it took us a while to find why my control bits written to the bus weren’t quite having the expected effect. Needless to say, rather than rewriting all the constant data to match the defective wiring, I just wrote a small routine that reversed the data being written to the port. At first, like everybody, I wrote a for-loop that shifted bits from the original data and shifted them back, opposite direction, into the new value:

unsigned reverse(unsigned x)
unsigned t=0;
for (int i=0;i<16;=1) t= (t<<1) | (x&1); return t; } [/sourcecode] In fact, this code is wrong on so many levels it's not even funny—despite the fact that it computes the right thing. First, it uses two shifts by 1 bit, 16 times. On a processor with poor shift instruction performance, this is very slow. Second, and worse, it uses a for-loop, which means it has to use an extra variable (i), and that it must have a compare instruction, and a conditional jump. Jumps are very, very bad for performance.

So I began wondering if there was a much better way to do that. So, first thing, I checked the instruction set manual to see if, by any chance, there was a reverse register instruction. Of course, no such luck. After a few minutes, I started doodling on a piece of paper operations such as rotates, shifts, etc. that could move bunch of bits near, or at, the place they would be if completely reverse. So I finally drew something like this

And that was pretty much the solution. While the drawing represent a byte size register, the solution generalizes nicely to any register size: for a register of size n (n being a power of two on virtually all processors) you will need exactly lg n stages to complete the reversal.
Let us go step by step. The first step:

swaps the higher and the lower half of the register:

x = (x >> 4) | (x << 4); [/sourcecode] Which, assuming x is unsigned char, produces efghabcd. This could be performed with a ror (rotate right) instruction, but bit-wise rotations aren’t available to C nor C++.

The next step swaps groups of two bits. Notice that the group ef and ab shift in the same direction by the same amount of bits. So do gh and cd. Indeed, grouping them allows us to write:

x=((x & 0xcc)>>2)|((x & 0x33)<<2); [/sourcecode] The masking enables only parts of the register. 0xcc masks only ef and ab. Similarly, 0x33 masks only gh and cd. Combining the two shifted values with a bit-wise or produces the correct value ghefdcab.

Now we repeat the trick one last time to shift odd bits into even bits and vice versa:

x = ((x & 0xaa) >> 1) | ((x & 0x55) << 1); [/sourcecode] and voilà, we're done: the reversal becomes: x=(x>>4) | (x<<4); x=((x & 0xcc) >> 2) | ((x & 0x33)<<2); x=((x & 0xaa) >> 1) | ((x & 0x55)<<1); [/sourcecode] So, that was an easy one, because the operations, at each level, are very regular. To reverse a register, we just mask half of the bits and shift them by smaller and smaller amounts. Let us consider another problem, that also happens often in practice: the perfect shuffle. Perfect shuffle (the type we’re interested in) occurs naturally into fast Fourier transforms and hash functions. The perfect shuffle is named in analogy to shuffling a deck of card. In each round of shuffling, you interleave cards held in one hand with those held in the other in a way that’s credibly randomized. Perfect shuffling interleaves the card in such a way that right-hand cards and left-hand cards strictly alternates. If I asked you to come up with a flow diagram just like we did for the register reversal, after some thinking, you would come up with the following diagram:

Now, let us decompose the motion again. The first steps merely swaps cd and ef, leaving ab and gh in place.

This leads to the following expression:

x=(x & 0xc3) | ((x & 0x30)>>2) | ((x & 0x0c)<<2); [/sourcecode] Finally, the last stage

swaps b and e on one side, and d and g
on the other, leaving a,fc, and h unchanged.

This yields

x=(x & 0x99) | ((x & 0x44)>>1) | ((x & 0x22)<<1); [/sourcecode] The perfect shuffle becomes [sourcecode language='c']

8 Responses to Bit blenders: Move bits around quite a bit

  1. Rod McMahon says:

    Hi, Good Blog….
    Can you help please. I am using itoa( base2) to convert an integer to an array so I can extract the bits.
    If say I convert 3 then the answer is 11 and 6 nulls, not 00000011 as I need. Interesting enough it does work if I reverse it twice using your formula even though there are null characters in the last 6 array locations.
    Is there an easy way to do this for any integer up to 255

    • Steven Pigeon says:

      If I understand your question correctly… The algorithm is meant to be performed into a single contiguous register, say a 32 bits integer. If you want to reverse 256 bits long integers, you will need to first swap around the largest possible integers as whole blocks (that is, reversing 256 bits by 64 bits chunks) then proceed with the above method for the blocks.

  2. Rod McMahon says:

    No Not guite
    Itoa() converts a numeric number to a binary in an array. If I say define the target array as A[7] then the 3 becomes
    A[6] = 1
    A[5] = 1
    A[4] to A[0] are all null characters
    What I need is the array to be 0000011 ir A[0]=1 A[1} = 1
    In other words move the bits to the other end.
    What is needed is an easy generalised method that will work for any number of bits in the array./ Try as I might I can only crack it by complicated code, not the methods you so nicely use

  3. Steven Pigeon says:

    Still not sure if you want reverse or just “left justified”. If the goal is to shift the most significant bit to bit 7, something like

    while (!(z & 0x40)) z<<=1;

    would do it. If you really want reverse, then above recipe would do what’s intended. Of course, the above works only on 8 bits numbers. Say flip8 reverses 8 bits numbers

    uint8_t flip8(uint8_t x)
      x=((x & 0xcc) >> 2) | ((x & 0x33)<<2);
      x=((x & 0xaa) >> 1) | ((x & 0x55)<<1);
      return x;
    uint16_t flip16(uint16_t x)
      return flip8(x>>8) | (flip8(x&0xff)) << 8; 

    would flip 16 bits numbers.

    Not sure if this answers your question better.

  4. […] can use a bit blender. A bit blender, at least how I defined it in a previous post, is an operation that move bits around […]

  5. Joe says:

    // Note: lookup1/lookup2 are just bit-reversed tables
    y = lookup1[x && 0xF0] || lookup2 [x && 0x0F];

    That’s probably the easiest way. Now if only I had an efficient way to reorder a 256-bit word into any arbitrary order. Most CPU instruction sets I’ve seen make that easily a rather ‘slow’ problem to solve. It’s one reason that AES uses byte-level swaps. :/

    I have a feeling that there’s a way to use something like matrix operators in linear algebra to automatically reduce the number of OR/XORs/ANDs/Shifts but there has to be some instruction that supports ‘blending’ in a sane manner, on some common CPU…

    • I think there are two different complexities here: sequential and parallel. On a sequential machine, that is, a “ordinary” CPU with only one instruction at a time, then we should arrive to O(n \lg n) operations, since we can see arbitrary shuffling as some kind of directed Quicksort, with depth O(\lg n). If we have some kind of machine that operates on n-bit wide registers, and that we can shift, or, xor, etc., on them in parallel, then we have a complexity of O(n + \lg n), since we can do O(n) shift+masks and do O(\lg n) ORs to get the final result. Or, if you have some parallel machine that ORs values when they are written at the same location, you compute everything in O(1)!

  6. […] discussed moving bits around before. The first function (that I called “blender”) moves bits around […]

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: