YoU CanT MaKE BuBBleSorT FaSTER With ASseMbLY

In one of the classes I teach, we end up writing assembly language programs. And while I explain the (sometimes very relative) benefits of writing assembly language, I use bubble sort as an example where even carefully crafted assembly language doesn’t mean much: it’s a bad algorithm to start with.

YoU CanT MaKE BuBBleSorT FaSTER With ASseMbLY

Except that it’s not quite true.

First, let’s consider (standard) vanilla bubble sort:

template <template <typename...> typename C,
          typename... Ts>
void bubble_sort( C<Ts...> & coll )
 {
  if (coll.size())
   {
    bool swapped;
    typename C<Ts...>::iterator last=coll.end()-1;
    typename C<Ts...>::iterator i;
    do
     {
      swapped=false;
      i=coll.begin();
      while (i!=last)
       {
        if (*i>*(i+1))
         {
          std::swap(*i,*(i+1));
          swapped=true;
         }
        ++i;
       }
      --last;
     }
    while (swapped);
   }
 }

So nothing fancy here, except that it should work on any container with bidirectional iterators, and types for which operator> is defined. To make write that in assembly language, we should simplify things a bit. Let’s say “containers” are flat arrays and the T types is int.


# void bubble(size_t nb, int items[]);

_Z6bubblemPi:
.LFB0:
        .cfi_startproc

        ## rdi nb
        ## rsi items[]

        mov rcx,rsi             # itemsp[
        xor rdx,rdx             # bool 'swapped'
        lea rdi,[rsi+rdi*4-4]   # last

.bubble_while:

        cmp rcx,rdi
        jge .bubble_while_done

        mov eax,[rcx]
        mov r9d,[rcx+4]

        cmp eax,r9d
        jle .bubble_while_next

        mov [rcx],r9d
        mov [rcx+4],eax
        mov rdx,1

.bubble_while_next:

        add rcx,4
        jmp .bubble_while

.bubble_while_done:

        or rdx,rdx
        jz .bubble_done

        mov rcx,rsi
        xor rdx,rdx

        dec rdi
        jnz .bubble_while

.bubble_done:

        ret

        .cfi_endproc

This piece of assembly language does exactly what you expect it to do: passes that scans the array and swaps items when they are out of order; pushing the largest one at the end of the array, and stopping whenever no swaps were performed.

This version is longer than would be necessary if we had memory-to-memory swap instructions (which we don’t, on x86), and we could make it a bit faster if we changed the jump by conditional moves. But that wouldn’t change much because we still scan the array two items at a time. But… what if we compared 4? 8?

The newer instruction sets allow just that! With AVX, the xmm registers are 128-bits wide and can hold 4 ints, the ymm registers are 256-bits wide (and hold 8)… and if you can do AVX-512, there are the zmm 512-bits wide registers! We cant easily sort the values within a register, but we can compare them, pairwise, with another register. For example, we can compute the minimum of two registers:

min(xmmi,xmmj) :=
  xmmi[0]=min(xmmi[0],xmmj[0])
  xmmi[1]=min(xmmi[1],xmmj[1])
  xmmi[2]=min(xmmi[2],xmmj[2])
  xmmi[3]=min(xmmi[3],xmmj[3])

We can also compute the maximum of two registers. If we replace the
conditional swap by:

a=min(t[i],t[i+1])
b=max(t[i],t[i+1])
t[i]=a;
t[j]=b;

…then we can use that to swap, in order, two items. If we use the parallel AVX version, we’d do that with t[i...i+3] and t[i+4...i+7]. They wouldn’t be completely in order, but that increases order a lot.

If fact, if we use 4- (or 8-) int wide min/max, we end up bubble sorting the array as if it was 4 separate columns. After a we passes, then all columns (0,4,8,…) (1,5,8,…) (2,6,10,…) (3,9,11,…) are sorted. That a Shell-sort like kick start. We can then finish the job with our basic bubble sort, hoping that no element is too far from its final location.


# void xmmbubble(size_t nb, int items[]);

_Z9xmmbubblemPi:
.LFB0:
        .cfi_startproc

        ## rdi nb
        ## rsi items[]

        mov r8,rsi
        mov r9,rdi

        and r9b,0xf1            # ~0x7
        shl r9,2                # *sizeof(int)
        add r9,r8               # last item

.xmmbubble_while:

        lea r10,[r8+16]
        cmp r10,r9
        jge .xmmbubble_next

        movdqu xmm0,[r8]
        movdqu xmm1,[r8+16]

        vpminsd xmm2,xmm0,xmm1  # xmm2=min(xmm0,xmm1)
        vpmaxsd xmm3,xmm0,xmm1  # xmm3=min(xmm0,xmm1)

        movdqu [r8],xmm2
        movdqu [r8+16],xmm3

        add r8,16
        jmp .xmmbubble_while

.xmmbubble_next:

        mov r8,rsi
        sub r9,16
        cmp r9,r8
        je .xmmbubble_done

        jmp .xmmbubble_while

.xmmbubble_done:

        # finishes with ordinary
        # bubble sort
        call _Z6bubblemPi

        ret

        .cfi_endproc

*
* *

That’s a lot of work. Do we get a good speed-up, then?

Size × 16 Naïve XMM
1000 0.28s 0.025s
10000 38s 1.8s
100000 3813s 183s

Surprisingly, we do. A lot! The Shell Sort -like first pass moves most of the values around the right position, so the classic bubble sort finishes sorting by only making a few passes (I would need to work out of the details of the inversions, but I conjecture that since they are (number-of-columns) times smaller, the sort is (number-of-columns)-squared times faster!)

4 Responses to YoU CanT MaKE BuBBleSorT FaSTER With ASseMbLY

  1. jp says:

    Data-oriented SIMD operations on the metal make for AMAZING speed-ups! Love it, great post.

  2. jobleonard says:

    Wouldn’t this also speed up other sorting algorithms that exploit existing order? Natural merge sort and Timsort come to mind

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: