## YoU CanT MaKE BuBBleSorT FaSTER With ASseMbLY

In one of the classes I teach, we end up writing assembly language programs. And while I explain the (sometimes very relative) benefits of writing assembly language, I use bubble sort as an example where even carefully crafted assembly language doesn’t mean much: it’s a bad algorithm to start with.

YoU CanT MaKE BuBBleSorT FaSTER With ASseMbLY

Except that it’s not quite true.

First, let’s consider (standard) vanilla bubble sort:

```template <template <typename...> typename C,
typename... Ts>
void bubble_sort( C<Ts...> & coll )
{
if (coll.size())
{
bool swapped;
typename C<Ts...>::iterator last=coll.end()-1;
typename C<Ts...>::iterator i;
do
{
swapped=false;
i=coll.begin();
while (i!=last)
{
if (*i>*(i+1))
{
std::swap(*i,*(i+1));
swapped=true;
}
++i;
}
--last;
}
while (swapped);
}
}
```

So nothing fancy here, except that it should work on any container with bidirectional iterators, and types for which operator> is defined. To make write that in assembly language, we should simplify things a bit. Let’s say “containers” are flat arrays and the T types is int.

```
# void bubble(size_t nb, int items[]);

_Z6bubblemPi:
.LFB0:
.cfi_startproc

## rdi nb
## rsi items[]

mov rcx,rsi             # itemsp[
xor rdx,rdx             # bool 'swapped'
lea rdi,[rsi+rdi*4-4]   # last

.bubble_while:

cmp rcx,rdi
jge .bubble_while_done

mov eax,[rcx]
mov r9d,[rcx+4]

cmp eax,r9d
jle .bubble_while_next

mov [rcx],r9d
mov [rcx+4],eax
mov rdx,1

.bubble_while_next:

jmp .bubble_while

.bubble_while_done:

or rdx,rdx
jz .bubble_done

mov rcx,rsi
xor rdx,rdx

dec rdi
jnz .bubble_while

.bubble_done:

ret

.cfi_endproc

```

This piece of assembly language does exactly what you expect it to do: passes that scans the array and swaps items when they are out of order; pushing the largest one at the end of the array, and stopping whenever no swaps were performed.

This version is longer than would be necessary if we had memory-to-memory swap instructions (which we don’t, on x86), and we could make it a bit faster if we changed the jump by conditional moves. But that wouldn’t change much because we still scan the array two items at a time. But… what if we compared 4? 8?

The newer instruction sets allow just that! With AVX, the xmm registers are 128-bits wide and can hold 4 ints, the ymm registers are 256-bits wide (and hold 8)… and if you can do AVX-512, there are the zmm 512-bits wide registers! We cant easily sort the values within a register, but we can compare them, pairwise, with another register. For example, we can compute the minimum of two registers:

```min(xmmi,xmmj) :=
xmmi[0]=min(xmmi[0],xmmj[0])
xmmi[1]=min(xmmi[1],xmmj[1])
xmmi[2]=min(xmmi[2],xmmj[2])
xmmi[3]=min(xmmi[3],xmmj[3])
```

We can also compute the maximum of two registers. If we replace the
conditional swap by:

```a=min(t[i],t[i+1])
b=max(t[i],t[i+1])
t[i]=a;
t[j]=b;
```

…then we can use that to swap, in order, two items. If we use the parallel AVX version, we’d do that with t[i...i+3] and t[i+4...i+7]. They wouldn’t be completely in order, but that increases order a lot.

If fact, if we use 4- (or 8-) int wide min/max, we end up bubble sorting the array as if it was 4 separate columns. After a we passes, then all columns (0,4,8,…) (1,5,8,…) (2,6,10,…) (3,9,11,…) are sorted. That a Shell-sort like kick start. We can then finish the job with our basic bubble sort, hoping that no element is too far from its final location.

```
# void xmmbubble(size_t nb, int items[]);

_Z9xmmbubblemPi:
.LFB0:
.cfi_startproc

## rdi nb
## rsi items[]

mov r8,rsi
mov r9,rdi

and r9b,0xf1            # ~0x7
shl r9,2                # *sizeof(int)

.xmmbubble_while:

lea r10,[r8+16]
cmp r10,r9
jge .xmmbubble_next

movdqu xmm0,[r8]
movdqu xmm1,[r8+16]

vpminsd xmm2,xmm0,xmm1  # xmm2=min(xmm0,xmm1)
vpmaxsd xmm3,xmm0,xmm1  # xmm3=min(xmm0,xmm1)

movdqu [r8],xmm2
movdqu [r8+16],xmm3

jmp .xmmbubble_while

.xmmbubble_next:

mov r8,rsi
sub r9,16
cmp r9,r8
je .xmmbubble_done

jmp .xmmbubble_while

.xmmbubble_done:

# finishes with ordinary
# bubble sort
call _Z6bubblemPi

ret

.cfi_endproc

```

*
* *

That’s a lot of work. Do we get a good speed-up, then?

 Size × 16 Naïve XMM 1000 0.28s 0.025s 10000 38s 1.8s 100000 3813s 183s

Surprisingly, we do. A lot! The Shell Sort -like first pass moves most of the values around the right position, so the classic bubble sort finishes sorting by only making a few passes (I would need to work out of the details of the inversions, but I conjecture that since they are (number-of-columns) times smaller, the sort is (number-of-columns)-squared times faster!)

### 4 Responses to YoU CanT MaKE BuBBleSorT FaSTER With ASseMbLY

1. jp says:

Data-oriented SIMD operations on the metal make for AMAZING speed-ups! Love it, great post.

• Thanks. It was a bonus question for an assignment (a course on assembly language). (But shh, don’t tell.)

2. jobleonard says:

Wouldn’t this also speed up other sorting algorithms that exploit existing order? Natural merge sort and Timsort come to mind

• Probably. I guess we could sort a block using a sorting network-thingie (using min, max, and shuffle), and that would be a good start for merge sort.