In one of the classes I teach, we end up writing assembly language programs. And while I explain the (sometimes very relative) benefits of writing assembly language, I use bubble sort as an example where even carefully crafted assembly language doesn’t mean much: it’s a bad algorithm to start with.

Except that it’s not quite true.

First, let’s consider (standard) vanilla bubble sort:

template <template <typename...> typename C, typename... Ts> void bubble_sort( C<Ts...> & coll ) { if (coll.size()) { bool swapped; typename C<Ts...>::iterator last=coll.end()-1; typename C<Ts...>::iterator i; do { swapped=false; i=coll.begin(); while (i!=last) { if (*i>*(i+1)) { std::swap(*i,*(i+1)); swapped=true; } ++i; } --last; } while (swapped); } }

So nothing fancy here, except that it should work on any container with bidirectional iterators, and types for which `operator>` is defined. To make write that in assembly language, we should simplify things a bit. Let’s say “containers” are flat arrays and the `T` types is int.

# void bubble(size_t nb, int items[]); _Z6bubblemPi: .LFB0: .cfi_startproc ## rdi nb ## rsi items[] mov rcx,rsi # itemsp[ xor rdx,rdx # bool 'swapped' lea rdi,[rsi+rdi*4-4] # last .bubble_while: cmp rcx,rdi jge .bubble_while_done mov eax,[rcx] mov r9d,[rcx+4] cmp eax,r9d jle .bubble_while_next mov [rcx],r9d mov [rcx+4],eax mov rdx,1 .bubble_while_next: add rcx,4 jmp .bubble_while .bubble_while_done: or rdx,rdx jz .bubble_done mov rcx,rsi xor rdx,rdx dec rdi jnz .bubble_while .bubble_done: ret .cfi_endproc

This piece of assembly language does exactly what you expect it to do: passes that scans the array and swaps items when they are out of order; pushing the largest one at the end of the array, and stopping whenever no swaps were performed.

This version is longer than would be necessary if we had memory-to-memory swap instructions (which we don’t, on x86), and we could make it a bit faster if we changed the jump by conditional moves. But that wouldn’t change much because we still scan the array two items at a time. But… what if we compared 4? 8?

The newer instruction sets allow just that! With AVX, the `xmm` registers are 128-bits wide and can hold 4 `int`s, the `ymm` registers are 256-bits wide (and hold 8)… and if you can do AVX-512, there are the `zmm` 512-bits wide registers! We cant easily sort the values within a register, but we can compare them, pairwise, with another register. For example, we can compute the minimum of two registers:

min(xmmi,xmmj) := xmmi[0]=min(xmmi[0],xmmj[0]) xmmi[1]=min(xmmi[1],xmmj[1]) xmmi[2]=min(xmmi[2],xmmj[2]) xmmi[3]=min(xmmi[3],xmmj[3])

We can also compute the maximum of two registers. If we replace the

conditional swap by:

a=min(t[i],t[i+1]) b=max(t[i],t[i+1]) t[i]=a; t[j]=b;

…then we can use that to swap, in order, two items. If we use the parallel AVX version, we’d do that with `t[i...i+3]` and `t[i+4...i+7]`. They wouldn’t be completely in order, but that increases order a lot.

If fact, if we use 4- (or 8-) int wide min/max, we end up bubble sorting the array as if it was 4 separate columns. After a we passes, then all columns (0,4,8,…) (1,5,8,…) (2,6,10,…) (3,9,11,…) are sorted. That a Shell-sort like kick start. We can then finish the job with our basic bubble sort, hoping that no element is too far from its final location.

# void xmmbubble(size_t nb, int items[]); _Z9xmmbubblemPi: .LFB0: .cfi_startproc ## rdi nb ## rsi items[] mov r8,rsi mov r9,rdi and r9b,0xf1 # ~0x7 shl r9,2 # *sizeof(int) add r9,r8 # last item .xmmbubble_while: lea r10,[r8+16] cmp r10,r9 jge .xmmbubble_next movdqu xmm0,[r8] movdqu xmm1,[r8+16] vpminsd xmm2,xmm0,xmm1 # xmm2=min(xmm0,xmm1) vpmaxsd xmm3,xmm0,xmm1 # xmm3=min(xmm0,xmm1) movdqu [r8],xmm2 movdqu [r8+16],xmm3 add r8,16 jmp .xmmbubble_while .xmmbubble_next: mov r8,rsi sub r9,16 cmp r9,r8 je .xmmbubble_done jmp .xmmbubble_while .xmmbubble_done: # finishes with ordinary # bubble sort call _Z6bubblemPi ret .cfi_endproc

*

* *

That’s a lot of work. Do we get a good speed-up, then?

Size × 16 | Naïve | XMM |

1000 | 0.28s | 0.025s |

10000 | 38s | 1.8s |

100000 | 3813s | 183s |

Surprisingly, we do. A lot! The Shell Sort -like first pass moves most of the values *around* the right position, so the classic bubble sort finishes sorting by only making a few passes (I would need to work out of the details of the inversions, but I conjecture that since they are (number-of-columns) times smaller, the sort is (number-of-columns)-squared times faster!)

Data-oriented SIMD operations on the metal make for AMAZING speed-ups! Love it, great post.

Thanks. It was a bonus question for an assignment (a course on assembly language). (But shh, don’t tell.)

Wouldn’t this also speed up other sorting algorithms that exploit existing order? Natural merge sort and Timsort come to mind

Probably. I guess we could sort a block using a sorting network-thingie (using min, max, and shuffle), and that would be a good start for merge sort.