Fixed-point arithmetic joins the advantage of representing fractional values while maintaining a great simplicity in the operations. Contrary to floating point, arithmetic on fixed-points can be done using entirely integer operations, which explains their popularity on DSPs of all sorts.

Basically, a fixed-point number is just a fixed-width number of bits, with an unmoving, virtual point. For example, in a 16-bits register, four bits could be devoted to the fractional part, and 12 to the integer part, making a Q12.4 fixed-point (the Qn.m notation says that we have n integer bits, m fractional bits). In a fixed-point, the fractional bits behave as you would expect: the first after the point is worth 1/2, the next one 1/4, etc. Furthermore, the integers can be two’s-complement.

What makes it especially nice is the arithmetic that comes with fixed-point numbers:

**Addition/subtraction**. It behaves exactly as two’s-complement integer addition and subtraction. Therefore, adding/subtracting two fixed-point can be done directly by adding/subtracting the registers as signed integers.**multiplication**. Multiplying together two Qn.m fixed-points results in a Q2n.2m fixed-point, that we must restore to Q2n.m at least. (as always, if it overflows, that’s your problem). To multiply two Qn.m fixed-points, we use integer multiplication and shift back (right) m bits to restore the results to Qx.m. The nice thing is that it results in the correct truncation.**division**. Division is the tricky one (but isn’t much harder). Let:and

be two Qn.m values. Then

.

It’s as if we lost the fractional bits. That’s a problem because the quotient isn’t an integer; and it should be represented in ths. To express the result correctly in ths, we multiply the dividend (the number being divided) by , which is only a shift left.

So basically, fixed-point numbers are numbers expressed in ths.

*

* *

So let’s implement that. First, we need a cute little template to get, given an integer type, the type containing the double number of bits (for multiplication and division):

// enough.hpp #ifndef __MODULE_ENOUGH__ #define __MODULE_ENOUGH__ namespace enough { template <typename T> struct __twice; // incomplete type template <> struct __twice<int8_t> { using type = int16_t; }; template <> struct __twice<int16_t> { using type = int32_t; }; template <> struct __twice<int32_t> { using type = int64_t; }; template <> struct __twice<int64_t> { using type = int64_t; }; // fails silently template <typename T> using twice=typename __twice<T>::type; } // namespace enough #endif // __MODULE_ENOUGH__

We could use C++2a concepts to make sure that the fixed-point template has consistent arguments (I’m just beginning to look into C++20), and make sure we overload the basic operators:

// fixed.hpp #ifndef __MODULE_FIXED__ #define __MODULE_FIXED__ #include <enough.hpp> template <typename T, int scale=4> requires ((std::is_signed<T>::value==true) // c++2a && (scale<sizeof(T)*8)) class fixed { private: T number; fixed(int, T n) :number(n) {} public: // avec un autre fixed fixed operator+(const fixed & f) const { return fixed(0,number+f.number); } fixed operator-(const fixed & f) const { return fixed(0,number-f.number); } fixed operator*(const fixed & f) const { return fixed(0,(enough::twice<T>{number}*f.number) >> scale); } fixed operator/(const fixed & f) const { return fixed(0,(enough::twice<T>{number}<<scale)/f.number); } // avec float fixed operator+(float f) const { return *this+fixed(f); } fixed operator-(float f) const { return *this-fixed(f); } fixed operator*(float f) const { return *this*fixed(f); } fixed operator/(float f) const { return *this/fixed(f); } fixed operator=(const fixed & f) { return number=f.number; } bool operator==(const fixed & f) { return number==f.number; } operator float() const { return number/(float)(1<<scale); } fixed(): number(0) {} fixed(const fixed<T,scale> & other) : number(other.number) {}; fixed(float f) : number(f * (1<<scale)) {} }; #endif // __MODULE_FIXED__

We just use it:

#include <iostream> #include <fixed.hpp> int main() { fixed<int16_t> a(-4),b(3); std::cout << a+b << std::endl << a*b << std::endl << a/b << std::endl << a-b << std::endl ; return 0; }

*

* *

The complexity is visibly much, much less than floating point numbers. Addition/subtraction is just signed integer add/sub; multiplication and division ask for an integer promotion (the “cast”) and a shift. If you choose the number of bits after the point wisely, you may even get away with no-shifts and only memory operations. No wonder it’s used in low-complexity, low-power DSP chips.

]]>

If you can mirror the image in post-production, some cameras can’t do it either (and in Zoom, it seems that mirror image is only for *your* display only, not the image you send, and therefore is perfectly useless), so you need a bit of help. What you want is a simple mirror:

I built a rack with two sliding windows and a t-nut (apparently, that’s what they’re called). Cameras (all types it seems) use ¼”-20 screws (¼” of an inch diameter, 20 turns per inch), and they’re easy to get in any lengths from your favorite hardware store (I merely rummaged in my miscellaneous screw box to find a few). What’s harder to find in a misc. box are the t-nuts:

The t-nut will be use to screw the camera tripod onto the rail (maybe by replacing the usually very short default screw by a longer one). The teeth go on the opposite side so that as you fasten the screw, it holds the block of wood. Pretty much everything else is bits of scrap wood, wood glue, and generic black spray paint.

So There are three holes: one for the t-nut that will fasten the whole thing onto the camera tripod; and two for adjustments for the camera and the mirror block. In principle, you would not need a slot for the camera, because you could measure everything exactly. But if you change the camera, or measure wrong, you’re … screwed. So a slot. Same for the mirror, which could also be glued in optimal position, but, here also, a slot. Slots are done by drilling at both ends a single ¼” inch hole, and then the wood in between is cut away with a chisel. Two small blocks are placed to raise the camera a bit: they are merely glued in place.

The mirror is glued to a 2×4 block (scraps from the scrap box). The mirror itself is a cut-off from some larger mirror, but you could use a mirror from a dollar-store handheld mirror. The hole in the block is smaller than ¼” so that the screw actually stay screwed (3/16″).

The finished product:

Once you’re happy with the setup and the glue is dried, time for paint. That’s just generic black spray paint. Once the paint is very dry, you can add some felt on the block so that when screwed in, the camera won’t stick to the paint. You can see the felt in red, if you look closely:

Finally, the block being mobile, you can adjust the mirror whenever you setup your lightboard at home. A still from the camera, seen through the mirror:

*

* *

This post pretty much conclude my series on the lightboard. In these three posts, we’ve seen how to build the lightboard itself, how to setup the studio, and now how to build the mirror assembly. So, all’s left to do is actually use it!

]]>

First, let’s have a global look at the setup:

This is a panorama stitches from 3 or 4 individual pictures since I don’t have a wide-enough lens for to show all of it at once. Let’s have a second look, now with numbered items:

We have (in no particular order) (pay no attention to the misc stuff on the shelves: it’s the basement!):

- Computer. A ~10-year old MacBook Pro running Ubuntu 20.04 (Focal Fossa) for Zoom and recording.
- Spots. A set of Neewer BiColor 660 LED spots. They are positioned (see diagram below) so that they do not reflect in the class from the point of view of the camera. Two are on the sides and one center, below.
- Generic microphone stand reused to hold 3rd spot.
- Lectern. I put notes on what to do in what order during lectures or recording. Put at an angle so that it doesn’t show too much I’m reading from it when I write stuff on the board.
- Larger screen to show Zoom participants. Higher, the image would reflect on the glass. It sits on a standard milk crate.
- Coffee table (from Goodwill). Serves as hub. Not visible: USB 8-port hub, camera battery charger, extra sets of speakers (so that the lightboard doesn’t cut the sound from the laptop). Misc. thingies holder.
- Focus pattern to give something the camera to focus on during setup. Might also use “standard test card” (not necessarily the one with the “Indian head”, but maybe this one.
- Ceiling light. Should be off.
- Bundle of 6′ electrical extensions.
- Crop marks. I adjust the view from the camera so that these marks are just barely outside the field of view. It also helps you do stay within field of view.
- HDMI to “fake USB Camera” adapter. Used to convert the movie camera HDMI output to the computer as a USB camera (for Zoom). Actually, on the image, it’s the USB cable that leads to the hub with the converter. Also HDMI output to second screen.
- A “puzzle play mat”, black and white. That helps if you’re going to stand up all day.

Not shown/not highlighted:

- Canon HF R800. A basic 1080p video camera with zoom, HDMI output and about 2 hours’ worth of autonomy. It’s uncomplicated, and you can control some of the exposition, frame rate, color balanced. It’s a basic consumer-level thingie, it’s grainy in mid- to low-light, but it works well for hours on end.
- Canon Rebel T8i (alias EOS 850D, alias EOS Kiss-X10i) for quality recording. It does 4K (that nobody uses, but that may eventually turn out to be useful in post-prod, by shooting in 4K, applying image processing, then downsampling to 1080p), but I use 1080p for now.
- Good Manfrotto Tripod (maybe different from my model; I had mine for the last 20 years or so).
- Audiotechnica Lavaliere microphones.
- A bunch of extensions & power bars:
- Audio cables with 2.5mm jacks (one going to the camera, the other to the computer).
- HDMI extensions: one 25′ for the camera, one 6′ for the second screen.
- At least two power bars: one for the computer and screen, one for the spots.
- Green masking tape marks. To align stuff. One you’re happy with your setup, use making tape (easy to remove) to mark what goes where.
- Curtains behind the lightboard but also behind the camera. It keeps the rest of the room from reflecting in the lightboard glass.

The general arrangement is (as seen from above):

The spots are placed to light the screen evenly, but not reflect in it. Two are placed at eye-height (for me, 1m70 / 5’8″ or so) and the center one as low as possible but also as close as possible to the lightboard and you’ll need something else than the default stand that comes with it (it’s a microphone stand).

*

* *

Finally, you’ll have to experiment. Exposition settings on your camera will vary (because you have a different lens, different brand, etc.). Light will vary. Maybe you’ll find that spots at 50% are good. Maybe you’ll want darker background. Or not.

The markers I use are Expo Neons, but other style works well (liquid chalk works visually well, but isn’t really dry-erasable), some are just really bad (everything that’s vaguely wax-based, like Crayola “dry erase”).

You’ll also have to experiment with sound. Boom mike or Lavaliere? Record sound separately and resync in post-prod or record on the camera itself? I prefer recording on the camera (thus the audio cable extensions).

Finally, the software. I, as a enthusiast penguinist^{tm}, avoid proprietary software. For now, I use Kendlive (KDE application that runs just fine in Gnome). I still have to figure out everything in it.

The rest is up to you.

An example: Des nombres à l’égyptienne.

]]>But let’s take the problem the other way around: what about defines that gives you the smallest integer, not for the number of bits (because that’s trivial with `int`xx`_t`) but from the maximum value you need to represent?

There are plenty of reasons to use the smallest possible integer to store values with known bounds: smaller files, smaller data structures, more useful data in the same cache line. The C99 <stdint.h> isn’t much use here because it lacks metaprogramming, or more exactly, the C preprocessor and old-style macros are too weak to provide the kind of metaprogramming we need here:

- A compile-time function that tells us how many bits are needed to represent a max value;
- A template that takes a number of bits and decides one the smallest integer accommodating it.

The first `constexpr` function gives the number of bits to represent a *maximum value*, that is, it doesn’t think you need 5000 values (0-4999) but that you need to represent the value 5000 (0-5000). So if the maximum value is 4, you do need 3 bits (because 4 is 100_{2}); so it’s not quite log-base-2 (and don’t come whine in the comments).

//////////////////////////////////////// // c++14 and + constexpr std::size_t bits_from_value(std::size_t n) { if (n) return (n<2)?1:(1+bits_from_value(n/2)); else return 0; }

Now, let’s create a template that takes a number of bits and decides on the smallest integer that accommodates that number of bits:

//////////////////////////////////////// // should also do signed... template <int x> struct __just_enough_uint; // incomplete type template <> struct __just_enough_uint<64> { using type = uint64_t; }; template <> struct __just_enough_uint<32> { using type = uint32_t; }; template <> struct __just_enough_uint<16> { using type = uint16_t; }; template <> struct __just_enough_uint<8> { using type = uint8_t; }; template <const int x> using just_enough_uint = typename __just_enough_uint<(x>32) ? 64 : ((x>16) ? 32 : ((x>8) ? 16 : 8))>::type; //////////////////////////////////////// template <typename T> struct bits_from_type { constexpr static size_t value = sizeof(T)*8; };

You may have to extend the above if your architecture has more possible types; but you should be fine with 8, 16, 32, and 64 bits. To create a variable of the `just_enough_uint` type, you use:

just_enough_uint<bits_from_value(13)> z; //enough for max value 13 (0...13)

*

* *

This is but a small brick in a much wider scheme to save memory (or storage). Compact data structures have been a research interest for while now and I’ve done a few things before. However, a more algorithmic approach is needed, as a couple of clever tricks, while helpful, aren’t a complete theory. More on this later.

]]>So, the problem is, where do you get a lightboard on short notice? Well, you can build one. Let’s see how:

Basically, a lightboard is just a glass panel with legs; there are many way you can build one. I opted for something very simple: A frame, transverse legs and a 4’×6′ pane of glass:

What you’ll need:

- 5 2×4 8′,
- 6 quarter round 1/2″, 8′
- 4 heavy duty shelving brackets,
- 4 T-shaped “flat angle” brackets,
- 4 “flat corner” brackets,
- 4 4″ wood screws,
- a large number of 1″ (¼) wood screws,
- 4 lockable swivel wheels,
- some black paint,
- 20 ¼”×2½” rectangle felts.

It took a small afternoon to cut everything to size and assemble it:

The quarter rounds will hold the sheet of glass (as shown in the hand drawing). I also used felts between the glass and the quarter rounds (4 on 6′ lengths, 3 on 4′ lengths) to fasten softly the glass, as I feared that pressing the wood directly to the glass and then screwing it to the frame may break the glass. The assembled frame looks like this:

I painted the inside and the front of the frame black, to avoid glare and reflection into the glass. With the glass, it looks like this, with yours truly:

Total assembly time is about a week: an afternoon for the frame, another day for the paint, a couple of day drying and waiting for the glass to be delivered, a few minutes assembling the last quarter rounds to hold the glass in place.

*

* *

Aside from the lightboard itself, you need to light the glass. Some use in-sheet lighting, it seems to be marginally useful from my test. What works best are studio lights, placed in front of the glass and outside of view, (you can see the glare on the frame on the picture above). You can get some from your favorite online store for more or less 100$ apiece.

The curtains behind the glass *and* in front (to prevent reflection of the surrounding room) are from online. They are 10’×12′ black muslin.

To write on the glass, basic whiteboard pens won’t work, they’re too transparent to be useful. You’ll need some thick, preferably fluo, pens, such as Expo Neons and the like. Some use liquid chalk pens, I haven’t found any locally. I may order some later.

You’ll also need to figure the type of video you want, the general brightness of the scene, etc. Anyway, you’ll want to maximize readability, and I what I found worked best (for me) is to disable the camera auto-exposure and use a fixed ISO and opening. For my camera, ISO 400 and F/4 does the trick:

*

* *

For sound, a “lavaliere” (or Lavalier, or tie-clip,…) type of clip-on microphone works well, I’m still experimenting with a microphone above me on a boom… I’m not sure what works best yet.

*

* *

I’ll do another entry where I detail the rest of the “studio” more carefully. Until then, I’ll keep experimenting!

]]>

A polynomial is an expression of the form

.

A naïve approach to evaluating a polynomial would be to compute, independently, each monomial , everyone at a cost of products ( for , plus one for ). When you sum over all , you get products (and additions) for a polynomial of degree . That’s way too much.

Someone, in the early 19th century, Horner, remarked that you could rewrite any polynomial as

,

giving us products and additions. That’s much better. Also turns out that if there’s nothing special about the polynomial (no zero coefficients) it’s optimal.

But Horner’s method (or Horner’s formula, depending on where you read about it) is inherently *sequential*, because all the products are nested. It’s not amenable to parallel processing, even in its simpler SIMD form.

However, preparing lectures notes, I found that Estrin proposed a parallel algorithm to evaluate polynomial that both minimize wait time and the total number of products [1]. The scheme is shown here:

The original paper presents a method for polynomials with a degree of , but you can easily adapt the splitting for an arbitrary degree. The first step is to split the polynomial into binomials, each of which are evaluated in parallel. You then combine those into new binomials, also evaluated in parallel, and again, until you have only one term left, that is, the answer.

What immediately comes to mind is a SIMD implementation where we use wide registers to do all products in parallel, but maybe just relying on instruction-level parallelism and the compiler is quite efficient for low degree polynomials. Let’s try with degree seven. Let’s say, with:

,

because why not. Let’s test an implementation:

//////////////////////////////////////// int pow_naif(int x, int n) { int p=1; for (int i=0;i<n;i++,p*=x); return p; } //////////////////////////////////////// int eval_naif(int x) { return 9*pow_naif(x,7) +5*pow_naif(x,6) +7*pow_naif(x,5) +4*pow_naif(x,4) +5*pow_naif(x,3) +3*pow_naif(x,2) +3*x +2; } ///////////////////////////////////////// int eval_horner(int x) { return ((((((9*x+5)*x+7)*x+4)*x+5)*x+3)*x+3)*x+2; } //////////////////////////////////////// int eval_estrin(int x) { int t0=3*x+2; int t1=5*x+3; int t2=7*x+4; int t3=9*x+5; x*=x; int t4=t1*x+t0; int t5=t3*x+t2; x*=x; return t5*x+t4; }

We compile with all optimizations enabled, and evaluate the polynomial for from 0 to 1000000000. Times are:

Method | Time (s) |

Naïve | 1.93 |

Horner | 1.66 |

Estrin | 1.41 |

So despite not being explicitly parallel, Estrin’s version performs significantly better because of instruction-level parallelism. Let’s look at the generated code:

0000000000000e00 <_Z11eval_estrini>: e00: 8d 04 fd 00 00 00 00 lea eax,[rdi*8+0x0] e07: 89 f9 mov ecx,edi e09: 0f af cf imul ecx,edi e0c: 8d 54 07 05 lea edx,[rdi+rax*1+0x5] e10: 29 f8 sub eax,edi e12: 0f af d1 imul edx,ecx e15: 8d 44 02 04 lea eax,[rdx+rax*1+0x4] e19: 89 ca mov edx,ecx e1b: 0f af d1 imul edx,ecx e1e: 0f af c2 imul eax,edx e21: 8d 54 bf 03 lea edx,[rdi+rdi*4+0x3] e25: 0f af d1 imul edx,ecx e28: 8d 4c 7f 02 lea ecx,[rdi+rdi*2+0x2] e2c: 01 ca add edx,ecx e2e: 01 d0 add eax,edx e30: c3 ret

We see that the clever use of `lea` allows different pipelines to compute address independently and that it is also used to multiply by the coefficients. Such magic wouldn’t occur if the coefficients were much less cooperative (say 27, or something).

*

* *

What about actual SIMD implementation? Well, I gave it a try and my implementation has the same number of instructions as the sequential version generated by the compiler. Turns out that even if you can you a couple of multiply in parallel, the butterfly-like structure asks you to shuffle the values around (using `pshufd`) and that negates any gain you get from parallelism (on some of my machines, it’s even slower!). Maybe there’s a better way of doing this. Questions for later.

[1] Gerald Estrin —

First, we have the most known of these approximations, the famous “Stirling formula”:

,

Where the terms at the right are known as Stirling Series (the numerators are given by A046968 and the denominators by A046969). If you evaluate the complete series, it’s truly equal to .

However, you may not quite want to evaluate an infinite series, and we find truncated versions:

that we will call “Stirling” from now on; a “Stirling more” version could be

,

or even a “Stirling most”:

.

The literature is fraught with approximations. For example, we find Gosper’s:

.

Everybody refers [a] as the source of this approximation. However, while it can be a consequence of that paper, it’s *not* in it (also, its typography is a train wreck).

We have Burnside’s [2]:

.

Then Mortici’s [3]:

.

There are plenty more, but let’s consider one more. Mohanty and Rummens’ [4]:

.

*

* *

How do those compare? Asymptotically, when is very large, the ratio of any of these approximation to goes to 1. On smaller , they also all kind-of-work OK:

From the figure above, we see that Gosper’s does very well, just a bit worst than “Sterling More”. Mohanty and Rumme1ns’ does best, but it’s also quite a bit more complex than Gosper’s. What if we have a look at the digits that are output? The following shows the result (in *Mathematica*, which computes in “infinite precision”) rounded:

But what’s more telling, is the ratio to the real value:

But we can also have a look at the number of correct leading digits:

*

* *

So Gosper’s approximation is much better than just the truncated Stirling and compares to “Stirling More”, which is not that surprising because it’s very close to what you get by distributing the into the square root; so it’s a good numerical trade off. However, “Stirling More” and Mohanty & Rummens’ compare with a slight advantage to the later.

[1] R. William Gosper, Jr. — *Decision Procedure for Indefinite Hypergeometric Summation* — Procs. Nat. Acad. Science USA, vol 75 no 1 (1978) p. 42–46.

[2] W. Burnside — *A Rapidly Convergent Series for Log N!* — Messenger Math., vol 46 no ? (1917) p. 157–159

[3] Cristinel Mortici — *An Ultimate Extremely Accurate Formula for Approximation of the Factorial Function* — Archiv der Mathematik, vol 93 (2009) p. 37–45

[4] S. Mohanty, F. H. A. Rummens — *Comment on “An Improved Analytical Approximation to n!”* — J. Chem. Physics, vol 80 (1984) p. 591.

Let’s see what hypotheses are useful, and how we can use them to get a good idea on the number of bits needed.

**For Sound**

I have shown how dB and bits are related, here and also here. Basically, adding one bit to a code adds about 6 dB to the resulting signal. Now, by definition, the threshold of hearing, is set at 0 dB. This corresponds to the weakest sound you can distinguish from true silence. Threshold of pain (the point where you kind of expect your ears to start bleeding) is somewhere above 120 dB. Much louder sounds lead to actual hearing damage—explosions, rocket launches, etc. If we assume that we stay in the 0 to 120 dB range, the useful range for safe sound reproduction, at about 6 dB by bit,

.

So about 20 bits would be enough. If you consider 0 dB as the threshold of hearing, you might want to use 1 or 2 more bits to account for people with much finer hearing (as would suggest the loudness contour chart). Round to the next byte, you get 24 bits. What pros suggest you use.

**For Images**

That one asked me a bit more research to find good references. Some report the total visual dynamic range is about 10 orders of magnitude (from to ) (in appropriate luminosity units), others, like Fein and Szuts[1], report 16 (from to ). Depending on the range, that’d yield

bits,

or

bits.

However, while the human eye *can* see luminosity on that range, it can’t do it *simultaneously*. The following figure (from Gonzalez & Woods, [2]) shows that around a base value (average scene luminosity), shown as in the figure, only a certain range can be perceived (with lower range marked as ). That range seems to be only 4, or 5 order of magnitude, so only

bits.

So if we consider the simultaneously perceivable range around some standard average-but-bright-enough luminosity, we might get away with 16 bits per color component (maybe less?).

*

* *

The number we get are pretty much in line with what we find in audio and video. 24 bits is considered “professional” (but not necessarily useful, depending on the quantity of noise in the original source) for audio. HDMI support up to 48 bits per pixel (16 bits per component) while digital camera often sport 10, 12 or 14 bits per component.

[1] Alan Fein, Ete Zoltan Szutz

— *Photoreceptors: Their Role in Vision* —

Cambridge University Press (1982)

[2] Rafael C. Gonzalez, Richard E. Woods

— *Digital Image Processing* — 2nd ed, Prentice

Hall (2002)

Except that it’s not quite true.

First, let’s consider (standard) vanilla bubble sort:

template <template <typename...> typename C, typename... Ts> void bubble_sort( C<Ts...> & coll ) { if (coll.size()) { bool swapped; typename C<Ts...>::iterator last=coll.end()-1; typename C<Ts...>::iterator i; do { swapped=false; i=coll.begin(); while (i!=last) { if (*i>*(i+1)) { std::swap(*i,*(i+1)); swapped=true; } ++i; } --last; } while (swapped); } }

So nothing fancy here, except that it should work on any container with bidirectional iterators, and types for which `operator>` is defined. To make write that in assembly language, we should simplify things a bit. Let’s say “containers” are flat arrays and the `T` types is int.

# void bubble(size_t nb, int items[]); _Z6bubblemPi: .LFB0: .cfi_startproc ## rdi nb ## rsi items[] mov rcx,rsi # itemsp[ xor rdx,rdx # bool 'swapped' lea rdi,[rsi+rdi*4-4] # last .bubble_while: cmp rcx,rdi jge .bubble_while_done mov eax,[rcx] mov r9d,[rcx+4] cmp eax,r9d jle .bubble_while_next mov [rcx],r9d mov [rcx+4],eax mov rdx,1 .bubble_while_next: add rcx,4 jmp .bubble_while .bubble_while_done: or rdx,rdx jz .bubble_done mov rcx,rsi xor rdx,rdx dec rdi jnz .bubble_while .bubble_done: ret .cfi_endproc

This piece of assembly language does exactly what you expect it to do: passes that scans the array and swaps items when they are out of order; pushing the largest one at the end of the array, and stopping whenever no swaps were performed.

This version is longer than would be necessary if we had memory-to-memory swap instructions (which we don’t, on x86), and we could make it a bit faster if we changed the jump by conditional moves. But that wouldn’t change much because we still scan the array two items at a time. But… what if we compared 4? 8?

The newer instruction sets allow just that! With AVX, the `xmm` registers are 128-bits wide and can hold 4 `int`s, the `ymm` registers are 256-bits wide (and hold 8)… and if you can do AVX-512, there are the `zmm` 512-bits wide registers! We cant easily sort the values within a register, but we can compare them, pairwise, with another register. For example, we can compute the minimum of two registers:

min(xmmi,xmmj) := xmmi[0]=min(xmmi[0],xmmj[0]) xmmi[1]=min(xmmi[1],xmmj[1]) xmmi[2]=min(xmmi[2],xmmj[2]) xmmi[3]=min(xmmi[3],xmmj[3])

We can also compute the maximum of two registers. If we replace the

conditional swap by:

a=min(t[i],t[i+1]) b=max(t[i],t[i+1]) t[i]=a; t[j]=b;

…then we can use that to swap, in order, two items. If we use the parallel AVX version, we’d do that with `t[i...i+3]` and `t[i+4...i+7]`. They wouldn’t be completely in order, but that increases order a lot.

If fact, if we use 4- (or 8-) int wide min/max, we end up bubble sorting the array as if it was 4 separate columns. After a we passes, then all columns (0,4,8,…) (1,5,8,…) (2,6,10,…) (3,9,11,…) are sorted. That a Shell-sort like kick start. We can then finish the job with our basic bubble sort, hoping that no element is too far from its final location.

# void xmmbubble(size_t nb, int items[]); _Z9xmmbubblemPi: .LFB0: .cfi_startproc ## rdi nb ## rsi items[] mov r8,rsi mov r9,rdi and r9b,0xf1 # ~0x7 shl r9,2 # *sizeof(int) add r9,r8 # last item .xmmbubble_while: lea r10,[r8+16] cmp r10,r9 jge .xmmbubble_next movdqu xmm0,[r8] movdqu xmm1,[r8+16] vpminsd xmm2,xmm0,xmm1 # xmm2=min(xmm0,xmm1) vpmaxsd xmm3,xmm0,xmm1 # xmm3=min(xmm0,xmm1) movdqu [r8],xmm2 movdqu [r8+16],xmm3 add r8,16 jmp .xmmbubble_while .xmmbubble_next: mov r8,rsi sub r9,16 cmp r9,r8 je .xmmbubble_done jmp .xmmbubble_while .xmmbubble_done: # finishes with ordinary # bubble sort call _Z6bubblemPi ret .cfi_endproc

*

* *

That’s a lot of work. Do we get a good speed-up, then?

Size × 16 | Naïve | XMM |

1000 | 0.28s | 0.025s |

10000 | 38s | 1.8s |

100000 | 3813s | 183s |

Surprisingly, we do. A lot! The Shell Sort -like first pass moves most of the values *around* the right position, so the classic bubble sort finishes sorting by only making a few passes (I would need to work out of the details of the inversions, but I conjecture that since they are (number-of-columns) times smaller, the sort is (number-of-columns)-squared times faster!)

While discussing this algorithm in class, a student asked a very interesting question: what’s special about base 2? Couldn’t we use another base? Well, yes, yes we can.

Indeed, there’s nothing special about base 2, except maybe that it’s very simple, and it yields a very simple implementation. For example:

unsigned expo_iter(unsigned x, unsigned p) { unsigned t=x, e=1; while (p) { if (p&1) e*=t; t*=t; p/=2; } return e; }

This simple procedures raises to the th power in steps. And they’re (mostly) inexpensive: a shift right by one bit, a mask, and a multiply (which could be expensive if the objects are complex and we can’t rely on machine-sized multiplications).

But what about other bases? Say, base 7? or 3? or whatever other than 1?

First, we must understand what the binary algorithm does. It exploits the fact that in a very specific way: the exponent will be broken into powers of two, so that becomes ; and indeed the powers to use are given by the binary representation of 25, 11001 (which is !). Now, let us see that the above algorithm does. The following table gives you what the algorithm computes at step :

Step | e | p | t |

(init) | 25 | ||

1 | 25 | ||

2 | 12 | ||

3 | 6 | ||

4 | 3 | ||

5 | 1 | ||

(done) | 0 | — |

So it is clear that the number of iterations is given by the position of the most significant bit in the exponent (assuming it’s integer!), and we do at most two multiply (and one shift) per round.

*

* *

So how do we generalize this algorithm from base 2 to some base ? Certainly, the decomposition of the exponent in base is necessary. Let’s pick (because it’s not 2 and it’s also not too big) and (also, no special reason other than a good mix of digits and just long enough we can figure something out of them). We will use the digits to multiply each group of powers of 3 so that the total is 98. Indeed, we have . Indeed:

0

Step | e | m | p | t |

(init) | — | 1 | x | |

1 | 2 | |||

2 | 2 | |||

3 | 1 | |||

4 | 0 | |||

5 | 1 | |||

(done) | — |

Now, let’s turn that into code! Here, *Mathematica* will help us deal with arbitrary large numbers.

expo[x_, e_, b_: 2] := Module[{tx = x, p = 1, te = e, m}, While[te != 0, m = Mod[te, b]; If[m != 0, p *= tx^m]; te = Quotient[te, b]; If[te != 0, tx = tx^b]; (* if product is expensive *) ]; p ]

*

* *

This algorithm trades-off the number of iterations (from to to raise to the th power) for a more complex inner loop: div/mod by something else than 2 could be expensive—even if divmod is likely just one instruction!

]]>