But let’s take the problem the other way around: what about defines that gives you the smallest integer, not for the number of bits (because that’s trivial with `int`xx`_t`) but from the maximum value you need to represent?

There are plenty of reasons to use the smallest possible integer to store values with known bounds: smaller files, smaller data structures, more useful data in the same cache line. The C99 <stdint.h> isn’t much use here because it lacks metaprogramming, or more exactly, the C preprocessor and old-style macros are too weak to provide the kind of metaprogramming we need here:

- A compile-time function that tells us how many bits are needed to represent a max value;
- A template that takes a number of bits and decides one the smallest integer accommodating it.

The first `constexpr` function gives the number of bits to represent a *maximum value*, that is, it doesn’t think you need 5000 values (0-4999) but that you need to represent the value 5000 (0-5000). So if the maximum value is 4, you do need 3 bits (because 4 is 100_{2}); so it’s not quite log-base-2 (and don’t come whine in the comments).

//////////////////////////////////////// // c++14 and + constexpr std::size_t bits_from_value(std::size_t n) { if (n) return (n<2)?1:(1+bits_from_value(n/2)); else return 0; }

Now, let’s create a template that takes a number of bits and decides on the smallest integer that accommodates that number of bits:

//////////////////////////////////////// // should also do signed... template <int x> struct __just_enough_uint; // incomplete type template <> struct __just_enough_uint<64> { using type = uint64_t; }; template <> struct __just_enough_uint<32> { using type = uint32_t; }; template <> struct __just_enough_uint<16> { using type = uint16_t; }; template <> struct __just_enough_uint<8> { using type = uint8_t; }; template <const int x> using just_enough_uint = typename __just_enough_uint<(x>32) ? 64 : ((x>16) ? 32 : ((x>8) ? 16 : 8))>::type; //////////////////////////////////////// template <typename T> struct bits_from_type { constexpr static size_t value = sizeof(T)*8; };

You may have to extend the above if your architecture has more possible types; but you should be fine with 8, 16, 32, and 64 bits. To create a variable of the `just_enough_uint` type, you use:

just_enough_uint<bits_from_value(13)> z; //enough for max value 13 (0...13)

*

* *

This is but a small brick in a much wider scheme to save memory (or storage). Compact data structures have been a research interest for while now and I’ve done a few things before. However, a more algorithmic approach is needed, as a couple of clever tricks, while helpful, aren’t a complete theory. More on this later.

]]>So, the problem is, where do you get a lightboard on short notice? Well, you can build one. Let’s see how:

Basically, a lightboard is just a glass panel with legs; there are many way you can build one. I opted for something very simple: A frame, transverse legs and a 4’×6′ pane of glass:

What you’ll need:

- 5 2×4 8′,
- 6 quarter round 1/2″, 8′
- 4 heavy duty shelving brackets,
- 4 T-shaped “flat angle” brackets,
- 4 “flat corner” brackets,
- 4 4″ wood screws,
- a large number of 1″ (¼) wood screws,
- 4 lockable swivel wheels,
- some black paint,
- 20 ¼”×2½” rectangle felts.

It took a small afternoon to cut everything to size and assemble it:

The quarter rounds will hold the sheet of glass (as shown in the hand drawing). I also used felts between the glass and the quarter rounds (4 on 6′ lengths, 3 on 4′ lengths) to fasten softly the glass, as I feared that pressing the wood directly to the glass and then screwing it to the frame may break the glass. The assembled frame looks like this:

I painted the inside and the front of the frame black, to avoid glare and reflection into the glass. With the glass, it looks like this, with yours truly:

Total assembly time is about a week: an afternoon for the frame, another day for the paint, a couple of day drying and waiting for the glass to be delivered, a few minutes assembling the last quarter rounds to hold the glass in place.

*

* *

Aside from the lightboard itself, you need to light the glass. Some use in-sheet lighting, it seems to be marginally useful from my test. What works best are studio lights, placed in front of the glass and outside of view, (you can see the glare on the frame on the picture above). You can get some from your favorite online store for more or less 100$ apiece.

The curtains behind the glass *and* in front (to prevent reflection of the surrounding room) are from online. They are 10’×12′ black muslin.

To write on the glass, basic whiteboard pens won’t work, they’re too transparent to be useful. You’ll need some thick, preferably fluo, pens, such as Expo Neons and the like. Some use liquid chalk pens, I haven’t found any locally. I may order some later.

You’ll also need to figure the type of video you want, the general brightness of the scene, etc. Anyway, you’ll want to maximize readability, and I what I found worked best (for me) is to disable the camera auto-exposure and use a fixed ISO and opening. For my camera, ISO 400 and F/4 does the trick:

*

* *

For sound, a “lavaliere” (or Lavalier, or tie-clip,…) type of clip-on microphone works well, I’m still experimenting with a microphone above me on a boom… I’m not sure what works best yet.

*

* *

I’ll do another entry where I detail the rest of the “studio” more carefully. Until then, I’ll keep experimenting!

]]>

A polynomial is an expression of the form

.

A naïve approach to evaluating a polynomial would be to compute, independently, each monomial , everyone at a cost of products ( for , plus one for ). When you sum over all , you get products (and additions) for a polynomial of degree . That’s way too much.

Someone, in the early 19th century, Horner, remarked that you could rewrite any polynomial as

,

giving us products and additions. That’s much better. Also turns out that if there’s nothing special about the polynomial (no zero coefficients) it’s optimal.

But Horner’s method (or Horner’s formula, depending on where you read about it) is inherently *sequential*, because all the products are nested. It’s not amenable to parallel processing, even in its simpler SIMD form.

However, preparing lectures notes, I found that Estrin proposed a parallel algorithm to evaluate polynomial that both minimize wait time and the total number of products [1]. The scheme is shown here:

The original paper presents a method for polynomials with a degree of , but you can easily adapt the splitting for an arbitrary degree. The first step is to split the polynomial into binomials, each of which are evaluated in parallel. You then combine those into new binomials, also evaluated in parallel, and again, until you have only one term left, that is, the answer.

What immediately comes to mind is a SIMD implementation where we use wide registers to do all products in parallel, but maybe just relying on instruction-level parallelism and the compiler is quite efficient for low degree polynomials. Let’s try with degree seven. Let’s say, with:

,

because why not. Let’s test an implementation:

//////////////////////////////////////// int pow_naif(int x, int n) { int p=1; for (int i=0;i<n;i++,p*=x); return p; } //////////////////////////////////////// int eval_naif(int x) { return 9*pow_naif(x,7) +5*pow_naif(x,6) +7*pow_naif(x,5) +4*pow_naif(x,4) +5*pow_naif(x,3) +3*pow_naif(x,2) +3*x +2; } ///////////////////////////////////////// int eval_horner(int x) { return ((((((9*x+5)*x+7)*x+4)*x+5)*x+3)*x+3)*x+2; } //////////////////////////////////////// int eval_estrin(int x) { int t0=3*x+2; int t1=5*x+3; int t2=7*x+4; int t3=9*x+5; x*=x; int t4=t1*x+t0; int t5=t3*x+t2; x*=x; return t5*x+t4; }

We compile with all optimizations enabled, and evaluate the polynomial for from 0 to 1000000000. Times are:

Method | Time (s) |

Naïve | 1.93 |

Horner | 1.66 |

Estrin | 1.41 |

So despite not being explicitly parallel, Estrin’s version performs significantly better because of instruction-level parallelism. Let’s look at the generated code:

0000000000000e00 <_Z11eval_estrini>: e00: 8d 04 fd 00 00 00 00 lea eax,[rdi*8+0x0] e07: 89 f9 mov ecx,edi e09: 0f af cf imul ecx,edi e0c: 8d 54 07 05 lea edx,[rdi+rax*1+0x5] e10: 29 f8 sub eax,edi e12: 0f af d1 imul edx,ecx e15: 8d 44 02 04 lea eax,[rdx+rax*1+0x4] e19: 89 ca mov edx,ecx e1b: 0f af d1 imul edx,ecx e1e: 0f af c2 imul eax,edx e21: 8d 54 bf 03 lea edx,[rdi+rdi*4+0x3] e25: 0f af d1 imul edx,ecx e28: 8d 4c 7f 02 lea ecx,[rdi+rdi*2+0x2] e2c: 01 ca add edx,ecx e2e: 01 d0 add eax,edx e30: c3 ret

We see that the clever use of `lea` allows different pipelines to compute address independently and that it is also used to multiply by the coefficients. Such magic wouldn’t occur if the coefficients were much less cooperative (say 27, or something).

*

* *

What about actual SIMD implementation? Well, I gave it a try and my implementation has the same number of instructions as the sequential version generated by the compiler. Turns out that even if you can you a couple of multiply in parallel, the butterfly-like structure asks you to shuffle the values around (using `pshufd`) and that negates any gain you get from parallelism (on some of my machines, it’s even slower!). Maybe there’s a better way of doing this. Questions for later.

[1] Gerald Estrin —

First, we have the most known of these approximations, the famous “Stirling formula”:

,

Where the terms at the right are known as Stirling Series (the numerators are given by A046968 and the denominators by A046969). If you evaluate the complete series, it’s truly equal to .

However, you may not quite want to evaluate an infinite series, and we find truncated versions:

that we will call “Stirling” from now on; a “Stirling more” version could be

,

or even a “Stirling most”:

.

The literature is fraught with approximations. For example, we find Gosper’s:

.

Everybody refers [a] as the source of this approximation. However, while it can be a consequence of that paper, it’s *not* in it (also, its typography is a train wreck).

We have Burnside’s [2]:

.

Then Mortici’s [3]:

.

There are plenty more, but let’s consider one more. Mohanty and Rummens’ [4]:

.

*

* *

How do those compare? Asymptotically, when is very large, the ratio of any of these approximation to goes to 1. On smaller , they also all kind-of-work OK:

From the figure above, we see that Gosper’s does very well, just a bit worst than “Sterling More”. Mohanty and Rumme1ns’ does best, but it’s also quite a bit more complex than Gosper’s. What if we have a look at the digits that are output? The following shows the result (in *Mathematica*, which computes in “infinite precision”) rounded:

But what’s more telling, is the ratio to the real value:

But we can also have a look at the number of correct leading digits:

*

* *

So Gosper’s approximation is much better than just the truncated Stirling and compares to “Stirling More”, which is not that surprising because it’s very close to what you get by distributing the into the square root; so it’s a good numerical trade off. However, “Stirling More” and Mohanty & Rummens’ compare with a slight advantage to the later.

[1] R. William Gosper, Jr. — *Decision Procedure for Indefinite Hypergeometric Summation* — Procs. Nat. Acad. Science USA, vol 75 no 1 (1978) p. 42–46.

[2] W. Burnside — *A Rapidly Convergent Series for Log N!* — Messenger Math., vol 46 no ? (1917) p. 157–159

[3] Cristinel Mortici — *An Ultimate Extremely Accurate Formula for Approximation of the Factorial Function* — Archiv der Mathematik, vol 93 (2009) p. 37–45

[4] S. Mohanty, F. H. A. Rummens — *Comment on “An Improved Analytical Approximation to n!”* — J. Chem. Physics, vol 80 (1984) p. 591.

Let’s see what hypotheses are useful, and how we can use them to get a good idea on the number of bits needed.

**For Sound**

I have shown how dB and bits are related, here and also here. Basically, adding one bit to a code adds about 6 dB to the resulting signal. Now, by definition, the threshold of hearing, is set at 0 dB. This corresponds to the weakest sound you can distinguish from true silence. Threshold of pain (the point where you kind of expect your ears to start bleeding) is somewhere above 120 dB. Much louder sounds lead to actual hearing damage—explosions, rocket launches, etc. If we assume that we stay in the 0 to 120 dB range, the useful range for safe sound reproduction, at about 6 dB by bit,

.

So about 20 bits would be enough. If you consider 0 dB as the threshold of hearing, you might want to use 1 or 2 more bits to account for people with much finer hearing (as would suggest the loudness contour chart). Round to the next byte, you get 24 bits. What pros suggest you use.

**For Images**

That one asked me a bit more research to find good references. Some report the total visual dynamic range is about 10 orders of magnitude (from to ) (in appropriate luminosity units), others, like Fein and Szuts[1], report 16 (from to ). Depending on the range, that’d yield

bits,

or

bits.

However, while the human eye *can* see luminosity on that range, it can’t do it *simultaneously*. The following figure (from Gonzalez & Woods, [2]) shows that around a base value (average scene luminosity), shown as in the figure, only a certain range can be perceived (with lower range marked as ). That range seems to be only 4, or 5 order of magnitude, so only

bits.

So if we consider the simultaneously perceivable range around some standard average-but-bright-enough luminosity, we might get away with 16 bits per color component (maybe less?).

*

* *

The number we get are pretty much in line with what we find in audio and video. 24 bits is considered “professional” (but not necessarily useful, depending on the quantity of noise in the original source) for audio. HDMI support up to 48 bits per pixel (16 bits per component) while digital camera often sport 10, 12 or 14 bits per component.

[1] Alan Fein, Ete Zoltan Szutz

— *Photoreceptors: Their Role in Vision* —

Cambridge University Press (1982)

[2] Rafael C. Gonzalez, Richard E. Woods

— *Digital Image Processing* — 2nd ed, Prentice

Hall (2002)

Except that it’s not quite true.

First, let’s consider (standard) vanilla bubble sort:

template <template <typename...> typename C, typename... Ts> void bubble_sort( C<Ts...> & coll ) { if (coll.size()) { bool swapped; typename C<Ts...>::iterator last=coll.end()-1; typename C<Ts...>::iterator i; do { swapped=false; i=coll.begin(); while (i!=last) { if (*i>*(i+1)) { std::swap(*i,*(i+1)); swapped=true; } ++i; } --last; } while (swapped); } }

So nothing fancy here, except that it should work on any container with bidirectional iterators, and types for which `operator>` is defined. To make write that in assembly language, we should simplify things a bit. Let’s say “containers” are flat arrays and the `T` types is int.

# void bubble(size_t nb, int items[]); _Z6bubblemPi: .LFB0: .cfi_startproc ## rdi nb ## rsi items[] mov rcx,rsi # itemsp[ xor rdx,rdx # bool 'swapped' lea rdi,[rsi+rdi*4-4] # last .bubble_while: cmp rcx,rdi jge .bubble_while_done mov eax,[rcx] mov r9d,[rcx+4] cmp eax,r9d jle .bubble_while_next mov [rcx],r9d mov [rcx+4],eax mov rdx,1 .bubble_while_next: add rcx,4 jmp .bubble_while .bubble_while_done: or rdx,rdx jz .bubble_done mov rcx,rsi xor rdx,rdx dec rdi jnz .bubble_while .bubble_done: ret .cfi_endproc

This piece of assembly language does exactly what you expect it to do: passes that scans the array and swaps items when they are out of order; pushing the largest one at the end of the array, and stopping whenever no swaps were performed.

This version is longer than would be necessary if we had memory-to-memory swap instructions (which we don’t, on x86), and we could make it a bit faster if we changed the jump by conditional moves. But that wouldn’t change much because we still scan the array two items at a time. But… what if we compared 4? 8?

The newer instruction sets allow just that! With AVX, the `xmm` registers are 128-bits wide and can hold 4 `int`s, the `ymm` registers are 256-bits wide (and hold 8)… and if you can do AVX-512, there are the `zmm` 512-bits wide registers! We cant easily sort the values within a register, but we can compare them, pairwise, with another register. For example, we can compute the minimum of two registers:

min(xmmi,xmmj) := xmmi[0]=min(xmmi[0],xmmj[0]) xmmi[1]=min(xmmi[1],xmmj[1]) xmmi[2]=min(xmmi[2],xmmj[2]) xmmi[3]=min(xmmi[3],xmmj[3])

We can also compute the maximum of two registers. If we replace the

conditional swap by:

a=min(t[i],t[i+1]) b=max(t[i],t[i+1]) t[i]=a; t[j]=b;

…then we can use that to swap, in order, two items. If we use the parallel AVX version, we’d do that with `t[i...i+3]` and `t[i+4...i+7]`. They wouldn’t be completely in order, but that increases order a lot.

If fact, if we use 4- (or 8-) int wide min/max, we end up bubble sorting the array as if it was 4 separate columns. After a we passes, then all columns (0,4,8,…) (1,5,8,…) (2,6,10,…) (3,9,11,…) are sorted. That a Shell-sort like kick start. We can then finish the job with our basic bubble sort, hoping that no element is too far from its final location.

# void xmmbubble(size_t nb, int items[]); _Z9xmmbubblemPi: .LFB0: .cfi_startproc ## rdi nb ## rsi items[] mov r8,rsi mov r9,rdi and r9b,0xf1 # ~0x7 shl r9,2 # *sizeof(int) add r9,r8 # last item .xmmbubble_while: lea r10,[r8+16] cmp r10,r9 jge .xmmbubble_next movdqu xmm0,[r8] movdqu xmm1,[r8+16] vpminsd xmm2,xmm0,xmm1 # xmm2=min(xmm0,xmm1) vpmaxsd xmm3,xmm0,xmm1 # xmm3=min(xmm0,xmm1) movdqu [r8],xmm2 movdqu [r8+16],xmm3 add r8,16 jmp .xmmbubble_while .xmmbubble_next: mov r8,rsi sub r9,16 cmp r9,r8 je .xmmbubble_done jmp .xmmbubble_while .xmmbubble_done: # finishes with ordinary # bubble sort call _Z6bubblemPi ret .cfi_endproc

*

* *

That’s a lot of work. Do we get a good speed-up, then?

Size × 16 | Naïve | XMM |

1000 | 0.28s | 0.025s |

10000 | 38s | 1.8s |

100000 | 3813s | 183s |

Surprisingly, we do. A lot! The Shell Sort -like first pass moves most of the values *around* the right position, so the classic bubble sort finishes sorting by only making a few passes (I would need to work out of the details of the inversions, but I conjecture that since they are (number-of-columns) times smaller, the sort is (number-of-columns)-squared times faster!)

While discussing this algorithm in class, a student asked a very interesting question: what’s special about base 2? Couldn’t we use another base? Well, yes, yes we can.

Indeed, there’s nothing special about base 2, except maybe that it’s very simple, and it yields a very simple implementation. For example:

unsigned expo_iter(unsigned x, unsigned p) { unsigned t=x, e=1; while (p) { if (p&1) e*=t; t*=t; p/=2; } return e; }

This simple procedures raises to the th power in steps. And they’re (mostly) inexpensive: a shift right by one bit, a mask, and a multiply (which could be expensive if the objects are complex and we can’t rely on machine-sized multiplications).

But what about other bases? Say, base 7? or 3? or whatever other than 1?

First, we must understand what the binary algorithm does. It exploits the fact that in a very specific way: the exponent will be broken into powers of two, so that becomes ; and indeed the powers to use are given by the binary representation of 25, 11001 (which is !). Now, let us see that the above algorithm does. The following table gives you what the algorithm computes at step :

Step | e | p | t |

(init) | 25 | ||

1 | 25 | ||

2 | 12 | ||

3 | 6 | ||

4 | 3 | ||

5 | 1 | ||

(done) | 0 | — |

So it is clear that the number of iterations is given by the position of the most significant bit in the exponent (assuming it’s integer!), and we do at most two multiply (and one shift) per round.

*

* *

So how do we generalize this algorithm from base 2 to some base ? Certainly, the decomposition of the exponent in base is necessary. Let’s pick (because it’s not 2 and it’s also not too big) and (also, no special reason other than a good mix of digits and just long enough we can figure something out of them). We will use the digits to multiply each group of powers of 3 so that the total is 98. Indeed, we have . Indeed:

0

Step | e | m | p | t |

(init) | — | 1 | x | |

1 | 2 | |||

2 | 2 | |||

3 | 1 | |||

4 | 0 | |||

5 | 1 | |||

(done) | — |

Now, let’s turn that into code! Here, *Mathematica* will help us deal with arbitrary large numbers.

expo[x_, e_, b_: 2] := Module[{tx = x, p = 1, te = e, m}, While[te != 0, m = Mod[te, b]; If[m != 0, p *= tx^m]; te = Quotient[te, b]; If[te != 0, tx = tx^b]; (* if product is expensive *) ]; p ]

*

* *

This algorithm trades-off the number of iterations (from to to raise to the th power) for a more complex inner loop: div/mod by something else than 2 could be expensive—even if divmod is likely just one instruction!

]]>I recently got an AMD Ryzen 9 3900x, and I wondered if I have the bug as well.

Let’s first write a routine to read the random numbers with `rdrand`. This instruction sets the carry flag to let us know if the value returned is random enough, so that we can retry if it isn’t. GCC/Gas/Masm assembly lets us easily interface with this instruction:

#ifndef __module__rrand__ #define __module__rrand__ #include <cstdint> #include <utility> using rand_t=std::pair<uint64_t,uint64_t>; rand_t rd_rand(); #endif // __module__rrand__

With an assembly file:

_Z7rd_randv: .LFB82: .cfi_startproc xor rdx,rdx .retry: inc rdx rdrand rax jnc .retry ## rax,rdx returns a std::pair<uint64_t,uint64_t> ret .cfi_endproc

This function automagically assigns two registers (%0 and %1), one with the random value, the other with the number of tries (1 if it succeeds right away). A quick test shows us that it (seems to) work:

int main() { uint64_t counts[64]={0}; std::cout << std::hex << std::setfill('0'); for (int i=0;i<100000000;i++) { rand_t t=rd_rand(); std::cout << std::dec << t.second << '\t' << std::hex << std::setw(16) << t.first << std::endl; } return 0; }

It prints:

Tries Values 1 848d44f6fabf8aa5 1 36b3c5953104ea25 1 e5312e8cfdba8d62 1 c55a698dcd5ec5fd 1 a7bf76560b22d56a 1 207c2fa54beea397 1 ae27f2a5a9263b83 1 a829dd4aabd41b8f 1 a8e3d747de951a02 1 f9781bda02545036 1 410cc2263d1c8001 1 f8ebf3fad61c1d6b 1 70a76df3f4759e4e 1 f5aabcfa42b4824d 1 fcdd1260c56027ec

Well, that’s not `fff...ff`, but is it random?

Changing the above code to count the number of times each bits is set to either 0 or 1:

int main() { uint64_t counts[64]={0}; for (int i=0;i<1000000;i++) { rand_t t=rd_rand(); for (int i=0;i<64;i++) counts[i]+=((t.first>>i)&1); } // display for (int i=0;i<64;i++) std::cout << counts[i] << std::endl; return 0; }

We take those result and plot them (in gnumeric, for example):

It seems that my AMD R9 3900x doesn’t suffer from that defect.

*

* *

At first, I tried to use GCC inline assembly:

rand_t rd_rand() { uint64_t x,r; asm(".intel_syntax noprefix;" " xor %1,%1;" ".retry%=:" " inc %1;" " rdrand %0;" " jnc .retry%=;" ".att_syntax;" : "=R"(x),"=R"(r) // %0, %1 : // no input : // auto clobber ); return {x,r}; }

But the compiler kept optimizing the function away, with weird results. Sometimes working, sometimes not… depending on whether or not I used both members of the pair. GCC inline assembly always gave me trouble.

]]>

In the 6×7×6 palette example, we have used the formula

to encode an 676-RGB triplet into a single value. The inverse was given by

,

,

.

That may give the impression that we must encode/decode the entire triplet each time. Certainly, the inverse gives us the means to extract only one of the values—by using any of the three equations! But what if I wanted to rewrite only the blue component?

What about:

?

Clearly, that subtract the old blue () and replaces it by the new blue, . What if we wanted to rewrite the green? I we did

,

that would indeed change only the blue since is the red component only, (what’s subtracting the green and blue, but we could have used , an equivalent to `x=(x>>a)<<a` to set the lower bits to zero), and we add back the green and blue.

Let’s look at the general case now. Let , , … be the numbers of values the fields can take. Let also

,

,

be the product of the number of values of the fields that precede the th, or, in other words, the number of combination of all possible values of the fields that precede the th.

To extract the th field, we compute:

.

The first shifts the desired value in the least significant bits and extracts it. That’s equivalent to `(x>>shift_bits)&mask_bits`, but using potentially a fractional number of bits for the shift, and a fractional number of bits for the mask.

To set a new value for the th field, we compute:

.

The first part, contains all the fields “above” the th, fields to . The last part, contains the value of all fields before the th. Finally, $f\cdot p_k$ is the new value shifted in the th place.

*

* *

Let’s see what the code to do this looks like. I’ll use initializer lists once more to pass the s to the function, mostly because it’s convenient, and you can do your own container-agnostic version from it quite easily.

The most difficult part is to compute the . We basically compute the products of the up to the th field. Extracting a code becomes:

int get( int c, int f, const std::initializer_list<int> & n) { int preds=1,ff=0; auto pn=n.begin(); while (ff++<f) preds*=*pn++; return (c/preds) % *pn; }

Setting a new code asks for the computation of and . The code is:

void set( int & c, int f, int v, const std::initializer_list<int> & n) { int preds=1,ff=0,one_more; auto pn=n.begin(); while (ff++<f) preds*=*pn++; one_more=preds**pn; c=(c-(c % one_more)) + (v*preds) + (c % preds); }

In the code, `c` is the code for all fields, `f` is the field number, and `v` the new value. A much better version would use compile-time techniques, like tuple’s `get` and `set` functions to allow the compiler to optimize everything away.

Use would be something like this:

// fields: 11,3,4,5,12 values. const std::initializer_list<int> params{11,3,4,5,12}; int c=3*11*3+2*11+7; // values: 7,2,3,0,0 std::cout << get(c,1,params) << std::endl; set(c,2,1,params); std::cout << get(c,2,params) << std::endl; std::cout << get(c,3,params) << std::endl;

This outputs:

2 1 0

which is what’s expected.

*

* *

What are the expected savings? Well, very obviously, that depends on the number of values for each field. If they’re all powers of two, the method becomes a complicated way of doing classical shifts and masking; if they’re not, the savings can be important. Let’s take the field values in the example above.

In the classical (integer number of bits) bit-fields, encoding five fields with 11, 3, 4, 5, and 12 possible values each would as for 4, 2, 2, 3, and 4 bits (because, for example, , therefore 4 bits). That's a total of 4+2+2+3+4=15 bits.

The largest possible value the (sub)bit-field can take is , that is, 7920 different combinations. We have . Since the language doesn’t allow us to use 12.951 bits, the best we can do is to store the value in a 13-bits bit-field. That 2 bits less than the naive encoding.

*

* *

While 2 bits doesn’t seem much, that leaves us two more bits to pack in the same (sub)bit-field, and we can use it to store something else. It doesn’t have to be an in-memory (sub)bit-field, we could also use it to store stuff in a file, in some bit-stream where alignment isn’t that important.

^{1}I didn’t say arithmetic coding because that refers, of course, to a much more elaborate coding technique.

The web safe palette divides the RGB color cube in 6×6×6 colors. Each color component, r, g, and b, varies from 0 to 5, levels that are expanded for 24-bits colors to 0x00, 0x33, 0x66, 0x99, 0xcc, 0xff (or 0, 51, 102, 153, 204, 255). If we’re to encode the rgb components as we usually do, by using a fixed number of bits per components, we’d need 9 bits. Indeed, since 2^{2}=4<6<8=2^{3}, we will need 3 bits per component. The encoding would then be

`c=(r<<6)|(g<<3)|b;`

And we could decode using similar bit-oriented operations (using >> and & for masking). The problem with that is that we are using 9 bits and 6×6×6=216, so clearly, we would need at most 8 bits!

Using 3 bits per components would be efficient if each component had 8 levels, but they have only six. We can’t shift by bits for 6 levels. Or can we? Well, let’s first redo the encoding:

`c=36*r+6*g+b;`

The inverse is

`b=c%6;`

`g=(c/6)%6;`

`r=c/36;`

So, if…

`x<<3==x*8==x*2 ^{3}`

then

`x*6==x*2 ^{log26}=x<<log_{2}6`.

Because = works both ways, multiplying by 6 is the same as shifting by log_{2}6≈2.58 bits! Indeed, 3×log_{2}6≈7.75 bits, just as log_{2}216=log_{2}6^{3}=3×log_{2}6!

The 6×6×6 palette uses ≈97% of the 8 bits. Can we use the few extra fractions of bits to squeeze in more colors? You could think, well, if I have 216 colors, that leaves 40 other colors, and I could use the codes from 216 to 255 to encode these new colors, and decode as rgb triplets for codes 0 to 215. Yes, that’s an idea. We could also use a 6×7×6 color cube, because 6×7×6=252, and that very nearly all the range of 8 bits—and we should use 7.98 bits out of 8.

So we will use 6 levels for red, 7 for green (because the eye is more sensitive to green), and 6 for blue. The palette now somewhat lost its (pure) grays because colors are now somewhat off the diagonal.

The code looks pretty much as before, but with new values:

`c=42*r+6*g+b;`

Here, we “shift” green by log_{2}6 bits to make room for blue, and red by log_{2}42=log_{2}7+log_{2}6 to make room for green and blue. Using normal shifts, you would have shifted green by enough position to accommodate the bits for blue, then red by enough to accommodate the bits from blue and green. We did the same here, except with fractional bit shifts in the guise of multiplications.

The inverse is

`b=c%6;`

`g=(c/6)%7;`

`r=c/42;`

Isn’t that neat?

*

* *

How do we reduce 0 to 255 on 0 to 6 (or to 7) and back? A very short program does just that:

int pack(int rgb, int k) { return (rgb*k)/256; } int unpack(int v, int k) { return (v*255)/(k-1); }

The `pack` function transforms the rgb (more exactly, r or g or b) component on the interval 0≤x<1 then on the interval 0≤y<k. The fact we are dividing by 256 ensures that 255 doesn’t give one, therefore k. We want to have a value from 0 to k-1. `unpack` reverses the operation and scales back to 0 to 255. Note that the order of the operation isn’t as natural as they might have been, but to preserve precision—thanks to C/C++ integer arithmetic.

*

* *

The important point here isn’t that the 6×7×6 palette is somewhat better than the 6×6×6 web safe palette; nor even that we use 7.98 bits out of 8 instead of merely 7.75, but that shifting left and multiplying *are the same operation*—just as shifting right *is* dividing. That gives us the possibility of shifting left or right by log-values (some of which are integers).