First, we have the most known of these approximations, the famous “Stirling formula”:

,

Where the terms at the right are known as Stirling Series (the numerators are given by A046968 and the denominators by A046969). If you evaluate the complete series, it’s truly equal to .

However, you may not quite want to evaluate an infinite series, and we find truncated versions:

that we will call “Stirling” from now on; a “Stirling more” version could be

,

or even a “Stirling most”:

.

The literature is fraught with approximations. For example, we find Gosper’s:

.

Everybody refers [a] as the source of this approximation. However, while it can be a consequence of that paper, it’s *not* in it (also, its typography is a train wreck).

We have Burnside’s [2]:

.

Then Mortici’s [3]:

.

There are plenty more, but let’s consider one more. Mohanty and Rummens’ [4]:

.

*

* *

How do those compare? Asymptotically, when is very large, the ratio of any of these approximation to goes to 1. On smaller , they also all kind-of-work OK:

From the figure above, we see that Gosper’s does very well, just a bit worst than “Sterling More”. Mohanty and Rumme1ns’ does best, but it’s also quite a bit more complex than Gosper’s. What if we have a look at the digits that are output? The following shows the result (in *Mathematica*, which computes in “infinite precision”) rounded:

But what’s more telling, is the ratio to the real value:

But we can also have a look at the number of correct leading digits:

*

* *

So Gosper’s approximation is much better than just the truncated Stirling and compares to “Stirling More”, which is not that surprising because it’s very close to what you get by distributing the into the square root; so it’s a good numerical trade off. However, “Stirling More” and Mohanty & Rummens’ compare with a slight advantage to the later.

[1] R. William Gosper, Jr. — *Decision Procedure for Indefinite Hypergeometric Summation* — Procs. Nat. Acad. Science USA, vol 75 no 1 (1978) p. 42–46.

[2] W. Burnside — *A Rapidly Convergent Series for Log N!* — Messenger Math., vol 46 no ? (1917) p. 157–159

[3] Cristinel Mortici — *An Ultimate Extremely Accurate Formula for Approximation of the Factorial Function* — Archiv der Mathematik, vol 93 (2009) p. 37–45

[4] S. Mohanty, F. H. A. Rummens — *Comment on “An Improved Analytical Approximation to n!”* — J. Chem. Physics, vol 80 (1984) p. 591.

Let’s see what hypotheses are useful, and how we can use them to get a good idea on the number of bits needed.

**For Sound**

I have shown how dB and bits are related, here and also here. Basically, adding one bit to a code adds about 6 dB to the resulting signal. Now, by definition, the threshold of hearing, is set at 0 dB. This corresponds to the weakest sound you can distinguish from true silence. Threshold of pain (the point where you kind of expect your ears to start bleeding) is somewhere above 120 dB. Much louder sounds lead to actual hearing damage—explosions, rocket launches, etc. If we assume that we stay in the 0 to 120 dB range, the useful range for safe sound reproduction, at about 6 dB by bit,

.

So about 20 bits would be enough. If you consider 0 dB as the threshold of hearing, you might want to use 1 or 2 more bits to account for people with much finer hearing (as would suggest the loudness contour chart). Round to the next byte, you get 24 bits. What pros suggest you use.

**For Images**

That one asked me a bit more research to find good references. Some report the total visual dynamic range is about 10 orders of magnitude (from to ) (in appropriate luminosity units), others, like Fein and Szuts[1], report 16 (from to ). Depending on the range, that’d yield

bits,

or

bits.

However, while the human eye *can* see luminosity on that range, it can’t do it *simultaneously*. The following figure (from Gonzalez & Woods, [2]) shows that around a base value (average scene luminosity), shown as in the figure, only a certain range can be perceived (with lower range marked as ). That range seems to be only 4, or 5 order of magnitude, so only

bits.

So if we consider the simultaneously perceivable range around some standard average-but-bright-enough luminosity, we might get away with 16 bits per color component (maybe less?).

*

* *

The number we get are pretty much in line with what we find in audio and video. 24 bits is considered “professional” (but not necessarily useful, depending on the quantity of noise in the original source) for audio. HDMI support up to 48 bits per pixel (16 bits per component) while digital camera often sport 10, 12 or 14 bits per component.

[1] Alan Fein, Ete Zoltan Szutz

— *Photoreceptors: Their Role in Vision* —

Cambridge University Press (1982)

[2] Rafael C. Gonzalez, Richard E. Woods

— *Digital Image Processing* — 2nd ed, Prentice

Hall (2002)

Except that it’s not quite true.

First, let’s consider (standard) vanilla bubble sort:

template <template <typename...> typename C, typename... Ts> void bubble_sort( C<Ts...> & coll ) { if (coll.size()) { bool swapped; typename C<Ts...>::iterator last=coll.end()-1; typename C<Ts...>::iterator i; do { swapped=false; i=coll.begin(); while (i!=last) { if (*i>*(i+1)) { std::swap(*i,*(i+1)); swapped=true; } ++i; } --last; } while (swapped); } }

So nothing fancy here, except that it should work on any container with bidirectional iterators, and types for which `operator>` is defined. To make write that in assembly language, we should simplify things a bit. Let’s say “containers” are flat arrays and the `T` types is int.

# void bubble(size_t nb, int items[]); _Z6bubblemPi: .LFB0: .cfi_startproc ## rdi nb ## rsi items[] mov rcx,rsi # itemsp[ xor rdx,rdx # bool 'swapped' lea rdi,[rsi+rdi*4-4] # last .bubble_while: cmp rcx,rdi jge .bubble_while_done mov eax,[rcx] mov r9d,[rcx+4] cmp eax,r9d jle .bubble_while_next mov [rcx],r9d mov [rcx+4],eax mov rdx,1 .bubble_while_next: add rcx,4 jmp .bubble_while .bubble_while_done: or rdx,rdx jz .bubble_done mov rcx,rsi xor rdx,rdx dec rdi jnz .bubble_while .bubble_done: ret .cfi_endproc

This piece of assembly language does exactly what you expect it to do: passes that scans the array and swaps items when they are out of order; pushing the largest one at the end of the array, and stopping whenever no swaps were performed.

This version is longer than would be necessary if we had memory-to-memory swap instructions (which we don’t, on x86), and we could make it a bit faster if we changed the jump by conditional moves. But that wouldn’t change much because we still scan the array two items at a time. But… what if we compared 4? 8?

The newer instruction sets allow just that! With AVX, the `xmm` registers are 128-bits wide and can hold 4 `int`s, the `ymm` registers are 256-bits wide (and hold 8)… and if you can do AVX-512, there are the `zmm` 512-bits wide registers! We cant easily sort the values within a register, but we can compare them, pairwise, with another register. For example, we can compute the minimum of two registers:

min(xmmi,xmmj) := xmmi[0]=min(xmmi[0],xmmj[0]) xmmi[1]=min(xmmi[1],xmmj[1]) xmmi[2]=min(xmmi[2],xmmj[2]) xmmi[3]=min(xmmi[3],xmmj[3])

We can also compute the maximum of two registers. If we replace the

conditional swap by:

a=min(t[i],t[i+1]) b=max(t[i],t[i+1]) t[i]=a; t[j]=b;

…then we can use that to swap, in order, two items. If we use the parallel AVX version, we’d do that with `t[i...i+3]` and `t[i+4...i+7]`. They wouldn’t be completely in order, but that increases order a lot.

If fact, if we use 4- (or 8-) int wide min/max, we end up bubble sorting the array as if it was 4 separate columns. After a we passes, then all columns (0,4,8,…) (1,5,8,…) (2,6,10,…) (3,9,11,…) are sorted. That a Shell-sort like kick start. We can then finish the job with our basic bubble sort, hoping that no element is too far from its final location.

# void xmmbubble(size_t nb, int items[]); _Z9xmmbubblemPi: .LFB0: .cfi_startproc ## rdi nb ## rsi items[] mov r8,rsi mov r9,rdi and r9b,0xf1 # ~0x7 shl r9,2 # *sizeof(int) add r9,r8 # last item .xmmbubble_while: lea r10,[r8+16] cmp r10,r9 jge .xmmbubble_next movdqu xmm0,[r8] movdqu xmm1,[r8+16] vpminsd xmm2,xmm0,xmm1 # xmm2=min(xmm0,xmm1) vpmaxsd xmm3,xmm0,xmm1 # xmm3=min(xmm0,xmm1) movdqu [r8],xmm2 movdqu [r8+16],xmm3 add r8,16 jmp .xmmbubble_while .xmmbubble_next: mov r8,rsi sub r9,16 cmp r9,r8 je .xmmbubble_done jmp .xmmbubble_while .xmmbubble_done: # finishes with ordinary # bubble sort call _Z6bubblemPi ret .cfi_endproc

*

* *

That’s a lot of work. Do we get a good speed-up, then?

Size × 16 | Naïve | XMM |

1000 | 0.28s | 0.025s |

10000 | 38s | 1.8s |

100000 | 3813s | 183s |

Surprisingly, we do. A lot! The Shell Sort -like first pass moves most of the values *around* the right position, so the classic bubble sort finishes sorting by only making a few passes (I would need to work out of the details of the inversions, but I conjecture that since they are (number-of-columns) times smaller, the sort is (number-of-columns)-squared times faster!)

While discussing this algorithm in class, a student asked a very interesting question: what’s special about base 2? Couldn’t we use another base? Well, yes, yes we can.

Indeed, there’s nothing special about base 2, except maybe that it’s very simple, and it yields a very simple implementation. For example:

unsigned expo_iter(unsigned x, unsigned p) { unsigned t=x, e=1; while (p) { if (p&1) e*=t; t*=t; p/=2; } return e; }

This simple procedures raises to the th power in steps. And they’re (mostly) inexpensive: a shift right by one bit, a mask, and a multiply (which could be expensive if the objects are complex and we can’t rely on machine-sized multiplications).

But what about other bases? Say, base 7? or 3? or whatever other than 1?

First, we must understand what the binary algorithm does. It exploits the fact that in a very specific way: the exponent will be broken into powers of two, so that becomes ; and indeed the powers to use are given by the binary representation of 25, 11001 (which is !). Now, let us see that the above algorithm does. The following table gives you what the algorithm computes at step :

Step | e | p | t |

(init) | 25 | ||

1 | 25 | ||

2 | 12 | ||

3 | 6 | ||

4 | 3 | ||

5 | 1 | ||

(done) | 0 | — |

So it is clear that the number of iterations is given by the position of the most significant bit in the exponent (assuming it’s integer!), and we do at most two multiply (and one shift) per round.

*

* *

So how do we generalize this algorithm from base 2 to some base ? Certainly, the decomposition of the exponent in base is necessary. Let’s pick (because it’s not 2 and it’s also not too big) and (also, no special reason other than a good mix of digits and just long enough we can figure something out of them). We will use the digits to multiply each group of powers of 3 so that the total is 98. Indeed, we have . Indeed:

0

Step | e | m | p | t |

(init) | — | 1 | x | |

1 | 2 | |||

2 | 2 | |||

3 | 1 | |||

4 | 0 | |||

5 | 1 | |||

(done) | — |

Now, let’s turn that into code! Here, *Mathematica* will help us deal with arbitrary large numbers.

expo[x_, e_, b_: 2] := Module[{tx = x, p = 1, te = e, m}, While[te != 0, m = Mod[te, b]; If[m != 0, p *= tx^m]; te = Quotient[te, b]; If[te != 0, tx = tx^b]; (* if product is expensive *) ]; p ]

*

* *

This algorithm trades-off the number of iterations (from to to raise to the th power) for a more complex inner loop: div/mod by something else than 2 could be expensive—even if divmod is likely just one instruction!

]]>I recently got an AMD Ryzen 9 3900x, and I wondered if I have the bug as well.

Let’s first write a routine to read the random numbers with `rdrand`. This instruction sets the carry flag to let us know if the value returned is random enough, so that we can retry if it isn’t. GCC/Gas/Masm assembly lets us easily interface with this instruction:

#ifndef __module__rrand__ #define __module__rrand__ #include <cstdint> #include <utility> using rand_t=std::pair<uint64_t,uint64_t>; rand_t rd_rand(); #endif // __module__rrand__

With an assembly file:

_Z7rd_randv: .LFB82: .cfi_startproc xor rdx,rdx .retry: inc rdx rdrand rax jnc .retry ## rax,rdx returns a std::pair<uint64_t,uint64_t> ret .cfi_endproc

This function automagically assigns two registers (%0 and %1), one with the random value, the other with the number of tries (1 if it succeeds right away). A quick test shows us that it (seems to) work:

int main() { uint64_t counts[64]={0}; std::cout << std::hex << std::setfill('0'); for (int i=0;i<100000000;i++) { rand_t t=rd_rand(); std::cout << std::dec << t.second << '\t' << std::hex << std::setw(16) << t.first << std::endl; } return 0; }

It prints:

Tries Values 1 848d44f6fabf8aa5 1 36b3c5953104ea25 1 e5312e8cfdba8d62 1 c55a698dcd5ec5fd 1 a7bf76560b22d56a 1 207c2fa54beea397 1 ae27f2a5a9263b83 1 a829dd4aabd41b8f 1 a8e3d747de951a02 1 f9781bda02545036 1 410cc2263d1c8001 1 f8ebf3fad61c1d6b 1 70a76df3f4759e4e 1 f5aabcfa42b4824d 1 fcdd1260c56027ec

Well, that’s not `fff...ff`, but is it random?

Changing the above code to count the number of times each bits is set to either 0 or 1:

int main() { uint64_t counts[64]={0}; for (int i=0;i<1000000;i++) { rand_t t=rd_rand(); for (int i=0;i<64;i++) counts[i]+=((t.first>>i)&1); } // display for (int i=0;i<64;i++) std::cout << counts[i] << std::endl; return 0; }

We take those result and plot them (in gnumeric, for example):

It seems that my AMD R9 3900x doesn’t suffer from that defect.

*

* *

At first, I tried to use GCC inline assembly:

rand_t rd_rand() { uint64_t x,r; asm(".intel_syntax noprefix;" " xor %1,%1;" ".retry%=:" " inc %1;" " rdrand %0;" " jnc .retry%=;" ".att_syntax;" : "=R"(x),"=R"(r) // %0, %1 : // no input : // auto clobber ); return {x,r}; }

But the compiler kept optimizing the function away, with weird results. Sometimes working, sometimes not… depending on whether or not I used both members of the pair. GCC inline assembly always gave me trouble.

]]>

In the 6×7×6 palette example, we have used the formula

to encode an 676-RGB triplet into a single value. The inverse was given by

,

,

.

That may give the impression that we must encode/decode the entire triplet each time. Certainly, the inverse gives us the means to extract only one of the values—by using any of the three equations! But what if I wanted to rewrite only the blue component?

What about:

?

Clearly, that subtract the old blue () and replaces it by the new blue, . What if we wanted to rewrite the green? I we did

,

that would indeed change only the blue since is the red component only, (what’s subtracting the green and blue, but we could have used , an equivalent to `x=(x>>a)<<a` to set the lower bits to zero), and we add back the green and blue.

Let’s look at the general case now. Let , , … be the numbers of values the fields can take. Let also

,

,

be the product of the number of values of the fields that precede the th, or, in other words, the number of combination of all possible values of the fields that precede the th.

To extract the th field, we compute:

.

The first shifts the desired value in the least significant bits and extracts it. That’s equivalent to `(x>>shift_bits)&mask_bits`, but using potentially a fractional number of bits for the shift, and a fractional number of bits for the mask.

To set a new value for the th field, we compute:

.

The first part, contains all the fields “above” the th, fields to . The last part, contains the value of all fields before the th. Finally, $f\cdot p_k$ is the new value shifted in the th place.

*

* *

Let’s see what the code to do this looks like. I’ll use initializer lists once more to pass the s to the function, mostly because it’s convenient, and you can do your own container-agnostic version from it quite easily.

The most difficult part is to compute the . We basically compute the products of the up to the th field. Extracting a code becomes:

int get( int c, int f, const std::initializer_list<int> & n) { int preds=1,ff=0; auto pn=n.begin(); while (ff++<f) preds*=*pn++; return (c/preds) % *pn; }

Setting a new code asks for the computation of and . The code is:

void set( int & c, int f, int v, const std::initializer_list<int> & n) { int preds=1,ff=0,one_more; auto pn=n.begin(); while (ff++<f) preds*=*pn++; one_more=preds**pn; c=(c-(c % one_more)) + (v*preds) + (c % preds); }

In the code, `c` is the code for all fields, `f` is the field number, and `v` the new value. A much better version would use compile-time techniques, like tuple’s `get` and `set` functions to allow the compiler to optimize everything away.

Use would be something like this:

// fields: 11,3,4,5,12 values. const std::initializer_list<int> params{11,3,4,5,12}; int c=3*11*3+2*11+7; // values: 7,2,3,0,0 std::cout << get(c,1,params) << std::endl; set(c,2,1,params); std::cout << get(c,2,params) << std::endl; std::cout << get(c,3,params) << std::endl;

This outputs:

2 1 0

which is what’s expected.

*

* *

What are the expected savings? Well, very obviously, that depends on the number of values for each field. If they’re all powers of two, the method becomes a complicated way of doing classical shifts and masking; if they’re not, the savings can be important. Let’s take the field values in the example above.

In the classical (integer number of bits) bit-fields, encoding five fields with 11, 3, 4, 5, and 12 possible values each would as for 4, 2, 2, 3, and 4 bits (because, for example, , therefore 4 bits). That's a total of 4+2+2+3+4=15 bits.

The largest possible value the (sub)bit-field can take is , that is, 7920 different combinations. We have . Since the language doesn’t allow us to use 12.951 bits, the best we can do is to store the value in a 13-bits bit-field. That 2 bits less than the naive encoding.

*

* *

While 2 bits doesn’t seem much, that leaves us two more bits to pack in the same (sub)bit-field, and we can use it to store something else. It doesn’t have to be an in-memory (sub)bit-field, we could also use it to store stuff in a file, in some bit-stream where alignment isn’t that important.

^{1}I didn’t say arithmetic coding because that refers, of course, to a much more elaborate coding technique.

The web safe palette divides the RGB color cube in 6×6×6 colors. Each color component, r, g, and b, varies from 0 to 5, levels that are expanded for 24-bits colors to 0x00, 0x33, 0x66, 0x99, 0xcc, 0xff (or 0, 51, 102, 153, 204, 255). If we’re to encode the rgb components as we usually do, by using a fixed number of bits per components, we’d need 9 bits. Indeed, since 2^{2}=4<6<8=2^{3}, we will need 3 bits per component. The encoding would then be

`c=(r<<6)|(g<<3)|b;`

And we could decode using similar bit-oriented operations (using >> and & for masking). The problem with that is that we are using 9 bits and 6×6×6=216, so clearly, we would need at most 8 bits!

Using 3 bits per components would be efficient if each component had 8 levels, but they have only six. We can’t shift by bits for 6 levels. Or can we? Well, let’s first redo the encoding:

`c=36*r+6*g+b;`

The inverse is

`b=c%6;`

`g=(c/6)%6;`

`r=c/36;`

So, if…

`x<<3==x*8==x*2 ^{3}`

then

`x*6==x*2 ^{log26}=x<<log_{2}6`.

Because = works both ways, multiplying by 6 is the same as shifting by log_{2}6≈2.58 bits! Indeed, 3×log_{2}6≈7.75 bits, just as log_{2}216=log_{2}6^{3}=3×log_{2}6!

The 6×6×6 palette uses ≈97% of the 8 bits. Can we use the few extra fractions of bits to squeeze in more colors? You could think, well, if I have 216 colors, that leaves 40 other colors, and I could use the codes from 216 to 255 to encode these new colors, and decode as rgb triplets for codes 0 to 215. Yes, that’s an idea. We could also use a 6×7×6 color cube, because 6×7×6=252, and that very nearly all the range of 8 bits—and we should use 7.98 bits out of 8.

So we will use 6 levels for red, 7 for green (because the eye is more sensitive to green), and 6 for blue. The palette now somewhat lost its (pure) grays because colors are now somewhat off the diagonal.

The code looks pretty much as before, but with new values:

`c=42*r+6*g+b;`

Here, we “shift” green by log_{2}6 bits to make room for blue, and red by log_{2}42=log_{2}7+log_{2}6 to make room for green and blue. Using normal shifts, you would have shifted green by enough position to accommodate the bits for blue, then red by enough to accommodate the bits from blue and green. We did the same here, except with fractional bit shifts in the guise of multiplications.

The inverse is

`b=c%6;`

`g=(c/6)%7;`

`r=c/42;`

Isn’t that neat?

*

* *

How do we reduce 0 to 255 on 0 to 6 (or to 7) and back? A very short program does just that:

int pack(int rgb, int k) { return (rgb*k)/256; } int unpack(int v, int k) { return (v*255)/(k-1); }

The `pack` function transforms the rgb (more exactly, r or g or b) component on the interval 0≤x<1 then on the interval 0≤y<k. The fact we are dividing by 256 ensures that 255 doesn’t give one, therefore k. We want to have a value from 0 to k-1. `unpack` reverses the operation and scales back to 0 to 255. Note that the order of the operation isn’t as natural as they might have been, but to preserve precision—thanks to C/C++ integer arithmetic.

*

* *

The important point here isn’t that the 6×7×6 palette is somewhat better than the 6×6×6 web safe palette; nor even that we use 7.98 bits out of 8 instead of merely 7.75, but that shifting left and multiplying *are the same operation*—just as shifting right *is* dividing. That gives us the possibility of shifting left or right by log-values (some of which are integers).

So let’s say we have a random variable to generate with few values (maybe corresponding to something like choices in a game) with different odds. Let be four of those choices, with probabilities , , , and . Clearly, drawing an uniform random variable on isn’t sufficient, on its own, to choose one outcome with the desired probabilities. We have to cut the interval into four regions^{1}, one for each symbol, and each with a length equal to the corresponding probability. That’s actually quite easily done:

Now drawing a uniform random variable on will let us “fall” into one of the region, thus choosing the corresponding value.

Since the regions have lengths equal to the probabilities, we are guaranteed that the uniform variable lands with the desired probabilities in each region. There! We have a technique to generate discrete variables with any kind of random distribution—even for those for which it would be hard to find a clever formula.

*

* *

OK, now, how do we translate that into code?

If we have a lot of values (and correspondingly a large number of probabilities), we would have to build a table of cumulative probabilities (the i^{th} entry would contain the sum of all probabilities up to the i^{th} one), and then use binary search (or better yet, interpolation search) to find where our uniformly chosen value lands.

However, if we have very few values, say, four or five, that’s a lot of work!

First, let’s setup our uniform random generator using C++11 `<random>` header:

using random_type=uint64_t; using generator_type= std::linear_congruential_engine <random_type, random_type{6364136223846793005}, // Knuth's random_type{1442695040888963407}, // Knuth's std::numeric_limits<random_type>::max()>;

Now, we must compute the cumulative (mass) function, and search it using binary search. We will do neither! Let’s rather do this:

//////////////////////////////////////// // // p contains the probability of each // class, ended by zero. If the sum is // less than one, a last, virtual choice // is added as an "else", with correct // probability. // size_t random_choice(generator_type & rand, const std::initializer_list<float> & p) { float d=rand()/(float)std::numeric_limits<random_type>::max(); auto z=p.begin(); while (z!=p.end() && (d>*z)) d-=*z++; return z-p.begin(); }

The variable `d` is uniform on with `float` accuracy. Then, we walk the list. If `d` is greater than the current probability (not the sum of all previous probabilities, just the current one), we decrease `d` by the current probability, and examine the next one. If `d` is smaller than the current probability, we stop and have chosen the value. Let’s note that the above code doesn’t suppose that the `initializer_list` is normalized: if we reach beyond the end of the list, the “else” symbol is selected. That’s me at being lazy and not wanting to worry that everything sums to exactly 1.

*

* *

Now, let’s test that code to see what happens:

int main() { generator_type rand_gen(time(0)); float probs[]={0.1,0.1,0.2,0.3}; size_t nb_tries=10000000; size_t c[10]={0}; for (size_t i=0;i<nb_tries;i++) c[random_choice(rand_gen,{0.1,0.1,0.2,0.3})]++; for (int i=0;i<5;i++) std::cout << c[i] << '\t' << c[i]/(float)nb_tries << std::endl; }

Prints:

1001059 0.100106 1001505 0.100151 2000932 0.200093 2998483 0.299848 2998021 0.299802

Showing that, on the long run, it does select the values with the right probabilities.

*

* *

One last remark on the cost of `initializer_list`. To my surprise, `initializer_list` shows no significant pessimization over the use of a simple, null-terminated, array. One reason is that initializer lists may be implemented using only two pointers over a (static allocated) array, and that doesn’t cost a lot. In any case, `initializer_list` gives some flexibility, but we’d probably also need a version taking an array (not `std::vector`!) as input. It’s not very hard to implement.

^{1} They would not need to be contiguous, but it does simplify things a lot if they are.

The first solution that comes to mind is either a for-loop or a call to a library function like `memcpy`. The for-loop would give something like:

char * dest = ... char * src = ... for (std::size_t i=0;i<5;i++) dest[i]=src[i];

If you’re lucky, the compiler understands the small copy and optimizes. More likely, it will merely unroll the loop and generate something like:

14d4: 0f b6 07 movzx eax,BYTE PTR [rdi] 14df: 88 04 24 mov BYTE PTR [rsp],al 14e2: 0f b6 47 01 movzx eax,BYTE PTR [rdi+0x1] 14e6: 88 44 24 01 mov BYTE PTR [rsp+0x1],al 14ea: 0f b6 47 02 movzx eax,BYTE PTR [rdi+0x2] 14ee: 88 44 24 02 mov BYTE PTR [rsp+0x2],al 14f2: 0f b6 47 03 movzx eax,BYTE PTR [rdi+0x3] 14f6: 88 44 24 03 mov BYTE PTR [rsp+0x3],al 14fa: 0f b6 47 04 movzx eax,BYTE PTR [rdi+0x4] 14fe: 88 44 24 04 mov BYTE PTR [rsp+0x4],al

Let’s see how we can fix that!

If the number of bytes to copy is unknown at compile time, there’sn’t much we can do but to rely on `memcpy` that has quite good performance. I once spent an afternoon trying to get better performance than `memcpy` with the same kind of tricks and only managed a shadow of a ghost of a flea of a speedup. But if the number of bytes is known at compile time, we can use templates to force the compiler to generate better code!

Let’s start with something simple: copy the first byte, and if there’s more to copy, call (recursively) again:

template<int n> inline void copy(char *src, char *dest) { *dest=*src; copy<n-1>(src+1,dest+1); } // base case! template<> inline void copy<0>(char *,char *) {}

The base case is needed to stop the recursion “at the bottom”. This unrolls the copy and basically produce the same code as the previous for-loop. To speed things up, we must copy in bigger—as big as possible—chunks. Let’s suppose that the largest basic type is `uint64_t` (while in fact it may be given by `max_align_t`, probably `long double`, if you have a 64-bits CPU). Then while there are more than 64 bits left to copy, copy 64 bits, decrease the number of bytes to copy by 8, and do it again. Then repeat (once!) with 32 bits, then (once) with 16, and finally, the last byte.

template <std::size_t size> void copy(char *s, char *d) { if (size>=8) // 8 bytes, 64 bits! { *(std::uint64_t*)d=*(std::uint64_t*)s; copy<size-8>(s+8,d+8); } else if (size>=4) // 4, 32 bits! { *(std::uint32_t*)d=*(std::uint32_t*)s; copy<size-4>(s+4,d+4); } else if (size>=2) // 4, 16 bits! { *(std::uint16_t*)d=*(std::uint16_t*)s; copy<size-2>(s+2,d+2); } else *d=*s; // last char. } // aha-a. base caseS. template <> void copy< 0>(char *, char *) {} template <> void copy<(std::size_t)-1>(char *, char *) {} template <> void copy<(std::size_t)-2>(char *, char *) {} template <> void copy<(std::size_t)-3>(char *, char *) {} template <> void copy<(std::size_t)-4>(char *, char *) {} template <> void copy<(std::size_t)-5>(char *, char *) {} template <> void copy<(std::size_t)-6>(char *, char *) {} template <> void copy<(std::size_t)-7>(char *, char *) {}

Why so many base cases? Well, it seems the compiler can’t quite figure out that the recursive function is greedy and therefore tries to unroll all cases simultaneously, and therefore if `size` is zero, `size-4` may be 0, -1, -2, -3, … And since we have `size-8` in the template, there may be up to 8 bases cases!

The generated code is now, for 5 bytes:

14d4: 8b 07 mov eax,DWORD PTR [rdi] 14de: 89 04 24 mov DWORD PTR [rsp],eax 14e1: 0f b6 47 04 movzx eax,BYTE PTR [rdi+0x4] 14e5: 88 44 24 04 mov BYTE PTR [rsp+0x4],al

*

* *

With the template, we went from 10 instructions to only 4, and also to simpler instructions. I do not know that’s the (real) speed difference between a simple `mov` and a `movzx` (move with zero extend), but the instruction takes one fewer byte. Maybe it’s inconsequential.

The two main things I had to fix were header dependencies and missing files. The first is easily fixed, you just need to parse headers as well. The second isn’t too hard to fix either, depending on how you intend on finding files.

Finding headers relies on two parameters passed to the script. The first one is a series of include paths (that you’d have anyway in a Makefile) and the second the source files (that you’d also have in a Makefile). Typically, your Makefile will have lines such as:

SOURCES= \ $(wildcard sources/*.cpp) INCLUDES=-I. -I./includes/ -I./plugins -I/usr/include/boost/

and you invoke the script:

depend: @./make-depend.sh "$(INCLUDES)" "$(SOURCES)" > .depend

Then, you can grep your way into the files:

headers=$( grep -e '^\ *#\ *include' $f \ | sed 's/.*\(<\|"\)\(.*\)\(>\|"\).*/\1\2\3/' \ | tr '\n' ' ' \ | sed -e 's/[ ]*$//' \ )

This new version finds both <includes> and “includes”, and keeps the delimiter. I will use that later on to check whether or not the file is missing.

To check if an include exists in your project, I check all the directories pointed by the includes list, and if the script doesn’t find the files, it will check if it’s included within quotes or within <>. If it is quoted, then it *must* be local, and the script should have found it: it outputs an error about that file not being found. If it is enclosed in <> the script considers that if it includes a point (as in <thingie.hpp>) it’s part of the project and must be found, otherwise it assume it’s some part of the standard library and doesn’t complain.

function exists { local where=$1 local dir=$2 local name=$3 local found=false dequoted_name=$(echo $name | sed 's/\(<\|>\|"\)//g') for w in ${where[@]} do local f=$w/$dir/$dequoted_name if [ -f "$f" ] then echo -n $(joli $f)' ' found=true fi done # file not found? if [ $found = false ] then # check if it might be in the STL # by assuming that if there's a dot in # the name, it's user-defined (and if # it's in quotation, check anyway) # if [[ "$name" =~ .*'.'.* || \ "$name" =~ \".*\" ]] then echo not found: $name 1>&2 else :; # should do better checking fi fi }

*

*

There are still a number of things to fix. First, the script doesn’t deal with random spaces, for example `< spaces >`, but that shouldn’t be a problem.

*

*

Click to deconflapulate:

#!/usr/bin/env bash function joli { # sed: the first symbol after s is the separator # replaces multiple consecutives ///// by / # then removes initial ./ at the begining echo $1 | sed -e 's,/[/]*,/,g' | sed -e 's,^\./,,' } function exists { local where=$1 local dir=$2 local name=$3 local found=false dequoted_name=$(echo $name | sed 's/\(<\|>\|"\)//g') for w in ${where[@]} do local f=$w/$dir/$dequoted_name if [ -f "$f" ] then echo -n $(joli $f)' ' found=true fi done # file not found? if [ $found = false ] then # check if it might be in the STL # by assuming that if there's a dot in # the name, it's user-defined (and if # it's in quotation, check anyway) # if [[ "$name" =~ .*'.'.* || \ "$name" =~ \".*\" ]] then echo not found: $name 1>&2 else :; # should do better checking fi fi } ################################3 includes=$(echo $1 | sed s/-I//g) files=$2 all_headers=( ) for f in ${files[@]} do # grep: finds lines that begin by #include # sed: extracts between < > or " " but keeps delimiters # tr: replaces \n by space # sed: removes trailing spaces echo -n ${f%.*}.o": "$f" " headers=$( grep -e '^\ *#\ *include' $f \ | sed 's/.*\(<\|"\)\(.*\)\(>\|"\).*/\1\2\3/' \ | tr '\n' ' ' \ | sed -e 's/[ ]*$//' \ ) z=$( for h in ${headers[@]} do #echo "$h" 1>&2 d=$(dirname $h | sed -e 's/^\.//' ) b=$(basename $h) exists "${includes[@]}" "$d" "$b" done ) echo $z # | sed -e 's/\ /\ \\\n/g' echo all_headers+=( $z ) done headers=$(echo ${all_headers[@]} | tr ' ' '\n' | sort -u) for h in ${headers[@]} do echo $h":" echo done exit 0]]>