While discussing this algorithm in class, a student asked a very interesting question: what’s special about base 2? Couldn’t we use another base? Well, yes, yes we can.

Indeed, there’s nothing special about base 2, except maybe that it’s very simple, and it yields a very simple implementation. For example:

unsigned expo_iter(unsigned x, unsigned p) { unsigned t=x, e=1; while (p) { if (p&1) e*=t; t*=t; p/=2; } return e; }

This simple procedures raises to the th power in steps. And they’re (mostly) inexpensive: a shift right by one bit, a mask, and a multiply (which could be expensive if the objects are complex and we can’t rely on machine-sized multiplications).

But what about other bases? Say, base 7? or 3? or whatever other than 1?

First, we must understand what the binary algorithm does. It exploits the fact that in a very specific way: the exponent will be broken into powers of two, so that becomes ; and indeed the powers to use are given by the binary representation of 25, 11001 (which is !). Now, let us see that the above algorithm does. The following table gives you what the algorithm computes at step :

Step | e | p | t |

(init) | 25 | ||

1 | 25 | ||

2 | 12 | ||

3 | 6 | ||

4 | 3 | ||

5 | 1 | ||

(done) | 0 | — |

So it is clear that the number of iterations is given by the position of the most significant bit in the exponent (assuming it’s integer!), and we do at most two multiply (and one shift) per round.

*

* *

So how do we generalize this algorithm from base 2 to some base ? Certainly, the decomposition of the exponent in base is necessary. Let’s pick (because it’s not 2 and it’s also not too big) and (also, no special reason other than a good mix of digits and just long enough we can figure something out of them). We will use the digits to multiply each group of powers of 3 so that the total is 98. Indeed, we have . Indeed:

0

Step | e | m | p | t |

(init) | — | 1 | x | |

1 | 2 | |||

2 | 2 | |||

3 | 1 | |||

4 | 0 | |||

5 | 1 | |||

(done) | — |

Now, let’s turn that into code! Here, *Mathematica* will help us deal with arbitrary large numbers.

expo[x_, e_, b_: 2] := Module[{tx = x, p = 1, te = e, m}, While[te != 0, m = Mod[te, b]; If[m != 0, p *= tx^m]; te = Quotient[te, b]; If[te != 0, tx = tx^b]; (* if product is expensive *) ]; p ]

*

* *

This algorithm trades-off the number of iterations (from to to raise to the th power) for a more complex inner loop: div/mod by something else than 2 could be expensive—even if divmod is likely just one instruction!

]]>I recently got an AMD Ryzen 9 3900x, and I wondered if I have the bug as well.

Let’s first write a routine to read the random numbers with `rdrand`. This instruction sets the carry flag to let us know if the value returned is random enough, so that we can retry if it isn’t. GCC/Gas/Masm assembly lets us easily interface with this instruction:

#ifndef __module__rrand__ #define __module__rrand__ #include <cstdint> #include <utility> using rand_t=std::pair<uint64_t,uint64_t>; rand_t rd_rand(); #endif // __module__rrand__

With an assembly file:

_Z7rd_randv: .LFB82: .cfi_startproc xor rdx,rdx .retry: inc rdx rdrand rax jnc .retry ## rax,rdx returns a std::pair<uint64_t,uint64_t> ret .cfi_endproc

This function automagically assigns two registers (%0 and %1), one with the random value, the other with the number of tries (1 if it succeeds right away). A quick test shows us that it (seems to) work:

int main() { uint64_t counts[64]={0}; std::cout << std::hex << std::setfill('0'); for (int i=0;i<100000000;i++) { rand_t t=rd_rand(); std::cout << std::dec << t.second << '\t' << std::hex << std::setw(16) << t.first << std::endl; } return 0; }

It prints:

Tries Values 1 848d44f6fabf8aa5 1 36b3c5953104ea25 1 e5312e8cfdba8d62 1 c55a698dcd5ec5fd 1 a7bf76560b22d56a 1 207c2fa54beea397 1 ae27f2a5a9263b83 1 a829dd4aabd41b8f 1 a8e3d747de951a02 1 f9781bda02545036 1 410cc2263d1c8001 1 f8ebf3fad61c1d6b 1 70a76df3f4759e4e 1 f5aabcfa42b4824d 1 fcdd1260c56027ec

Well, that’s not `fff...ff`, but is it random?

Changing the above code to count the number of times each bits is set to either 0 or 1:

int main() { uint64_t counts[64]={0}; for (int i=0;i<1000000;i++) { rand_t t=rd_rand(); for (int i=0;i<64;i++) counts[i]+=((t.first>>i)&1); } // display for (int i=0;i<64;i++) std::cout << counts[i] << std::endl; return 0; }

We take those result and plot them (in gnumeric, for example):

It seems that my AMD R9 3900x doesn’t suffer from that defect.

*

* *

At first, I tried to use GCC inline assembly:

rand_t rd_rand() { uint64_t x,r; asm(".intel_syntax noprefix;" " xor %1,%1;" ".retry%=:" " inc %1;" " rdrand %0;" " jnc .retry%=;" ".att_syntax;" : "=R"(x),"=R"(r) // %0, %1 : // no input : // auto clobber ); return {x,r}; }

But the compiler kept optimizing the function away, with weird results. Sometimes working, sometimes not… depending on whether or not I used both members of the pair. GCC inline assembly always gave me trouble.

]]>

In the 6×7×6 palette example, we have used the formula

to encode an 676-RGB triplet into a single value. The inverse was given by

,

,

.

That may give the impression that we must encode/decode the entire triplet each time. Certainly, the inverse gives us the means to extract only one of the values—by using any of the three equations! But what if I wanted to rewrite only the blue component?

What about:

?

Clearly, that subtract the old blue () and replaces it by the new blue, . What if we wanted to rewrite the green? I we did

,

that would indeed change only the blue since is the red component only, (what’s subtracting the green and blue, but we could have used , an equivalent to `x=(x>>a)<<a` to set the lower bits to zero), and we add back the green and blue.

Let’s look at the general case now. Let , , … be the numbers of values the fields can take. Let also

,

,

be the product of the number of values of the fields that precede the th, or, in other words, the number of combination of all possible values of the fields that precede the th.

To extract the th field, we compute:

.

The first shifts the desired value in the least significant bits and extracts it. That’s equivalent to `(x>>shift_bits)&mask_bits`, but using potentially a fractional number of bits for the shift, and a fractional number of bits for the mask.

To set a new value for the th field, we compute:

.

The first part, contains all the fields “above” the th, fields to . The last part, contains the value of all fields before the th. Finally, $f\cdot p_k$ is the new value shifted in the th place.

*

* *

Let’s see what the code to do this looks like. I’ll use initializer lists once more to pass the s to the function, mostly because it’s convenient, and you can do your own container-agnostic version from it quite easily.

The most difficult part is to compute the . We basically compute the products of the up to the th field. Extracting a code becomes:

int get( int c, int f, const std::initializer_list<int> & n) { int preds=1,ff=0; auto pn=n.begin(); while (ff++<f) preds*=*pn++; return (c/preds) % *pn; }

Setting a new code asks for the computation of and . The code is:

void set( int & c, int f, int v, const std::initializer_list<int> & n) { int preds=1,ff=0,one_more; auto pn=n.begin(); while (ff++<f) preds*=*pn++; one_more=preds**pn; c=(c-(c % one_more)) + (v*preds) + (c % preds); }

In the code, `c` is the code for all fields, `f` is the field number, and `v` the new value. A much better version would use compile-time techniques, like tuple’s `get` and `set` functions to allow the compiler to optimize everything away.

Use would be something like this:

// fields: 11,3,4,5,12 values. const std::initializer_list<int> params{11,3,4,5,12}; int c=3*11*3+2*11+7; // values: 7,2,3,0,0 std::cout << get(c,1,params) << std::endl; set(c,2,1,params); std::cout << get(c,2,params) << std::endl; std::cout << get(c,3,params) << std::endl;

This outputs:

2 1 0

which is what’s expected.

*

* *

What are the expected savings? Well, very obviously, that depends on the number of values for each field. If they’re all powers of two, the method becomes a complicated way of doing classical shifts and masking; if they’re not, the savings can be important. Let’s take the field values in the example above.

In the classical (integer number of bits) bit-fields, encoding five fields with 11, 3, 4, 5, and 12 possible values each would as for 4, 2, 2, 3, and 4 bits (because, for example, , therefore 4 bits). That's a total of 4+2+2+3+4=15 bits.

The largest possible value the (sub)bit-field can take is , that is, 7920 different combinations. We have . Since the language doesn’t allow us to use 12.951 bits, the best we can do is to store the value in a 13-bits bit-field. That 2 bits less than the naive encoding.

*

* *

While 2 bits doesn’t seem much, that leaves us two more bits to pack in the same (sub)bit-field, and we can use it to store something else. It doesn’t have to be an in-memory (sub)bit-field, we could also use it to store stuff in a file, in some bit-stream where alignment isn’t that important.

^{1}I didn’t say arithmetic coding because that refers, of course, to a much more elaborate coding technique.

The web safe palette divides the RGB color cube in 6×6×6 colors. Each color component, r, g, and b, varies from 0 to 5, levels that are expanded for 24-bits colors to 0x00, 0x33, 0x66, 0x99, 0xcc, 0xff (or 0, 51, 102, 153, 204, 255). If we’re to encode the rgb components as we usually do, by using a fixed number of bits per components, we’d need 9 bits. Indeed, since 2^{2}=4<6<8=2^{3}, we will need 3 bits per component. The encoding would then be

`c=(r<<6)|(g<<3)|b;`

And we could decode using similar bit-oriented operations (using >> and & for masking). The problem with that is that we are using 9 bits and 6×6×6=216, so clearly, we would need at most 8 bits!

Using 3 bits per components would be efficient if each component had 8 levels, but they have only six. We can’t shift by bits for 6 levels. Or can we? Well, let’s first redo the encoding:

`c=36*r+6*g+b;`

The inverse is

`b=c%6;`

`g=(c/6)%6;`

`r=c/36;`

So, if…

`x<<3==x*8==x*2 ^{3}`

then

`x*6==x*2 ^{log26}=x<<log_{2}6`.

Because = works both ways, multiplying by 6 is the same as shifting by log_{2}6≈2.58 bits! Indeed, 3×log_{2}6≈7.75 bits, just as log_{2}216=log_{2}6^{3}=3×log_{2}6!

The 6×6×6 palette uses ≈97% of the 8 bits. Can we use the few extra fractions of bits to squeeze in more colors? You could think, well, if I have 216 colors, that leaves 40 other colors, and I could use the codes from 216 to 255 to encode these new colors, and decode as rgb triplets for codes 0 to 215. Yes, that’s an idea. We could also use a 6×7×6 color cube, because 6×7×6=252, and that very nearly all the range of 8 bits—and we should use 7.98 bits out of 8.

So we will use 6 levels for red, 7 for green (because the eye is more sensitive to green), and 6 for blue. The palette now somewhat lost its (pure) grays because colors are now somewhat off the diagonal.

The code looks pretty much as before, but with new values:

`c=42*r+6*g+b;`

Here, we “shift” green by log_{2}6 bits to make room for blue, and red by log_{2}42=log_{2}7+log_{2}6 to make room for green and blue. Using normal shifts, you would have shifted green by enough position to accommodate the bits for blue, then red by enough to accommodate the bits from blue and green. We did the same here, except with fractional bit shifts in the guise of multiplications.

The inverse is

`b=c%6;`

`g=(c/6)%7;`

`r=c/42;`

Isn’t that neat?

*

* *

How do we reduce 0 to 255 on 0 to 6 (or to 7) and back? A very short program does just that:

int pack(int rgb, int k) { return (rgb*k)/256; } int unpack(int v, int k) { return (v*255)/(k-1); }

The `pack` function transforms the rgb (more exactly, r or g or b) component on the interval 0≤x<1 then on the interval 0≤y<k. The fact we are dividing by 256 ensures that 255 doesn’t give one, therefore k. We want to have a value from 0 to k-1. `unpack` reverses the operation and scales back to 0 to 255. Note that the order of the operation isn’t as natural as they might have been, but to preserve precision—thanks to C/C++ integer arithmetic.

*

* *

The important point here isn’t that the 6×7×6 palette is somewhat better than the 6×6×6 web safe palette; nor even that we use 7.98 bits out of 8 instead of merely 7.75, but that shifting left and multiplying *are the same operation*—just as shifting right *is* dividing. That gives us the possibility of shifting left or right by log-values (some of which are integers).

So let’s say we have a random variable to generate with few values (maybe corresponding to something like choices in a game) with different odds. Let be four of those choices, with probabilities , , , and . Clearly, drawing an uniform random variable on isn’t sufficient, on its own, to choose one outcome with the desired probabilities. We have to cut the interval into four regions^{1}, one for each symbol, and each with a length equal to the corresponding probability. That’s actually quite easily done:

Now drawing a uniform random variable on will let us “fall” into one of the region, thus choosing the corresponding value.

Since the regions have lengths equal to the probabilities, we are guaranteed that the uniform variable lands with the desired probabilities in each region. There! We have a technique to generate discrete variables with any kind of random distribution—even for those for which it would be hard to find a clever formula.

*

* *

OK, now, how do we translate that into code?

If we have a lot of values (and correspondingly a large number of probabilities), we would have to build a table of cumulative probabilities (the i^{th} entry would contain the sum of all probabilities up to the i^{th} one), and then use binary search (or better yet, interpolation search) to find where our uniformly chosen value lands.

However, if we have very few values, say, four or five, that’s a lot of work!

First, let’s setup our uniform random generator using C++11 `<random>` header:

using random_type=uint64_t; using generator_type= std::linear_congruential_engine <random_type, random_type{6364136223846793005}, // Knuth's random_type{1442695040888963407}, // Knuth's std::numeric_limits<random_type>::max()>;

Now, we must compute the cumulative (mass) function, and search it using binary search. We will do neither! Let’s rather do this:

//////////////////////////////////////// // // p contains the probability of each // class, ended by zero. If the sum is // less than one, a last, virtual choice // is added as an "else", with correct // probability. // size_t random_choice(generator_type & rand, const std::initializer_list<float> & p) { float d=rand()/(float)std::numeric_limits<random_type>::max(); auto z=p.begin(); while (z!=p.end() && (d>*z)) d-=*z++; return z-p.begin(); }

The variable `d` is uniform on with `float` accuracy. Then, we walk the list. If `d` is greater than the current probability (not the sum of all previous probabilities, just the current one), we decrease `d` by the current probability, and examine the next one. If `d` is smaller than the current probability, we stop and have chosen the value. Let’s note that the above code doesn’t suppose that the `initializer_list` is normalized: if we reach beyond the end of the list, the “else” symbol is selected. That’s me at being lazy and not wanting to worry that everything sums to exactly 1.

*

* *

Now, let’s test that code to see what happens:

int main() { generator_type rand_gen(time(0)); float probs[]={0.1,0.1,0.2,0.3}; size_t nb_tries=10000000; size_t c[10]={0}; for (size_t i=0;i<nb_tries;i++) c[random_choice(rand_gen,{0.1,0.1,0.2,0.3})]++; for (int i=0;i<5;i++) std::cout << c[i] << '\t' << c[i]/(float)nb_tries << std::endl; }

Prints:

1001059 0.100106 1001505 0.100151 2000932 0.200093 2998483 0.299848 2998021 0.299802

Showing that, on the long run, it does select the values with the right probabilities.

*

* *

One last remark on the cost of `initializer_list`. To my surprise, `initializer_list` shows no significant pessimization over the use of a simple, null-terminated, array. One reason is that initializer lists may be implemented using only two pointers over a (static allocated) array, and that doesn’t cost a lot. In any case, `initializer_list` gives some flexibility, but we’d probably also need a version taking an array (not `std::vector`!) as input. It’s not very hard to implement.

^{1} They would not need to be contiguous, but it does simplify things a lot if they are.

The first solution that comes to mind is either a for-loop or a call to a library function like `memcpy`. The for-loop would give something like:

char * dest = ... char * src = ... for (std::size_t i=0;i<5;i++) dest[i]=src[i];

If you’re lucky, the compiler understands the small copy and optimizes. More likely, it will merely unroll the loop and generate something like:

14d4: 0f b6 07 movzx eax,BYTE PTR [rdi] 14df: 88 04 24 mov BYTE PTR [rsp],al 14e2: 0f b6 47 01 movzx eax,BYTE PTR [rdi+0x1] 14e6: 88 44 24 01 mov BYTE PTR [rsp+0x1],al 14ea: 0f b6 47 02 movzx eax,BYTE PTR [rdi+0x2] 14ee: 88 44 24 02 mov BYTE PTR [rsp+0x2],al 14f2: 0f b6 47 03 movzx eax,BYTE PTR [rdi+0x3] 14f6: 88 44 24 03 mov BYTE PTR [rsp+0x3],al 14fa: 0f b6 47 04 movzx eax,BYTE PTR [rdi+0x4] 14fe: 88 44 24 04 mov BYTE PTR [rsp+0x4],al

Let’s see how we can fix that!

If the number of bytes to copy is unknown at compile time, there’sn’t much we can do but to rely on `memcpy` that has quite good performance. I once spent an afternoon trying to get better performance than `memcpy` with the same kind of tricks and only managed a shadow of a ghost of a flea of a speedup. But if the number of bytes is known at compile time, we can use templates to force the compiler to generate better code!

Let’s start with something simple: copy the first byte, and if there’s more to copy, call (recursively) again:

template<int n> inline void copy(char *src, char *dest) { *dest=*src; copy<n-1>(src+1,dest+1); } // base case! template<> inline void copy<0>(char *,char *) {}

The base case is needed to stop the recursion “at the bottom”. This unrolls the copy and basically produce the same code as the previous for-loop. To speed things up, we must copy in bigger—as big as possible—chunks. Let’s suppose that the largest basic type is `uint64_t` (while in fact it may be given by `max_align_t`, probably `long double`, if you have a 64-bits CPU). Then while there are more than 64 bits left to copy, copy 64 bits, decrease the number of bytes to copy by 8, and do it again. Then repeat (once!) with 32 bits, then (once) with 16, and finally, the last byte.

template <std::size_t size> void copy(char *s, char *d) { if (size>=8) // 8 bytes, 64 bits! { *(std::uint64_t*)d=*(std::uint64_t*)s; copy<size-8>(s+8,d+8); } else if (size>=4) // 4, 32 bits! { *(std::uint32_t*)d=*(std::uint32_t*)s; copy<size-4>(s+4,d+4); } else if (size>=2) // 4, 16 bits! { *(std::uint16_t*)d=*(std::uint16_t*)s; copy<size-2>(s+2,d+2); } else *d=*s; // last char. } // aha-a. base caseS. template <> void copy< 0>(char *, char *) {} template <> void copy<(std::size_t)-1>(char *, char *) {} template <> void copy<(std::size_t)-2>(char *, char *) {} template <> void copy<(std::size_t)-3>(char *, char *) {} template <> void copy<(std::size_t)-4>(char *, char *) {} template <> void copy<(std::size_t)-5>(char *, char *) {} template <> void copy<(std::size_t)-6>(char *, char *) {} template <> void copy<(std::size_t)-7>(char *, char *) {}

Why so many base cases? Well, it seems the compiler can’t quite figure out that the recursive function is greedy and therefore tries to unroll all cases simultaneously, and therefore if `size` is zero, `size-4` may be 0, -1, -2, -3, … And since we have `size-8` in the template, there may be up to 8 bases cases!

The generated code is now, for 5 bytes:

14d4: 8b 07 mov eax,DWORD PTR [rdi] 14de: 89 04 24 mov DWORD PTR [rsp],eax 14e1: 0f b6 47 04 movzx eax,BYTE PTR [rdi+0x4] 14e5: 88 44 24 04 mov BYTE PTR [rsp+0x4],al

*

* *

With the template, we went from 10 instructions to only 4, and also to simpler instructions. I do not know that’s the (real) speed difference between a simple `mov` and a `movzx` (move with zero extend), but the instruction takes one fewer byte. Maybe it’s inconsequential.

The two main things I had to fix were header dependencies and missing files. The first is easily fixed, you just need to parse headers as well. The second isn’t too hard to fix either, depending on how you intend on finding files.

Finding headers relies on two parameters passed to the script. The first one is a series of include paths (that you’d have anyway in a Makefile) and the second the source files (that you’d also have in a Makefile). Typically, your Makefile will have lines such as:

SOURCES= \ $(wildcard sources/*.cpp) INCLUDES=-I. -I./includes/ -I./plugins -I/usr/include/boost/

and you invoke the script:

depend: @./make-depend.sh "$(INCLUDES)" "$(SOURCES)" > .depend

Then, you can grep your way into the files:

headers=$( grep -e '^\ *#\ *include' $f \ | sed 's/.*\(<\|"\)\(.*\)\(>\|"\).*/\1\2\3/' \ | tr '\n' ' ' \ | sed -e 's/[ ]*$//' \ )

This new version finds both <includes> and “includes”, and keeps the delimiter. I will use that later on to check whether or not the file is missing.

To check if an include exists in your project, I check all the directories pointed by the includes list, and if the script doesn’t find the files, it will check if it’s included within quotes or within <>. If it is quoted, then it *must* be local, and the script should have found it: it outputs an error about that file not being found. If it is enclosed in <> the script considers that if it includes a point (as in <thingie.hpp>) it’s part of the project and must be found, otherwise it assume it’s some part of the standard library and doesn’t complain.

function exists { local where=$1 local dir=$2 local name=$3 local found=false dequoted_name=$(echo $name | sed 's/\(<\|>\|"\)//g') for w in ${where[@]} do local f=$w/$dir/$dequoted_name if [ -f "$f" ] then echo -n $(joli $f)' ' found=true fi done # file not found? if [ $found = false ] then # check if it might be in the STL # by assuming that if there's a dot in # the name, it's user-defined (and if # it's in quotation, check anyway) # if [[ "$name" =~ .*'.'.* || \ "$name" =~ \".*\" ]] then echo not found: $name 1>&2 else :; # should do better checking fi fi }

*

*

There are still a number of things to fix. First, the script doesn’t deal with random spaces, for example `< spaces >`, but that shouldn’t be a problem.

*

*

Click to deconflapulate:

#!/usr/bin/env bash function joli { # sed: the first symbol after s is the separator # replaces multiple consecutives ///// by / # then removes initial ./ at the begining echo $1 | sed -e 's,/[/]*,/,g' | sed -e 's,^\./,,' } function exists { local where=$1 local dir=$2 local name=$3 local found=false dequoted_name=$(echo $name | sed 's/\(<\|>\|"\)//g') for w in ${where[@]} do local f=$w/$dir/$dequoted_name if [ -f "$f" ] then echo -n $(joli $f)' ' found=true fi done # file not found? if [ $found = false ] then # check if it might be in the STL # by assuming that if there's a dot in # the name, it's user-defined (and if # it's in quotation, check anyway) # if [[ "$name" =~ .*'.'.* || \ "$name" =~ \".*\" ]] then echo not found: $name 1>&2 else :; # should do better checking fi fi } ################################3 includes=$(echo $1 | sed s/-I//g) files=$2 all_headers=( ) for f in ${files[@]} do # grep: finds lines that begin by #include # sed: extracts between < > or " " but keeps delimiters # tr: replaces \n by space # sed: removes trailing spaces echo -n ${f%.*}.o": "$f" " headers=$( grep -e '^\ *#\ *include' $f \ | sed 's/.*\(<\|"\)\(.*\)\(>\|"\).*/\1\2\3/' \ | tr '\n' ' ' \ | sed -e 's/[ ]*$//' \ ) z=$( for h in ${headers[@]} do #echo "$h" 1>&2 d=$(dirname $h | sed -e 's/^\.//' ) b=$(basename $h) exists "${includes[@]}" "$d" "$b" done ) echo $z # | sed -e 's/\ /\ \\\n/g' echo all_headers+=( $z ) done headers=$(echo ${all_headers[@]} | tr ' ' '\n' | sort -u) for h in ${headers[@]} do echo $h":" echo done exit 0]]>

Well, that’s not bad, but for now, I must reconsider the time I invested, and will continue to invest in this blog. Posting once a week, even with the summer hiatuses, is becoming a strenuous pace, especially that I also have—many—other things to do. So I was tempted to just end it, and say, well, after ten years, it’s much better than your average two-post blog, let’s call it quits. But the thing is, I enjoy writing about the things I do, the math I work out, and other ideas. But I will likely stop for this month, then be back with once-a-month posting.

‘Till then, it’s summer. Let’s enjoy it.

]]>

The Rosenberg-Strong function [1,2] maps the pairs (i,j) in the following way:

If we look at it closely, we see that numbers aren’t placed on diagonals, but in “shells” (technically, gnomons). The original idea was to use this as a storage allocation method, so that each new “shell” was somehow minimal (I am not convinced how this is meant to make sense in block-based storage).

How do we construct the function? Well, first thing is to figure how to generate the shells:

That’s easily found: the shell number is simply max(i,j)! We then notice a pattern linking the number of the shell and its first number:

Now to compute the numbers in the gnomon, let’s assume i are rows and j columns. If i is equal to the row, we add j; if j is equal to the column, we add i, plus the middle of the interval. This is j+max(i,j)-i. The inverse is almost simpler than the function itself.

To find the inverse, we notice that the shell containing some number is . Then is the base of the shell; is the “rest” the position of on the gnomon; if (the gnomon with base has numbers on it), it's on the horizontal section and the pair is , otherwise it's on the vertical part, and the pair is $(2b-r,b)$.

The Mathematica code is:

s[i_, j_] := Max[i, j]^2 + j + (Max[i, j] - i) is[n_] := Module[{b = Floor[Sqrt[n]], r}, r = n - b^2; If[r < b, {b, r}, {2 b - r, b} ] ]

*

* *

The scan order imposed by the Rosenberg-Strong function is:

We see that at each gnomon’s end, we jump across the map to the other side. What we could want, is rather that it wraps always to a close neighbor.

Fortunately, getting a boustrophedonic variant isn’t hard at all! If the shell is even, we wind it one way; if it’s odd, we wind it the other:

We, indeed, get:

The Mathematica code is:

bs[i_, j_] := If[EvenQ[Max[i, j]^2], s[i, j], s[j, i]] ibs[n_] := Module[{t = is[n]}, If[EvenQ[Max[t]^2], t, Reverse[t]] ]

*

* *

What if we’re interested in more than two numbers? We could, of course, devise some complicated function for 3, or more; but we can use the following relations to accomplish just that with function :

,

,

,

etc.

[1] Arnold L. Rosenberg — *Allocating Storage for Extendible Arrays* — J. ACM, vol 21, no 4 (1974), p. 652–670

[2] Arnold L. Rosenberg, H. R. Strong — *Addressing Arrays by Shells* — IBM Technical Disclosure Bulletin, vol 14, no 10 (1972), p. 3026–3028