If we sort the list (or array), the cost is if we use a comparison sort (or if we use an address-transformation sort like the radix sort). If we use selection, that’s on average, but we still have, just as Quicksort, a worst case. Heuristics can be much faster, but do not guaranty an exact result.

So far, I’ve only considered the general case where the range of allowable values is much greater than the length of the list. But if the range is small, say 0-255, and the list is long, certainly we can do better.

Indeed! Remember, there’s a thing called counting sort, were you count the number of times each value appears then just fills the destination with the right number of copies for each. This is since we need steps to scan the list and update counts (since the range is limited, we can use a direct index and the update operation (++) is constant-time), then steps to scan the different counts and make fill-copies.

Turns out that this is pretty much what we need to do if we’re interested in the median, except for the last step, the fill-copy. If we scan the list and update the counts (for different values), the sum of the counts is the length of the list (since each item of the list ends up in one of the slots). We just have to scan these counts until we’ve seen half of the items in the original list. This is .

In code, that’d look something like this:

template <int range, typename T> T counting_median(const std::vector<T> & v) { std::vector<size_t> counts(range,0); for (const T & vv: v) counts[vv]++; size_t sum=0; size_t half=(v.size()+1)/2; for (size_t i=0;i<counts.size();i++) if (sum+counts[i]>half) return i; else sum+=counts[i]; return v[0]; // prevents "warning: control reaches end of non-void function" }

The first part of the function counts the number of instances of the values in the range. This implementation supposes that everything is non-negative. We could use a `std::map`, but the complexity would become . The second part adds each count until the sum reaches or exceeds half the number of items. Then it returns the index where it stopped counting.

*

* *

How much faster does that version run compared to the others? The other implementations are:

`std::sort`, that …sorts the whole thing.`selection`, that we’ve seen before`std::nth_element`, that basically do`selection`, but somewhat better than my implementation.- The median heap, an heuristic.
- The counting median.

To compare fairly, the same vector of length 10000 filled with random values on the 0-999 range is ran through all alternatives. The experiment is repeated 1000 times. This gives the following results:

The counting median is therefore much faster than the other, more general, methods. Of course, we get a speed-up because we exploit a special case, where the number of different values is small compared to the length of the list. But hey, why not?

Filed under: algorithms, C-plus-plus Tagged: heuristics, median, QuickSort, selection, sort ]]>

int fonction (int random_spacing)^M{ ^M int niaiseuses; for (int i=0;i<random_spacing; i++){ { { std::cout << bleh << std::endl; }} } }

There’s a bit of everything. Random spacing. Traces of conversions from one OS to another, braces at the end of line. Of course, they lose points, but that doesn’t make the code any easier to read. In a previous installment, I proposed something to rebuild the whitespaces only. Now, let’s see how we can repair as many defects as possible with an Emacs function.

Let’s start at the beginning: a list of the things to repair:

- OS-related conversion. Linux/*nixes end lines in \n, Windows in \r\n. Other platforms may use something else. Let’s not concern ourselves with the ZX80.
- Replace longs series of (white)spaces by only one space.
- Deal with braces at the end of lines.
- Reindent everything else using the defined style.

The first two items can be combined. Since transforming \r\n into \n only requires to remove \r, we can bundle series of (white)spaces and \r for replacement. I’m not a regex ninja: I came up with this:

; replaces multiple spaces and stray ^M (while (re-search-forward "[[:space:]\|?\r]+" nil t) (replace-match " " nil nil))

Trailing braces are a bit more complicated. They may, or mayn’t, be preceded by spaces and *followed*by spaces. This time, the regex is a bit more complicated:

; remove fiendish { at end of (non-empty) line (while (re-search-forward "\\([^[:space:]{?\n]+\\)\\([[:space:]]*\\)\\({\\)\\([[:space:]]*$\\)" nil t) (replace-match "\\1\n{" nil nil))

It matches three parts. Something that is not whitespaces, followed by something that is whitespaces, the brace {, then whitespaces to the end of line. OK, that makes four. The only one we’re interested in not replacing is the first (the \\1 argument in replace). Everything else, most of it whitespaces, is replaced by newline, { , newline.

Now, the buffer should be in a rather messy state, possibly with trailing whitespaces and destroyed indentation. Calls to `whitespace-cleanup` and `indent-region` should finish the job.

Putting all that together:

(defun cleanup-whole-buffer() "Removes ^M, tabs, and reindent whole buffer" (interactive) (save-excursion (undo-boundary) (beginning-of-buffer) ; replaces multiple spaces and stray ^M (while (re-search-forward "[[:space:]\|?\r]+" nil t) (replace-match " " nil nil)) (beginning-of-buffer) ; remove fiendish { at end of (non-empty) line (while (re-search-forward "\\([^[:space:]{?\n]+\\)\\([[:space:]]*\\)\\({\\)\\([[:space:]]*$\\)" nil t) (replace-match "\\1\n{" nil nil)) (beginning-of-buffer) (whitespace-cleanup) (indent-region (point-min) (point-max) nil) ) )

A few explanations on the other stuff we haven’t discussed yet. The `save-excursion` primitive saves cursor position so that when the function ends, we are still where we called it from. The `undo-boundary` makes sure that we won’t need a series of undos to undo the cleanup. `beginning-of-buffer` moves the cursor… at the beginning of the buffer.

Applying it to the above code snippet, we end up with:

int fonction (int random_spacing) { int niaiseuses; for (int i=0;i<random_spacing; i++) { { { std::cout << bleh << std::endl; }} } }

There are still a number of issues. For example, `i++` has still an extraneous space before it, and we still have two closing braces on the same line. Maybe we should fix that sometime.

Filed under: emacs, hacks Tagged: braces, elisp, n!, newline, whitespace, \r, \r\n ]]>

But how do we generate such sequences? Well, there are many ways to do so. Some more amusing than other, some more structured than others. One of the early example, Halton sequences (c. 1964) is particularly well behaved: it generates 0, 0.5, then 0.25 and 0.75, then 0.125, 0.375, 0.625, and 0.875, etc. It does so with a rather simple binary trick.

Halton observed that if you count in binary from 0 to infinity, reverse the bits, and made it fractional, you’ll end up filling the interval [0,1):

n | binary | reversed | fraction |

0 | 0 | 0 | 0 |

1 | 1 | .1 | 0.5 |

2 | 10 | .01 | 0.25 |

3 | 11 | .11 | 0.75 |

4 | 100 | .001 | 0.125 |

5 | 101 | .101 | 0.625 |

6 | 110 | .011 | 0.375 |

7 | 111 | .111 | 0.875 |

Or, if we look at iterations graphically (up to 7 bits):

*

* *

The first element is then the bit-reversing function. It is a bit different from the usual version, because bit 0 doesn’t end-up in bit 31, but rather in a position that depends on the magnitude of the number. Well, fortunately, there’s an easy way to compute this (albeit not that efficient):

T p_halton_reverse(T x) { T t=0; while (x) { t<<=1; t|=(x&1); x>>=1; } return t; }

The next part is to compute the next power-of-two. There are quite many ways to do so, but in fact, we can maintain it incrementally as we generate the counting sequence. The limit is initially 1. When the counter reaches it, we double it. This entails a check every iteration, but it’s likely less expensive (and, anyway, a lot simpler) than recomputing the next power of two from scratch every time.

So the method is then:

- Next power of two is initially 1, counter is 0.
- Increment counter, adjust the next power of two.
- Output counter scaled by the next power of two.
- Goto 2, unless tired.

*

* *

Now, to make it fun and practical, it should be implemented as a range, or iterator, so that we can write something like:

for (auto t : halton_sequence(128)) std::cout << t << std::endl;

Fortunately, C++11-style range-based loops doesn’t require much from an object: `begin()` and `end()` functions that return an iterator object that knows how to to `!=`, `operator *()` (indirection) and `operator++`.

The “host object” must know how to do `begin()` and `end()`. It must contain the two functions:

const_iterator begin() const { return const_iterator(0,max_val); } const_iterator end() const { return const_iterator(max_val,max_val); }

Where `const_iterator(start_value,end_value)` is the iterator object itself. It doesn’t do much: it takes a `start_value` from which it will start to count, and an `end_value` that limits the count. In my implementation, the iterator is equivalent to `for (T i=start_value; i<end_value;i++)`. It may not be as general as possible, but having `<` rather than `<=` deals with a number of problems such as overflow and repeated values.

The iterator itself is very simple, as it contains only three important functions (or, more precisely, operators):

bool operator!=(const const_iterator & other) { return current_val!=other.current_val; } F operator*() const { return p_halton_reverse(current_val)/(F)next; } const const_iterator & operator++() { if (++current_val==next) next<<=1; return *this; }

The `operator!=` is needed so that the loop can compare to `end()`. The indirection operator is what will give you the value through the iterator. Finally, `operator++` advances the iterator to the next value.

*

* *

To make things fun, or at least practical, we could make this template code. The template for the host object should have two arguments, one for the type of integer used for the counter, and one for the float type used for the generated sequence. We could, as the STL often does, hide a couple of special cases behind typedef/using aliases. In any case, the class shouldn’t be hard to use:

for (auto t : halton_sequence<int,double>(16)) std::cout << t << std::endl;

*

* *

The complete PoC:

#include <iostream> #include <bitset> #include <cstdint> #include <limits> template <typename T=int, typename F=float> class halton_sequence { protected: const T max_val; static T p_halton_reverse(T x) { T t=0; while (x) { t<<=1; t|=(x&1); x>>=1; } return t; } static T p_next(T x) { T i=1; while (i<x) i<<=1; return i; } public: class const_iterator { protected: const T max_val; T current_val, next; public: bool operator!=(const const_iterator & other) { return current_val!=other.current_val; } F operator*() const { return p_halton_reverse(current_val)/(F)next; } const const_iterator & operator++() { if (++current_val==next) next<<=1; return *this; } const_iterator(const T & s, const T & m) : max_val(m), current_val(s), next(p_next(s)) {} const_iterator(const T & other) : max_val(other.max_val), current_val(other.current_val), next(other.next) {} ~const_iterator()=default; }; const_iterator begin() const { return const_iterator(0,max_val); } const_iterator end() const { return const_iterator(max_val,max_val); } halton_sequence(const T & max_=std::numeric_limits<T>::max()) : max_val(max_) {} ~halton_sequence()=default; }; int main() { for (auto t : halton_sequence<int,double>(16)) std::cout << t << std::endl; return 0; };

Filed under: algorithms, bit twiddling, Mathematics Tagged: Halton, Integration, low-discrepancy, Monte Carlo, Monte Carlo Integration, quasi-random ]]>

Filed under: Uncategorized ]]>

Well, it depends.

Make uses “rules” to build object files and executable. A typical rule is of the form target: dependencies, or, for example:

plugins/filter.o: plugins/filter.cpp includes/wstream.hpp plugins/filter.hpp includes/wstream.hpp: plugins/filter.hpp:

These rules state that the file `plugins/filter.o` depends on `plugins/filter.cpp` and the (presumably) headers `includes/wstream.hpp` and `plugins/filter.hpp`. They also say that the headers do not depend on further files.

So, how do we get the rules out of the source code? There are many short answers.

- If you’re using gcc/g++, you can use its capabilities to output the dependencies and create a number of make-compatible dependency files. The procedure is described here.
- You can use a tool like makedepend that, however, doesn’t seem that standard. It will generate make-compatible files.

Or you can regex header files out of your source files. This, however, isn’t as straightforward as it sounds. To build the correct rules for make, you have to not only detect include headers but also to find where the corresponding files are. This requires you to know the include path of your program so that if there’s an include somewhere in your sourcecode, it expands to a path explicit enough to feed to make.

If we already have a list of the .cpp files (let’s say you get it from the Makefile), you just need to grep for lines that have includes. Since you can’t guaranty you’ll have the exact standard syntax (you’ll always have some …creative type that will write `# include <thingie.hpp> // trololol`), the grep regular expression must be flexible somewhat. Once you’ve got one of these lines, you extract what you find between < and >. We join everything in a single list by replacing newlines by spaces, and by removing trailing spaces. That would look something like:

# grep: finds lines that begin by #include # sed: extracts between < > # tr: replaces \n by space # sed: removes trailing spaces headers=$( grep -e '^\ *#\ *include' $f \ | sed 's/.*<\(.*\)>.*/\1/' \ | tr '\n' ' ' \ | sed -e 's/[ ]*$//' \ )

Once we have filenames, we must “qualify” them so that `<filter.hpp>` is expanded to `plugins/filter.hpp` in the rule. At first I thought of using `find` (another command I love to hate), but it doesn’t play well with filenames such as `plugins/filter.hpp`, because the `-name` argument must be a basename without dirname. Further, if you have an include path that has `-I. -I/some/where -I./yet/somewhere/else`, you must probe in these to see where is `plugins/filter.hpp`. This gives something like:

function exists { local where=$1 local dir=$2 local name=$3 for w in ${where[@]} do local f=$w/$dir/$name if [ -f "$f" ] then echo -n $(joli $f)' ' fi done }

where `where` is the list of paths from the include path, `dir` is “`plugins`” for the file `plugins/filter.cpp` but is empty if the file isn’t specified with a path, and `name` the basename of the header file. `joli` is a function that replaces runs of multiple slashes by a single slash (something that happens when you concat dirs and filenames).

Lastly follows some pretty-printing that complies to make rule syntax.

*

* *

The changes in the makefile itself are not that important. Indeed:

OBJS=$(SOURCES:.cpp=.o) -include .depend depend: @./make-depend.sh "$(INCLUDES)" "$(SOURCES)" > .depend $(NAME): $(OBJS) g++ $(OBJS) $(LDFLAGS) -o $(NAME) $(LIBS)

the `$(OBJS)` variable creates the default rules for the object files, which are generated from `$(SOURCES)`, a variable that should contain a list of your source files (the .cpp). The variable `$(INCLUDES)$` is the usual list of include directories, of the form `-I. -I./includes/`, etc.

Invoking `make depend` will call the script and create a `.depend` file containing the rules. The `-include .depend` imports the rules into the makefile.

*

* *

There’s a few things that still need work. Some are out of my control. For example, make thinks that an extensionless file (one that doesn’t end in .something) is a target, so that if you have a .cpp file including an extensionless file, it will try to figure out a rule to build it as an executable. Other are things that could be tweaked somehow, like computing the very minimal rules, that seem incomplete, but still build the executable correctly. Not sure it’s very useful, though.

*

* *

Click for the script’s full code

#!/usr/bin/env bash function joli { # sed: the first symbol after s is the separator # replaces multiple consecutive ///// by / # then removes initial ./ at the beginning echo $1 | sed -e 's,/[/]*,/,g' | sed -e 's,^\./,,' } function exists { local where=$1 local dir=$2 local name=$3 for w in ${where[@]} do local f=$w/$dir/$name if [ -f "$f" ] then echo -n $(joli $f)' ' fi done } ################################3 includes=$(echo $1 | sed s/-I//g) files=$2 all_headers=( ) for f in ${files[@]} do # grep: finds lines that begin by #include # sed: extracts between < > # tr: replaces \n by space # sed: removes trailing spaces echo -n ${f%.*}.o": "$f" " headers=$( grep -e '^\ *#\ *include' $f \ | sed 's/.*<\(.*\)>.*/\1/' \ | tr '\n' ' ' \ | sed -e 's/[ ]*$//' \ ) z=$( for h in ${headers[@]} do d=$(dirname $h | sed -e 's/^\.//' ) b=$(basename $h) exists "${includes[@]}" "$d" "$b" done ) echo $z # | sed -e 's/\ /\ \\\n/g' echo all_headers+=( $z ) done headers=$(echo ${all_headers[@]} | tr ' ' '\n' | sort -u) for h in ${headers[@]} do echo $h":" echo done exit 0

Filed under: Bash (Shell), hacks Tagged: bash, depend, gnu make, grep, make, Makedepend, Makefile, sed, tr ]]>

OK, first, let’s model the trapezoidal filter. Let’s suppose, as a simplifying assumption, that the track is entirely under the head (that is, the head is longer than the track is wide, and is never so tilted that it’s entirely in the track). Or, in other words, we have this situation:

We suppose, initially, that we don’t really have control on the head’s azimuth. It may be mostly correct, but it may be wildly off. If it’s wildly off, the resulting projection of the head (in gray) gives the trapezoidal filter (orange):

Of course, we *want* the slant to be exactly zero, the head being perfectly vertical |. We see that if we vary the angle continuously, the window also changes quite gradually:

In the figure, the quarter circle indicates that the with of the head remains constant (something that might not be that clear in the diagram because 1) it’s hard in general comparing rectangle widths then rectangles have different orientations and 2) *Mathematica* makes its difficult to maintain correct aspect ratio in figures). But as you see, the center of the circle is the lower-right corner of the head against the track and the left side “rolls” on the circle in the right place, at exactly a head’s width from the other corner. So we’re sure the model computes the right projection.

Let’s suppose the azimuth is off by 5°, which is, methinks, rather large. Let’s see what happens. First, the filters looks like this:

In frequency domain, we have:

In green, the box filter, corresponding to a perfectly aligned head, in red, the head tilted 5°. Both are shifted relative to one another so that we can see something. In frequency space, both peaks align. What are we seeing? Superficially, we might think they look the same, but with a closer look, we see that even if the three main packs are of about the same amplitude, the red squiggle goes to zero much faster as we get further from the main peak. That means it has a lower high-frequency response than the green one. So, people claiming azimuth is important are right. It is. But how much is it, then?

Let’s have a look at a slant of 20°. That’s probably not as bad as it could get, but at this point I’d think your cassette deck is pretty much scrap.

The frequency analysis of the filter makes it even clearer: the red squiggle is *much* more compact than the green one, and therefore it responds even less to high frequencies.

So that settles the question about azimuth.

*

* *

But, how much should we worry about it? Well, it depends on how much that thing gets out of whack when you use the deck. If it’s a few degrees, say, 2 or less, that’s basically within the noise of the device. At 1° we get basically indistinguishable frequency responses. Lo!:

I can’t say right now that it doesn’t matter at all. First, we should compare the differences in dB knowing that a typical noise floor for those things is about -60 to -72dB (or in the range of 10 to 12 bits). If the difference in alignment causes a frequency shift of less than -60dB, then it’s harmless. Second, the width of the gap plays a lot in this. As its width goes to zero, *any* angle will cause a trapezoid window. The tolerable angle is therefore function of the gap width and the gap length. Will need more experimenting.

Filed under: signal processing, Uncategorized Tagged: azimuth, box filter, Compact cassettes, dB, Fourier transform, Frequency, Frequency Domain, tilt, Time Domain ]]>

First, let’s understand how the signal is recorded on the compact cassette. The device uses a pulse-amplitude modulation encoding, or, simply put, an encoding scheme where the value of the signal is directly encoded by the intensity of the magnetism on the tape. Strong positive magnetism yields a large positive value, strong negative magnetism yields a “large” negative value, and the waveform is encoded using variation in strength of the magnetisation on tape. In the figure, red is negative, green positive, and saturation represents intensity.

To give some frequency resolution, the recording/reading head is narrow, but spans the whole width of the track (on a single-track tape, that would be the width of the tape, but as compact cassettes are recorded as four distinct tracks, two for each side). The device reads/writes the magnetism under the head gap. The tape advances continuously, rather than by discrete increments, acting as an interpolation between samples, preventing the introduction of high frequency noise components.

The precision, in terms of frequencies, of the recording is limited by two factors: the width of the gap and the speed of the tape under the head. We have:

.

In tape cassettes, constants align so that is about 15KHz (or maybe 16KHz). That’s if all goes well.

Since these tape drives contain a lot of moving parts, especially in auto-reverse models, it may happen that the head moves. There are four types of displacementment (in no particular order):

- Height. The track should be centered on its reading head. If it the head is too low, it may even overlap the track below, leading to crosstalk (you pick in a channel sound from another channel) in addition of weakened signal (if you read only 80% of the signal, it will be 20% weaker).
- Wrap. The head gap should be at 90° with the tape. Not sure exactly what the effects would be; probably weakened signal.
- Zenith. The head must touch the tape evenly. If some parts of the head are futher away from the tape, then the signal is weakened. Possibly induces wear on the tape too.
- Azimtuh. The head gap should be perpendicular to the direction of the tape. Effects can include phase shift between channels (as the head is slanted, one track is read earlier (or later) than the other) and high frequency loss.

Plus any random effects like varying tape speed, noise from the environment, crud on the head, etc. This video explains it very well.

Let’s concentrate on azimuth for now.

In the best case, tape flows perpendicular to the head gap:

The head is either completely within a recording cell or is across at most two cells. The value read is simply a weighted average of the two, with proporitions determined by the position of the head. This corresponds to a rectangular window whose width is proportional to the width of the head gap. The box filter in time domain yields a sinc in frequency domain. Applying a convolution with the box time-domaine filter is like multiplying frequencies by the sinc function.

If the head is slanted, then it looks something like this:

If the head is really slanted, then the read head may overlap more than one cell, and the corresponding filter isn’t a box anymore, it’s a trapezoid. If the height is correct, then it’s an isoceles trapezoid. Something like this:

Which is starting to look like a smoothing filter, something that is really bad news for the frequency response of the device.

How much tilt do we need to notice the effect in the sound? What are the spectral properties of the trapezoid filter?

*To be continued next week*

Filed under: signal processing Tagged: box filter, Compact cassettes, filter, magnetic tape, mexican hat, oldies, sinc, trapezoid ]]>

So, why not test it?

A quick implementation of the thing would be something like

template <typename T> T ror(const T & a, size_t n) { return (a>>n) | (a<<(std::numeric_limits<T>::digits-n)); } template <typename T> class cheap_prng { private: T count, seed; public: T operator()() { seed^=count++; seed=ror(seed,1); seed+=count; return seed; } cheap_prng(const T & s) : count(0),seed(s) {} ~cheap_prng()=default; };

The supposedly interesting part is that the random-generating function is especially inexpensive. The assembly for the function itself, assuming a 32 bits type, excluding loads and write-backs, boils down to:

;mov eax,<seed> ;mov edx,<count> xor eax,edx inc edx ror eax,1 add eax,edx ;mov <count>,edx ;mov <seed>,eax ret

How does it fare? Using `T=unsigned short` and 10000 times 65536 draws, we get:

Which is kind of uniform. It would probably pass the Chi Square test. Even if you draw it, say, modulo 7, it looks good enough. For `T=unsigned char`, modulo 7 yields:

0 370108 1 368697 2 369940 3 370286 4 360392 5 361169 6 359408

*

* *

It’s clearly not that strong, however, but to generate (noisy) textures, for example, it seems to be good enough:

Filed under: algorithms, Mathematics, programming Tagged: notebook, PRNG, pseudorandom, simple ]]>

The first derivation I gave then was focused on the noise, where the noise maximal amplitude was proportional to the amplitude represented by the last bit of the (encoded) signal. Let’s now derive it from the most significant bit of the signal to its least significant.

Let us suppose that the first bit represents half the maximum signal amplitude, . Then the second one should represent a quarter, and the third an eighth, etc. For bits, then, the maximum amplitude is

,

Or, alternatively,

,

which means that the rest of the amplitude—noise—is

.

Recall, the signal-to-noise ratio, the SNR, is given by the ratio of the powers of the signal and the noise, that is:

.

Substituting and for their expressions parameterized by we get that

,

where the s cancel out. We can rewrite the ratio as

.

To get decibels out of the expression of the power, we write

,

or, in our case,

.

Let us see what this expression yields.

.

Since , and goes rapidly to zero, we can write

.

*

* *

So how fast does the last part, . go to zero? Intuitively, we should agree that as grows, gets closer and closer to one, and that, therefore, goes to zero. But how fast? A quick plot reveals the following picture:

which indeed goes rapidly to zero. When we combine the two results, we see that the full expression and the approximation go quite agree as soon as is somewhat large:

*

* *

So when a device tells you it has a SNR of 60 dB (like an 1980s cassette deck), what it really tells you is that it’s gives you about 10 bits resolution. CDs, with 16 bits, gives you approximately 96 dB; 24 bits bring you in the realm of 144 dB. Well, good luck finding a device that will indeed guarantee this kind of SNR.

Filed under: data compression, Mathematics Tagged: CD, dB, decibels, log, noise, PSNR, Signal-to-Noise, Signal-to-Noise Ratio, SNR ]]>

But the Burrows-Wheeler transform isn’t the only possible one. There are a few other techniques to generate (reversible) permutations of the input.

If we don’t care really about CPU time, we *could* try each and every one the possible and distinct permutations^{1} and pick the shortest-encoded one. Storing the permutation index (the number of the permutation) will require bits. Although this guarantees that we find the best possible ordering, it is computationally unfeasible, except for very small block lengths.

So we must find something else, something much simpler. Something that shuffles the contents in a parametric (and reversible) way, something that displays each symbol in a block exactly once. One possible way to do that is to use a probe, like we have in hash tables, with linear or quadratic probing. Quadratic probing is cumbersome because you need all kinds of conditions on the table size, and linear probing is too simple.

But if we extend linear probing from a simple +1 step so something like

,

where is relatively prime to the table size , we are sure that all positions in the table will be visited exactly once when varying from 0 to .

For our reordering, we do not know beforehand which step will be the best. Fortunately for us, we only need it to be relatively prime to to be acceptable. Then, we merely try them one by one, something expensive, but not all that much since there will be less than candidates. To pick the best one, we merely need a proxy to the compression ratio, say, a function that counts repetitions or something like that—you would need something more complex if your next step of encoding is more sophisticated.

The search loop would therefore be something like:

//////////////////////////////////////// solution select(const std::string & src) { const size_t l=src.size(); size_t best_step=0,best_score=0; std::string best_remix; for (size_t step=1; step<l; step++) if (gcd(step,l)==1) { std::string this_remix=remix(src,step); size_t this_score=score(this_remix); if (this_score>best_score) { best_step=step; best_score=this_score; best_remix=this_remix; } } // should output best step also return {best_remix,{best_step,best_score}}; }

where `gcd` computes the greatest common divisor, `remix` shuffles the buffer (here a string for display purposes) and `score` computes a proxy to the compression ratio.

*

* *

So let’s try this with actual text:

// Shakespeare: Julius Ceasar, Cassisus, Act 1, scene 2 const std::string cassius="Men at some time are masters of their fates: The fault, dear Brutus, is not in our stars, But in ourselves, that we are underlings."; // Shakespare: Twelfth Night, Malvolio, Act 2, scene 5 const std::string malvolio="Some are born great, some achieve greatness, and some have greatness thrust upon 'em."; // https://en.wikiquote.org/wiki/Spinosaurus const std::string horner="If we base the ferocious factor on the length of the animal, there was nothing that ever lived on this planet that could match this creature.";

The score function,

size_t score(const std::string & src) { size_t score=0; char last=0; for (char c: src) { score+=(last==c); last=c; } return score; }

merely count repetitions. The number of repetitions in the original texts is, in order, 0, 2, and 0. So, not that many useful repetitions. Now, after “optimization” the program finds

step = 18 score= 22 0.167939 Mrr,nBtii f,avnor ruua mae ts els,,r oeuse ts eee ouhnmta redmsTurrraataii .ait ltf stluse:Boo n fdttagehuissee ht setsernnw Men at some time are masters of their fates: The fault, dear Brutus, is not in our stars, But in ourselves, that we are underlings. step = 36 score= 12 0.141176 Seumgone t ve,nesa euctgooaserrdmghse vsthasphnrmmtt en .rro ba'e ,arsoieeeen aa s Some are born great, some achieve greatness, and some have greatness thrust upon 'em. step = 14 score= 18 0.12766 I tgm eta .ecnis hhehaenatntcrtflawao tu h tateseeetdemasuhhr en eaottegvadrbi hnillc cnftilpu eooo h oswr ,trsci erhloei hffotanvhtt If we base the ferocious factor on the length of the animal, there was nothing that ever lived on this planet that could match this creature.

It found a reordering with 22 repetitions for the first, 12 for the second and 18 for the last. While that doesn’t seem like much, that might still give the next stage compression a chance of getting a few % more compression. Who knows.

*

* *

It is unclear how much extra compression we can get from such as scheme. On the plus side, “decompression” (as shown in the `demix` function in the full source below) is extremely simple and the coding overhead of storing the step in the compressed block is proportional to , which is not necessarily negligible but isn’t excessive either. Maybe there’s something exploitable in the distribution of the steps, further than being drawn from relatively prime numbers to ? Questions for later.

*

* *

The full sourcecode. Click to decollapsulate.

#include <string> #include <vector> #include <iostream> // Shakespeare: Julius Ceasar, Cassisus, Act 1, scene 2 const std::string cassius="Men at some time are masters of their fates: The fault, dear Brutus, is not in our stars, But in ourselves, that we are underlings."; // Shakespare: Twelfth Night, Malvolio, Act 2, scene 5 const std::string malvolio="Some are born great, some achieve greatness, and some have greatness thrust upon 'em."; // https://en.wikiquote.org/wiki/Spinosaurus const std::string horner="If we base the ferocious factor on the length of the animal, there was nothing that ever lived on this planet that could match this creature."; typedef std::pair<size_t,size_t> step_score; typedef std::pair<std::string, step_score> solution; //////////////////////////////////////// size_t gcd(size_t a, size_t b) { while (b) { unsigned t=b; b=a % b; a=t; } return a; } //////////////////////////////////////// size_t score(const std::string & src) { size_t score=0; char last=0; for (char c: src) { score+=(last==c); last=c; } return score; } //////////////////////////////////////// std::string remix(const std::string & src, size_t step) { const size_t l=src.size(); std::string temp(l,0); // reserve space for (size_t s=0,d=0;d<l;d++,s=(s+step)%l) temp[d]=src[s]; return temp; } //////////////////////////////////////// std::string demix(const std::string & src, size_t step) { const size_t l=src.size(); std::string temp(l,0); // reserve space for (size_t s=0,d=0;s<l;s++,d=(d+step)%l) temp[d]=src[s]; return temp; } //////////////////////////////////////// solution select(const std::string & src) { const size_t l=src.size(); size_t best_step=0,best_score=0; std::string best_remix; for (size_t step=1; step<l; step++) if (gcd(step,l)==1) { std::string this_remix=remix(src,step); size_t this_score=score(this_remix); if (this_score>best_score) { best_step=step; best_score=this_score; best_remix=this_remix; } } // should output best step also return {best_remix,{best_step,best_score}}; } //////////////////////////////////////// int main() { std::vector<solution> solutions { select(cassius), select(malvolio), select(horner) }; for (std::string & s: std::vector<std::string>{cassius,malvolio,horner}) std::cout << "original score=" << score(s) << std::endl ; for (const auto & s: solutions) std::cout << "step = " << s.second.first << std::endl << "score= " << s.second.second << " " << s.second.second/(float)s.first.size() << std::endl << s.first << std::endl << std::endl << demix(s.first,s.second.first) << std::endl << std::endl << std::endl ; return 0; }

^{1}In fact, the number of distinct permutations is

,

as given by the multinomial coefficients.

Filed under: algorithms, C-plus-plus, data compression, hacks Tagged: Burrows, Burrows Wheeler Transform, compression, hash table, Linear probing, Quadratic probing, Transform, Wheeler ]]>