## New Block Order

The idea of reorganizing data before compression isn’t new. Almost twenty five years ago, Burrows and Wheeler proposed a block sorting transform to reorder data to help compression. The idea here is to try to create repetitions that can be exploited by a second compression engine.

But the Burrows-Wheeler transform isn’t the only possible one. There are a few other techniques to generate (reversible) permutations of the input.

If we don’t care really about CPU time, we could try each and every one the $O(n!)$ possible and distinct permutations1 and pick the shortest-encoded one. Storing the permutation index (the number of the permutation) will require $O(n \log_2 n)$ bits. Although this guarantees that we find the best possible ordering, it is computationally unfeasible, except for very small block lengths.

So we must find something else, something much simpler. Something that shuffles the contents in a parametric (and reversible) way, something that displays each symbol in a block exactly once. One possible way to do that is to use a probe, like we have in hash tables, with linear or quadratic probing. Quadratic probing is cumbersome because you need all kinds of conditions on the table size, and linear probing is too simple.

But if we extend linear probing from a simple +1 step so something like

$h(k,t)= k+ t c (\mod n)$,

where $c$ is relatively prime to the table size $n$, we are sure that all positions in the table will be visited exactly once when varying $t$ from 0 to $n-1$.

For our reordering, we do not know beforehand which step $c$ will be the best. Fortunately for us, we only need it to be relatively prime to $n$ to be acceptable. Then, we merely try them one by one, something expensive, but not all that much since there will be less than $n$ candidates. To pick the best one, we merely need a proxy to the compression ratio, say, a function that counts repetitions or something like that—you would need something more complex if your next step of encoding is more sophisticated.

The search loop would therefore be something like:

////////////////////////////////////////
solution select(const std::string & src)
{
const size_t l=src.size();
size_t best_step=0,best_score=0;
std::string best_remix;

for (size_t step=1; step<l; step++)
if (gcd(step,l)==1)
{
std::string this_remix=remix(src,step);

size_t this_score=score(this_remix);
if (this_score>best_score)
{
best_step=step;
best_score=this_score;
best_remix=this_remix;
}
}

// should output best step also
return {best_remix,{best_step,best_score}};
}


where gcd computes the greatest common divisor, remix shuffles the buffer (here a string for display purposes) and score computes a proxy to the compression ratio.

*
* *

So let’s try this with actual text:

// Shakespeare: Julius Ceasar, Cassisus, Act 1, scene 2
const std::string cassius="Men at some time are masters of their fates: The fault, dear Brutus, is not in our stars, But in ourselves, that we are underlings.";

// Shakespare: Twelfth Night, Malvolio, Act 2, scene 5
const std::string malvolio="Some are born great, some achieve greatness, and some have greatness thrust upon 'em.";

// https://en.wikiquote.org/wiki/Spinosaurus
const std::string horner="If we base the ferocious factor on the length of the animal, there was nothing that ever lived on this planet that could match this creature.";


The score function,

size_t score(const std::string & src)
{
size_t score=0;
char last=0;
for (char c: src)
{
score+=(last==c);
last=c;
}

return score;
}


merely count repetitions. The number of repetitions in the original texts is, in order, 0, 2, and 0. So, not that many useful repetitions. Now, after “optimization” the program finds

step = 18
score= 22 0.167939
Mrr,nBtii f,avnor ruua mae  ts els,,r oeuse  ts   eee  ouhnmta redmsTurrraataii .ait   ltf stluse:Boo n fdttagehuissee ht setsernnw

Men at some time are masters of their fates: The fault, dear Brutus, is not in our stars, But in ourselves, that we are underlings.

step = 36
score= 12 0.141176
Seumgone t ve,nesa  euctgooaserrdmghse   vsthasphnrmmtt en .rro ba'e ,arsoieeeen aa s

Some are born great, some achieve greatness, and some have greatness thrust upon 'em.

step = 14
score= 18 0.12766
I tgm eta .ecnis  hhehaenatntcrtflawao tu     h tateseeetdemasuhhr en eaottegvadrbi  hnillc cnftilpu eooo h  oswr  ,trsci erhloei hffotanvhtt

If we base the ferocious factor on the length of the animal, there was nothing that ever lived on this planet that could match this creature.


It found a reordering with 22 repetitions for the first, 12 for the second and 18 for the last. While that doesn’t seem like much, that might still give the next stage compression a chance of getting a few % more compression. Who knows.

*
* *

It is unclear how much extra compression we can get from such as scheme. On the plus side, “decompression” (as shown in the demix function in the full source below) is extremely simple and the coding overhead of storing the step in the compressed block is proportional to $\log_2 n$, which is not necessarily negligible but isn’t excessive either. Maybe there’s something exploitable in the distribution of the steps, further than being drawn from relatively prime numbers to $n$? Questions for later.

*
* *

The full sourcecode. Click to decollapsulate.

#include <string>
#include <vector>
#include <iostream>

// Shakespeare: Julius Ceasar, Cassisus, Act 1, scene 2
const std::string cassius="Men at some time are masters of their fates: The fault, dear Brutus, is not in our stars, But in ourselves, that we are underlings.";

// Shakespare: Twelfth Night, Malvolio, Act 2, scene 5
const std::string malvolio="Some are born great, some achieve greatness, and some have greatness thrust upon 'em.";

// https://en.wikiquote.org/wiki/Spinosaurus
const std::string horner="If we base the ferocious factor on the length of the animal, there was nothing that ever lived on this planet that could match this creature.";

typedef std::pair<size_t,size_t> step_score;
typedef std::pair<std::string, step_score> solution;

////////////////////////////////////////
size_t gcd(size_t a, size_t b)
{
while (b)
{
unsigned t=b;
b=a % b;
a=t;
}
return a;
}

////////////////////////////////////////
size_t score(const std::string & src)
{
size_t score=0;
char last=0;
for (char c: src)
{
score+=(last==c);
last=c;
}

return score;
}

////////////////////////////////////////
std::string remix(const std::string & src, size_t step)
{
const size_t l=src.size();
std::string temp(l,0); // reserve space

for (size_t s=0,d=0;d<l;d++,s=(s+step)%l)
temp[d]=src[s];

return temp;
}

////////////////////////////////////////
std::string demix(const std::string & src, size_t step)
{
const size_t l=src.size();
std::string temp(l,0); // reserve space

for (size_t s=0,d=0;s<l;s++,d=(d+step)%l)
temp[d]=src[s];

return temp;
}

////////////////////////////////////////
solution select(const std::string & src)
{
const size_t l=src.size();
size_t best_step=0,best_score=0;
std::string best_remix;

for (size_t step=1; step<l; step++)
if (gcd(step,l)==1)
{
std::string this_remix=remix(src,step);

size_t this_score=score(this_remix);
if (this_score>best_score)
{
best_step=step;
best_score=this_score;
best_remix=this_remix;
}
}

// should output best step also
return {best_remix,{best_step,best_score}};
}

////////////////////////////////////////
int main()
{
std::vector<solution>
solutions
{
select(cassius),
select(malvolio),
select(horner)
};

for (std::string & s: std::vector<std::string>{cassius,malvolio,horner})
std::cout
<< "original score=" << score(s) << std::endl
;

for (const auto & s: solutions)
std::cout
<< "step = " << s.second.first << std::endl
<< "score= "
<< s.second.second
<< " " << s.second.second/(float)s.first.size()
<< std::endl
<< s.first << std::endl
<< std::endl
<< demix(s.first,s.second.first) << std::endl
<< std::endl
<< std::endl
;

return 0;
}


1In fact, the number of distinct permutations is

$\displaystyle \binom{n}{n_1 n_2 \ldots n_k}=\frac{n!}{n_1! n_2! \ldots n_k!}$,

as given by the multinomial coefficients.