Ad Hoc Compression Methods: Digram Coding

Ad hoc compression algorithms always sound somewhat flimsy, but they still have their usefulness whenever the data you have to compress exhibits very special properties or is extremely short. Universal compression algorithms need a somewhat lengthy input to adapt sufficiently to yield good compression. Moreover, you may not have the computing power (CPU speed, RAM and program space) to implement full-blown compressors. In this case also, an ad hoc algorithm may be useful.

Digram (bigram, digraph, bigraph) coding is one of those ad hoc algorithms. Digram coding will compress text (or any source, really) by finding pairs of symbols that appear often and assigning them codes corresponding to used symbols, if any.

hieroglyph

Let us see in detail what digram compression is, and how effective it really is.

The basic idea of digram coding is to notice that in text, usually, only a small fraction of symbols are used from the possible character set. French text coded in ISO-8859-1 may use around 110 distinct symbols (for, you see, they have diacritics) while an English text may only use 80 or so. That is, if your characters are coded on 8 bits, more than 140 symbols remain unused. Digram coding will assign to each of the unused code a digram, that is, a group of two symbols.

The text to be compressed is scanned one character at a time. If the current character and the next are found in the digram list, a special digram code is emitted, and we advance two characters in the text. If not, the code for a single character is emitted, and we go to the next character. If the text would entirely be matched by digrams, the algorithm would yield a 2:1 compression ratio. Realistically, however, the expected compression ratio lies between 1.4 and 1.9:1; only very rarely would you see 2:1.

However, to make compression more efficient, only the most frequent digram will be picked to be assigned codes. To do so, you must first scan the text and compute the digram frequencies. One this is done, you take only the most frequent to form the digram dictionary.

Computing optimally the digrams is a NP-hard problem, since you’d have to compute all possible parsings of the text with all possible digram dictionaries. However, it is possible to compute a very good approximation of the optimal dictionary using simply a weighted digram list. For each digram you find in the text (i.e., sliding one character forward at the time), you count the number of occurrences. You keep only the most frequent; if you have, say, 90 distinct symbols, you’d keep the 256-90=166 most frequent digrams.

A digram codec is readily implemented, and the coder-side is a lot more intensive (albeit not very much so at all) than the decoder-side. The coder side has to scan the text at least one to gather statistics to compute the digrams, then a second time to compress.

As the digram_dictionary class holds all the code to manage the digram dictionary (about 60 lines or so), the coder part is given by:

for (int i=0;i<file_size;i++)
 {
  uint8_t code;
  if (i<file_size-1)
   {
    uint16_t bigram=(file_bytes&#91;i&#93;<<8)+file_bytes&#91;i+1&#93;;
        
    if (dictionary.has(bigram))
     {
      code=dictionary.code(bigram);
      i++;
     }
    else
     code=dictionary.code(file_bytes&#91;i&#93;);
   }
  else
   code=dictionary.code(file_bytes&#91;i&#93;);

  out.write((char*)&code,sizeof(uint8_t));
 }
&#91;/sourcecode&#93;

Of course, I also wrote a command-line wrapper so that I could call the
digram codec on arbitrary files. Yielding the results:<br><br>


-rw-r--r--  1 steven steven 32702515 2004-03-30 00:00 11800-8.txt
-rw-r--r--  1 steven steven 22540745 2009-01-17 18:39 11800-8.txt.dig
-rw-r--r--  1 steven steven  7284401 2004-03-30 00:00 11800-8.txt.bz2
-rw-r--r--  1 steven steven  3050208 2004-06-13 18:45 12606.txt
-rw-r--r--  1 steven steven  1695083 2009-01-17 18:38 12606.txt.dig
-rw-r--r--  1 steven steven   780463 2004-06-13 18:45 12606.txt.bz2
-rw-r--r--  1 steven steven    31298 2004-11-03 15:03 13938-8.txt
-rw-r--r--  1 steven steven    18786 2009-01-17 16:50 13938-8.txt.dig
-rw-r--r--  1 steven steven    10965 2004-11-03 15:03 13938-8.txt.bz2
-rw-r--r--  1 steven steven  1722578 2004-11-04 13:22 13952-8.txt
-rw-r--r--  1 steven steven   982666 2009-01-17 17:01 13952-8.txt.dig
-rw-r--r--  1 steven steven   442766 2004-11-04 13:22 13952-8.txt.bz2
-rw-r--r--  1 steven steven    21582 2004-07-16 17:31 7901.txt
-rw-r--r--  1 steven steven    12912 2009-01-17 17:14 7901.txt.dig
-rw-r--r--  1 steven steven     6699 2004-07-16 17:31 7901.txt.bz2

All files are from the Project Gutenberg:

  • 11800-8.txt. US Copyright Renewals, 1950-1977. Coded using ISO-8859-1. Compressed 1.45:1
  • 12606.txt. The Great Speeches and Orations of Daniel Webster. Coded in plain US ASCII. Compressed 1.8:1
  • 13938-8.txt. Discours prodigieux et espouventable de trois Espagnols et une Espagnolle. Coded using ISO-8859-1. Compressed 1.67:1
  • 13952-8.txt. Vingt Ans Après. Coded in ISO-8859-1. Compressed 1.75:1
  • 7901.txt. American Historical and Literary Antiquities, part 1. Coded in plain US ASCII. Compressed 1.67:1

Very obviously, digram coding does not fare very well compared to bzip2. In the above results, one sees that digram coding gives a compression ratio somewhere between 1.5:1 to 1.8:1, but never more. Indeed, in rare cases it may give closer to 2:1, but that’s somewhat of an edge case. Bzip2, of course, just beats digram coding senseless. For 11800-8, digram coding’s worst file, it yields an impressive 4.49:1 compression ratio. Just three times as much.

However, one mustn’t loose the perspective. Bzip2 is a big, sophisticated piece of code, using many layers of algorithms (a Burrows-Wheeler Transform followed by a move-to-front transform finally followed by Huffman Coding). Digram coding codes digrams and outputs fixed length codes, with a complete decoder that probably implemented in 30 lines of C. Indeed:

#include
#include

#include “digram-dictionary”

int main(int argc, const char * argv[])
{
if (argc==2)
{
std::ifstream in(argv[1],std::ios::binary);

if (in)
{
digram_dictionary dictionary;

dictionary.load(in);

while (in)
{
uint8_t code;

if (in.read((char*)&code,sizeof(uint8_t)))
if (dictionary.short_code(code))
std::cout << (char)dictionary.decode_short_code(code); else { uint16_t bigram=dictionary.decode_long_code(code); std::cout << (char)(bigram>>8)
<< (char)(bigram & 0xff); } } } else { std::cerr << "could not open: " << argv[1] << std::endl; return EXIT_FAILURE; } } else { std::cerr << "usage: " << argv[0] << " infile.big"; return EXIT_FAILURE; } } [/sourcecode] So, accordingly, it could probably be reimplemented by using more C-esque data structures and be made somewhat shorter. I would guess that decompressing digram coded data from memory would yield a somewhat shorter code, entirely amenable to assembly language implementation, would you happen to use a micro-controller of some sort.

You can boost digram coding compression by noticing that some of the original symbols are used, but not very often at all. Say you take the rarest 10% symbols and remove them from the basic symbol set. You take one code as an escape (as we discussed earlier) to allow the introduction of the removed symbols, and you reassign their code to digrams. In previous work, I showed that the optimal re-assignation may lead to gains of up to 7%, which is significant, given the relatively low compression ratios one can get with digram coding.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: