Fractional Bits (Part II)

04/03/2014

In a previous post, we started discussing the subject of encoding values using fractional bits. This is quite counter-intuitive as we usually think that encoding asks for whole bits, furthermore that we think of bits as being essentially atomic—unsplittable.

pincer-grip

However, bits can be fractioned, helping us devising efficient encodings, something not unlike a primitive form of arithmetic coding, for a variety of data. Let’s consider a simple case that illustrates the principle: encoding the date.

Read the rest of this entry »


Universal Coding (part III)

18/02/2014

In Universal Coding (part II), we explored the idea of variable length codes that encode integers of any magnitude in O(\lg n+\lg\lg n) bits, but the encoding requires (somewhat) complex arithmetic, and we concluded that sometimes, it’s better to use a simple code that sacrifices some coding-efficiency in exchange for machine-efficiency. Let’s now look at one of those, MIDI VLQ and how we can make it a bit better.

pincer-grip

Read the rest of this entry »


Short Pointers

04/02/2014

One good thing with 64 bits addresses, is that you can, in principle, use essentially as much memory as you want—the address space certainly exceeds today’s computers’ capabilities. One bad thing, especially when you create lots of objects and need plenty of pointers, is that 64 bits pointers are big. They use 8 bytes of memory. One or two pointers aren’t a problem, of course, but what if your data structure is a sparse graph, each node being mostly pointers, and that you need to create a very large graph?

pincer-grip

One solution is to use stretch codes, as I proposed a while ago, trading off precision of addressing for shorter pointers. However, unless you rewrite the memory allocator, the technique won’t play well with the default new. Another solution is to store just barely the number of bits (rounded to bytes) necessary to hold an address. Can we do this? If so, how?

Read the rest of this entry »


Universal Coding (Part II)

03/12/2013

Last time, we presented universal codes and two simple code. This week, let’s revisit the topic and present Elias codes, which are much better universal codes than Chaitin’s and modified Chaitin codes.

We will use the idea of recursion introduced before, but we will push the idea further, getting codes with lengths O(\lg n + \lg\lg n + \lg\lg\lg n + \cdots), which are pretty much as good as it gets.

Read the rest of this entry »


Universal Coding (Part I)

08/10/2013

It’s been a while since I last talked about data compression. Let us return to the topic this week with universal codes. Universal codes are entropy-like codes, but, unlike, say, Huffman codes, they presuppose very little on the distribution of the integers to code.

run-code

We can use optimally a universal code for a random variable X if

  • all integers may show up: P(X=n)\neq 0, \forall n \in \mathbb{N},
  • and if beyond n_0, the distribution is non-increasing, that is, for n>n_0 and m>n_0, n<m, we have P(X=n)\geqslant P(X=m).

We will show how to build an universal code in a moment, but we will first state the condition that makes a code universal if the probability distribution satisfies the above condition. A code with length function L_u(n) (the length of the code for the integer n) is k-universal if, and only if

\displaystyle\lim_{n\to\infty}\frac{\sum_{i=1}^n P(X=n)L_u(n)}{\sum_{i=1}^n P(X=n)\log_2 n}=k

That is, the expectation of the length (in bits) of the code is proportional to the logarithm of the number being coded.

Let’s now see what an universal code looks like

Read the rest of this entry »


Float16

12/02/2013

The possible strategies for data compression fall into two main categories: lossless and lossy compression. Lossless compression means that you retrieve exactly what went in after compression, while lossy means that some information was destroyed to get better compression, meaning that you do not retrieve the original data, but only a reasonable reconstruction (for various definitions of “reasonable”).

Destroying information is usually performed using transforms and quantization. Transforms map the original data onto a space were the unimportant variations are easily identified, and on which quantization can be applied without affecting the original signal too much. For quantization, the first approach is to simply reduce precision, somehow “rounding” the values onto a smaller set of admissible values. For decimal numbers, this operation is rounding (or truncating) to the nth digit (with n smaller than the original precision). A much better approach is to minimize an explicit error function, choosing the smaller set of values in a way that minimizes the expected error (or maximum error, depending on how you formulate your problem).

Read the rest of this entry »


Flac vs Lame

21/02/2012

In a previous post I discussed lossless audio encoding and presented a Bash script using flac to rip and process CD audio files. I also commented on how the psycho-acoustic model used in a MP3 encoder will dominate encoding as bit-rate increases, without much quantitative evidence. In this post, I present some.

Read the rest of this entry »


Looking At Flac Compression Ratios

07/02/2012

In a previous post, I told you about a short script to rip and encode CDs using Flac, and I discussed a bit about how LPC works. In this post, let us have a look on how efficient Flac is.

Let us use a quantitative approach to this. Since I have a great number of songs, we can use statistics to give us a good idea of what kind of compression we can expect.

Read the rest of this entry »


(Random Musings) On Floats and Encodings

31/01/2012

The float and double floating-point data types have been present for a long time in the C (and C++) standard. While neither the C nor C++ standards do not enforce it, virtually all implementations comply to the IEEE 754—or try very hard to. In fact, I do not know as of today of an implementation that uses something very different. But the IEEE 754-type floats are aging. GPU started to add extensions such as short floats for evident reasons. Should we start considering adding new types on both ends of the spectrum?

The next step up, the quadruple precision float, is already part of the standard, but, as far as I know, not implemented anywhere. Intel x86 does have something in between for its internal float format on 80 bits, the so-called extended precision, but it’s not really standard as it is not sanctioned by the IEEE standards, and, generally speaking, and surprisingly enough, not really supported well by the instruction set. It’s sometimes supported by the long double C type. But, anyway, what’s in a floating point number?

Read the rest of this entry »


Lossless Coding of CD Audio

17/01/2012

Once upon a time, I discussed how to pick bit-rate for MP3, while considering re-ripping all my CDs. But if I’m to re-rip everything, I might as well do it one last time and use lossless compression.

In this post, we’ll discuss the simple script I cooked up to do just that, and a bit on how Flac works, and how it compares to MP3.

Read the rest of this entry »