The art of lossy compression consists almost entirely in choosing a representation for your data in which you can easily figure out what data to destroy. That is, if you’re compressing sound, you must transform the waveform into a domain where you can apply a psychoacoustic model that will guide decimation, that is, the deletion of part of the information, part of the information that won’t be heard missing. If you’re rather compressing images—that will be our topic today—you must have a good psychovisual model that will help you choose what information to destroy in your image
The model can be very sophisticated—and therefore computationally expensive—or merely a good approximation of what’s going on in the observers’ eyes (and mind).
But let’s start with the beginning: representing color
You probably already know that it suffices to use a few primary colors to reproduce, either by additive or subtractive mixing, a wide variety of colors. If the primaries are well chosen, and available in enough different densities, you can represent a satisfactory gamut of colors. The wider the gamut, the more life-like the reproduced colors are.
Most, if not all, computer screens (or TVs, for that matter) display images using three primaries: Red, Green, and Blue, or RGB. These aren’t just any red, green or blue, they’re specifically calibrated, they’re specific shades of red, green, and blue. Colors are recreated by using a linear combination of the primaries—by adding different quantities of each primary colors. For example, if the coefficients are between 0 and 1 inclusive, (0.5,0.3,0.1) would represent some kind of ugly brown, with 0.5 red, 0.3 green, and 0.1 blue. The triplet (0,0.1,0.8) gives a color not unlike a dark royal blue.
All the RGB triplets form a colorspace, and in the case of RGB triplets, it takes the form of a cube:
While convenient (we really should say “hardware compatible”), the RGB representation does not take advantage of how the human visual system works. Humans (with normal vision at least) are very sensitive to variations in brightness but not all that much to variations in hues or saturation. That is, we can mess up hue or saturation quite a bit more than brightness before we notice that there’s something wrong with the image. But these concepts aren’t easily expressed in the RGB model. We need a colorspace where dimensions correspond to brightness, hue, and saturation, or close enough.
The HSL model (for Hue-Saturation-Lightness) is a bit better for this. This model represents colors as a cone, with the main axis corresponding to lightness. The two other dimensions are the hue, basically an angle on the color wheel, and the saturation, the distance from the central axis. The further away you’re from the main axis, the more “colorful” the color is. At the center, on the central axis, there’s no color, only brightness… white on top, black at the bottom, and shades of gray in between. Also, in this model, the colorspace shrinks as we approach black. It is not possible for a color to be very “colorful” and very dark at the same time. You have to pick one. If the color is dark, it looses its “colorfulness”. The same goes with extreme brightness. Colors at the top (or base, the flat end) of the cone are either very saturated but not very bright, or very bright and not very saturated. In much of the same way as with dark colors, colors cannot be very colorful and bright at the same time.
For high-speed image processing, however, HSL is inconvenient. The conversion routines from HSL to RGB (necessary since the display hardware understands only RGB) or from RGB to HSL are relatively complex. One thing we can do is to cheat and use an approximation to HSL. One that’s easy to compute. One idea, exploited by image or video standards such as JPEG or H.264, is to use a linear transformation of the RGB color space. Specifically, we rotate/stretch the RGB cube so that it stands on a corner with the black at the base and the white at the top, perfectly aligned with the Y axis. We get the YCrCb color space:
Like in HSL, the main axis corresponds to black, white, and shades of gray. In this representation, the main axis is Y, the luminance. The two other axes are Cr (the “red difference”) and Cb (the “blue difference”) that, combined, correspond (rather loosely) to hue and saturation. This will allow us to manipulate the colorspace as if it was HSL, but with much cheaper operations.
To transform from RGB to YCrCb, we apply the following transformation matrix:
and from YCrCb to RGB:
The transform matrices seem to require computations using floating point numbers, but since the precision isn’t very high, we can get away by using 1024th, or some similar precision that is easily computed using integer multiplies and shifts. About 15 years ago, while working with the team on DjVu, I proposed to use a color space such that the inverse is computable using only (small) shifts and adds to speed-up image decoding. At that time, computers were a lot slower than today, and it seemed like a good idea to save on colorspace conversion time.
Next week, we’ll have a look at how to use all this for image compression.