I’m always playing around with data compression ideas, and I recently came across a particular problem that inspired this week’s entry about implicit coding.
The idea of implicit coding is to exploit some constraint in the data that lets you reconstruct one (or more) value(s) unambiguously from the others. The textbook example of implicit coding is to use a constraint such as a known sum. For example, If you have a right triangle you necessarily have that
Yes, that’s good ol’ Pythagoras’ theorem, possibly the simplest, yet not completely trivial, example of a constrained equation. Knowing that , , and are linked through this strong constraint will let you determine any one of the three from the other two. Indeed, is determined as above, we also have , and . Therefore, you could store a right triangle using only and .
Well, almost. Squaring, in this case, is a lossy operation. Think of why a second before continue reading.
Exactly! Squaring discard signs. If we drop , it is uniquely determined from and . However, and may need orientation and this can be avoided if we chose to drop . What if is always one? That’s a triangle inside the unit circle, with
and now, we could discard , and store . But we have an ambiguity:
Which was it? Positive? Negative? Now the equation with is insufficiently constrained to allow lossless implicit coding. Is there a solution? Well, yes. We could store the sign bit by itself, but that doesn’t play well with an in-memory data structure. If we store it with something like a bit-field, that’s wasting a lot of memory for just one bit. But we might be able to hide it in .
The simplest solution is to shove the bit at the end of . If is an integer, (x<<1)|s (where s is the sign bit of ) might do the trick. For a float, well, that’s trickier. If you can afford a relative error on , you can just overwrite the least significant bit of float x with the sign bit of .
Implicit coding isn’t quite the same thing as rematerialization, where a value is recomputed from scratch rather than loaded from memory when needed, but is related in a way. Rematerialization is a compiler optimization that balances the cost of recomputing a value with the cost of keeping it in a register or memory location. Reconstructing implicit values can be seen as a form of rematerialization, but it can be done either in memory or on demand.
So, all this to say, if you have a value that can be recomputed losslessly (and with negligible cost) from the other values in a data set, simply recompute it. Do not store it. It would be interesting to test, quite quantitatively, whether or not actually trading off storage for computation is beneficial to run-time of programs, and if so, to what extent, and what are the conditions necessary and sufficient for speed-ups in addition to smaller storage. Of course, there’s a rather hard to predict interaction between the complexity of the computation needed, the size of the cache, and the speed of reading from the cache or main memory.