## Enumerating 32 bits Floats

November 5, 2013

This week, let’s go back to (low level) programming, with IEEE floats. To unit test a function of float, it does not sound unreasonable to just enumerate them all. But how do we do that efficiently? Clearly f++ will not get us there.

Nor will the machine-epsilon (the std::numeric_limits::epsilon()) because this value works fine around 1, but as the value diverges from 1, the epsilon basically becomes useless. We would either need a magnitude-dependent epsilon (which the standard library does not provide) or a way of enumerating explicitly the floats in increasing or decreasing order (something also not provided by the standard library). Well, let’s see how we can do that

## Float16

February 12, 2013

The possible strategies for data compression fall into two main categories: lossless and lossy compression. Lossless compression means that you retrieve exactly what went in after compression, while lossy means that some information was destroyed to get better compression, meaning that you do not retrieve the original data, but only a reasonable reconstruction (for various definitions of “reasonable”).

Destroying information is usually performed using transforms and quantization. Transforms map the original data onto a space were the unimportant variations are easily identified, and on which quantization can be applied without affecting the original signal too much. For quantization, the first approach is to simply reduce precision, somehow “rounding” the values onto a smaller set of admissible values. For decimal numbers, this operation is rounding (or truncating) to the $n$th digit (with $n$ smaller than the original precision). A much better approach is to minimize an explicit error function, choosing the smaller set of values in a way that minimizes the expected error (or maximum error, depending on how you formulate your problem).