## Float16

The possible strategies for data compression fall into two main categories: lossless and lossy compression. Lossless compression means that you retrieve exactly what went in after compression, while lossy means that some information was destroyed to get better compression, meaning that you do not retrieve the original data, but only a reasonable reconstruction (for various definitions of “reasonable”).

Destroying information is usually performed using transforms and quantization. Transforms map the original data onto a space were the unimportant variations are easily identified, and on which quantization can be applied without affecting the original signal too much. For quantization, the first approach is to simply reduce precision, somehow “rounding” the values onto a smaller set of admissible values. For decimal numbers, this operation is rounding (or truncating) to the $n$th digit (with $n$ smaller than the original precision). A much better approach is to minimize an explicit error function, choosing the smaller set of values in a way that minimizes the expected error (or maximum error, depending on how you formulate your problem).

(For lossless compression, the general approach is to use a transform, or a predictor, to predict the next value in the (possibly transformed) sequence, and encode only the difference to the prediction using an entropy code adapted to the distribution of the residuals. That is, the $t$th residual is $r_t=x_t-\hat{x}_t$, where $\hat{x}_t$ is the prediction, and we code $r_t$. Since $\hat{x}_t$ depends on the original/decompressed $x_{t-1}$, $x_{t-2}$, …, the decoder computes the same $\hat{x}_t$ than the encoder, and therefore the decoded $r_t$ allows the reconstruction of the original $x_t$, since $x_t=\hat{x}_t+r_t$, and everybody’s happy about it.)

So, back to rounding. How do you quantize floats in general? As mentioned before, using an algorithm such as K-Means and optimize for a given bit-rate is the way to go, but for all kinds of (not necessarily good) reasons, you may want something faster (or that does not need a look-up table). For example, quantize from 64-bits floats to 32-bits, or 32-bits to 16-bits floats. Especially that 16-bits floats are supported natively by some graphics cards.

A half-float, as they are also known, is pretty much structured like a 32-bits float. It has one sign bit, 5 bits for the exponent with a bias of 15 (instead of 8 and 127, respectively), and 10 bits precision (instead of 23) with a virtual leading 1 bit. This gives $\pm{}6.10\times{}10^{-5}$ as the smallest representable number and $\pm{}65504$ as the largest possible number.

The C definition of a half-float is as follows:

typedef
union
{
// float16 v;
struct
{
// type determines alignment!
uint16_t m:10;
uint16_t e:5;
uint16_t s:1;
} bits;
} float16_s;


If you do not already have a function to convert 32-bits to 16-bits floats, it would look a lot like this:

typedef
union
{
float v;
struct
{
uint32_t m:23;
uint32_t e:8;
uint32_t s:1;
} bits;
} float32_s;

...

void float32to16(float x,float16_s * f16)
{
float32_s f32={x}; // c99

// to 16
f16->bits.s=f32.bits.s;
f16->bits.e=std::max(-15,std::min(16,(int)(f32.bits.e-127))) +15;
f16->bits.m=f32.bits.m >> 13;
}


The min(...,max(...)) is necessary to clip large values (either negative or positive) to what the 16-bits floats can represent (and it does so rather badly—why?).

To retrieve 32-bits floats from 16-bits floats, the code looks like:


float float16to32(float16_s f16)
{
// back to 32
float32_s f32;
f32.bits.s=f16.bits.s;
f32.bits.e=(f16.bits.e-15)+127; // safe in this direction
f32.bits.m=((uint32_t)f16.bits.m) << 13;

return f32.v;
}


*
* *

When such quantization applicable? Well, for example when you compress meshes for 3D animation. Right off the bat, before any other compression techniques are applied, you get a 2:1 compression ratio, possibly without impact on the animation quality (the precision is about $1/2^{10}$). After the precision reduction, you can still apply a prediction+encoding of the residual scheme to get better compression.

We may get back to this later on, to be continued…

### 4 Responses to Float16

1. Results We have designed a general-purpose compression and decompression scheme for 32-bit floating-point data on graphics hardware. It both outperforms an existing 16-bit compressor adapted to handle 32-bbit data and is able to compress general data. We have shown this capability by presenting promising compression rates for geometry data (vertex positions, normals, texture coordinates, etc.) for real-world applications. Average rates for color, depth, and geometry data are 1.5x, 7.9x, and 2.9x, respectively. Furthermore, we have proposed two novel techniques applicable to any hardware compression scheme: dynamic bucket selection and the use of a Fibonacci encoder. These proposals increased compression ratios by averages of 1.25x and 1.06x, with maximum improvements of 2.4x and 1.7x, respectively. Note that these are not just compression rates, this also takes quantized storage into account. So, these results should not be viewed as a single tile seeing an improvement of 1.25x (for example), but as several tiles remaining unchanged, and several others improving by 2x. Lastly, we have shown that extra savings are available by using range reduction on variable-precision data. The additional savings will depend on the specific application, but are expected to be between 5% and 20%, for overall color, depth, and geometry compression rates of 1.9x, 10.7x, and 3.6x, respectively.

• Do you have a technical report or published paper that gives the details of your methods?

2. The question is old and has already been answered, but I figured it would be worth mentioning an open source C library that can create 16bit IEEE compliant half precision floats and has a class that acts pretty much identically to the built in float type, but with 16 bits instead of 32. It is the “half” class of the OpenEXR library . The code is under a permissive BSD style license. I don’t believe it has any dependencies outside of the standard library.

• Thanks. I knew of a few libraries that do that, especially binded to GPU-specific applications. But, in general, you a right: it’s preferable to use a well-tested implementation than to write your own from scratch.

The current post, though, only presents what is a half (float16), and I may use them later on, in future posts.