The `float` and `double` floating-point data types have been present for a long time in the C (and C++) standard. While neither the C nor C++ standards do not enforce it, virtually all implementations comply to the IEEE 754—or try very hard to. In fact, I do not know as of today of an implementation that uses something very different. But the IEEE 754-type floats are aging. GPU started to add extensions such as short floats for evident reasons. Should we start considering adding new types on both ends of the spectrum?

The next step up, the quadruple precision float, is already part of the standard, but, as far as I know, not implemented anywhere. Intel x86 does have something in between for its internal float format on 80 bits, the so-called extended precision, but it’s not really standard as it is not sanctioned by the IEEE standards, and, generally speaking, and surprisingly enough, not really supported well by the instruction set. It’s sometimes supported by the `long double` C type. But, anyway, what’s in a floating point number?

Let us take the 32-bits `float` as an example. There’s one bit assigned to the sign (therefore it’s not a 2s complement representation), 8 bits for the exponent, representing values of -128 to 127 (with a bias of +127), and 23 bits for the mantissa, the “precision bits” (with the most significant one removed, it allows one more low-weight bit to be shoved in): That is, it looks like:

and the value is given by

where is the sign, the exponent as stored in the `float`, and are the bits of the mantissa, from left to right in the above picture.

The `double` uses one sign bit, 11 exponent bits, and 52 (+1 virtual) bits for precision, for a total of 64 bits. The `quadruple` (which is not a C nor a C++ keyword, unfortunately) uses one sign bit, 15 exponent bits, and 112 (+1 virtual) bits for precision. Its precision is about 34 significant digits. And hypothetical `octuple` would eat up 256 bits, 32 bytes, probably with something like 24 or 32 bits for the exponent, leaving 224 bits of precision (that’s about significant digits (225 because of the virtual leading bit)). That’s significantly more precise than either `float` or `double`, but it would also ask significantly bigger FPUs to deal with these formats!

Generalizing these formats to smaller floats is also straightforward, but we now face serious lack of precision. The half float, on 16 bits, uses one sign bit, 5 exponent bits (with a bias of 15), and 10 (+1 virtual) precision bits.The range is evidently smaller (from to ) but the steps are rather large (all things considered) at . That’s about 3 significant digits. A byte-size floating point number doesn’t seem to make all that much sense.

The half-float format makes sense only if you’re interested in saving storage space, or gaining computation speed, or minimizing hardware, at the expense of precision—you can’t have all three. But could we do better with 16 bits than this? Well, since 16 bits aren’t that much, the first thing that comes to mind is that we could just *learn* a 64k-entry table filled with higher-precision floats optimized for the precision of the computation at hand. This is in fact a quantization problem where the objective function is not directly the mean square error to the prototypes (as would be scalar quantization optimized over a probability distribution) but the squared error between the quantized computation and the distribution of the “real” result using high-precision arithmetic everywhere. We know have a look-up for each value, but the table may fit in cache, and we could use rather large computations (that do not fit in cache).

*

* *

The idea generalizes to other problems. In instruction encoding, one usually decide how to allocate bits within an instruction word to encode the operations to be performed, the type of the operands (immediate, register, memory reference) and the operands. In simple CPUs, like the Z80 (no, I’m not *that* old, I know of it because I used to do embedded programming a lot) the instructions are really crude and the opcode fits, always, on one byte.

The instruction set is hierarchical, with groups of prefixes for similar instructions, probably to ease decoding, but that’s not necessary to do it like this. If you’re going to have microcode anyway, the instruction encoding itself is essentially irrelevant: it becomes merely a pointer in the microcode table, and you can assign pretty much every instruction you want to each of your 8-bits (or 16-bits) opcode, without worrying about having a correct prefix structure or enough bits to encode all possible/interesting registers. The look-up happens inside the CPU, at a very deep level in the micro-architecture, and so is not really concerned by the speed discrepancy between the CPU and the memory.

If you want it to be cute, you can enforce a couple of special instructions to have a specific opcode (say `0x00` puts the CPU on HALT (and no, not necessarily, and catch fire), `0x01` for an interrupt, etc.). You could even think of reconfiguring instruction set and microcode on the fly. Maybe a by-process setting?

*

* *

Extending significantly the precision of the usual float formats, say, moving to 128 and 256 bits, poses many problems. While it is undeniable that we *will* need those eventually for scientific computing, it means some profound changes in FPU architectures. First, we will need wider data buses. That, we already partly have, with GPUs and newer CPUs. A 32-byte long float is smaller than a cache-line, so on that side, that’s not too bad. But what about the hardware that actually computes operations between big floats?

Well, first, I think we all agree on the fact that if we are going forward with these formats, we will not accept painfully slow implementations. This means that the hardware will expand significantly. I would suspect quadratically in the number of input bits in general, while it’s still open for debate how low the circuit complexity *can* get [1].

*

* *

The idea that we should learn representations rather than designing them by hand is central to data compression and machine-learning (which, in fact, can be unified at a rather deep level). In both cases, we want to extract information from the data with as little a priori knowledge as possible and let the algorithms find a representation (or, more accurately, a *re*-representation) of the data so that there are exploitable structures for further analysis stages. The criteria for “exploitable” may vary. In data compression, you want a representation for which you can find a short code (i.e., for which the entropy is minimized); in machine-learning it may be that you want to minimize prediction error; finding a short code and formulating an accurate prediction are not exactly the same thing but they are very similar.

[1] Amos R. Omondi — *Computer Arithmetic Systems: Algorithms, Architecture, and Implementation* — Prentice Hall,1994

[…] let us have a look at how floats are structured. We discussed that previously, but let’s review it here anyway. In a IEEE-754 32bits (single precision) float, you have a […]