## Failed Experiment

Experiments do not always work as planned. Sometimes you may invest a lot of time into a (sub)project only to get no, or only moderately interesting results. Such a (moderately) failed experiment is the topic of this week’s blog post.

Some time ago I wrote a CSV exporter for an application I was writing and, amongst the fields I needed to export, were floating point values. The application was developed under Visual Studio 2005 and I really didn’t like how VS2005’s printf function handled the formats for floats. To export values losslessly, that is, you could read back exactly what you wrote to file, I decided to use the "%e" format specifier for printf. Turned out that it was neither lossless nor minimal!

If you use "%f" as a format specifier, printf uses a readable default format that is friendly only to moderate values, that is, neither too small nor too big. If the value printed is too big or too small, you lose precision or you get an exaggeratedly long representation. Consider:

```printf("%f\n",1e-20f);
printf("%f\n",3.14e20f);
```

you get the output:

```0.000000
314000008403897810944.000000
```

In the first case, you loose all information, in the second you have an extra long representation. However, if you use the engineering notation specifier "%e", you get the more sensible (but not always shorter) output:

```9.99999968e-021
-3.14000008e+020
```

With GCC, you get e+20 instead of e+020 but you still get extra non-significant + and zeroes, and, in bonus, weird rounding artifacts. Indeed, 3.14e20 is a valid and sufficient representation for the same number. Visually, we could reduce the printf format to "%e", but while it would be visually more pleasing, rescanning output values will not always yield the original machine-precision float. In some cases it does, but not always, and that’s bad.

So if "%e" isn’t sufficient (it defaults to 6 digits after the point), how many digits do we need? A float in IEEE 754 representation as one sign bit, 8 bits for exponent and 23 bits for the mantissa. Since it also has a virtual most significant bit (which is 1), the mantissa can be seen as 24 bits, with a leading 1. This leaves us with $\log_{10} 2^{24} = 24 \log_{10} 2 \approx 7.225$ digits, which makes ".7e" a bit short, as quick testing shows, and makes ".8e" format necessary.

So the first goal here is to save floats in text losslessly but we’d also like them to be as short as possible. Not so much for data compaction that for human reading; 3.14e20 is still better than 3.1400000e+020. So, naively, I set to write “clean up” code:

```////////////////////////////////////////
//
//  Rules:
//
//     floats are of the form:
//
//     x.yyyyyyyyye±zzz
//
//     Removes sign after e if it's a +
//     Removes all non significant zeroes.
//     Packs the resulting representation
//
static void fd_strip(char s[])
{
// find e
char * e = strchr(s,'e');

if (e)
{
char * d = e+1;

// remove (redundant) unary +
if (*d=='+') *d='_';
d++;

// strip leading zeroes in exponent
while (*d && (*d=='0')) *d++='_';

if (*d==0) // the exponent is all zeroes?
*e='_'; // remove e as well

// rewind and remove non-significant zeroes
e--;
while ((e>s) && (*e=='0')) *e--='_';

// go forward and pack over _
for (d=e; *e; e++)
if (*e!='_') *d++=*e;
*d=0;
}
else
{
// some kind of nan or denormal,
// don't touch! (like ±inf)
}
}

////////////////////////////////////////
//
// simplistic itoa-style conversion for
// floats (buffer should be at least 16
// char long) (0.12345678e±12 + \0)
//
void ftoa(float f, char buffer [])
{
snprintf(buffer,16,"%.8e",f); // redundant but safer
fd_strip(buffer);
}
```

As I have said in a previous post, if your function’s domain is small enough, you should use exhaustive testing rather than manual testing based on a priori knowledge you have. Floats are a pest to enumerate in strict order (because, for example, the machine-specific epsilon works only for numbers close to 1, and does nothing for 1e20) so I build a (machine-specific) bit-field:

```typedef union
{
float value;  // The float's value
unsigned int int_value;

struct
{
unsigned int mantissa:23;
unsigned int exponent:8;
unsigned int sign:1;
} sFloatBits; // internal float structure

} tFloatStruct;
```

that allows me to control every part of a float. looping through signs, exponents, and mantissas will allow us to generate all possible floats, including infinities, denormals, and NaNs.

The main loop looks like:

```
tFloatStruct fx;
for (sign=0;sign<2;sign++)
{
fx.sFloatBits.sign = sign;

for (exponent=0; exponent<256; exponent++)
{
fx.sFloatBits.exponent=exponent;

for (mantissa=0; mantissa< (1u<<24); mantissa++)
{
float new_fx, diff;
char ftoa_buffer[40]; // some room for system-specific behavior1
char sprintf_buffer[40]; // ?

fx.sFloatBits.mantissa=mantissa;

if (isnan(fx.value))
{
// we don't really care for
// NaNs, but we should check
// that they decode all right?
//
// but nan==nan is always false!
}
else
{
// once in a while
if ((mantissa & 0xffffu)==0)
{
printf("\r%1x:%02x:%06x %-.8e",sign,exponent,mantissa, fx.value);
fflush(stdout); // prevents display bugs
}

how_many++;

ftoa(fx.value,ftoa_buffer);
sprintf(sprintf_buffer,"%.8e",fx.value);

// gather stats on length
//
ftoa_length+=strlen(ftoa_buffer);
sprintf_length+=strlen(sprintf_buffer);

// check if invertible
//
new_fx = (float)atof( ftoa_buffer );

if (new_fx!=fx.value)
{
diff = (new_fx - fx.value);
printf("\n%e %s %e %e\n", fx.value, ftoa_buffer, new_fx, diff);
}
}
} // for mantissa
} // for exp
} // for sign

printf("\n\n");
printf(" average length for %%-.8e = %f\n", sprintf_length/(float)how_many);
printf(" average length for ftoa  = %f\n", ftoa_length/(float)how_many);
printf(" saved: %f\n",(sprintf_length-ftoa_length)/(float)how_many);
```

So, we run this and verify that 1) all floats are transcoded losslessly and 2) the ftoa is much shorter than printf‘s. Or is it?

*
* *

After a few hours of running the program (it takes a little more than 6 hours on my computer at home), the results are somewhat disappointing. First, it does transcode all floats correctly. But it doesn’t shorten the representation significantly.

Using GCC (and ICC), you get that the average representation length out of ".8e" without tweaking is 14.5 digits (including signs and exponents). Using ftoa (and fd_strip), the representation is shortened to 13.53 digits on average, an average saving of 0.96, which is far from fantastic.

With visual studio, the savings are a bit better, but clearly not supercalifragilisticexpialidocious either: from an average of 15.5 digits, it reduces to 13.6, an average saving of 1.9 digit.

With doubles, the results are quite similar. On GCC (and ICC) you start with an average length of 23.2, and of 22.5 after “simplification”. For double, you have to use ".16e" to get losslessness.

*
* *

So, it turns out that it was a lot of work for nothingnot much. On the bright side, we figured out that 7 digits aren’t always enough to save floats in text and get them back losslessly; while the documentation says it should be only seven. Maybe it’s a platform-specific artifact, maybe not; anyway, it’s much better to save numbers losslessly than saving them so that they are merely pretty to look at.

Acknowledgements

I would like to thank my friend Mathieu for running the experiments (and help debug) on Visual Studio Express 2008. The help is greatly appreciated, as I didn’t have a Windows+Visual Studio machine at hand at the time of writing.

### 8 Responses to Failed Experiment

1. Chris Chiesa says:

I haven’t read this entire thing — I’m looking at it at work, and it’s long, so I’m saving it on my “to be read later” list — but I do have one thing to say after looking at just your first few examples.

It’s not clear that you will EVER be able to ‘read back into memory exactly what you wrote out,” because the exact conversion between binary and decimal fractions may be nonterminating; 1/5, for example, can be represented exactly in decimal (0.2) but not in binary (0.0011 0011 0011 …). Other examples work oppositely, having finite representations in binary but not in decimal.

Second, the “weird rounding errors” you cite in your first example (3.14e20) are an unavoidable consequence of the usual, “standard,” binary representations of floating-point numbers, i.e. in a mantissa-and-exponent format like IEEE et al. The real numbers are an infinite set, while the numbers representable in a computer are a finite set. If you have an M-bit mantissa, only 2eM discrete values can be represented between 2eN and 2e(N+1), no matter how large N is. Thus, there are the same number of values available between e.g. 2e2 = 4 and 2e3 = 8 as there are between 2e20 = 1,048,576 and 2e21 =2,097,152. Clearly, the exactly-representable values will be 2e18 times as far apart in the latter case as in the first, which leaves room for a vast number of values that CAN’T be represented exactly. So your “weird rounding error” isn’t exactly a “rounding” error, it’s more of a “quantizing” error — when you code “float a = 3.14e20,” the value actually stored in the variable is only the nearest REPRESENTABLE value to 3.14e20 — and that’s what you get, in decimal, when you print it. Clearly the nearest available value to 3.14e20 is different from the exact value by 8,403,897,810,944 — over 8 trillion. That’s not very exact, in my book.

The only real way to “get out what you put in,” and later “get back in what you put out,” is to do all your math using numeric strings. There are packages for doing that. Look up “arbitrary precision arithmetic.”

• Steven Pigeon says:

The weird rounding errors I was thinking about/referring to aren’t this at all (though you make a valid point).

On the x86 platform, the FPU registers are 80 bits long; depending on how the compiler generates code, you may get unpredictable rounding errors—or let’s just say error. What happens is that to get faster code, the compiler will try to keep all computation into the registers, writing them back (and reducing their precision back to 64 or 32 bits) only when absolutely necessary. If you run a program on a platform (CPU, compiler) that enforces strict IEEE 754 compliance, you get the (IEEE) expected result. If you leave the compiler to optimize freely, result may vary.

But what I was thinking of was rather the printf conversion routines themselves. I remember (while I can’t materially demonstrate it right now) that visual studio 2003 gave different rounds than gcc, and gcc seemed more accurate back then.

While we can debate, as you point out, what exactly is a rounding error in a given context, I would still expect the standard C library to output predictable representations of floating point numbers, which still isn’t quite the case (c.f., e+020 vs e+20 vs e20)

2. Barrie says:

Not all floating point numbers are representable in a finite number of decimal digits. If you haven’t read it, check out “What Every Computer Scientist Should Know About Floating-Point Arithmetic” by David Goldberg.

To be lossless, one approach is to serialize floats as either a device-dependent hex blob or, for device independence, by serializing each of the sFloatBits’s fields as integers in decimal or hex.

– Barrie

• Steven Pigeon says:

You’re mistaking arbitrary ‘real number’ for ‘floating point number’ here. First, the only subset of numbers floats can represent are rational numbers; and represent with precision a very small number of them; the only thing you have to/can control is precision-limited rounding to make sure the ascii version of the number decodes back to the original value, with the limitations of rounding (and possibly implementation-specific defects)

Of course, you can’t expect to recover 1/3 exactly (or any fraction that has a periodic binary expansion) but you can make sure through explicit control of rounding that you recover the best-effort representation of 1/3 (or any other rational number allowable by your float) and that, all in all, you load back the same number you saved. The number we meant to save is, of course, irrelevant.

And yes, I have read the paper by Goldberg. And the books by Acton (before somebody suggest that I read those too). And a couple others :p

The binary blob approach is unfortunately not human-readable (which was one of my requirements back then) and may not play well with others—making importing in a spreadsheet impossible, for example. However, there’s a low risk of machine specificity (as in ISO/IEC 9898-1999 Annex F imposes IEC 60559 (a.k.a. IEEE 754-1985) as float point representation) and you could write quite portable code. Using existing code is much better, though.

3. Ken says:

Have you seen C99’s hexadecimal floats? I’ve adapted your test loop to run all floats through the hex double formatter %a, and as you might expect given that it uses powers of 2 rather than 10, all of the formatted values mapped back to the original values (neglecting nans). I haven’t tested doubles, but gcc’s documentation suggests this is true for those as well.

Anyway, the test code: http://c.pastebin.com/gvW1sdA4
As it is C99, I don’t know if Visual Studio will accept this even if you do pretend it’s a C++ source.

```// This is C99 code. In GCC, compile with --std=c99
#include <stdio.h>
#include <math.h>

typedef union
{
float value;  // The float's value
unsigned int int_value;

struct
{
unsigned int mantissa:23;
unsigned int exponent:8;
unsigned int sign:1;
} sFloatBits; // internal float structure

} tFloatStruct;

int main() {
tFloatStruct fx;
unsigned long count=0, bad=0;
for (int sign=0;sign<2;sign++)
{
fx.sFloatBits.sign = sign;

for (int exponent=0; exponent<256; exponent++)
{
fx.sFloatBits.exponent=exponent;

for (int mantissa=0; mantissa< (1u<<24); mantissa++)
{
float new_fx, diff;
char sprintf_buffer[128];

fx.sFloatBits.mantissa=mantissa;
count++;

if (isnan(fx.value))
{
// we don't really care for
// NaNs, but we should check
// that they decode all right?
//
// but nan==nan is always false!
}
else
{
sprintf(sprintf_buffer,"%a",fx.value);

// check if invertible
//
sscanf(sprintf_buffer,"%a", &new_fx);

if (new_fx!=fx.value)
{
diff = (new_fx - fx.value);
printf("BAD: %g != %a by ~%g\n", fx.value, new_fx, diff);
}
}
} // for mantissa
printf("%a\n", fx.value);
} // for exp
} // for sign
printf("%lu bad out of %lu\n", bad, count);

}
```
• Steven Pigeon says:

That’s actually a pretty nifty way of exporting losslessly floats.

I don’t know how far Visual Studio is with supporting C99; I know I still hear friends complaining that it’s missing simple headers such as <stdint.h>. I know that in some cases, you can use third party header files. For %a, I would need to check.

To prevent the paste from expiring, I took the liberty of adding it to your post.

4. pete says:

I’m catching up here.

IEEE floats have an “implied bit” (http://en.wikipedia.org/wiki/Single_precision_floating-point_format), so it’s 23+1 bits = 7.2 digits, hence the need for 8 digits.

But note that even at eight digits, the decimal number is only an approximation: if a rational number described by a floating point number needs n “binary digits” to the right of the “binary point” to represent, then it will also need n decimal digits to the right of the decimal point; for example 1/16 = 0.0001xb = 0.0625, both have four “decimals” and this continues. [I have a more formal, algebraic proof, but informally 0.1xb is 0.5, and if you divide a decimal ending in 5 by 2, then–because its odd–you’re guranteed a digit to the right which is itself 5.] The upshot is the exact single-precision number used to represet the decimal 0.2 takes 24 digits to be written in decimal. Floats are evil.

I looked at the C-standard and can’t see portable way to make gurantees about rounding, beyond %a. Although there are ways to prevent the use of 80-bit numbers. (And in 64-bit mode you should be using the SSE regs, not the x87, so those kind of artifacts will be a thing of a past.)

• Steven Pigeon says:

That’s 24 bits precision counting the virtual most significant bit. That’s $\log_{10} 2^{24} = 24 \log_{10} 2 \approx 7.225$ digits, therefore 8. Good catch.