Safer Float Types

This week, we’ll be following last week’s post, where we looked at type-safe integer constants, with floating point constants, that is, float and double.

In ISO/IEC 14882:2003 3.9.1 § 8, we learn that C++ knows three floating (or “arithmetic”) types, namely float, double (which provides at least as much precision as float) and long double (which in turns provides at least as much precision as double), but states that The value representation of floating-point types is implementation-defined. Therefore, you cannot suppose they’re nice IEEE 754 floating point numbers. They can be just about anything.

Now, in part 2.13.3 § 1 of ISO/IEC 14882:2003, the standard says that if it is not suffixed by f, the constant has type double. If suffixed by l, the value is considered as long double. Therefore, the mess we had with integers (such as the automatic successive promotion of type from int to long int) doesn’t reoccur with floating point values. If a constant is specified with a higher precision than the receiving type, the value is quietly scaled to the receiving type. That is:

const float pi=3.14159265358979323846;

doesn’t even result in a warning. The constant is simply demoted to the most precise value representable by float. If we know the type is an IEEE 754 float (which the standard does not guaranty), we could truncate the constant to 6 or 7 digits passed the point and suffix it with an f, thus 3.1415926f would be the correct constant. But since we can’t be sure what exactly float is as a data type, we might as well use the maximum precision we can, and let the compiler generate the rounded version as it sees fit.

The header <float.h> (and its C++ counterpart, <cfloat>) contains a number of definitions to help the programmer use floats, but they’re not very useful. In C++, one should favor the use of the template std::numeric_limits<T> that provide type information on the type T, which can be any of the basic numerical C++ types. An expression such as std::numeric_limits<double>::max() would always return the maximum value representable by double. Using these template, you can also determine whether or not the reprentation is signed, floating point, IEEE-754/IEC-599 compliant, etc.

Because the standard doesn’t enforce any particular implementation for float, double, and long double, it cannot provide typedefs (or #defines, although they are thoroughly evil) such as float32_t because it may be that floats aren’t 32 bits at all. Maybe they’re 20. Or 40. Yes, I do agree with you that it is silly, since basically every implementation I know of uses IEEE 754 floats (except for CUDA, which also implement 16-bits IEEE 754-like half floats, a type of minifloat).

You could roll up your own <stdfloat.h> header providing definitions similar to <stdint.h> but you would have to make sure via the Makefile (or the autotools) that the header is generated correctly for the current architecture. Looking for a weekend project? Anyone?

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: