## Whatever sums your floats

While flipping the pages of a “Win this interview” book—just being curious, not looking for a new job—I saw this seemingly simple question: how would you compute the sum of a series of floats contained in a array? The book proceeded with the simple, obvious answer. But… is it that obvious?

IEEE 754 floats, the most likely implementation of the floats on your computer, aren’t as straightforward as one might think. First, they use a “scientific notation” representation of numbers, with a sign bit, a certain number of bits for the exponent, and some more bits for the mantissa, the “precision bits”. However, since they’re finite-precision numbers, they’re bound to exhibit all kind of error propagation when used in computations.

Even the simple question of the sum of a series, say a vector, of floats isn’t quite that simple. Indeed, we may be tempted to answer the question (how to compute the sum of a series of floats) with something like this:

float simple_sum(const std::vector<float> & v)
{
return std::accumulate(v.begin(),v.end(),0.0f);
}


with std::accumulate being defined in <algorithm>. It does what you expect: it sets a temporary to zero, then adds one by one the items of the collection to this temporary, and returns it when it’s done. This code is simple, container and type agnostic. It should be fine, except it isn’t.

To understand what’s wrong with this simple code, we must first understand how float addition works. As I mentioned earlier, a float is expressed as

$f= (-1)^s \times 2^{x-127} \times m$,

where $s$ is the sign bit, $x$ the exponent (stored as 8 bits) and $m$ the mantissa, that keeps only the 23 most significant bits of the number (well, 24, because the most significant bit being always 1 needn’t be stored).

When you add two numbers, if the two exponents are about the same, then the tow mantissas more or less align and the addition works out correctly. Neglecting exponents, and keeping only the 23 most significant bits:

   11000010110000001100000 ...
+      1111000001010000111000 ...
=  11010001110001011100111 ....


This isn’t very different of what we do when we compute long sums on paper; the only difference is that the least significant bits that are missing are taken as zeroes.

However, if the exponents differ somewhat more than 23, we find ourselves in a this situation:

   11000010110000001100000 ...
+                                 1111000001010000111000 ...
=  11000010110000001100000 ...


and the sum is not the sum of the two numbers, but only the greatest of the two! Now, imagine that in your collection, you have one extremely large number and the rest of the series is composed of smallish numbers. The naive sum isn’t the sum at all: it more or less computes the maximum of the series!

How can we work our way of this? The short answer is “we can’t”. The longer answer is that we must find an order to perform the sum in which only numbers of comparable magnitudes are added together. Sorting them (in increasing order) doesn’t seem like a bad start. Let’s see.

First, let’s create a troublesome array:

#include <cstddef> // size_t
#include <vector>
#include <random>

// for some weird reason, generators and
// distributions do not inherit from common
// ancestors; template is pretty much the
// way to do this.
template <typename D,typename G>
void fill(std::vector<float> & v, size_t nb,
D & distribution,
G & generator)
{
v.resize(nb);
for (auto & f : v )
f=distribution(generator);
}

....

int main()
{
std::default_random_engine generator;
std::exponential_distribution<float> exponential(0.125f);

std::vector<float> v;
fill(v,1000000,exponential,generator);
...
}


This fills the std::vector with exponentially distributed values with $\lambda=0.125$, what is, with an average of $\lambda^{-1}=8$, but with some very large values. This should provide a “troublesome” series.

We’ve already presented the code for the naive sum. Let’s show it again:

float simple_sum(const std::vector<float> & v)
{
return std::accumulate(v.begin(),v.end(),0.0f);
}


This, as mentioned earlier, just do the naive sum. But we now that if the next number to add is too small, it won’t be accounted for. So let’s pursue this idea we had earlier: sort them first then add them:

float sorting_sum(const std::vector<float> & v)
{
std::vector<float> t(v);
std::sort(t.begin(),t.end());
return simple_sum(t);
}


So let’s generate 1000000 values and compare the “as-is” sum and the “sorted sum”. Let’s repeat the experiment five times to see what variation there is. We might get something like:

 as-is sorted 7995938 7996406.5 8009019.5 8009517.5 8001316 8001631.5 8009463.5 8009750.5 7995282 7995591

As expected, the sum is close to the average times the number of samples, and, indeed, the sum is always close to 8000000. But they differ! All of them!

*
* *

So merely sorting isn’t enough. Hmm… What else could we do? What about adding, pair by pair, iteratively, until only one value remains? What if we do a “reduction” like this:

This should, if the values are evenly distributed, end up adding numbers of comparable magnitude and help precision. A straightforward implementation might look something like this.

float pair_sum(const std::vector<float> & v)
{
std::vector<float> t(v);

size_t l=t.size();
do
{
size_t d,s;
for (d=0,s=0;s+1<l;s+=2,d++)
t[d]=t[s]+t[s+1];

// deal with left-overs, if any
if (s<l)
t[d++]=t[s];

l=d;
}
while (l>1);

return t[0];
}


And, while we’re at it, let’s make a version that sorts (increasing order) the numbers before adding them pair by pair:

float sorted_pair_sum(const std::vector<float> & v)
{
std::vector<float> t(v);
std::sort(t.begin(),t.end());
return pair_sum(t);
}


*
* *

Let’s compare the results:

 as-is sorted pair sorted pair 7995938 7996406.5 7996777.5 7996777 8009019.5 8009517.5 8009789 8009789 8001316 8001631.5 8001917 8001917 8009463.5 8009750.5 8010206 8010206 7995282 7995591 7995851 7995852

If we examine the results, we see that pair-wise sums (either with or without sorting) are larger than the naive sums. This seems to indicate that more numbers are taken into account. But is the pair-by-pair sum really better? Hard to say. To compare correctly, we would need an arbitrary-precision library, but we can think it does a better job. With sorting, it come close to an Huffman-like algorithm where only the two nearest (value-wise) floats are added together. Maybe there are much better strategies. What do you think?

*
* *

The complete code: click to deflapulate.

#include <cstddef> // size_t
#include <vector>
#include <algorithm> // std::sort, std::accumulate
#include <iostream>
#include <iomanip>
#include <random> // for std::distributions...

// for some weird reason, generators and
// distributions do not inherit from common
// ancestors; template is pretty much the
// way to do this.
template <typename D,typename G>
void fill(std::vector<float> & v, size_t nb,
D & distribution,
G & generator)
{
v.resize(nb);
for (auto & f : v )
f=distribution(generator);
}

float simple_sum(const std::vector<float> & v)
{
return std::accumulate(v.begin(),v.end(),0.0f);
}

float sorting_sum(const std::vector<float> & v)
{
std::vector<float> t(v);
std::sort(t.begin(),t.end());
return simple_sum(t);
}

float pair_sum(const std::vector<float> & v)
{
std::vector<float> t(v);

size_t l=t.size();
do
{
size_t d,s;
for (d=0,s=0;s+1<l;s+=2,d++)
t[d]=t[s]+t[s+1];

// deal with left-overs, if any
if (s<l)
t[d++]=t[s];

l=d;
}
while (l>1);

return t[0];
}

float sorted_pair_sum(const std::vector<float> & v)
{
std::vector<float> t(v);
std::sort(t.begin(),t.end());
return pair_sum(t);
}

int main()
{
std::default_random_engine generator;
//std::uniform_real_distribution<float> uniform(0,1.0f);
std::exponential_distribution<float> exponential(0.125f);

std::cout << std::setprecision(10);

for (int i=0;i<20;i++)
{
std::vector<float> v;
fill(v,1000000,exponential,generator);

std::cout
<< std::setw(10) << simple_sum(v) << " "
<< std::setw(10) << sorting_sum(v) << " "
<< std::setw(10) << pair_sum(v) << " "
<< std::setw(10) << sorted_pair_sum(v) << std::endl
;
}
return 0;
}


### 5 Responses to Whatever sums your floats

1. M says:

In any situation I’ve ever encountered this, the following would have been fully acceptable.

float simple_sum(const std::vector & v)
{
return std::accumulate(v.begin(),v.end(),0.0);
}

2. Martin Capoušek says:

Just curious… how good (in this case) would the result be if we would use “double” as accumulator ?

3. Martin Capoušek says:

…or famous “Kahan summation” algo?
https://en.wikipedia.org/wiki/Kahan_summation_algorithm

• The problem would be reduced with double, but wouldn’t go away.

I didn’t know about Kahan’s summation algorithm (so, thanks for pointing that one out!). Ultimately, it seems to suffer from the same basic bug, as numbers must till be in similar ranges for it to work properly (it seem to asssume that only some digits are lost, and not all).

https://en.wikipedia.org/wiki/Kahan_summation_algorithm

4. […] while ago, Martin Capoušek drew my attention on Kahan’s summation algorithm and ask how it compared to the other methods […]