Medians (Part IV)

I’ve visited the median problem a couple of times already. The median is the value in the middle of a sorted list of values. If the list is already sorted, that’s not much of a problem. If it isn’t, then we must use some efficient way to do it.

If we sort the list (or array), the cost is O(n \log n) if we use a comparison sort (or O(k n) if we use an address-transformation sort like the radix sort). If we use selection, that’s O(n) on average, but we still have, just as Quicksort, a O(n^2) worst case. Heuristics can be much faster, but do not guaranty an exact result.

So far, I’ve only considered the general case where the range of allowable values is much greater than the length of the list. But if the range is small, say 0-255, and the list is long, certainly we can do better.

Indeed! Remember, there’s a thing called counting sort, were you count the number of times each value appears then just fills the destination with the right number of copies for each. This is \Theta(n) since we need n steps to scan the list and update counts (since the range is limited, we can use a direct index and the update operation (++) is constant-time), then n+r steps to scan the r different counts and make n fill-copies.

Turns out that this is pretty much what we need to do if we’re interested in the median, except for the last step, the fill-copy. If we scan the list and update the r counts (for r different values), the sum of the counts is the length of the list (since each item of the list ends up in one of the r slots). We just have to scan these counts until we’ve seen half of the items in the original list. This is \Theta(r).

In code, that’d look something like this:


template <int range, typename T>
T counting_median(const std::vector<T> & v)
{
  std::vector<size_t> counts(range,0);
  for (const T & vv: v)
   counts[vv]++;

  size_t sum=0;
  size_t half=(v.size()+1)/2;
  for (size_t i=0;i<counts.size();i++)
   if (sum+counts[i]>half)
    return i;
   else
    sum+=counts[i];

  return v[0]; // prevents "warning: control reaches end of non-void function"
}

The first part of the function counts the number of instances of the values in the range. This implementation supposes that everything is non-negative. We could use a std::map, but the complexity would become O(n \log r). The second part adds each count until the sum reaches or exceeds half the number of items. Then it returns the index where it stopped counting.

*
* *

How much faster does that version run compared to the others? The other implementations are:

  • std::sort, that …sorts the whole thing.
  • selection, that we’ve seen before
  • std::nth_element, that basically do selection, but somewhat better than my implementation.
  • The median heap, an heuristic.
  • The counting median.

To compare fairly, the same vector of length 10000 filled with random values on the 0-999 range is ran through all alternatives. The experiment is repeated 1000 times. This gives the following results:

The counting median is therefore much faster than the other, more general, methods. Of course, we get a speed-up because we exploit a special case, where the number of different values is small compared to the length of the list. But hey, why not?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: