Medians (Part III)

So in the two previous parts of this series, we have looked at the selection algorithm and at sorting networks for determining efficiently the (sample) median of a series of values.

In this last installment of the series, I consider an efficient (but approximate) algorithm based on heaps to compute the median.

A heap is an efficient tree-like data structure used to maintain, for example, priority queues. The basic invariant of a max-heap (where we are interested in knowing the largest value; there’s also a min-heap where we want to know the smallest value) is that, unless a leaf, an internal node contain a key that is larger than both its children’s keys. If this invariant is respected through all of the heap, then the root contains the maximum value contained in the heap (and, respectively for a min-heap, the minimum value). A max-heap would look something like this:

The best part is that you can make any array a heap in linear time. But, how does that help us for the median? Well, we could think of a med-heap, where, at every node (that is not a leaf), the invariant is that the node has a key that is the median amongst itself and its two children! A (very) basic med-heap class would look something like:

template <typename T>
void cmpexchg(T & a, T & b) 
 { 
  using std::swap; // ADL safe
  if (a>b) swap(a,b); 
 }

template <typename T>
const T & median3( T & a, T & b, T & c )
 {
  cmpexchg(a,c);
  cmpexchg(a,b);
  cmpexchg(b,c);
  return b;
 }

template <typename T>
class med_heap
 {
 private:

  std::vector<T> & heap;

  int left_child(int i) { return 2*i+1; }
  int right_child(int i) { return 2*i+2; }
 
  void heapify()
   {
    for (int current=heap.size()/2-1;current>-1;current--)
     {
      int left=left_child(current);
      int right=right_child(current);

      // check if it has two children
      // (otherwise give up)
      //
      if ( (left < heap.size()) && 
           (right < heap.size()))
       {
        T & a = heap[current];
        T & b = heap[left];
        T & c = heap[right];

        const T & med = median3(a,b,c);

        if (b==med)
         std::swap(a,b);
        else
         if (c==med)
          std::swap(a,c);
         else
          ; // a is already the median
       }
      else
       ; // has only one child, so
         // already "median"
     }
   }

 public:

  size_t size() const { return heap.size(); }

  // lets you peek at the next
  const T & median() const
   {
    return heap[0];
   }


  med_heap(std::vector<T> & v)
   : heap(v)
   {
    heapify();
   }
 };

Note that the class does not copy the vector, it references an already existing one (this not only avoids computing the time for allocation and copy, it is also fair to select, as it also only uses a reference to a vector).

So here the magic happens in heapify(). Using the addressing described in a previous post, it becomes simple to scan the array from the middle backwards to the beginning and enforce at each step the invariant. This takes “small” linear time, because at each entry, we only need to examine three values.

The problem is that, the median of median is not the median of the whole data, and that there will be imprecision in exchange of speed. That may be a good trade-off, as we will see.

*
* *

OK, the real contenders so far are select and the med-heap. We will compare to the stl::sort algorithm since is a bona fide comparison as 1) simple to use and 2) a likely solution for someone not wanting to reinvent the wheel (or not knowing about selection). The following shows times to find the median in an array of 1000 entries (with 1000 instances of the problem, the same for all three methods), with values on 0…9999:

From the box plots, we see that the stl::sort fares worse (in fact a lot worse) than the two other alternatives. The med-heap is also significantly faster than select, 2-3× faster.

What about accuracy? Looking at the distribution of the errors, we see that 50% of the times, the relative error is within ±5%, and 95% of the time it is less than ±20%. This is distribution seems to be rather indifferent to the range of the values, for example, with a range of 0…255, you get more or less the same results.

*
* *

In essence, therefore, you trade off accuracy (±5% error 50% of the time) for a rather interesting speed-up (2-3×) over an exact algorithm such as select, which is an interesting results on its own. Now, whether or not a med-heap is adequate for your needs is another story. You could argue that for some applications, it is necessary to have the exact algorithm, and you could make the case where, for another application, the ±5% error 50% of the time is unimportant or unnoticeable.

*
* *

Here, we only have a rather sketchy implementation of a med-heap that provides only the fun part as far as knowing the median is concerned, but we could just as easily as with a min- or max-heap, provide the necessary functions to pop the median, insert, and remove, values from the med-heap. In fact, it would be exactly the same code as with a min- or max-heap, but for the median instead of min or max.

*
* *

Full Test Code here.

*
* *

This is the 300th post.

*
* *

The STL function nth_element seems to be doing a good job at select (and a better one than me), but is still much slower than the med-heap:

7 Responses to Medians (Part III)

  1. Keegan says:

    You could just use the nth_element function.

    nth_element(v.begin(), v.begin() + (v.size() / 2), v.end());
    cout << v[v.size() / 2] << endl;

    • Steven Pigeon says:

      I have added the nth_element results, see at the end of the post (for some reason wordpress doesn’t feel like letting me insert the image in the post)

      So the global results remain the same; nth_element uses a variation on selection (better implementation than my own; obviously). Also subject to variance in performance, nth_element seems about ×2 slower than med-heap, about 20% faster than my version of select. sort is still not an option. Conclusion: if you are to compute the exact median, use nth_element, otherwise use a med-heap.

  2. twolfe18 says:

    the error in your median statistic will depend on the underlying probability distribution of values. you should try different distributions and see how it affects your error.

    • Steven Pigeon says:

      Yes. I do mention they’re uniformly drawn. I suspect that the error goes down quite a lot when the distribution is peaky (a.k.a. leptokurtic). I am sure we can arrive at a fairly good estimation of the expected error using statistical/combinatorial analysis.

  3. Gregory says:

    Steven,

    Both two links at the very beginning points to the same part III page instead of part I and part II

    • Steven Pigeon says:

      Thanks. Fixed.

      I think workpress eats links once in a while if they’re not exactly in the right canonical format.

  4. […] visited the median problem a couple of times already. The median is the value in the middle of a sorted list of values. If the list is already sorted, […]

Leave a comment