Medians (Part I)

In a previous installment, about filtering noise, we discussed how to use a moving average to even out irregularities in a sequence. Averaging over a window is tricky. First, the window size must make sense in terms of the speed at which the signal changes (in samples), and the average is (overly) sensitive to outliers.

One way to limit the influence of the outliers for denoising is to use the median. However, computing the median is usually more tricky than computing the average, or mean, and this first post (in a series of three, in the next few weeks), discusses how to compute the median efficiently using the selection algorithm.

*
* *

Before continuing, let us say a word on why the average is sensitive to outliers. There’s of course the intuitive explanation that if you have three values, say, 1,2, and 10, then the average is \approx 4.3, which is rather far away of the majority of the samples that are grouped around 1 and 2. A less intuitive, but also rather straightforward reason is that the average sample average \bar{x} is the solution to the least square equation

\bar{x}=\arg \min_{x} \sum_{i} (x_i-x)^2

that is, it is the value that minimizes the square error between itself and all the sample values. If there’s an exceedingly large x_i, it will pull the average towards itself quite a lot.

The median, say x_{m}, is, on the other hand, the solution to the equation

x_m=\arg \min_{x} \sum_{i} |x_i-x|

that is, the value that minimizes absolute value. An overly large x_j will not pull on the solution at all if we use the sample median, which is defined as the middle x_j, if all x_i are sorted. Basically, median (mostly) ignore outliers, which is something we might want for filtering.

*
* *

Since we’re interested in the sample median, we must find a way of finding the the x_j that occupies middle rank. Usually, the median is defined for \lceil n/2 \rceil if n is odd, and n/2 if n is even (or sometime the average of the two sample lying at the center is used, but that’s a minor detail for what we will discuss).

A first idea would be to sort the list, in O(n \lg n) using a robust sorting algorithm such as Quicksort, or even in linear time if one can use Radix Sort. But Radix Sort is a “big” linear time algorithm because it depends on the size of the machine-specific integers you’re using as a key. Would it even be that sorting completely the list is overkill, because we’re not interested in sorting them all, just in getting the middle value, when (sufficiently) sorted?

So how do we sort an array “just enough” to assess the median value?

Well, it turns out that Quicksort will lend us quite a hand here. Remember, Quicksort sorts quickly by dividing the array in two sub-arrays, with one on the left having all values smaller than the pivot (a value picked within the array) and with the other on the right with values larger than or equal to the pivot. If the pivot is chosen wisely, both sub-arrays should be roughly equal in length. Recursion on each sub-array will do again the same thing, all the way down where arrays are so small as to contain only one element. At that point, when recursion visited all possible sub-arrays, the array is sorted and the algorithm terminates.

So what if we only sorted one side, the side that might contain the median? The idea is to use the Quicksort partition algorithm (and there are more than one, but that’s more or less irrelevant right now) and look at the two resulting sub-array. Is the median position, k=\lceil n/2\rceil somewhere in the left or the right sub-array? If it’s in the left, we reapply only on the left sub-array; if it’s in the right, we reapply only on the right sub-array. Again, we will obtain two sub-sub-arrays, and we check whether k is in the left sub-sub-array or the right sub-sub-array. We continue splitting until we create a sub-sub-…-sub-array of size one, in position k: it will contain the median value!

A minimal C++ implementation would probably look something like this:

////////////////////////////////////////
//
// Implements the generic selection
// algorithm loosely based on the
// QuickSort partition algorithm
// (more or less any of the QuickSort
// partition algorithms will do just
// fine)
//
// (it may be due to Wirth, from his
// book algorithms + data structures
// = programs)
//
// NOTE: array v gets modified in the
// process!
//
template <typename T>
const T & select( std::vector<T> & v,
                  int select_index)
  {
   // check if lo_index <= select_index <= hi_index
   // ...eventually
   //
   int lo_index=0;
   int hi_index=v.size()-1;

   while (hi_index>lo_index)
    {
     // classic mid-point
     // pivot selection (greatly
     // reduces chances of n^2
     // behavior if v is
     // nearly sorted, as it would
     // be after a few call to select.
     //
     const T pivot = v[(lo_index+hi_index)/2];

     int x=lo_index;
     int y=hi_index;

     // basic pivot/partition
     // algorithm
     //
     while (x<=y)
      {
       while (v[x]<pivot) x++;
       while (v[y]>pivot) y--;
       if (x<=y)
        {
         std::swap(v[x],v[y]);
         x++;
         y--;
        }
      }

     // check on which side of
     // the partition lands the
     // select_index, so that we
     // iterate only on the half
     // that contains the select_index
     //
     if (select_index >= x)
      lo_index=x;
     else
      hi_index=x-1;
    }

   // in our application, the actual
   // color isn't very important, unless
   // where at the last stage
   return v[select_index];
  }

You would invoke the above using, say

std::vector<T> v;

//...fill v....

T med = select(v,v.size()/2);

for some type T.

*
* *

The above method has an expected run-time O(n), but, just like Quicksort, it can run as high as O(n^2), which is quite bad. The choice of the pivot in the middle of the current sub-array reduces this risk quite a lot (and it would take quite an evil array to push it back to O(n^2) behavior). Furthermore, each time select is called on the array,it gets more and more sorted, making the choice of the pivot better and better with time.

Furthermore, it does swap elements multiple times (but only a number of times expected linear in n), and, depending on the size of the array and the partition method (as I said, there are many variants, for example above is probably Hoare’s, but there is also Lomuto’s method, discussed in Cormen et al‘s Introduction to Algorithms), might not be cache-friendly at all.

In the next post, we will examine other sort-like algorithms to obtain the median, maybe at lower cost.

To Be Continued…

4 Responses to Medians (Part I)

  1. […] the previous post of this series, we left off where we were asking ourselves if there was a better way than the selection algorithm […]

  2. […] in the two previous parts of this series, we have looked at the selection algorithm and at sorting networks […]

  3. […] algorithm does just that: selection. We have discussed the selection algorithm a long time ago, here, but let’s recapitulate. The selection works a bit like QuickSort: it chooses a value […]

  4. […] visited the median problem a couple of times already. The median is the value in the middle of a sorted list of values. If the […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: