Building a Tree from a List in Linear Time (II)

Quite a while ago, I proposed a linear time algorithm to construct trees from sorted lists. The algorithm relied on the segregation of data and internal nodes. This meant that for a list of $n$ data items, $2n-1$ nodes were allocated (but only $n$ contained data; the $n-1$ others just contained pointers.

While segregating structure and data makes sense in some cases (say, the index resides in memory but the leaves/data reside on disk), I found the solution somewhat unsatisfactory (but not unacceptable). So I gave the problem a little more thinking and I arrived at an algorithm that produces a tree with optimal average depth, with data in every node, in linear time and using at most $\Theta(\lg n)$ extra memory.

The hard part to figure out with this algorithm is that you must create the tree starting at the root, do an in-order scan of the generated tree, but still scan the list from left to right. Turns out, however, that it’s not as complicated as it sounds.

The root (if the list is sorted) will contain list element number $\lceil n/2 \rceil$ (C++: (n+1)/2), thus creating two sub-lists: one with elements from 0 to (n+1)/2-1, and the other with elements (n+1)/2+1 to n-1 (we assume that indexing starts at zero). The root of the left sub-list will be at the middle: again rounding, excluding the root, and creating two sub-lists. We split each list recursively until we have a degenerate list of one item; where we stop.

The best way of convincing oneself that the algorithm works properly is to implement it. Here I will abstract the copying from the list (or array-list, which would be more convenient). The minimal code would look like this:

////////////////////////////////////////
typedef int (*pivot_function_t)(int,int);

////////////////////////////////////////
class tree_node
{
public:
int x;
tree_node *left, *right;

tree_node( tree_node *_left, int _x, tree_node *_right)
: x(_x),
left(_left),
right(_right) {}
};

////////////////////////////////////////
template <pivot_function_t pivot>
tree_node * make_tree(int l, int h, int & src)
{
if (l==h)
// degenerate list, a leaf!
return new tree_node(nullptr,l,nullptr);
else
{
int p=pivot(l,h);

return
new tree_node( (p-1>=l) ? make_tree<pivot>(l,p-1,src) : nullptr,
p, // here we would copy from list or array[src++]
(p+1<=h) ? make_tree<pivot>(p+1,h,src) : nullptr);
}
}


where the pivot function returns an integer that tells were to split the list. Possible functions:

//////////////////////////////
int pivot_round(int l, int h) { return (l+h+1)/2; }

//////////////////////////////
int pivot_random(int l, int h) { return l+(std::rand() % (h-l+1)); }

//////////////////////////////
int pivot_fibonacci(int l, int h)
{
static const int fibo[]=
{ 1,1,2,3,5,8,13,21,34,55,89,144,
233,377,610,987,1597,2584,4181,
6765,10946,17711,28657,46368,
75025,121393,196418,317811,};

int d=h-l;
int i=0;
while (fibo[i]<d) i++;

return l+fibo[i-1];
}


The pivot_round computes what we want. The others are given for comparison, with pivot_random being maybe indicative of a tree constructed from random insertions over time—so it might not be as stupid as it seems at first glance. The pivot_fibonacci is rather fanciful here, but, eh, why not.

To benchmark the quality of the tree, we will use the average node depth. First, we must find the optimal average depth for a tree with $n \sim 2^m+k$ nodes. With a bit of algebra, we find that the average depth $\bar{d}$ is given by:

$\displaystyle \bar{d}=\frac{2^m(m-1)+1+(k+1)(m+1)}{2^m+k}$.

(Don’t worry, I have an upcoming post that explains how to derive this expression.) With a formula for $\bar{d}$, we can launch experiments creating trees with arbitrary number of nodes and compare performance in terms of average depth vs the optimal average depth.

With $n$ running from 1 to 1023 (a tree with zero nodes isn’t very interesting!), we obtain the following graph:

The black (optimal) and green (center, or rounded split) overlap perfectly because the center split yields optimal average depth. The Fibonacci split isn’t as bad as I expected, but it yields trees with deeper branches that strictly needed, but not much (maybe 1.6 more?). The random pivot, shown in red, fluctuates wildly, and so is displayed with a moving average (with a window of 100). It does a lot worse than the other methods, but, again, not as much as I would have thought at first.

*
* *

Let us now have a look at what shape the trees actually are. If you know the average depth is optimal you kind of expect the trees to look all nifty like this one:

(The numbers in the nodes represent the index of the element in the original sorted list.) That’s the base case tree with $n\sim 2^m-1$ (and that indeed we find with the center split: all trees shown here are svg made from the GraphViz representation of the trees generated by the code above). What if we have an “inconvenient” $n$? Say $n=12$? Well, it’s not what you expect:

but the average depth is still optimal.

*
* *

So, that’s the $\Theta(\lg n)$ extra memory I was speaking of at the beginning? Simply the stack. Unlike other recursive algorithms such as Quicksort where you can split the current list in two very uneven sub-lists, picking the pivot in the center creates the most even sub-lists possible, and that, in turn, guarantees that the recursion depth is $\Theta(\lg n)$ (or, “exactly $\lg n$“). There’s a hidden constant (for whatever the storage of one stack frame actually is), but the “big O” notation lets us get away with it.

*
* *

And there, you have a linear time (each node/list item is visited exactly once) with $\Theta(\lg n)$ extra storage.

One Response to Building a Tree from a List in Linear Time (II)

1. Reblogged this on justanotherhumanoid.