## Building a Balanced Tree From a List in Linear Time

The usual way of forming a search tree from a list is to scan the list and insert each of its element, one by one, into the tree, leading to a(n expected) run-time of $O(n \lg n)$.

However, if the list is sorted (in ascending order, say) and the tree is not one of the self-balancing varieties, insertion is $O(n^2)$, because the “tree” created by the successive insertions of sorted key is in fact a degenerate tree, a list. So, what if the list is already sorted and don’t really want to have a self-balancing tree? Well, it turns out that you can build a(n almost perfectly) balanced tree in $O(n)$.

Let us make the simplifying assumption that we can have two types of nodes in the tree: leaves, that contains the actual data, and internal nodes (or just nodes for the remainder of this post) that holds only a key.

The first strategy that comes to mind, using this assumption, is to use a method reminiscent of how Huffman Codes are constructed. That is, we scan the original list from left to right: we take two nodes (if there are at least two left), remove them from the list, and insert back in their place a new node containing the key of the second list item (remember, the list is supposed to be sorted in ascending order), with the two original nodes (or leaves) as children. That is, the operation goes something like this:

(Notice the metaphor: leaves are green, internal nodes brown.) We simply re-scan the list until we have a single remaining node, which will be the root of the tree. The procedure is simple enough, indeed. Let us work out a full example with 5 leaves:

Then we do one pass of merges:

Then another:

…and finally:

What do you notice? The three isn’t all that balanced. In fact, it’s really unbalanced. What went wrong? Well, if for Huffman Codes, this algorithm is optimal, for search trees, it is clearly not. For one thing, Huffman’s method wants to build trees with average path-lengths as close as possible to the source’s entropy; here our goal is to have path-lengths as equal as possible.

Fortunately, we can “repair” the algorithm quite easily, and it’s another code that will come to our aid: Phase-In Codes, which have average code lengths close to $\lg n$ (as I show here).

The key observation is that if the length of the list $n$ is a power of two, then the list can be successively merged together by the above algorithm and yield a tree that has exactly $\lg n$ depth. Therefore, if we somehow modify the list to have exactly $2^k$ items in it, we’ll be able to have our optimal $O(\lg n)$ depth. Phase-In Codes provides the mean to do so. If

$n = 2^k+b$

with the largest $k\in\mathbb{N}$ and with $b\in\mathbb{Z^*}$ (thus imposing the uniqueness of the solution, with $0\leqslant{b}<{2^k}$), we can merge the first $2b$ nodes, and it will result in a list of $2^k$ nodes ($b$ of which are now internal nodes). Once we have merged the first $2b$ nodes, we stop the iteration there and restart from the beginning of the list, which is now of length $2^k$, and will yield the most-equal depth tree we wanted.

Again, with $n=5$: $n=2^k+b=2^2+1$, so $b=1$, we take the two first nodes

and merge them:

and voilà, we proceed with the successive merges:

and then

and we have a tree of almost equal depth, with average depth of $^{12}/_{5}=2.4$ (while the actual $\lg 5 \approx 2.32$) rather than $^{13}/_{5} \approx 2.6$, a considerable improvement, even for this trivial tree.

*
* *

While it is not the customary way of representing a tree in memory, the segregation of leaves and internal nodes may be justified in the context where the leaves and the tree itself live in different memory locations. For example, if the leaves are rather large compared to just the key, they may have to reside in external memory—that is, the disk—while you will still like to keep the index, the tree structure itself, in memory.

One could also argue that this scheme is cache-friendlier as one can use some kind of compact tree storage to maintain the keys and the structure of the tree in a small amount of memory, and with keys that are nearby in the tree being laid-out nearby in memory, thus reducing cache misses quite a lot.

In all cases, we still have reduced the complexity of building the initial tree from $O(n \lg n)$ to essentially $O(n)$,

### 6 Responses to Building a Balanced Tree From a List in Linear Time

1. Jimmi says:

I think if you have a list of sorted items, you can probably create an AVL-tree in O(N), which is actually self-balancing. The trick is an O(1) method that picks the root item from a list of length N, then you can recoursively create the left and right subtrees too.

• Steven Pigeon says:

If you have an array-list, you indeed have your $O(1)$ access to the pivot and the problem simplifies quite a lot. I suppose we could do the same with a skip-list. If you have an ordinary linked-list, you cannot do much better without using extra memory… basically building only the top levels of the skip-list. The resulting algorithm would be still something of $O(n \lg k)$, where $k$ is the size of a chunk that is using only regular-linked list (the structure would allow access to $O(n/k)$ entry points in the list in at worst $O(n/k)$.

2. Jimmi says:

Well, if the subtree-building function not only builds subtrees, but also tells the last link it used to do that, and you can calculate the pivot index in O(1), then you can still do it in O(N) with linked lists: the recursive function should process the left subtree first, then extract pivot, then process right subtree. With this approach extracting pivot element also takes O(1), and the whole algorithm runs in O(N) for linked lists too.

3. […] a while ago, I proposed a linear time algorithm to construct trees from sorted lists. The algorithm relied on the segregation of data and internal […]

4. Very interesting research! Just one concern. At least, we have to know the length of source list before processing. In most cases, we can calculate tho length in O(n) time (this will be the first pass). So, keeping this fact in mind, instead of calculating the length of the list we can convert it into the array and then end up with very simple and straightforward algorithm (with random access to an array). Anyway, this is still an interesting approach for the functional settings, where we don’t have arrays.

• Yes, I explored that type of addressing in a previous post. However, if that scheme is interesting for heaps (where the ordering invariant is different), it is not that useful for a tree that wil have items added or deleted, because 1) the array will have to be resized and 2) insertion will need large portion of the array being copied.