However, if the list is sorted (in ascending order, say) and the tree is not one of the self-balancing varieties, insertion is , because the “tree” created by the successive insertions of sorted key is in fact a degenerate tree, a list. So, what if the list is already sorted and don’t really want to have a self-balancing tree? Well, it turns out that you can build a(n almost perfectly) balanced tree in .
Let us make the simplifying assumption that we can have two types of nodes in the tree: leaves, that contains the actual data, and internal nodes (or just nodes for the remainder of this post) that holds only a key.
The first strategy that comes to mind, using this assumption, is to use a method reminiscent of how Huffman Codes are constructed. That is, we scan the original list from left to right: we take two nodes (if there are at least two left), remove them from the list, and insert back in their place a new node containing the key of the second list item (remember, the list is supposed to be sorted in ascending order), with the two original nodes (or leaves) as children. That is, the operation goes something like this:
(Notice the metaphor: leaves are green, internal nodes brown.) We simply re-scan the list until we have a single remaining node, which will be the root of the tree. The procedure is simple enough, indeed. Let us work out a full example with 5 leaves:
Then we do one pass of merges:
What do you notice? The three isn’t all that balanced. In fact, it’s really unbalanced. What went wrong? Well, if for Huffman Codes, this algorithm is optimal, for search trees, it is clearly not. For one thing, Huffman’s method wants to build trees with average path-lengths as close as possible to the source’s entropy; here our goal is to have path-lengths as equal as possible.
The key observation is that if the length of the list is a power of two, then the list can be successively merged together by the above algorithm and yield a tree that has exactly depth. Therefore, if we somehow modify the list to have exactly items in it, we’ll be able to have our optimal depth. Phase-In Codes provides the mean to do so. If
with the largest and with (thus imposing the uniqueness of the solution, with ), we can merge the first nodes, and it will result in a list of nodes ( of which are now internal nodes). Once we have merged the first nodes, we stop the iteration there and restart from the beginning of the list, which is now of length , and will yield the most-equal depth tree we wanted.
Again, with : , so , we take the two first nodes
and merge them:
and voilà, we proceed with the successive merges:
and we have a tree of almost equal depth, with average depth of (while the actual ) rather than , a considerable improvement, even for this trivial tree.
While it is not the customary way of representing a tree in memory, the segregation of leaves and internal nodes may be justified in the context where the leaves and the tree itself live in different memory locations. For example, if the leaves are rather large compared to just the key, they may have to reside in external memory—that is, the disk—while you will still like to keep the index, the tree structure itself, in memory.
One could also argue that this scheme is cache-friendlier as one can use some kind of compact tree storage to maintain the keys and the structure of the tree in a small amount of memory, and with keys that are nearby in the tree being laid-out nearby in memory, thus reducing cache misses quite a lot.
In all cases, we still have reduced the complexity of building the initial tree from to essentially ,