Binary Trees are optimal… except when they’re not.

The best case depth for a search tree is O(\log_b n), if b is the arity (or branching) of the tree. Intuitively, we know that if we increase b, the depth goes down, but is that all there is to it? And if not, how do we chose the optimal branching b?

measuring-tree

While it’s true that an optimal search tree will have depth O(\log_b n), we can’t just increase b indefinitely, because we also increase the number of comparisons we must do in each node. A b-ary tree will have (in the worst case) b-1 keys, and for each, we must do comparisons for lower-than and equal (the right-most child will be searched without comparison, it will be the “else” of all the comparisons). We must first understand how these comparisons affect average search time.

If we use a naïve comparison scheme, each keys ask for two (potentially expensive) comparisons. One to test if the sought value, v is smaller than a key k; one to test if they’re equal. If neither tests is true, we move on to the next key. If none of the tests are true, it is an “else” case and we go down the rightmost branch. That scheme does way more work than necessary and would ask for 2(b-1) comparison per node (because there are b-1 keys per node).

The main problem with this straightforward approach is that comparisons can be very expensive. While comparing two integral types (int and the like) is often thought of as “free”, comparing string is expensive. So you clearly do not want to test two strings twice, once for < and once for =. The much better approach is to use a three-way comparison, also known as the “spaceship operator”.

The spaceship operator, <=> is C++20’s version of C’s old timey qsort comparison function (it is in fact much better because it also automagically provides all comparison operators). Basically, a<=>b can return <0 if a<b, 0 if they’re equal, and >0 if a>b. We can therefore store the result of the expensive comparison, then do inexpensive comparisons for the result. That reduces the number of expensive comparisons to b-1 per node.

The search complexity, counting the number of comparisons in the worst case for the best-case tree is

\displaystyle (b-1)\log_b n=(b-1)\frac{\ln n}{\ln b}=\frac{b-1}{ln b}\ln n,

a strictly increasing function in b>1 (as we can treat \ln n as a constant, since we’re not optimizing for n):

search-cost

We therefore conclude that b=2 is the optimal solution.

Except when it isn’t

We have neglected, or rather abstracted, something important: the cost of accessing the node. In main memory, it’s very convenient to suppose that reading a node is free, that is, incurs no cost. That’s of course false, because even reading from cache incurs a delay; fortunately very small. It is even more false if we read the node from disk (yes, even NVMe). A classical spinning rust disk reading a block will cause a few millisecond delay, that may be really, really expensive relative to the comparison of two keys (once they’re) in memory.

The new cost function to optimize will be

\displaystyle ((b-1)c_1+c_2)\log_b n=\frac{(b-1)c_1+c2}{\ln b}\ln n,

with c_1 the comparison cost, and c_2, the cost of accessing the node. We can set c_1 to 1 and make c_2 relative to c_1. Let’s say something like c_2=100. Now, luckily for us, this function is convex in b, so we can solve for the optimal b given c_1 and c_2. To do so, we take the derivative and find where it is zero, that is, solve

\displaystyle \frac{\partial}{\partial b}\frac{(b-1)c_1+c_2}{\ln b}=\frac{b(\ln b-1)c_1+c_1-c_2}{b(\ln b)^2}=0

for b. Since the denominator is strictly increasing in b>1, we must solve for a zero numerator. However, there’s a slight complication:

\displaystyle b(\ln b-1)=\frac{c_2}{c_1}-1

isn’t algebraic. We must use a numerical solution to the Lambert W Function. It gives us, with u=\frac{c_2}{c_1}-1,

\displaystyle b^*=\frac{u}{LambertW(\frac{u}{e})}.

The following graph shows the function’s surface with c_1=5 and c_2=200, one axes being b, block size and the other being n, the number of items in the tree. The green plane shows the optimal b found with the Lambert W function.

search-cost-with-blocks

*
* *

To conclude, binary trees are optimal when we ignore the cost of accessing a node, but they aren’t when it becomes costly to access a node. When we access the nodes on disk, with a high cost, it becomes interesting to bundles many keys in a node, and we gave the optimal solution. However, the problem is often solved quite more directly. We just fit as many keys as possible in an I/O block. Disks operates on small blocks of (typically) 512 bytes, but file systems operate in somewhat larger, but fixed-sized blocks, something like 4 or 8k. A good strategy is therefore to fit as many keys as possible in that block, since even if the number of comparisons is large, it will still be faster than bringing that block into main memory.

13 Responses to Binary Trees are optimal… except when they’re not.

  1. ajksdflksadf says:

    Hello,

    It seems like you can do way better than 2(b-1) comparisons even if you insist on comparing to every element. Can you fully specify what the search should return? In particular, if many elements of the tree compare equal to the query, which should be returned?

    Assuming that it’s ok to find any of the elements that compare equal to the query, if you have xs[1] <= needle (equivalently, !(neelde < xs[1])) and xs[2] <= needle and xs[1] <= xs[2] (this is sort of a given), you don't need to check for xs[1] == needle. You could afford to check only the greatest element of the node that is <= query for equality.

    • I think that multiplicities (many items with the same keys) are a somewhat different thing. I would probably manage that by having a node with the key and a list of objects or a counter. If they are distinguishable, then it means the key used for the tree is partial?

      But otherwise, I think that would work: it would be (b-1) per node, and O((b-1)lg n) in the average case (not just worst case) even for perfect trees. But it would be a bigger O(lg n). We should work out the specifics, but it’s probably something like (multiplicites*lg n).

  2. thethiefmaster says:

    Your article is on reddit (I don’t think posted by you) https://www.reddit.com/r/cpp/comments/op79gh/binary_trees_are_optimal_except_when_theyre_not/

    I made some comments there.

    • Then I’ll comment here:

      2 a) SIMD compares are cute for byte, words or 64 bits integers but do not work as well for abstract type T with operator . Think of just strings.

      3 a) disk blocks are 512 bytes, typical IO block IS 512 bytes. But *filesystems* groups them and manage larger block. That’s what actually written.

      • thethiefmaster says:

        For 3: we’ve moved to 4k disk blocks a number of years back. All SSDs use 4k pages, and “AF” (Advanced Format) hard-disks do too. 512 byte sectors is legacy.

        • …or not. On my box, with nvme storage:

          /home/steven> sudo blockdev –getsize64 /dev/nvme0n1p3
          65861058560
          /home/steven> sudo blockdev –getsize /dev/nvme0n1p3
          128634880

          The first gives total size, the second, the number of sectors. Ratio: 512.

          But anyway, the point was mainly that basic device IO block and filesystem blocks not not necessarily match, and usually, FS blocks are larger.

          • thethiefmaster says:

            Compatibility causes funny things – –getsize is in 512 byte sectors even if that’s not accurate for the device! Try –getpbsz (get physical sector size)

            • 512 ¯\_(ツ)_/¯

              • thethiefmaster says:

                Huh.

                The other options that can sometimes hint at it are –getiomin and –getioopt

                But… maybe you do have a 512 byte sector SSD ¯\_(ツ)_/¯


                • > sudo nvme list
                  [sudo] password for steven:
                  Node SN Model Namespace Usage Format FW Rev
                  ---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
                  /dev/nvme0n1 1928823000012856535F Force MP600 1 2.00 TB / 2.00 TB 512 B + 0 B EGFM11.1

                  yup.

                  • thethiefmaster says:

                    Just noticed that’s a Force MP600 – we have those at work. This is what Windows fsutil reports for it for us:

                    PS C:\windows\system32> fsutil fsinfo sectorInfo d:
                    LogicalBytesPerSector : 512
                    PhysicalBytesPerSectorForAtomicity : 4096
                    PhysicalBytesPerSectorForPerformance : 4096
                    FileSystemEffectivePhysicalBytesPerSectorForAtomicity : 4096
                    Device Alignment : Aligned (0x000)
                    Partition alignment on device : Aligned (0x000)
                    No Seek Penalty
                    Trim Supported
                    Not DAX capable
                    Not Thinly-Provisioned

                    4k physical sectors.

                    Something seems broken with the data reported by your system. The MP600 emulates 512 byte sectors (like most modern 4k drives) for compatibility with systems that don’t understand 4k sectors, but it ends up turning 512 byte sector writes into a 4k read+modify operation so it’s less than ideal.

  3. Moonlit says:

    Can you provide clarification on ” A good strategy is therefore to fit as many keys as possible in that block”. I’m guessing you mean, fit the top-branches onto one block until you can fit the remaining sub-tree onto one or two.

    • If accessing a single key cause you to read from disk, you might as well have this expensive read read a bunch of keys. An no, not necessarily just top branches, but at all level of the tree. I guess there are many ways to go about it (say, have a bunch of nodes in the same block or nodes with great many branches (and therefore great many keys))

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: