Programmers aren’t always the very rational beings they please themselves to believe. Very often, we close our eyes and take decisions based on what we think we know, and based on what have been told by more or less reliable sources. Such as, for example, taking red-black trees rather than AVL trees because they are faster, while not being able to quite justify in the details why it must be so. Programming using this kind of decision I call cargo cult programming.
Originally, I wanted to talk about red-black vs. AVL trees and how they compare, but I’ll rather talk about the STL std::map that is implemented using red-black trees with G++ 4.2, and std::unordered_map, a hash-table based container introduced in TR1.
TR1 std::unordered_map is a map that does not maintain any particular order between the keys of the data it contains. It is therefore implemented as a hash table. The std::map is implemented, at least in G++ 4.2, as a red-black tree. Self-balancing trees change their shape with each insertion in order to maintain most leaves at an equal depth, ensuring an access time. The difference between an AVL tree and a red-black tree is the way the tree is rebalanced, and the red-black tree does about half as much operations as an AVL tree on insertion, making somewhat faster—how much faster remains to be quantified.
So I ran a few simple experiments to compare std::unordered_map and std::map. First, I got hold of the Zingarelli word list, containing some 585,000 Italian words (I can’t find a working URL, but I got the list a few years ago on a Scrabble-related page). I ran three tests: insertion, successful search, and failed search. The list is broken into two parts, one containing 90% of the words, to be inserted, and 10% of the words where kept for the unsuccessful search test.
The insertion test consisted into inserting all of Zingarelli’s list into both data structures, in the most simplistic way: scanning the list sequentially and inserting items one by one.
Insertion times are about the same; so we cannot conclude anything special here except maybe that the insertion time is largely dominated by memory allocation and copy. For the successful search (10,000 tries in each case):
Map look-up is immensely faster with std::unordered_map, a ratio of about 8:1! Failed searches exhibit the same behavior. For ~60,000 failed searches:
We see the same kind of differences here again, but the failed searches are much slower than the successful searches.
When we repeat the experiment with integers (with the same kind of numbers; 500,000 integers of which 10% are randomly chosen for the failed searches), we get essentially the same picture but a massive speedup as strings are rather costly to copy. Indeed, for the insertion:
Now, we see the algorithmic difference between the two algorithms as the time spent allocating and copying strings is eliminated. For successful and failed searches, we get (again for 10,000 look-ups):
The std::unordered_map therefore seems to be much faster than std::map. What have we lost in order to gain this speed? First, a lot of memory as hash table must retain a certain sparseness to sport their average constant-time look-ups; and the ordering of keys. Listing the items sorted in a hash map won’t produce an ordered list, but rather a randomized version of the list. std::map on the other hand, allow lexicographic enumeration of its contents.
So you’re reading this and are thinking, mmppfh, of course, it’s a hash table, you nitwit. Well, yes. Maybe so.
But here’s what prompted me to do the test. Red-black trees offer access. But that’s assuming (and quite wrongly) that comparing keys can be performed in constant time. This may be true for simple keys, such as machine-size integers (known in C and C++ as int) but not so for more complex data, like strings and other structures. For string, for example, comparison may still be very fast because the cost of comparing two string only depends on the longest common prefix; if two long strings have differences in the first few characters, comparison terminates rapidly and the cost is moderate. If, on the other hand, the string share a very long prefix, then the comparison algorithm must scan both strings until the end is reached or a difference is found; this can be very long. So, lets be the average common prefix length. The expected search time is now which can grow large if is large.
For a hash table look-up, you must first compute the hash key which can be at best done in linear time in the average key length. In our first case, this length is the average string length. Let this average length be , and therefore the cost of computing the hash is proportional to . The number of probe is a small constant that depends on the sparseness of the table and the number of items, , it contains. For each of those probes, a string comparison is performed at cost , leading to an expected complexity of , assuming that secondary hashing is constant time.
Now, it is not clear when . Diving by leads to . So, in some conditions, it may be possible to make the tree appear faster than the hash table.
On the average, however, the hash map should win becase we expect .
A cargo cult is described as:
A cargo cult is a type of religious practice that may appear in primitive tribal societies in the wake of interaction with technologically advanced, non-native cultures. The cults are focused on obtaining the material wealth of the advanced culture through magical thinking, religious rituals and practices, believing that the wealth was intended for them by their deities and ancestors.
Except from the part with ancestor spirits but especially when considering magical thinking, cargo culting applies very often to how programmers write code and take decisions about data structures and algorithms. Choosing a red-black tree over an AVL, or over a splay tree, because we think that it is somehow always better—sometimes because Ancestor X (a more or less reliable source or authority, such as a more experience programmer, a teacher, or some Internet dude) said so—is a form of cargo cult where the programmer does not use rationality to its full extent to make a decision.
When choosing data structures, one must be fully aware of the dual cost of data structures. The first cost is the run-time cost. Theoretical complexity and actual implementation-dependant run-times may be quite different. The second cost is memory usage. Memory is large on modern systems but not infinite. A particularly wasteful method that offers constant-time access to the data may use as much as, say, ten times the memory used by a method that gives you access. If for small data set this 10× memory usage pose no particular problem, it may be quite different with a largish data set. It’s all a very delicate balancing act between run-time performance and scalability.
It’s very hard to trust one’s gut feelings about a data structure and the data put in. Combined with the access patterns, the data structure may yield a very counter-intuitive performance. Rather than giving into magic thinking cargo cult programming, you should always take a little time to validate our assumptions and hypotheses about the data, the data structure, and the access patterns, as data structure behavior is clearly not independent from the data and the access patterns.
Consider this very simple example. We have a simple binary tree and a list of strings. Binary trees offer insertion and access times, so it is, a priori, a good choice. Now, the list of string is composed of words, but it so happens it’s already sorted. If we insert the strings in the order they are in the list, we get a degenerate binary-tree that’s in fact a list! Indeed so: insertions are always performed at the far right of the tree, causing the tree to degenerate into something we could call a vine and then all operations degenerate to linear time. If we randomize the list and insert the words into the tree in that randomized order, we get a ragged tree, but about equally deep everywhere, leading to the expected access and insertion time.
The sorted list / binary tree example is an over-simplistic one, I admit, but it gets to the point: the programmer used a data structure cargo-culted to offer access time, but due to his incomplete comprehension of the data (the sorted list) and of the access pattern (inserting items sequentially from the list), the result was disastrous.
But the thing is, we all do that to a certain extent!