One thing that came on topic rather often recently, and especially in connection with Python, was data structure complexity and what kind of performance operations should offer. However, it’s not as simple as it sounds. First, algorithm operations to be performed dictate, to a great extent, what container, or abstract data structure, should be favored.
But sometimes, these data structures lend themselves to alternate implementations that have different run-time and memory complexities. Maybe we should have a look.
For the reader not familiar with the “Big O” notation, it may be preferable to have a cursorily look at what Wikipedia has to say.
Before discussing implementations, let’s have a look at the following tables, where is the number of items in the structures and is an index within the container:
|First||Seek at i||Find||Last|
|Prepend||Insert||Insert at i||Append|
|Delete 1st||Delete||Delete at i||Delete Last|
|Tree Map||—||to $O(i)$||—||—|
Where the em-dash — indicates an operation that’s not necessarily supported or meaningful for this container—which doesn’t mean it can’t be implemented, just that it’s not very useful for the container and how it’s used.
Let us take those one by one:
- Lists. Lists are typically implemented as linked lists. If the structure holds pointers to the first and last elements, accessing them is . Finding an item in a list that has no particular ordering is essentially since we have to scan the list, items long, to find the list node that holds the value . Inserting means either adding to the first/last position which can be performed in or in if it’s in the th position, since one has to start from the beginning of the list and skip the first items to insert at the th position. Insertion (or deletion) itself, once the location is determined, is (if we neglect the complexity of allocating/freeing a new list node). Lists can be implemented as vectors, in which case, they are array lists.
- Vectors. Vectors are simply arrays, regions of contiguous memory locations. They allow fast access to any of the items, but insertions and deletions are costly. If you insert an item at position , you must first make sure the container is capable of holding items. Most efficient implementations will allocate more items than actually in use to avoid reallocating (and copying) the array too often and therefore offer amortized constant-time append and linear time insertion/deletion. If you append and there are unused items left, the cost of insertion is because you only need to copy one item. Otherwise the cost is because you need to reallocate memory and possibly copy the values plus the cost of the reallocation itself.
If you insert somewhere else but have unused items available, insertion at is because you need to move the last items to insert an item at position . So even if you don’t reallocate the buffer, it’s an operation. Deleting is different in that you may, or may not, reallocate the array to free memory. As for list, finding a value requires to scan and compare every item—unless it’s sorted, in which case we have rather than .
- Tree or Tree Maps. Trees organize nodes in such a way that it is usually possible to find/delete/insert a specific key and its associated data in essentially operations. However, trees do not easily lend themselves to seemingly simple operations such as finding the th item in the collection, although a minor modification to the basic approach makes it possible in .
Basic trees, however, do not guaranty operations in all cases, merely most of the times. In some cases, the tree can degenerate into a list, and all operations cost as much as they cost on a list. That’s why there are a number of variants (such as AVL, red/black, or B-Trees) that ensures that all leaves in the tree are at the most equal depth possible, and operations as close to as possible.
- Hash Map. A hash table associates the key with a pseudo-random (therefore deterministic) location in an array that is much larger than would be strictly needed for the number of items. The hash function takes the key and compute a random-looking address for it. Of course, it always give the same random-looking address for the same key, and so it is possible to find back the data associated with the key. The hash function is not necessarily cheap, but on average, you will examine only a constant number of table locations before finding your data. The problem is that it’s not possible to enumerate the table’s content in lexicographical order: every item is pseudo-randomly dispersed in the table.
- Queue. A queue, or FIFO, is a data structure metaphor for a waiting line that can be implemented as a list or a circular buffer if length-limited. This data structure is meant to allow efficient appending (queuing) and efficient access to the head of the queue. That is, it’s easy to put an item in line, and easy to get/remove the first item in line, but other operations are not very meaningful in general.
- Priority Queue. Priority queues are very different from queues in that the first item is not determined by the order in which items where added to the priority queue but determined by its priority. The internal data representation, usually heap-like, will make sure that the first item has the highest (or lowest) key in the collection but does not specify how the remaining items are organized in the data structures. A heap’s contents can be very random-looking despite their being internally structured in such a way that inserting, deleting, new items is still . Heaps can also be represented as a compact tree, which is a reasonable representation.
Inserting a value in the priority queue means that the new value is placed somewhere such that overall, the priority properties (for example as dictated by the underlying heap) are preserved. You don’t really know where exactly it gets inserted, only that the first item has the highest (or lowest) priority. Searching, seeking, etc, has therefore little meaning, and these operations are taken over by the underlying data structure, typically, as we said, a heap.
It is hard to give in two pages and a half a good appreciation on how important choosing the good data structure for your algorithm is, or to explain all subtleties of these data structures.
First, most of these can be implemented in many, many different ways. For example, lists can be implemented as array lists, and array lists can be implemented as ropes; trees can be woven with a linked list to allow constant time “seek next” and “seek previous” operations, etc.
Second, understanding your needs (or more exactly, your algorithms’ needs) will help you select the right data structures. It doesn’t have to be efficient for all operations, just efficient for the operations you’ll perform most of the times. For example, finding a specific item in a priority queue is , but if you do not search for one too often, a priority queue will remain a good data structure and the occasional search will most probably not outweigh the benefits from maintaining order in .
It is hard to give in two pages and a half a good explanation of the internals of data structures and that’s why there are complete monographs written on the subject and that I linked to many pages from Wikipedia. Usually, however, you won’t have to worry about implementation details because the standard library (or even the language itself) will provide the data structures listed above and many, many more. Furthermore, it is likely that their implementation is much better than what we can come up with with only a few hours of work.
Still, I think it’s important to read books such as Advanced Data Structures and to get more than just a vague idea on how data structures actually work. It is also important, at least for the performance-minded coder, to know and understand what implementation your library or language offer. I have been bitten by the Python quite a few times because I somewhat naïvely assumed that a certain data structure would have the same implementation and semantics than the obvious C++ equivalent—turned out that neither implementation nor equivalence was obvious.