Sorting Linked Lists (part I)

June 16, 2009

The sorting algorithms we were taught in class were typically simplified versions of the algorithms that assumed that the data remained in memory and in contiguous memory locations, also known as an array. What we often have, however, are linked lists. Generalizing algorithms like the QuickSort to lists is not very hard in fact if we use a modification of the basic algorithm.

For lists with numerical keys (or values), there might be a simpler algorithm that scans the list in only one direction, so that simply linked lists are a perfectly cromulent data structure, that is, the radix sort.

Read the rest of this entry »


The True Cost of Calls

December 30, 2008

The cost of virtual functions is often invoked as a reason to C++’s poor performance compared to other languages, especially C. This is an enduring myth that, like most myths, have always bugged me. C++ myths are propagated by individuals that did not know C++ very well, tried it one weekend in 1996, used a bad compiler, knew nothing about optimization switches, and peremptorily declared C++ as fundamentally broken. Well, I must agree that C++ compilers in the mid-90s weren’t all that hot, but in the last fifteen years, a lot have been done. Compilers are now rather good at generating efficient C++ code.

However, the cost of calls, whether or not they are virtual, is not dominated by the the call itself (getting the address to jump to and jumping) but by everything else surrounding the call, like the stack setup and argument passing. Let us debunk that myth by looking at what types of calls are available in C and C++, how they translate to machine code, and see how faster or slower they are relative to each other.

Read the rest of this entry »


The 10 (classes of) Algorithms Every Programmer Must Know About

December 23, 2008

In Tunnels of Doom!, I wrote that the disjoint sets algorithm is one of the very few algorithms every programmer should know. That got me thinking. Should? What about must? If everyone must know about disjoint sets, what other algorithms must every programmer know about?

I made a “top ten” list of algorithms and data structures every programmer must know about.

Read the rest of this entry »


Serialization, Binary Encodings, and Sync Markers

October 7, 2008

Serialization, the process by which run-time objects are saved from memory to a persistent storage (typically disk) or sent across the network, necessitate the objects to be encoded in some efficient, preferably machine-independent, encoding.

One could consider XML or JSON, which are both viable options whenever simplicity or human-readability is required, or if every serialized object has a natural text-only representation. JSON, for example, provides only for a limited number of basic data types: number, string, bool, arrays and objects. What if you have a binary object? The standard approach with text encodings is to use Base64, but this results in an 33% data expansion, since every 3 source bytes become 4 encoded bytes. Base64 uses a-z, A-z, 0-9, +, /, and = as encoding symbols, entirely avoiding comas (,), quotes (both single and double), braces, parentheses, newlines, or other symbols likely to interfere with the host encoding, whether XML, JSON, or CSV.

What if you do not want (or cannot afford) the bloatification of your data incurred by XML or JSON, and are not using a language with built-in binary serialization? Evidently, you will roll up your own binary encoding for your data. But to do so, one has to provide not only the serialization mechanisms for the basic data types (including, one would guess, the infamous “binary blob”) but also a higher-level syntax of serialization that provides for structure and—especially—error resilience.

Read the rest of this entry »


UEID: Unique Enough IDs

September 30, 2008

Generating unique, unambiguous, IDs for data is something we do often, but we do not always know what level of uniqueness is really needed. In some cases, we want to be really sure that two instances of the same ID identify two copies of the same object or data. In other cases, we only want to be reasonably sure. In other cases, yet, we just assume that collisions—two different objects yielding the same ID—are very unlikely, and, if the need be, we can proceed to further testing to establish equality.

There are many ways of generating IDs, each with varying levels of confidence on uniqueness and differing applications.

Read the rest of this entry »


Stretch Codes

September 9, 2008

About ten years ago, I was working on a project that needed to log lots of information and the database format that was chosen then was Microsoft Access. The decision seemed reasonable since the data would be exported to other applications and the people who would process the data lived in a Microsoft-centric world, using Excel and Access (and VBA) on a daily basis. However, we soon ran into a major problem: Access does not allow a single file to be larger than 2GB.

After sifting through the basically random error messages that had nothing to do with the real problem, we isolated the bug as being reaching the maximum file size. “Damn that’s retarded!” I remember thinking. “This is the year 2000! If we don’t have flying cars, can we at least have databases larger than 2GB!?“. It’s not like 2GB was a file size set far off into the future as are exabytes hard-drives today. There were 20-something GB drives available back then, so the limitation made no sense whatsoever to me—and still doesn’t. After the initial shock, I got thinking about why there was such a limitation, what bizarre design decision lead to it.

Read the rest of this entry »