## UEID: Unique Enough IDs

September 30, 2008

Generating unique, unambiguous, IDs for data is something we do often, but we do not always know what level of uniqueness is really needed. In some cases, we want to be really sure that two instances of the same ID identify two copies of the same object or data. In other cases, we only want to be reasonably sure. In other cases, yet, we just assume that collisions—two different objects yielding the same ID—are very unlikely, and, if the need be, we can proceed to further testing to establish equality.

There are many ways of generating IDs, each with varying levels of confidence on uniqueness and differing applications.

## Length of Phase-in Codes

September 23, 2008

In a previous post, I presented the phase-in codes. I stated that they were very efficient given the distribution is uniform. I also stated that, given the distribution is uniform, they never do worse than $\approx 0.086$ bits worse than the theoretical lower bound of $\lg N$, where $N$ is the number of distinct symbols to code and $\lg N$ is a convenient shorthand for $\log_2$. In this post, I derive that result.

## The Right Tool

September 14, 2008

On his blog, Occasionally Sane, my friend Gnuvince tries to make a case for Python as a first programming language. He presents Python with a series of bullets, each giving a reason why Python is a good choice. While I think it may be a good choice pedagogically, it may not be a good pragmatic choice when someone is learning a programming language to find a job, as one of his reader commented.

But I think that betting on a sole language to base a programming career on is about as silly as someone mastering only the hammer looking for a construction job. His limited skills might not permit him to find a job, and even if he finds one, he may be confined to menial, unchallenging tasks. To push the tool analogy further, another workman that masters a wide array of tools will probably be more solicited and appreciated in his work-group, and be lead to more and more complex tasks, and more responsibilities within his group. Programming is very much the same: sometimes, it just takes something else than a hammer, and the right guy for the job is the guy that has the largest toolbox—the one with the Torx screwdriver set.

## Stretch Codes

September 9, 2008

About ten years ago, I was working on a project that needed to log lots of information and the database format that was chosen then was Microsoft Access. The decision seemed reasonable since the data would be exported to other applications and the people who would process the data lived in a Microsoft-centric world, using Excel and Access (and VBA) on a daily basis. However, we soon ran into a major problem: Access does not allow a single file to be larger than 2GB.

After sifting through the basically random error messages that had nothing to do with the real problem, we isolated the bug as being reaching the maximum file size. “Damn that’s retarded!” I remember thinking. “This is the year 2000! If we don’t have flying cars, can we at least have databases larger than 2GB!?“. It’s not like 2GB was a file size set far off into the future as are exabytes hard-drives today. There were 20-something GB drives available back then, so the limitation made no sense whatsoever to me—and still doesn’t. After the initial shock, I got thinking about why there was such a limitation, what bizarre design decision lead to it.

## Phase-in Codes

September 2, 2008

Once in a while, we have to encode a series of values to memory or file, of which we know either very little about, distribution-wise, or are very close to being drawn from an uniform random variable: basically, they’re all equally likely to be drawn. This precludes the use of Huffman (or like) codes because the distribution is flat and Huffman code will afford no real compression despite the added complexity.

We do not want to use a fixed number of bits either because the number of possible values might be very far from the next power of two. For example say I have 11 different values. I could encode them on 4 bits each, but I would waste more than half a bit by doing so. Half a bit wasted for every code. Fortunately, there’s an easy, fast encoding that can help us achieve high efficiency: phase-in codes.