Tunnels of Doom!

16/12/2008

Æons ago, that is, in September 1982, I got my first true computer, the Texas Instrument TI-99/4A. A truly incredible machine, especially considering the era. Its CPU, a true 16 bits processor, the TMS 9900, ran at 3MHz. Even the popular TRS-80 Color Computer, Radio Shack’s 6809-based computer, ran at a mere 0.895MHz, or, if the gods were good to you, you could tweak the clock divisor and crank it up to 1.79MHz.

In addition to being my first programmable computer, the TI-99/4A sported truly nice graphics and a relatively advanced sound chip—for the era—and a number of games were developed for it. Hunt the Wumpus, Parsec, and others. But the one game that really got me interested in programming is Kevin Kenney’s Tunnels of Doom, one of the very first graphic dungeon crawler games. While not being nostalgic by nature, I returned to that game about a year ago following an unusual route. I was looking into procedural content generation, especially map generation, in relation to data compression and I remembered that the old game’s dungeons were generated randomly, yet were very well structured. I decided to get a look into it, and see if I could generate maps like his.

Read the rest of this entry »


Deriving the 1 bit = 6 dB rule of thumb

09/12/2008

This week, a more mathematical topic. Sometime ago, we—friends and I—were discussing the fidelity of various signals, and how many bits were needed for an optimal digitization of the signal, given known characteristics such as spectrum and signal-to-noise ratio.

Indeed, at some point, when adding bits, you only add more power to represent noise in the signal. There’s a rule of thumb that say that for every bit you add, you can represent a signal with \approx 6 dB more of signal to noise ratio. Let me show you how you derive such a result.

Read the rest of this entry »


Safer Integer Constants (Part II)

02/12/2008

In C and C++, the behavior of integer promotions is not the same whether we use signed integers or unsigned integers. In bit twiddling hacks, it is not immediately apparent that it is a problem that may lead to unexpected result whenever code is ported to a different architecture. This is a recurring topic whenever I discuss portability with friends. Very often, they argue that if it works on Windows and 32 bits x86 Linux, that’s pretty much all there is to it.

Of course, I couldn’t disagree more. I have learned the hard way.

Read the rest of this entry »


Safer Float Types

25/11/2008

This week, we’ll be following last week’s post, where we looked at type-safe integer constants, with floating point constants, that is, float and double.

Read the rest of this entry »


Safer Integer Constants

18/11/2008

Consider the following short C (or C++) program:

const int thingy=123123123123;

Depending on your compiler, the above code may succeed, fail, produce a warning, or be accepted quietly and result in undefined behavior, which is extremely bad.

Read the rest of this entry »


The LP64 model and the AMD64 instruction set

28/10/2008

Remember the old days where you had five or six “memory models” to choose from when compiling your C program? Memory models allowed you to chose from a mix of short (16 bits) and long (32 bits) offsets and pointers for data and code. The tiny model, if I recall correctly, made sure that everything—code, data and stack—would fit snugly in a single 16 bits segment.

With the advent of 64 bits computing on the x86 platform with the AMD64 instruction set, we find ourselves in a somewhat similar position. While the tiny memory model disappeared (phew!), we still have to chose between different compilation models although this time they do not support mixed offset/pointer sizes. The new models, such as LP32 or ILP64, specify what are the data sizes of int, long and pointers. Linux on AMD64 uses the LP64 model, where int is 32 bits, and long and pointers are 64 bits.

Using 64 bits pointers uses a bit more memory for the pointer itself, but it also opens new possibilities: more than 4GB allocated to a single process, the capability of using virtual memory mapping for files that exceed 4GB in size. 64 bits arithmetic also helps some applications, such as cryptographic software, to run twice as fast in some cases. The AMD64 mode doubles the number of SSE registers available enabling, potentially, significant performance enhancement into video codecs and other multimedia applications.

However one might ask himself what’s the impact of using LP64 for a typical C program. Is LP64 the best memory compilation model for AMD64? Will you get a speedup from replacing int (or int32_t) by long (or int64_t) in your code?

Read the rest of this entry »


Spiking Makefiles with BASH

14/10/2008

The thing with complex projects is that they very often require complex build scripts. The build script for a given project can be a mix of Bash, Perl, and Make scripts. It is often convenient to write a script that ensures that all the project’s dependencies are installed, of the right version, or that builds them if necessary.

We often also need much simpler things to be done, like generating a build number, from within the makefile script. If you use makefiles, you know just how fun they are to hack and you probably do the same as I do: minimally modify the same makefile you wrote back in 1995 (or “borrowed” from somewhere else.) In many cases, that makes perfect sense: it does just what it has to do. In this week’s post, I show how to interface (although minimally) Bash from Make.

Read the rest of this entry »


Serialization, Binary Encodings, and Sync Markers

07/10/2008

Serialization, the process by which run-time objects are saved from memory to a persistent storage (typically disk) or sent across the network, necessitate the objects to be encoded in some efficient, preferably machine-independent, encoding.

One could consider XML or JSON, which are both viable options whenever simplicity or human-readability is required, or if every serialized object has a natural text-only representation. JSON, for example, provides only for a limited number of basic data types: number, string, bool, arrays and objects. What if you have a binary object? The standard approach with text encodings is to use Base64, but this results in an 33% data expansion, since every 3 source bytes become 4 encoded bytes. Base64 uses a-z, A-z, 0-9, +, /, and = as encoding symbols, entirely avoiding comas (,), quotes (both single and double), braces, parentheses, newlines, or other symbols likely to interfere with the host encoding, whether XML, JSON, or CSV.

What if you do not want (or cannot afford) the bloatification of your data incurred by XML or JSON, and are not using a language with built-in binary serialization? Evidently, you will roll up your own binary encoding for your data. But to do so, one has to provide not only the serialization mechanisms for the basic data types (including, one would guess, the infamous “binary blob”) but also a higher-level syntax of serialization that provides for structure and—especially—error resilience.

Read the rest of this entry »


UEID: Unique Enough IDs

30/09/2008

Generating unique, unambiguous, IDs for data is something we do often, but we do not always know what level of uniqueness is really needed. In some cases, we want to be really sure that two instances of the same ID identify two copies of the same object or data. In other cases, we only want to be reasonably sure. In other cases, yet, we just assume that collisions—two different objects yielding the same ID—are very unlikely, and, if the need be, we can proceed to further testing to establish equality.

There are many ways of generating IDs, each with varying levels of confidence on uniqueness and differing applications.

Read the rest of this entry »


Length of Phase-in Codes

23/09/2008

In a previous post, I presented the phase-in codes. I stated that they were very efficient given the distribution is uniform. I also stated that, given the distribution is uniform, they never do worse than \approx 0.086 bits worse than the theoretical lower bound of \lg N, where N is the number of distinct symbols to code and \lg N is a convenient shorthand for \log_2. In this post, I derive that result.

Read the rest of this entry »