The LP64 model and the AMD64 instruction set

28/10/2008

Remember the old days where you had five or six “memory models” to choose from when compiling your C program? Memory models allowed you to chose from a mix of short (16 bits) and long (32 bits) offsets and pointers for data and code. The tiny model, if I recall correctly, made sure that everything—code, data and stack—would fit snugly in a single 16 bits segment.

With the advent of 64 bits computing on the x86 platform with the AMD64 instruction set, we find ourselves in a somewhat similar position. While the tiny memory model disappeared (phew!), we still have to chose between different compilation models although this time they do not support mixed offset/pointer sizes. The new models, such as LP32 or ILP64, specify what are the data sizes of int, long and pointers. Linux on AMD64 uses the LP64 model, where int is 32 bits, and long and pointers are 64 bits.

Using 64 bits pointers uses a bit more memory for the pointer itself, but it also opens new possibilities: more than 4GB allocated to a single process, the capability of using virtual memory mapping for files that exceed 4GB in size. 64 bits arithmetic also helps some applications, such as cryptographic software, to run twice as fast in some cases. The AMD64 mode doubles the number of SSE registers available enabling, potentially, significant performance enhancement into video codecs and other multimedia applications.

However one might ask himself what’s the impact of using LP64 for a typical C program. Is LP64 the best memory compilation model for AMD64? Will you get a speedup from replacing int (or int32_t) by long (or int64_t) in your code?

Read the rest of this entry »


Everyday Origami

21/10/2008

Ever found yourself with a CD or DVD without a sleeve to protect it? In this post, I present a fun and simple origami solution to the sleeveless DVD problem. While origami is often associated with sophistication and lots of spare time, it can serve in our daily lives, sometimes in surprising ways.

Origami, (from the japanese 折り紙, literally, folding (oru) paper (kami)), is a notoriously difficult art form where pieces of paper are folded—while avoiding cuts whenever possible—in various shapes of animals or other objects, an art sometimes pushed to incredible levels.

Read the rest of this entry »


Spiking Makefiles with BASH

14/10/2008

The thing with complex projects is that they very often require complex build scripts. The build script for a given project can be a mix of Bash, Perl, and Make scripts. It is often convenient to write a script that ensures that all the project’s dependencies are installed, of the right version, or that builds them if necessary.

We often also need much simpler things to be done, like generating a build number, from within the makefile script. If you use makefiles, you know just how fun they are to hack and you probably do the same as I do: minimally modify the same makefile you wrote back in 1995 (or “borrowed” from somewhere else.) In many cases, that makes perfect sense: it does just what it has to do. In this week’s post, I show how to interface (although minimally) Bash from Make.

Read the rest of this entry »


First Impression Fail (part II)

12/10/2008

So I downloaded the current ISO for Solaris but it still won’t install correctly in VMWare. The install starts, but stalls at the stage where it decompresses the windowing system. But it stalls in a very peculiar way. The VM doesn’t freeze, as the terminal remains responsive, but the CPU time goes to zero as if it were waiting for something to occur. After googling for tips, implementing a few of them; Solaris still stalls at the same step in the installation process. Is it some interaction between VMWare and Solaris? I don’t know.

A few years ago, being unable to install an OS like that would have driven me crazy. Because at that time, I was an apologist: making excuses for bad software. If it doesn’t boot, it’s surely because I did something wrong. The fact that the install procedure requires many manual steps (most of them improperly documented) that depend in some weird way on the current hardware is normal, is it not? Or is it?

Read the rest of this entry »


First Impression Fail

07/10/2008

It is said that first impressions lasts forever. Well, I sure hope these first impressions won’t keep me from using Solaris in the future. So, yes, I decided to try out Solaris, and I downloaded the VMWare virtual machine images from Sun. First there’s the usual annoying questionnaire about who you are, what you do, why you want I to try Solaris. Everything short of my income and investment strategy.

So I download the file with what seems to have the biggest version number, but it turns out that’s only the second half of the virtual machine. And it’s a stupid tar bomb. OK, I reread the thing, and there’s no clear, visible indications that there are two parts to the tar ball and that I should download the two. Never mind that, bandwidth’s is cheap.

Read the rest of this entry »


Serialization, Binary Encodings, and Sync Markers

07/10/2008

Serialization, the process by which run-time objects are saved from memory to a persistent storage (typically disk) or sent across the network, necessitate the objects to be encoded in some efficient, preferably machine-independent, encoding.

One could consider XML or JSON, which are both viable options whenever simplicity or human-readability is required, or if every serialized object has a natural text-only representation. JSON, for example, provides only for a limited number of basic data types: number, string, bool, arrays and objects. What if you have a binary object? The standard approach with text encodings is to use Base64, but this results in an 33% data expansion, since every 3 source bytes become 4 encoded bytes. Base64 uses a-z, A-z, 0-9, +, /, and = as encoding symbols, entirely avoiding comas (,), quotes (both single and double), braces, parentheses, newlines, or other symbols likely to interfere with the host encoding, whether XML, JSON, or CSV.

What if you do not want (or cannot afford) the bloatification of your data incurred by XML or JSON, and are not using a language with built-in binary serialization? Evidently, you will roll up your own binary encoding for your data. But to do so, one has to provide not only the serialization mechanisms for the basic data types (including, one would guess, the infamous “binary blob”) but also a higher-level syntax of serialization that provides for structure and—especially—error resilience.

Read the rest of this entry »


UEID: Unique Enough IDs

30/09/2008

Generating unique, unambiguous, IDs for data is something we do often, but we do not always know what level of uniqueness is really needed. In some cases, we want to be really sure that two instances of the same ID identify two copies of the same object or data. In other cases, we only want to be reasonably sure. In other cases, yet, we just assume that collisions—two different objects yielding the same ID—are very unlikely, and, if the need be, we can proceed to further testing to establish equality.

There are many ways of generating IDs, each with varying levels of confidence on uniqueness and differing applications.

Read the rest of this entry »


Length of Phase-in Codes

23/09/2008

In a previous post, I presented the phase-in codes. I stated that they were very efficient given the distribution is uniform. I also stated that, given the distribution is uniform, they never do worse than \approx 0.086 bits worse than the theoretical lower bound of \lg N, where N is the number of distinct symbols to code and \lg N is a convenient shorthand for \log_2. In this post, I derive that result.

Read the rest of this entry »


The Right Tool

14/09/2008

On his blog, Occasionally Sane, my friend Gnuvince tries to make a case for Python as a first programming language. He presents Python with a series of bullets, each giving a reason why Python is a good choice. While I think it may be a good choice pedagogically, it may not be a good pragmatic choice when someone is learning a programming language to find a job, as one of his reader commented.

But I think that betting on a sole language to base a programming career on is about as silly as someone mastering only the hammer looking for a construction job. His limited skills might not permit him to find a job, and even if he finds one, he may be confined to menial, unchallenging tasks. To push the tool analogy further, another workman that masters a wide array of tools will probably be more solicited and appreciated in his work-group, and be lead to more and more complex tasks, and more responsibilities within his group. Programming is very much the same: sometimes, it just takes something else than a hammer, and the right guy for the job is the guy that has the largest toolbox—the one with the Torx screwdriver set.

Image from Wikimedia

Read the rest of this entry »


Stretch Codes

09/09/2008

About ten years ago, I was working on a project that needed to log lots of information and the database format that was chosen then was Microsoft Access. The decision seemed reasonable since the data would be exported to other applications and the people who would process the data lived in a Microsoft-centric world, using Excel and Access (and VBA) on a daily basis. However, we soon ran into a major problem: Access does not allow a single file to be larger than 2GB.

After sifting through the basically random error messages that had nothing to do with the real problem, we isolated the bug as being reaching the maximum file size. “Damn that’s retarded!” I remember thinking. “This is the year 2000! If we don’t have flying cars, can we at least have databases larger than 2GB!?“. It’s not like 2GB was a file size set far off into the future as are exabytes hard-drives today. There were 20-something GB drives available back then, so the limitation made no sense whatsoever to me—and still doesn’t. After the initial shock, I got thinking about why there was such a limitation, what bizarre design decision lead to it.

Read the rest of this entry »