Much Ado About Nothing


A rather long time ago, I wrote a blog entry on branchless equivalents of simple functions such as sex, abs, min, max. The Sing EXtension instruction propagates the sign bit in the upper bits, and is typically used in the promotion of, say, a 16 bits signed value into a 32 bits variable.

But this time, I needed something a bit different: I only wanted the sign-extended part. Could I do much better than last time? Turns out, the compiler has a mind of its own.

Read the rest of this entry »

Slow down, Keep It Cool


In a previous post, I discussed how to set the default power policy with Linux (Ubuntu) by detecting the battery/power status: if you’re plugged-in, set it to on-demand, if you’re running from the battery, set it to powersave. This is rather crude, but proved effective.

But CPUs that support SpeedStep (or similar) usually support a rather long list of possible speed settings. For example, my i7 supports about 15 different speeds, and “powersave” selects the slowest of all, 1.60GHz (on my laptop, that would be 800MHz). Maybe we could leave the policy to on-demand, but cap the maximum speed to something a bit lower than maximum?

Read the rest of this entry »



There are plenty of web sites and museums dedicated to the computers of yore. While most of them now seems quaint, and delightfully obsolete, there are probably a lot of lessons we could re-learn and apply today, with our modern computers.

If you followed my blog for some time, you know that I am concerned with efficient computation and representation of just about everything, applied to workstation, servers, and embedded systems. I do think that retro-computing (computing using old computers or the techniques of old computer) has a lot to teach us, and not only from an historical perspective.

Read the rest of this entry »

The Perfect Instruction Set


The x86 architecture is ageing, but rather than looking for re-invention, it only saw incremental extensions (especially for operating system instructions and SIMD) over the last decade or so. Before getting to the i7 core, we saw a long series of evolutions—not revolutions. It all started with the 8086 (and its somewhat weaker sibling, the 8088), which was first conceived as an evolutionary extension to the 8085, which was itself binary compatible with the 8080. The Intel 8080’s lineage brings us to the 8008, a 8 bits of data, 14 bits of address micro-processor. Fortunately, the 8008 isn’t a double 4004. The successors of the 8086 include (but the list surely isn’t exhaustive) the 80186, the 80286, the 80386, first in the series to include decent memory protection for multitasking, then the long series of 486, various models of Pentium, Core 2 and i7.

So, just like you can trace the human lineage to apes, then to monkeys, and eventually to rodent-like mammals, the x86 has a long history and is still far from being perfect, and its basic weakness, in my opinion, is that it still use the 1974 8080 accumulator-based instruction set architecture. And despite a number of very smart architectural improvements (out of order execution, branch prediction, SIMD), it still suffers from the same problems the 8085 did: the instruction set is not orthogonal, it contains many obsolete complex instructions that aren’t used all that much anymore (such as the BCD helpers), and that everything has to be backwards compatible, meaning that every new generation still is more of the same, only (mostly) faster.

But what would be the perfect instruction set? In [1], the typical instruction set is composed of seven facets (to which I add an eighth):

Read the rest of this entry »

Suggested Readings:Computer Architecture: A Quantitative Approach


John L. Hennessy, David A. Patterson — Computer Architecture: A Quantitative Approach — 4th ed., Morgan Kaufmann, , 704 pp. ISBN 0-12-370490-1

(Buy at

Computer Architecture: A Quantitative Approach is probably the most up-to-date and comprehensive introductory text for computer architecture, covering a broad spectrum of topics from micro-instructions to multi-core parallelism. This book is different—from the aging Advanced Computer Architecture: Parallelism, Scalability, Programmability by Kai Hwang (1992, now out of print) for example—in that it takes a quantitative approach, motivating most statements by hard numbers, simulations and benchmarks.

Read the rest of this entry »

Bundling Memory Accesses (Part I)


There’s always a question whether having “more bits” in a CPU will help. Is 64 bits better than 16? If so, how? Is it only that you have bigger integers to count further? Or maybe more accessible memory? Well, quite obviously, being able to address a larger memory or performing arithmetic on larger number is quite useful because, well, 640KB isn’t all that much, and counting on 16 bits doesn’t get your that far.

AMD Phenom

But there are other advantages to using the widest registers available for computation. Often, algorithms that scan the memory using only small chunks—like bytes or words—can be sped up quite a bit using bundled reads/writes. Let us see how.

Read the rest of this entry »

Affinities and ulimit


The Bash ulimit built-in can be used to probe and set the current user limits. Such limits include the amount of memory a process may use or the maximum number of opened files a user can have. While ulimit is generally understood to affect a whole session, it can be used to change the limits of a group of processes using, for example, a sub-shell.

However, the ulimit command is quirky (it expects a particular order for parameters and not all may be set on the same command line) and does not seems to be ageing all that well. For one thing, one cannot set the affinity of processes—indirectly controlling the number of and which cores one can use in a multi-core machine.

Read the rest of this entry »

Powers of Ten (so to speak)


I am not sure if you are old enough to remember the 1977 IBM movie Powers of Ten (trippy version, without narration) [also at the IMDB and wikipedia], but that’s a movie that sure put things in perspective. Thinking in terms of powers of ten helps me sort things out when I am considering a design problem. Thinking of the scale of a problem in terms of physical scale is a good way to assess its true importance for a project. Sometimes the problem is the one to solve, sometimes, it is not. It’s not because a problem is fun, enticing, or challenging, that it has to be solved optimally right away because, in the correct context, considering its true scale, it may not be as important as first thought.


Maybe comparing problems’ scales to powers of ten in the physical realm helps understanding where to put your efforts. So here are the different scales and what I think they should contain:

Read the rest of this entry »

More Blinking Lights (and a disgression)


In Blinking Lights I told you about how I feel the modern computer for its exterior, except for its screen, is boring. When I look at my Antec case, I see a large, silent black box, which, by its very definition, is uninteresting at best. Something like a rock that slowly dissipates heat.

However Bill Buzbee built a computer that has an interesting exterior, and a much more interesting interior: the Magic-1. The Magic-1 is a computer running at 4.something MHz, and is in the same computational power range as the original 8086 4.77 Mhz IBM PC, except with a more advanced instruction set.

The Magic-1 Computer

The Magic-1 Computer

Read the rest of this entry »

Suggested Reading: The Race for a New Game Machine


David Shippy, Mickie Phipps — The Race for a New Game Machine — Citadel Press, 2009, 256 pp. ISBN 978-080653101-4

(buy at

(buy at

This book, strongly reminescent of Tracy Kidder’s Pulitzer-winning Soul of a New Machine, relates the history of the development of the Cell, Xenon, and Broadway processors, the hearts of Sony’s PS3, Microsoft’s Xbox 360, and Nintendo’s Wii game machines, respectively.

Read the rest of this entry »