Yet More __builtins


So last week we saw how to use some of GCC’s built-ins, this week, let’s have a look at how we can create our own, if need be. Say because you need to have access to some instruction and that GCC does not offer the corresponding built-in.


To do so, we’ll use a bit of the C preprocessor and GCC’s inline assembly extension.

Read the rest of this entry »

The Perfect Instruction Set


The x86 architecture is ageing, but rather than looking for re-invention, it only saw incremental extensions (especially for operating system instructions and SIMD) over the last decade or so. Before getting to the i7 core, we saw a long series of evolutions—not revolutions. It all started with the 8086 (and its somewhat weaker sibling, the 8088), which was first conceived as an evolutionary extension to the 8085, which was itself binary compatible with the 8080. The Intel 8080’s lineage brings us to the 8008, a 8 bits of data, 14 bits of address micro-processor. Fortunately, the 8008 isn’t a double 4004. The successors of the 8086 include (but the list surely isn’t exhaustive) the 80186, the 80286, the 80386, first in the series to include decent memory protection for multitasking, then the long series of 486, various models of Pentium, Core 2 and i7.

So, just like you can trace the human lineage to apes, then to monkeys, and eventually to rodent-like mammals, the x86 has a long history and is still far from being perfect, and its basic weakness, in my opinion, is that it still use the 1974 8080 accumulator-based instruction set architecture. And despite a number of very smart architectural improvements (out of order execution, branch prediction, SIMD), it still suffers from the same problems the 8085 did: the instruction set is not orthogonal, it contains many obsolete complex instructions that aren’t used all that much anymore (such as the BCD helpers), and that everything has to be backwards compatible, meaning that every new generation still is more of the same, only (mostly) faster.

But what would be the perfect instruction set? In [1], the typical instruction set is composed of seven facets (to which I add an eighth):

Read the rest of this entry »

The CFM-00


Parallel computing is the next paradigm shift, everybody knows this, but not everyone is taking the proper action to face it adequately. One thing to do is to read on the subject and force oneself to code using threads and various degrees of parallelism; and that’s pretty easy now that a quad core machine doesn’t cost all that much. But the next step, distributed computing, necessitates, well, more than one machine, and if you have different levels of memories and communication channels, all the better.

So out of a bunch of old x86 PCs, I’ve decided to build my own portable mini-cluster with 8 nodes. Nothing all that impressive, but still plenty of fun to build.

Read the rest of this entry »

Faster than Bresenham’s Algorithm?


There’s always a number of graphics primitives you will be using in all kinds of projects. If you’re lucky, you’re working on a “real” platform that comes with a large number of graphics libraries that offer abstractions to the graphics primitives of the OS and, ultimately, to its hardware. However, it’s always good to know what’s involved in a particular graphics primitives and how to recode it yourself should you be in need to do so, either because you do not have a library to help you, or because it would contaminate your project badly to include a library only to draw, say, a line, within an image buffer.


Lines are something we do a lot. Perfectly horizontal or vertical lines have very simple algorithms. Arbitrary lines are a bit more complicated, that is, to get just right. This week, let us take a look at a few algorithms to draw lines. First, we’ll discuss a naïve algorithm using floating point. We’ll also have a look at Bresenham’s algorithm that uses only integer arithmetic. Finally, we’ll show that we can do better than Bresenham if we used fixed point arithmetic.

Read the rest of this entry »

The LP64 model and the AMD64 instruction set


Remember the old days where you had five or six “memory models” to choose from when compiling your C program? Memory models allowed you to chose from a mix of short (16 bits) and long (32 bits) offsets and pointers for data and code. The tiny model, if I recall correctly, made sure that everything—code, data and stack—would fit snugly in a single 16 bits segment.

With the advent of 64 bits computing on the x86 platform with the AMD64 instruction set, we find ourselves in a somewhat similar position. While the tiny memory model disappeared (phew!), we still have to chose between different compilation models although this time they do not support mixed offset/pointer sizes. The new models, such as LP32 or ILP64, specify what are the data sizes of int, long and pointers. Linux on AMD64 uses the LP64 model, where int is 32 bits, and long and pointers are 64 bits.

Using 64 bits pointers uses a bit more memory for the pointer itself, but it also opens new possibilities: more than 4GB allocated to a single process, the capability of using virtual memory mapping for files that exceed 4GB in size. 64 bits arithmetic also helps some applications, such as cryptographic software, to run twice as fast in some cases. The AMD64 mode doubles the number of SSE registers available enabling, potentially, significant performance enhancement into video codecs and other multimedia applications.

However one might ask himself what’s the impact of using LP64 for a typical C program. Is LP64 the best memory compilation model for AMD64? Will you get a speedup from replacing int (or int32_t) by long (or int64_t) in your code?

Read the rest of this entry »