There’s always a number of graphics primitives you will be using in all kinds of projects. If you’re lucky, you’re working on a “real” platform that comes with a large number of graphics libraries that offer abstractions to the graphics primitives of the OS and, ultimately, to its hardware. However, it’s always good to know what’s involved in a particular graphics primitives and how to recode it yourself should you be in need to do so, either because you do not have a library to help you, or because it would contaminate your project badly to include a library only to draw, say, a line, within an image buffer.

Lines are something we do a lot. Perfectly horizontal or vertical lines have very simple algorithms. Arbitrary lines are a bit more complicated, that is, to get *just right*. This week, let us take a look at a few algorithms to draw lines. First, we’ll discuss a naïve algorithm using floating point. We’ll also have a look at Bresenham’s algorithm that uses only integer arithmetic. Finally, we’ll show that we can do better than Bresenham if we used *fixed* point arithmetic.

Assuming that , the slope between points and is given by:

Using this, the is given by

Which expands to

Implementing this with a somewhat naïve approach, that is, using a slope as a float and rounding off to get the integer value, you’d get something like:

m=(y2-y1)/(float)(x2-x1); float y=y1; int x; for (x=x1;x<=x2;x++,y+=m) putpixel(x,y,color); // implicit rounding // of y to int

In this piece of code, we suppose that so that the line is drawn contiguously. To manage all possible , you only have two cases to check. If , the line is more vertical than horizontal, and you iterate on the rather than the , and you put instead. Since it’s not much more difficult than that, we will continue our demonstration considering only the more horizontal lines, that is, with slopes . This being said, the above code is simple, but involves the use of floating points.

A priori, you may think that on modern processors, that’s not much of a problem and you could be right. However, converting a float to an int is a costly operation. On my 32 bits box (which doesn’t have SSE), using `gcc -fno-inline-functions -O3 -S -masm=intel lines.c`, we see that the corresponding code inside the loop is given by:

.L6: fnstcw WORD PTR [%ebp-14] fld DWORD PTR [%ebp-20] mov DWORD PTR [%esp], %ebx add %ebx, 1 mov DWORD PTR [%esp+8], %edi movzx %eax, WORD PTR [%ebp-14] mov %ah, 12 mov WORD PTR [%ebp-16], %ax fldcw WORD PTR [%ebp-16] fistp DWORD PTR [%esp+4] fldcw WORD PTR [%ebp-14] call put_pixel cmp %esi, %ebx fld DWORD PTR [%ebp-20] fadd DWORD PTR [%ebp-24] fstp DWORD PTR [%ebp-20] jge .L6 .L7:

The compiler generated a rather long series of instructions because it tries to enforce 32-bits precision (because the internal floating point registers aren’t 32 bits but 80!) as well as dealing with the conversion of float values to integer. Would the CPU have been capable of SSE, the compiler might have generated code using the `cvttss2si` instruction. This instruction that converts a scalar single precision float into an integer. Since this instruction uses the XMM registers, the code is greatly simplified because there is no need to worry about precision; floats are stored as 32 bits in XMM packed registers. All that to say that all this seems like a lot of code just to perform `y+=m` and truncating.

Jack Bresenham, in 1965, came up with an algorithm using only integer arithmetic to draw a line of arbitrary slope [1]. Bresenham algorithm will loop on the s again, but rather than directly estimating , it will iteratively update a moving point, say , in the following way. Knowing and , will the point be above or under ? If it is under, the line passes closer to than so the next point is . If, on the contrary, it passes closer to , will be the next point. To avoid drift, the algorithm keeps track of the rounding errors as well.

The derivation is quite convoluted and relies differential equations, implicit forms, and you can find a (still somewhat sketchy) demonstration demonstration here. You can find the complete demonstration in [2].

Again, without loss of generality, we will consider only slopes . Other symmetries can be dealt quite easily. The C code is given by:

int dx=x2-x1; int dy=y2-y1; int d=2*dy-dx; int e=2*dy; int ne=2*(dy-dx); int x=x1; int y=y1; for (x=x1;x<=x2;x++) { put_pixel(x,y,color); if (d<=0) d+=e; else { d+=ne; y++; // sign? } }

Which generates the following machine code:

.L21: add %edi, DWORD PTR [%ebp-16] add %esi, 1 add %ebx, 1 cmp DWORD PTR [%ebp+16], %ebx jl .L24 .L20: mov %eax, DWORD PTR [%ebp+24] mov DWORD PTR [%esp+4], %esi mov DWORD PTR [%esp], %ebx mov DWORD PTR [%esp+8], %eax call put_pixel test %edi, %edi jg .L21 add %edi, DWORD PTR [%ebp-20] add %ebx, 1 cmp DWORD PTR [%ebp+16], %ebx jge .L20 .L24: ## exits function

This code is simple but has conditional jumps within the main loop —for a total of three if you count the loop itself. On processors with bad, or no, branch prediction, or deep pipelines, the code yields sub-optimal performance. Every time a branch is mis-predicted, there’s a definite performance penalty. To avoid branching, we could try to transform Bresenham’s algorithm to a branchless algorithm (that is, except for the main loop that cycles from to ).

One possibility is to use the machine specific integer types to do all the work for us. Using the standard C library `<stddint.h>` header that provides types such as `int16_t`, we can create a fixed point number, one that is friendly to the machine’s natural register sizes. On 32 bits x86, those are 16 and 32 bits registers. On a 64 bits machine, we could still use 32 and 16 bits (and I’ve already explained why).

So let us define a fixed point number as the union of a 32 bits int and of two 16 bits integers that will hold, respectively, the fractional part and the integer part of the number. Note that on x86 computers, the memory layout is little endian, so one would have to reverse the declaration of `lo` and `hi` on a big endian machine. The declaration could be something like:

typedef union { int32_t i; struct { int16_t lo; // endian-specific! int16_t hi; }; } fixed_point;

The only thing we need, is to compute the slope as a fixed point number rather than a floating point number. This allows us to write:

fixed_point y; int x; int32_t m=((int32_t)(y2-y1)<<16)/(x2-x1); // slope as fixed point y.i=y1<<16; // convert to fixed point for (x=x1;x<x2;x++,y.i+=m) put_pixel(x,y.hi,color);

Only the integer part of the fixed point is used to call `put_pixel`. The whole fixed point number, however, is used to incrementally add the slope, `m`, to `y`. The trick here is that the integer `.i` that holds the fixed point number can be operated upon as is, and, for addition, adding two 32 bits integers that hold each a fixed point number yields a correct fixed point number. This allows the compiler to generate, compile-time, much simpler code that what one would have thought:

.L16: lea %esi, [%edi+%esi] .L13: mov %eax, DWORD PTR [%ebp+24] mov DWORD PTR [%esp], %ebx add %ebx, 1 mov DWORD PTR [%esp+8], %eax mov %eax, %esi sar %eax, 16 mov DWORD PTR [%esp+4], %eax call put_pixel cmp DWORD PTR [%ebp+16], %ebx jge .L16

We can add rounding to make sure the pixels drawn are, as with Bresenham’s algorithm, always as close as possible to the “true” line from to . The main C loops becomes:

fixed_point f; int x; int32_t m=((int32_t)(y2-y1)<<16)/(x2-x1); f.i=y1<<16; for (x=x1;x<=x2;x++,f.i+=m) { fixed_point g=f; g.i+=32767; put_pixel(x,g.hi,color); }

and generates the following:

.L16: lea %esi, [%edi+%esi] .L13: mov %edx, DWORD PTR [%ebp+24] lea %eax, [%esi+32767] sar %eax, 16 mov DWORD PTR [%esp], %ebx add %ebx, 1 mov DWORD PTR [%esp+4], %eax mov DWORD PTR [%esp+8], %edx call put_pixel cmp DWORD PTR [%ebp+16], %ebx jge .L16

Which doesn’t differ much from the other version.

*

* *

Now it’s time to check the resulting speeds of the implementations. I have included results for a 32 bits machine, an Athlon XP 2000+ (1.66GHz) and a 64 bits machine, a AMD64 Athlon 4000+ (2.25GHz). The first machine is incapable of SSE, while the second is limited to SSE2. In both cases, I used similar compilation options:

`gcc -fno-inline-functions -O3 -m`X` lines.c -o lines`

with X being either 32 or 64 (bits) depending on the machine. The test was to draw a line with different offsets but the same length, about 200 pixels long. The buffer was prefetched in cache to avoid spurious fluctuations. In 32 bits mode, we get:

The naïve algorithm in float averages 4.81 µs, Bresenham’s algorithm averages at 1.84 µs, my fixed point variation at 1.74 µs. The Fixed point implementation runs about 5% faster. Which isn’t all that much, but still better than Bresenham’s; and *much* better than the naïve version using mixed floats and integers.

The situation is similar on the 64 bits machine, but the advantage of fixed point vanishes. Both methods takes very similar times: fixed point averages at 0.85 µs and Bresenham 0.84 µs, a difference of about 1%. However, the naïve implementation is still very far behind, at 2.96 µs.

*

* *

Although the testing is far from exhaustive, the results indicates that the fixed point algorithm is very efficient. I would guess that as you go down to simpler processors, the more efficient it will be. First, it has fewer branches than Bresenham’s algorithm. Second, it is quite possible that a clever use of the instruction set and registers yield code that is much more efficient than what gcc generates—although, I admit, it seems quite efficient at times.

[1] Jack E. Bresenham —

*Algorithm for computer control of a digital plotter*— IBM Systems Journal, Vol. 4, No.1, January 1965, pp. 25–30 (pdf, subscription needed)

[2] James D. Foley, Andries van Dam, Steven K. Feiner, John F. Hughes — *Computer Graphics: Principles and Practice* — Addison-Wesley, 1995.

Nice. Might come in handy some day. Thanks.

thanks,

helpfully

[...] of doing a ‘free’ scaling is to use a trick similar to what I used for an alternative line algorithm. That is, we’ll use the language to have a shift free [...]

[...] course there are better ways of drawing a line between two points. Interpolation may be used to fill a continuum or only to get [...]