Faster than Bresenham’s Algorithm?

There’s always a number of graphics primitives you will be using in all kinds of projects. If you’re lucky, you’re working on a “real” platform that comes with a large number of graphics libraries that offer abstractions to the graphics primitives of the OS and, ultimately, to its hardware. However, it’s always good to know what’s involved in a particular graphics primitives and how to recode it yourself should you be in need to do so, either because you do not have a library to help you, or because it would contaminate your project badly to include a library only to draw, say, a line, within an image buffer.

math06-detail

Lines are something we do a lot. Perfectly horizontal or vertical lines have very simple algorithms. Arbitrary lines are a bit more complicated, that is, to get just right. This week, let us take a look at a few algorithms to draw lines. First, we’ll discuss a naïve algorithm using floating point. We’ll also have a look at Bresenham’s algorithm that uses only integer arithmetic. Finally, we’ll show that we can do better than Bresenham if we used fixed point arithmetic.

Assuming that x_2 > x_1, the slope between points (x_1,y_1) and (x_2,y_2) is given by:

m=\frac{y_2-y_1}{x_2-x_1}

Using this, the y(x) is given by

y(x)=y_1+ m x

Which expands to

\displaystyle y(x)=y_1 + x\frac{y_2-y_1}{x_2-x_1}


Implementing this with a somewhat naïve approach, that is, using a slope m as a float and rounding off to get the integer value, you’d get something like:


m=(y2-y1)/(float)(x2-x1);
float y=y1;
int x;
for (x=x1;x<=x2;x++,y+=m)
  putpixel(x,y,color); // implicit rounding
                       // of y to int

In this piece of code, we suppose that -1\leqslant{}m\leqslant{}1 so that the line is drawn contiguously. To manage all possible m, you only have two cases to check. If |y_2-y_1| \geqslant |x_2-x_1|, the line is more vertical than horizontal, and you iterate on the y rather than the x, and you put m= (x_2-x_1)/(y_2-y_1) instead. Since it’s not much more difficult than that, we will continue our demonstration considering only the more horizontal lines, that is, with slopes -1\leqslant{}m\leqslant{}1. This being said, the above code is simple, but involves the use of floating points.

A priori, you may think that on modern processors, that’s not much of a problem and you could be right. However, converting a float to an int is a costly operation. On my 32 bits box (which doesn’t have SSE), using gcc -fno-inline-functions -O3 -S -masm=intel lines.c, we see that the corresponding code inside the loop is given by:

.L6:
	fnstcw	WORD PTR [%ebp-14]
	fld	DWORD PTR [%ebp-20]
	mov	DWORD PTR [%esp], %ebx
	add	%ebx, 1
	mov	DWORD PTR [%esp+8], %edi
	movzx	%eax, WORD PTR [%ebp-14]
	mov	%ah, 12
	mov	WORD PTR [%ebp-16], %ax
	fldcw	WORD PTR [%ebp-16]
	fistp	DWORD PTR [%esp+4]
	fldcw	WORD PTR [%ebp-14]
	call	put_pixel
	cmp	%esi, %ebx
	fld	DWORD PTR [%ebp-20]
	fadd	DWORD PTR [%ebp-24]
	fstp	DWORD PTR [%ebp-20]
	jge	.L6
.L7:

The compiler generated a rather long series of instructions because it tries to enforce 32-bits precision (because the internal floating point registers aren’t 32 bits but 80!) as well as dealing with the conversion of float values to integer. Would the CPU have been capable of SSE, the compiler might have generated code using the cvttss2si instruction. This instruction that converts a scalar single precision float into an integer. Since this instruction uses the XMM registers, the code is greatly simplified because there is no need to worry about precision; floats are stored as 32 bits in XMM packed registers. All that to say that all this seems like a lot of code just to perform y+=m and truncating.


Jack Bresenham, in 1965, came up with an algorithm using only integer arithmetic to draw a line of arbitrary slope [1]. Bresenham algorithm will loop on the xs again, but rather than directly estimating y_1+m x, it will iteratively update a moving point, say (x,y), in the following way. Knowing (x,y) and m, will the point (x+1,y+m) be above or under (x+1,y+\frac{1}{2})? If it is under, the line passes closer to (x+1,y) than (x+1,y+1) so the next point is (x+1,y). If, on the contrary, it passes closer to (x+1,y+1), (x+1,y+1) will be the next point. To avoid drift, the algorithm keeps track of the rounding errors as well.

The derivation is quite convoluted and relies differential equations, implicit forms, and you can find a (still somewhat sketchy) demonstration demonstration here. You can find the complete demonstration in [2].

Again, without loss of generality, we will consider only slopes 0\leqslant{}m\leqslant{}1. Other symmetries can be dealt quite easily. The C code is given by:

  int dx=x2-x1;
  int dy=y2-y1;
  int d=2*dy-dx;
  int e=2*dy;
  int ne=2*(dy-dx);
  int x=x1;
  int y=y1;
  
  for (x=x1;x<=x2;x++)
   {
    put_pixel(x,y,color);
    if (d<=0)
      d+=e;
    else
     {
      d+=ne;
      y++; // sign?
     }
   }

Which generates the following machine code:

.L21:
	add	%edi, DWORD PTR [%ebp-16]
	add	%esi, 1
	add	%ebx, 1
	cmp	DWORD PTR [%ebp+16], %ebx
	jl	.L24
.L20:
	mov	%eax, DWORD PTR [%ebp+24]
	mov	DWORD PTR [%esp+4], %esi
	mov	DWORD PTR [%esp], %ebx
	mov	DWORD PTR [%esp+8], %eax
	call	put_pixel
	test	%edi, %edi
	jg	.L21
	add	%edi, DWORD PTR [%ebp-20]
	add	%ebx, 1
	cmp	DWORD PTR [%ebp+16], %ebx
	jge	.L20
.L24:
        ## exits function

This code is simple but has conditional jumps within the main loop —for a total of three if you count the loop itself. On processors with bad, or no, branch prediction, or deep pipelines, the code yields sub-optimal performance. Every time a branch is mis-predicted, there’s a definite performance penalty. To avoid branching, we could try to transform Bresenham’s algorithm to a branchless algorithm (that is, except for the main loop that cycles from x_1 to x_2).

One possibility is to use the machine specific integer types to do all the work for us. Using the standard C library <stddint.h> header that provides types such as int16_t, we can create a fixed point number, one that is friendly to the machine’s natural register sizes. On 32 bits x86, those are 16 and 32 bits registers. On a 64 bits machine, we could still use 32 and 16 bits (and I’ve already explained why).

So let us define a fixed point number as the union of a 32 bits int and of two 16 bits integers that will hold, respectively, the fractional part and the integer part of the number. Note that on x86 computers, the memory layout is little endian, so one would have to reverse the declaration of lo and hi on a big endian machine. The declaration could be something like:

 typedef 
  union 
   {
    int32_t i;
    struct
     {
      int16_t lo; // endian-specific!
      int16_t hi;
     };
   } fixed_point;

The only thing we need, is to compute the slope m as a fixed point number rather than a floating point number. This allows us to write:

  fixed_point y;
  int x; 
  int32_t m=((int32_t)(y2-y1)<<16)/(x2-x1); // slope as fixed point
  y.i=y1<<16; // convert to fixed point

  for (x=x1;x<x2;x++,y.i+=m)
    put_pixel(x,y.hi,color);

Only the integer part of the fixed point is used to call put_pixel. The whole fixed point number, however, is used to incrementally add the slope, m, to y. The trick here is that the integer .i that holds the fixed point number can be operated upon as is, and, for addition, adding two 32 bits integers that hold each a fixed point number yields a correct fixed point number. This allows the compiler to generate, compile-time, much simpler code that what one would have thought:

.L16:
	lea	%esi, [%edi+%esi]
.L13:
	mov	%eax, DWORD PTR [%ebp+24]
	mov	DWORD PTR [%esp], %ebx
	add	%ebx, 1
	mov	DWORD PTR [%esp+8], %eax
	mov	%eax, %esi
	sar	%eax, 16
	mov	DWORD PTR [%esp+4], %eax
	call	put_pixel
	cmp	DWORD PTR [%ebp+16], %ebx
	jge	.L16

We can add rounding to make sure the pixels drawn are, as with Bresenham’s algorithm, always as close as possible to the “true” line from (x_1,y_1) to (x_2,y_2). The main C loops becomes:

  fixed_point f;
  int x; 
  int32_t m=((int32_t)(y2-y1)<<16)/(x2-x1);

  f.i=y1<<16;
  for (x=x1;x<=x2;x++,f.i+=m)
   {
    fixed_point g=f;
    g.i+=32767;
    put_pixel(x,g.hi,color);
   }

and generates the following:

.L16:
	lea	%esi, [%edi+%esi]
.L13:
	mov	%edx, DWORD PTR [%ebp+24]
	lea	%eax, [%esi+32767]
	sar	%eax, 16
	mov	DWORD PTR [%esp], %ebx
	add	%ebx, 1
	mov	DWORD PTR [%esp+4], %eax
	mov	DWORD PTR [%esp+8], %edx
	call	put_pixel
	cmp	DWORD PTR [%ebp+16], %ebx
	jge	.L16

Which doesn’t differ much from the other version.

*
* *

Now it’s time to check the resulting speeds of the implementations. I have included results for a 32 bits machine, an Athlon XP 2000+ (1.66GHz) and a 64 bits machine, a AMD64 Athlon 4000+ (2.25GHz). The first machine is incapable of SSE, while the second is limited to SSE2. In both cases, I used similar compilation options:

gcc -fno-inline-functions -O3 -mX lines.c -o lines

with X being either 32 or 64 (bits) depending on the machine. The test was to draw a line with different offsets but the same length, about 200 pixels long. The buffer was prefetched in cache to avoid spurious fluctuations. In 32 bits mode, we get:

Timing results, 32 bits Athlon XP 2000+

Timing results, 32 bits Athlon XP 2000+

The naïve algorithm in float averages 4.81 µs, Bresenham’s algorithm averages at 1.84 µs, my fixed point variation at 1.74 µs. The Fixed point implementation runs about 5% faster. Which isn’t all that much, but still better than Bresenham’s; and much better than the naïve version using mixed floats and integers.

The situation is similar on the 64 bits machine, but the advantage of fixed point vanishes. Both methods takes very similar times: fixed point averages at 0.85 µs and Bresenham 0.84 µs, a difference of about 1%. However, the naïve implementation is still very far behind, at 2.96 µs.

Timing results, 64 bits AMD64 4000+

Timing results, 64 bits AMD64 4000+

*
* *

Although the testing is far from exhaustive, the results indicates that the fixed point algorithm is very efficient. I would guess that as you go down to simpler processors, the more efficient it will be. First, it has fewer branches than Bresenham’s algorithm. Second, it is quite possible that a clever use of the instruction set and registers yield code that is much more efficient than what gcc generates—although, I admit, it seems quite efficient at times.



[1] Jack E. Bresenham — Algorithm for computer control of a digital plotter — IBM Systems Journal, Vol. 4, No.1, January 1965, pp. 25–30 (pdf, subscription needed)

[2] James D. Foley, Andries van Dam, Steven K. Feiner, John F. Hughes — Computer Graphics: Principles and Practice — Addison-Wesley, 1995.

4 Responses to Faster than Bresenham’s Algorithm?

  1. scaryreasoner says:

    Nice. Might come in handy some day. Thanks.

  2. tommy says:

    thanks,

    helpfully

  3. [...] of doing a ‘free’ scaling is to use a trick similar to what I used for an alternative line algorithm. That is, we’ll use the language to have a shift free [...]

  4. [...] course there are better ways of drawing a line between two points. Interpolation may be used to fill a continuum or only to get [...]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 73 other followers

%d bloggers like this: