Preparing lecture notes (or videos) sometimes brings you to revisit things you’ve known for a while, but haven’t really taken time to formalize properly. One of those things is fixed-point arithmetic.

Fixed-point arithmetic joins the advantage of representing fractional values while maintaining a great simplicity in the operations. Contrary to floating point, arithmetic on fixed-points can be done using entirely integer operations, which explains their popularity on DSPs of all sorts.

Basically, a fixed-point number is just a fixed-width number of bits, with an unmoving, virtual point. For example, in a 16-bits register, four bits could be devoted to the fractional part, and 12 to the integer part, making a Q12.4 fixed-point (the Qn.m notation says that we have n integer bits, m fractional bits). In a fixed-point, the fractional bits behave as you would expect: the first after the point is worth 1/2, the next one 1/4, etc. Furthermore, the integers can be two’s-complement.

What makes it especially nice is the arithmetic that comes with fixed-point numbers:

  • Addition/subtraction. It behaves exactly as two’s-complement integer addition and subtraction. Therefore, adding/subtracting two fixed-point can be done directly by adding/subtracting the registers as signed integers.
  • multiplication. Multiplying together two Qn.m fixed-points results in a Q2n.2m fixed-point, that we must restore to Q2n.m at least. (as always, if it overflows, that’s your problem). To multiply two Qn.m fixed-points, we use integer multiplication and shift back (right) m bits to restore the results to Qx.m. The nice thing is that it results in the correct truncation.
  • division. Division is the tricky one (but isn’t much harder). Let:

    \displaystyle \left(a+\frac{b}{2^m}\right)


    \displaystyle \left(c+\frac{c}{2^m}\right)

    be two Qn.m values. Then

    \displaystyle        \frac{\left(a+\frac{b}{2^m}\right)}        {\left(c+\frac{d}{2^m}\right)}=        \frac{a2^m+b}{c2^m+d}.

    It’s as if we lost the fractional bits. That’s a problem because the quotient isn’t an integer; and it should be represented in 2^mths. To express the result correctly in 2^mths, we multiply the dividend (the number being divided) by 2^m, which is only a shift left.

So basically, fixed-point numbers are numbers expressed in 2^mths.

* *

So let’s implement that. First, we need a cute little template to get, given an integer type, the type containing the double number of bits (for multiplication and division):

// enough.hpp
#ifndef __MODULE_ENOUGH__
#define __MODULE_ENOUGH__

namespace enough {
 template <typename T> struct __twice; // incomplete type

 template <> struct __twice<int8_t>  { using type = int16_t; };
 template <> struct __twice<int16_t> { using type = int32_t; };
 template <> struct __twice<int32_t> { using type = int64_t; };
 template <> struct __twice<int64_t> { using type = int64_t; }; // fails silently

 template <typename T>
 using twice=typename __twice<T>::type;

} // namespace enough

     // __MODULE_ENOUGH__

We could use C++2a concepts to make sure that the fixed-point template has consistent arguments (I’m just beginning to look into C++20), and make sure we overload the basic operators:

// fixed.hpp
#ifndef __MODULE_FIXED__
#define __MODULE_FIXED__

#include <enough.hpp>

template <typename T, int scale=4>
requires ((std::is_signed<T>::value==true) // c++2a
          && (scale<sizeof(T)*8))
class fixed

      T number;

      fixed(int, T n) :number(n) {}


      // avec un autre fixed
      fixed operator+(const fixed & f) const { return fixed(0,number+f.number); }
      fixed operator-(const fixed & f) const { return fixed(0,number-f.number); }
      fixed operator*(const fixed & f) const { return fixed(0,(enough::twice<T>{number}*f.number) >> scale); }
      fixed operator/(const fixed & f) const { return fixed(0,(enough::twice<T>{number}<<scale)/f.number); }

      // avec float
      fixed operator+(float f) const { return *this+fixed(f); }
      fixed operator-(float f) const { return *this-fixed(f); }
      fixed operator*(float f) const { return *this*fixed(f); }
      fixed operator/(float f) const { return *this/fixed(f); }

      fixed operator=(const fixed & f) { return number=f.number; }

      bool operator==(const fixed & f) { return number==f.number; }
      operator float() const { return number/(float)(1<<scale); }

  fixed(): number(0) {}
  fixed(const fixed<T,scale> & other) : number(other.number) {};
  fixed(float f) : number(f * (1<<scale)) {}


We just use it:

#include <iostream>
#include <fixed.hpp>

int main()
  fixed<int16_t> a(-4),b(3);

   << a+b << std::endl
   << a*b << std::endl
   << a/b << std::endl
   << a-b << std::endl

  return 0;

* *

The complexity is visibly much, much less than floating point numbers. Addition/subtraction is just signed integer add/sub; multiplication and division ask for an integer promotion (the “cast”) and a shift. If you choose the number of bits after the point wisely, you may even get away with no-shifts and only memory operations. No wonder it’s used in low-complexity, low-power DSP chips.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: