Fixed-Points

Preparing lecture notes (or videos) sometimes brings you to revisit things you’ve known for a while, but haven’t really taken time to formalize properly. One of those things is fixed-point arithmetic.

Fixed-point arithmetic joins the advantage of representing fractional values while maintaining a great simplicity in the operations. Contrary to floating point, arithmetic on fixed-points can be done using entirely integer operations, which explains their popularity on DSPs of all sorts.

Basically, a fixed-point number is just a fixed-width number of bits, with an unmoving, virtual point. For example, in a 16-bits register, four bits could be devoted to the fractional part, and 12 to the integer part, making a Q12.4 fixed-point (the Qn.m notation says that we have n integer bits, m fractional bits). In a fixed-point, the fractional bits behave as you would expect: the first after the point is worth 1/2, the next one 1/4, etc. Furthermore, the integers can be two’s-complement.

What makes it especially nice is the arithmetic that comes with fixed-point numbers:

• Addition/subtraction. It behaves exactly as two’s-complement integer addition and subtraction. Therefore, adding/subtracting two fixed-point can be done directly by adding/subtracting the registers as signed integers.
• multiplication. Multiplying together two Qn.m fixed-points results in a Q2n.2m fixed-point, that we must restore to Q2n.m at least. (as always, if it overflows, that’s your problem). To multiply two Qn.m fixed-points, we use integer multiplication and shift back (right) m bits to restore the results to Qx.m. The nice thing is that it results in the correct truncation.
• division. Division is the tricky one (but isn’t much harder). Let:

$\displaystyle \left(a+\frac{b}{2^m}\right)$

and

$\displaystyle \left(c+\frac{c}{2^m}\right)$

be two Qn.m values. Then

$\displaystyle \frac{\left(a+\frac{b}{2^m}\right)} {\left(c+\frac{d}{2^m}\right)}= \frac{a2^m+b}{c2^m+d}$.

It’s as if we lost the fractional bits. That’s a problem because the quotient isn’t an integer; and it should be represented in $2^m$ths. To express the result correctly in $2^m$ths, we multiply the dividend (the number being divided) by $2^m$, which is only a shift left.

So basically, fixed-point numbers are numbers expressed in $2^m$ths.

*
* *

So let’s implement that. First, we need a cute little template to get, given an integer type, the type containing the double number of bits (for multiplication and division):

// enough.hpp
#ifndef __MODULE_ENOUGH__
#define __MODULE_ENOUGH__

namespace enough {
template <typename T> struct __twice; // incomplete type

template <> struct __twice<int8_t>  { using type = int16_t; };
template <> struct __twice<int16_t> { using type = int32_t; };
template <> struct __twice<int32_t> { using type = int64_t; };
template <> struct __twice<int64_t> { using type = int64_t; }; // fails silently

template <typename T>
using twice=typename __twice<T>::type;

} // namespace enough

#endif
// __MODULE_ENOUGH__


We could use C++2a concepts to make sure that the fixed-point template has consistent arguments (I’m just beginning to look into C++20), and make sure we overload the basic operators:

// fixed.hpp
#ifndef __MODULE_FIXED__
#define __MODULE_FIXED__

#include <enough.hpp>

template <typename T, int scale=4>
requires ((std::is_signed<T>::value==true) // c++2a
&& (scale<sizeof(T)*8))
class fixed
{
private:

T number;

fixed(int, T n) :number(n) {}

public:

// avec un autre fixed
fixed operator+(const fixed & f) const { return fixed(0,number+f.number); }
fixed operator-(const fixed & f) const { return fixed(0,number-f.number); }
fixed operator*(const fixed & f) const { return fixed(0,(enough::twice<T>{number}*f.number) >> scale); }
fixed operator/(const fixed & f) const { return fixed(0,(enough::twice<T>{number}<<scale)/f.number); }

// avec float
fixed operator+(float f) const { return *this+fixed(f); }
fixed operator-(float f) const { return *this-fixed(f); }
fixed operator*(float f) const { return *this*fixed(f); }
fixed operator/(float f) const { return *this/fixed(f); }

fixed operator=(const fixed & f) { return number=f.number; }

bool operator==(const fixed & f) { return number==f.number; }
operator float() const { return number/(float)(1<<scale); }

fixed(): number(0) {}
fixed(const fixed<T,scale> & other) : number(other.number) {};
fixed(float f) : number(f * (1<<scale)) {}
};

#endif
// __MODULE_FIXED__


We just use it:

#include <iostream>
#include <fixed.hpp>

int main()
{
fixed<int16_t> a(-4),b(3);

std::cout
<< a+b << std::endl
<< a*b << std::endl
<< a/b << std::endl
<< a-b << std::endl
;

return 0;
}


*
* *

The complexity is visibly much, much less than floating point numbers. Addition/subtraction is just signed integer add/sub; multiplication and division ask for an integer promotion (the “cast”) and a shift. If you choose the number of bits after the point wisely, you may even get away with no-shifts and only memory operations. No wonder it’s used in low-complexity, low-power DSP chips.