## Fast Interpolation (Interpolation, part V)

In the last four installments of this series, we have seen linear interpolation, cubic interpolation, Hermite splines, and lastly cardinal splines.

In this installment (which should be the last of the series; at least for a while), let us have a look at how we can implement these interpolation efficient.

Let us consider cubic interpolation. We know that both Hermite and cardinal spline can be computed in essentially the same way cubic interpolation is computed, so we do not lose generality here. The matrix differ, but not the general method.

Cubic interpolation is done by solving a system of four equations in four unknowns, each equation corresponding to the polynomial valuated at a specific position. Solving is done by inverting the resulting Vandermonde matrix (re-read the cubic interpolation entry if need be).

The Vandermonde matrix, for the special case where $x_{-1}=-1$, $x_0=0$, $x_1=1$, and $x_2=2$ (which is also without loss of generality, surprisingly), is given by:

$M=\left[ \begin{array}{cccc} -1 & 1 & -1 & 1\\ 0 & 0 & 0 & 1\\ 1 & 1 & 1 & 1\\ 8 & 4 & 2 & 1 \end{array}\right]$

and its inverse by

$M^-1=\displaystyle\frac{1}{6}\left[ \begin{array}{cccc} -1 & 3 & -3 & 1\\ 3 & -6 & 3 & 0\\ -2 & -3 & 6 & -1\\ 0 & 6 & 0 & 0 \end{array}\right]$

To solve for the coefficients of the cubic polynomial $[~a~b~c~d~]$, we compute:

$\left[\begin{array}{c}a\\ b\\ c\\ d\end{array}\right]= M^{-1} y=\displaystyle\frac{1}{6}\left[ \begin{array}{cccc} -1 & 3 & -3 & 1\\ 3 & -6 & 3 & 0\\ -2 & -3 & 6 & -1\\ 0 & 6 & 0 & 0 \end{array} \right] \left[\begin{array}{c} y_{-1} \\ y_0 \\ y_1 \\ y_2 \end{array}\right]$

A direct implementation in C/C++ would look something like:

////////////////////////////////////////
//
// The Vandermonde matrix for the cubic
// polynomial (with special case x=-1,0,1,2)
// is
//
// | -1  1 -1  1 |
// |  0  0  0  1 |
// |  1  1  1  1 |
// |  8  4  2  1 |
//
// and its inverse is
//
//       | -1  3 -3  1 |
// (1/6) |  3 -6  3  0 |
//       | -2 -3  6 -1 |
//       |  0  6  0  0 |
//
float_x cubic_interpolation( float_x y_0,
float_x y_1,
float_x y_2,
float_x y_3,
float_x x)
{
float_x a = ( - y_0 + 3*y_1 - 3*y_2 + y_3)/6;
float_x b = ( 3*y_0 - 6*y_1 + 3*y_2 )/6;
float_x c = (-2*y_0 - 3*y_1 + 6*y_2 - y_3)/6;
float_x d = y_1;

// Horner!
return ((a*x+b)*x+c)*x+d;
}


Which is the direct application of the matrix formula to solve for the coefficients, and evaluate the polynomial at a given $x$ (with the polynomial being computed using Horner’s method, which reduces the cost of computing the polynomial quite a lot).

This implementation is not very efficient because 1) it solves the system all over again for every $x$ (it should solve the system and evaluate the polynomial separately) and 2) it does not take advantage of a matrix/vector or matrix/matrix multiply using a high-performance library such as, say, Atlas or OpenBlas.

In fact, we could take maximal advantage of such libraries by solving not only one, but many systems at the same time. How can we do this?

We find the parameters by computing $M^{-1}y$ for some values $y=[~y_{-1}~y_0~y_1~y_2~]$ then we solve another system by computing $M^{-1}y^\prime$, for another set of values $y^\prime=[~y_{-1}^\prime~y_0^\prime~y_1^\prime~y_2^\prime~]$. Now, we can notice that a matrix/matrix product works “by column”. Indeed, if we compute the product $A B$ for two conformant matrices $A$ and $B$, we can break the product in a series of matrix/vector products, where we multiply one column of $B$ by the whole matrix $A$. In other words, if $AB=C$, then the $i$th column of $C$ is given by $A$ times the $i$th column of $B$.

This means that the columns of $C$ can be computed independently (so, no, I’m not going to tell you that you can compute them in parallel, which is true, but that’s not the point I want to make). Therefore, the matrix $B$ can be seen as the concatenation of the $y$s. OK, let us have a look at a matlab/octave experiment to show this property:

octave:1> m=[-1,1,-1,1;0,0,0,1;1,1,1,1;8,4,2,1]
m =

-1   1  -1   1
0   0   0   1
1   1   1   1
8   4   2   1

octave:2> mm=inverse(m)
mm =

-0.16667   0.50000  -0.50000   0.16667
0.50000  -1.00000   0.50000   0.00000
-0.33333  -0.50000   1.00000  -0.16667
0.00000   1.00000   0.00000   0.00000

octave:3> y1=[1;2;3;4]
y1 =

1
2
3
4

octave:4> y2=[3;5;7;2]
y2 =

3
5
7
2

octave:5> mm * y1
ans =

0
0
1
2

octave:6> mm * y2
ans =

-1.16667
0.00000
3.16667
5.00000

octave:7> mm * horzcat(y1,y2)
ans =

-0.00000  -1.16667
0.00000   0.00000
1.00000   3.16667
2.00000   5.00000



where the horzcat (naturally) concatenates horizontally two matrix/vectors objects.

So we see that we can solve simultaneously many systems by bundling them into a single matrix/matrix product. In general, if we have $k-1$ patches to interpolate between $k+1$ points $\{(x_i,y_i)\}_{i=0}^k$, the bundled system will look like

$\left[ \begin{array}{cccc} a_0 & a_1 & & a_{k-1}\\ b_0 & b_1 & \ldots & b_{k-1}\\ c_0 & c_1 & & c_{k-1}\\ d_0 & d_1 & & d_{k-1} \end{array} \right]= M^{-1} \left[ \begin{array}{ccccc} y_0 & y_0 & y_1 & & y_{k-3}\\ y_0 & y_1 & y_2 & \ldots & y_{k-2}\\ y_1 & y_2 & y_3 & & y_{k-1}\\ y_2 & y_3 & y_4 & & y_{k-1}\end{array}\right]$

where the coefficients are indexed by the patch number. We also put that $y_{-1}=y_0$ for the first patch and that $y_{k}=y_{k-1}$ for the last patch. You can devise other strategies to fill non-existent values, but in general simply putting zero there is a bad idea.

*
* *

You can use a similar trick for evaluating many points in a single matrix/matrix product. We can rewrite the whole interpolation for a patch by computing a product $Xa=\hat{y}$, where $X$ is a matrix whose $j$th row is given by $[~\hat{x}_j^3~\hat{x}_j^2~\hat{x}_j~1]$ (where $\hat{x}_j$ is a point lying somewhere between two of the original points, say $x_i \leqslant \hat{x}_j \leqslant x_{i+1}$, and where $a$ is a vector containing the coefficients of the cubic solved around $x_i$. This yields $\hat{y}$, a (column) vector containing all the interpolated points on the patch.

*
* *

So this seems an awfully complicated way to optimize interpolation. But we save on many levels. First, we save the overhead of calling the function that solves for each section and the overhead of calling the function that evaluate the polynomial. Second, a good matrix/matrix multiply will be able to vectorize (i.e., use SSE+ instructions) and parallelize the matrix/matrix product in a way that is impossible if we perform separate calls for each matrix/vector product. In fact, the speed-ups can be quite impressive.

*
* *

So this concludes (for now) the series of entries on interpolation. It is by no mean an exhaustive survey of the field, but it should have given you the gist of interpolation and how polynomial interpolation in particular, is done.