## Evaluating polynomials

Evaluating polynomials is not a thing I do very often. When I do, it’s for interpolation and splines; and traditionally those are done with relatively low degree polynomials—cubic at most. There are a few rather simple tricks you can use to evaluate them efficiently, and we’ll have a look at them.

A polynomial is an expression of the form

$a_nx^n+a_{n-1}x^{n-1}+\cdots+a_2x^2+a_1x+a_0$.

A naïve approach to evaluating a polynomial would be to compute, independently, each monomial $a_jx^j$, everyone at a cost of $j$ products ($j-1$ for $x^j$, plus one for $a_j\times{}x^j$). When you sum over all $j$, you get $\frac{1}{2}n(n+1)$ products (and $n-1$ additions) for a polynomial of degree $n$. That’s way too much.

Someone, in the early 19th century, Horner, remarked that you could rewrite any polynomial as

$(\cdots((a_nx+a_{n-1})x+a_{n-2})x+\cdots+a_1)x+a_0$,

giving us $n$ products and $n-1$ additions. That’s much better. Also turns out that if there’s nothing special about the polynomial (no zero coefficients) it’s optimal.

But Horner’s method (or Horner’s formula, depending on where you read about it) is inherently sequential, because all the products are nested. It’s not amenable to parallel processing, even in its simpler SIMD form.

However, preparing lectures notes, I found that Estrin proposed a parallel algorithm to evaluate polynomial that both minimize wait time and the total number of products [1]. The scheme is shown here:

The original paper presents a method for polynomials with a degree of $n=2^k-1$, but you can easily adapt the splitting for an arbitrary degree. The first step is to split the polynomial into binomials, each of which are evaluated in parallel. You then combine those into new binomials, also evaluated in parallel, and again, until you have only one term left, that is, the answer.

What immediately comes to mind is a SIMD implementation where we use wide registers to do all products in parallel, but maybe just relying on instruction-level parallelism and the compiler is quite efficient for low degree polynomials. Let’s try with degree seven. Let’s say, with:

$9x^7+5x^6+7x^5+4x^4+5x^3+3x^2+3x+2$,

because why not. Let’s test an implementation:

////////////////////////////////////////
int pow_naif(int x, int n)
{
int p=1;
for (int i=0;i<n;i++,p*=x);

return p;
}

////////////////////////////////////////
int eval_naif(int x)
{
return
9*pow_naif(x,7)
+5*pow_naif(x,6)
+7*pow_naif(x,5)
+4*pow_naif(x,4)
+5*pow_naif(x,3)
+3*pow_naif(x,2)
+3*x
+2;
}

/////////////////////////////////////////
int eval_horner(int x)
{
return ((((((9*x+5)*x+7)*x+4)*x+5)*x+3)*x+3)*x+2;
}

////////////////////////////////////////
int eval_estrin(int x)
{
int t0=3*x+2;
int t1=5*x+3;
int t2=7*x+4;
int t3=9*x+5;
x*=x;
int t4=t1*x+t0;
int t5=t3*x+t2;
x*=x;
return t5*x+t4;
}


We compile with all optimizations enabled, and evaluate the polynomial for $x$ from 0 to 1000000000. Times are:

 Method Time (s) Naïve 1.93 Horner 1.66 Estrin 1.41

So despite not being explicitly parallel, Estrin’s version performs significantly better because of instruction-level parallelism. Let’s look at the generated code:

0000000000000e00 <_Z11eval_estrini>:
e00:  8d 04 fd 00 00 00 00    lea    eax,[rdi*8+0x0]
e07:  89 f9                   mov    ecx,edi
e09:  0f af cf                imul   ecx,edi
e0c:  8d 54 07 05             lea    edx,[rdi+rax*1+0x5]
e10:  29 f8                   sub    eax,edi
e12:  0f af d1                imul   edx,ecx
e15:  8d 44 02 04             lea    eax,[rdx+rax*1+0x4]
e19:  89 ca                   mov    edx,ecx
e1b:  0f af d1                imul   edx,ecx
e1e:  0f af c2                imul   eax,edx
e21:  8d 54 bf 03             lea    edx,[rdi+rdi*4+0x3]
e25:  0f af d1                imul   edx,ecx
e28:  8d 4c 7f 02             lea    ecx,[rdi+rdi*2+0x2]