A couple of months ago (already!) 0xjfdube produced an excellent piece on table-based trigonometric function computations. The basic idea was that you can look-up the value of a trigonometric function rather than actually computing it, on the premise that computing such functions directly is inherently slow. Turns out that’s actually the case, even on fancy CPUs. He discusses the precision of the estimate based on the size of the table and shows that you needn’t a terrible number of entries to get decent precision. He muses over the possibility of using interpolation to augment precision, speculating that it might be slow anyway.

I started thinking about how to interpolate efficiently between table entries but then I realized that it’s not fundamentally more complicated to compute a polynomial approximation of the functions than to search the table then use polynomial interpolation between entries.

The classical textbook example is to use a Taylor Series expansion of the function around the point of interest, let us say . For , it leads to quite a mess:

Not only does it require to know (and ) for a suitable near the we want, the expression doesn’t look easily factorizable. If we rather dispense of and expand around zero, we get a McLaurin Series, the special case for Taylor Series when . For , we find that:

While this expression seems to be requiring the calculation of large powers and factorials, a bit a factoring leads a much nicer expression involving only small products and squaring. Indeed:

which can be further factorized:

where means an infinite number of terms. This expansion means that the -th nested term is given by

which doesn’t look all that bad. Furthermore, since it’s the same everywhere, you can factor it and compute it once. A straightforward implementation would look something like:

inline float taylor(float x, int n) { const float x2=x*x; float s=1.0f; while (n) { const float n2=2.0f*n; s=1.0f-x2/( n2*(n2-1.0f))*s; n--; } return s; }

One may think of using different decompositions/approximations. In [1], we find the series

but we will see that it doesn’t converge fast enough (with a rather small number of terms) to be interesting.

Since is a rather smooth function, one may think that it can be approximated with a rather low-order polynomial. Fitting a regular degree 4 polynomial, we obtain the approximation

where

,

,

and

,

which I found using the `Fit` function in *Mathematica* against with . The terms for and are exceedingly small; enough so to be ignored. Why just ? Because:

because the Taylor expansion does very well on the range . In the image above: black is the true cosine, red is a Taylor Series with only two terms, orange with three, and green with seven. The purple line is the product expansion discussed above with something like 12 terms—as I said, it converges extremely slowly. So it seems reasonable to compare on the range . And this is not that much of a limitation, as we will discuss later on.

The low-degree polynomial function is given by

float poly(float x, int) { const float x2=x*x; return 0.999579f - 0.496381f * x2 + 0.0372019f*x2*x2; }

(note the `f` suffix used to avoid the promotion of constants to `double`.).

*

* *

What about speed? Are they faster than the `<math.h>` `cos` function? Fortunately, some are. For 1 million calls:

Method | Terms | Elapsed | Per Call |

EmptyLoop | — | 0.008891s | 8.891e-09 |

math.h | — | 0.028873s | 2.8873e-08s |

Polynomial | — | 0.011190s | 1.119e-08s |

Taylor | 2 | 0.022102s | 2.2102e-08s |

Taylor | 3 | 0.030911s | 3.0911e-08s |

Taylor | 7 | 0.068710s | 6.871e-08s |

So at three terms, already, the Taylor series is slower than the CPU/ `<math.h>` `cos` function. The polynomial approximation does rather well. Examining its assembly:

<_Z4polydi>: 0f 28 c8 movaps xmm1,xmm0 f3 0f 10 15 d1 07 00 00 movss xmm2,DWORD PTR [rip+0x7d1] f3 0f 59 c8 mulss xmm1,xmm0 f3 0f 10 05 c9 07 00 00 movss xmm0,DWORD PTR [rip+0x7c9] f3 0f 59 d1 mulss xmm2,xmm1 f3 0f 5c c2 subss xmm0,xmm2 f3 0f 10 15 bd 07 00 00 movss xmm2,DWORD PTR [rip+0x7bd] f3 0f 59 d1 mulss xmm2,xmm1 f3 0f 59 d1 mulss xmm2,xmm1 f3 0f 58 c2 addss xmm0,xmm2 c3 ret

we see that `g++` (4.6.1, for this experiment) uses SSE instructions to perform float arithmetic (it turns out that if I force 387-style instructions only, it’s slightly faster, but, meh). The code is short (10 instructions) and doesn’t use prefixes (with `double`, each instruction would have a `66` prefix, which seems to be more costly to decode, in addition of taking longer to execute with double-precision arithmetic. The memory references are to the 3 constants.

*

* *

What about precision? The maximum error is probably what interests us the most. The maximum error for the different methods are summarized in the following table:

Method | Terms | Max Error |

Taylor | 2 | 0.0198623 |

Taylor | 3 | 0.0008873 |

Taylor | 7 | 1.07e-7 |

Polynomial | — | 0.0012689 |

So Taylor with three terms, which compares more or less with the CPU/library `cos` function in terms of time, has a relatively high error (yes, yes, I know, big-O notation isn’t for constants). Which is only a bit smaller than the much faster polynomial approximation, which also has an error in the same range.

*

* *

What about the range ? Well, that can be most easily arranged when we observe that the complete range is in fact replications/rotation of the function on . Handwaving-ly:

Le badly drawn cos still shows that each colored section is a translation/mirror of the first one. This is exploitable. If `f` is a function approximating cosine (or sine) on , the following function suffices to cover :

float mirror(cos_func f, int n, float x) { if (x<M_PI/2.0) return f(x,n); else if (x<M_PI) return -f(M_PI-x,n); else if (x<3*M_PI/2) return -f(x-M_PI,n); else return f(2*M_PI-x,n); }

*

* *

So, what to conclude? My conclusion is to use the compiler-provided `cos` function call and not worry about it, most of the times. Unless your CPU can do floats but can’t do `cos`. In which case, the answer depends on the precision you need. Taylor series seems reasonably efficient (especially once factored properly, although this is more or less hand-waving since I didn’t test an implementation where and are computed) and precise enough; with three terms the error is , just a bit smaller than the polynomial approximation.

*

* *

Abramowitz & Stegun give slightly different constants for the polynomial approximation and claim error. Not sure how they arrived at the slightly different constants. I would guess *Mathematica* is fairly accurate on least-mean-squares fits. Maybe their constants were minimized using ?

*

* *

I am sure we can find other very good approximations involving only inexpensive operations. But the very notion of “inexpensive operation” will limit the class of functions we can explore for approximations. If one restricts himself to addition (and therefore subtraction) and multiplication, we pretty much consider only polynomials, with the special case of linear equations. If we allow division, we have the rational functions. As we keep adding more allowable operations, the class of possible functions grows, but the complexity may exceed the original function, the one we’re trying to avoid to compute exactly.

[1] Milton Abramowitz, Irene A. Stegun — *Handbook of Mathematical Functions: with Formulas, Graphs, and Mathematical Tables* Dover, 1965

[…] a couple of different occasions I discussed the topic of interpolation, without really going into the details. Lately, I had to […]