I always disliked when a book gives equations and formulas as if of fiat without providing justification or demonstration; and I don’t mind that they skip a few steps as long that I can fill in the blanks myself if I care to. Linear regression is one of those formula we see but we’d like to understand better.
So let us see how to derive the formulas for linear regression and see how they generalize to any number of unknowns.
First, you start with a bunch of points (for ) that you suppose are more or less lying on a line—if you suspect that they do not lie on a line, don’t do a linear regression, try another model! You want to find and such that
You do not want just any and , you want and so that the error between and is minimized. (I use the “hat notation” to denote predicted or estimated quantities.) You can use any measure of error (as long as it is a metric) but analytically, the squared error is convenient. So you can write
the sum of the squared error. There are two advantages to use this particular error form. First, every error is either zero or positive so you don’t have “negative errors” that gives a better fit despite being (negatively) greater errors. Second, in general, solving for the derivatives of the error relative to the unknowns is analytically friendly (I won’t say necessarily easy). Third, the error function, in our case, is convex, and allows us to solve
(simultaneously) for the best parameters and . If you solve and isolate then and isolate , you find
where is shorthand for . These two equations are cute until you notice that they mutually refer each other. So you have to expand one or the other, or, what you find in most books, both. After basic (but rather lengthy) manipulations you arrive at the textbook forms of
which are the equations found in, say, Mathworld. So the steps were:
- Establish a model. In our example, it’s the line .
- Establish a (convenient) error function, in our case, squared errors:
- Because the sum of squared errors is convex in and , we can solve for the minimum, where the derivatives are zero (since it’s convex, it conveniently has only one such zero), that is solving:
for and .
We can move on to actually compute the parameters and efficiently by noticing there aren’t that many different sums that compose the two formula. We may be extra careful if the denominator turns out extremely small, but that’s about it.
That’s usually where the story ends. You have and , so rejoice, end. But the error-derivative-solve method has a serious limitation, at least from a pragmatic point of view. If you fit a (linear) model with 3 variables, you end up with 3 derivatives, for which you solve for the variables, and you get 3 equations. But the thing is that the three equations are expressed in terms of the two other variables, just like you got two equations expressed in terms of the other variable for simple 1D linear regression. So you can imagine how much fun it is to replace and in the equation for . It’s entirely possible to do so, but somehow the complexity of doing this by hand exploded.
But turns out there’s another way that does not quite ask for that much effort and that scales for any number of variables (at least in principle; let us not discuss issues of numerical stability here).
turns out to be the determinant of the matrix
and you can verify that
gives the same answer as the classical formulas.
Now you’re thinking wow, and from where did that come from?!
Well, let’s see.
Usually, we can solve a linear system of simultaneous equations if we have equations and variables. The coefficients of the equations form the square matrix , the variable the column vector and on the RHS the column vector , and they are linked by the matrix equation
that we solve by essentially inverting the matrix , that is
But that works fine when the matrix is square and non singular, that is, there’s a (classical) inverse. When it’s not rectangular and/or singular, as it is the case with linear regression, we might think we’re done for, but we’re not, at least, not always.
In the case of linear regression, we have a bunch of equations (all those equations), and instead of having a nice square matrix, we have a matrix representing the equations which is not the type of matrix we’re used to.
In vector form, can be written as
And if you stack all the equations in the same system, you get
which can’t be solved as is. Before we continue, let
, , and
(it will make things a lot simpler from now on). So we have
which you can’t solve directly because is not a square matrix. Turns out there’s a nifty trick (due to Moore and Penrose?) called the normal equations.
and multiply both sides by (it’s obligatory to have the matrices to be compatible for multiplication), yielding
and since is square and positive definite, we can invert it:
With being the best values of and , solved to minimize squared errors. You can verify that
OK, this seems like a lot of work to get a simple linear regression, but the main point is that this method generalizes to any number of variables. To fit a plane, you’d use the function (one of the possible forms for the equation of a plane) and you’d have
and have vectors and and you solve the equation for , , using
and you’re done; we have moved from a tedious process to a convenient recipe.
The special matrix is a Vandermonde matrix, which has plenty of good properties, such as if it happens to be square, can be inverted in which is much faster than the usual . The special matrix is also special (I think it is a Hankel matrix) and probably has efficient inversion algorithms, although I could find any by googling (OK, maybe I didn’t google too hard ;) ).