I always disliked when a book gives equations and formulas as if of fiat without providing justification or demonstration; and I don’t mind that they skip a few steps as long that I can fill in the blanks myself if I care to. Linear regression is one of those formula we see but we’d like to understand better.

So let us see how to derive the formulas for linear regression and see how they generalize to any number of unknowns.

First, you start with a bunch of points (for ) that you suppose are more or less lying on a line—if you suspect that they do *not* lie on a line, don’t do a *linear* regression, try another model! You want to find and such that

.

You do not want just any and , you want and so that the error between and is minimized. (I use the “hat notation” to denote predicted or estimated quantities.) You can use any measure of error (as long as it is a metric) but analytically, the squared error is convenient. So you can write

,

the sum of the squared error. There are two advantages to use this particular error form. First, every error is either zero or positive so you don’t have “negative errors” that gives a better fit despite being (negatively) greater errors. Second, in general, solving for the derivatives of the error relative to the unknowns is analytically friendly (I won’t say necessarily easy). Third, the error function, in our case, is convex, and allows us to solve

and

(simultaneously) for the best parameters and . If you solve and isolate then and isolate , you find

and

where is shorthand for . These two equations are cute until you notice that they mutually refer each other. So you have to expand one or the other, or, what you find in most books, both. After basic (but rather lengthy) manipulations you arrive at the textbook forms of

and

which are the equations found in, say, Mathworld. So the steps were:

- Establish a model. In our example, it’s the line .
- Establish a (convenient) error function, in our case, squared errors:
,

- Because the sum of squared errors is convex in and , we can solve for the minimum, where the derivatives are zero (since it’s convex, it conveniently has only one such zero), that is solving:
and

for and .

We can move on to actually compute the parameters and efficiently by noticing there aren’t that many different sums that compose the two formula. We may be extra careful if the denominator turns out extremely small, but that’s about it.

*

* *

That’s usually where the story ends. You have and , so rejoice, end. But the error-derivative-solve method has a serious limitation, at least from a pragmatic point of view. If you fit a (linear) model with 3 variables, you end up with 3 derivatives, for which you solve for the variables, and you get 3 equations. But the thing is that the three equations are expressed in terms of the *two other variables*, just like you got two equations expressed in terms of the other variable for simple 1D linear regression. So you can imagine how much fun it is to replace and in the equation for . It’s entirely possible to do so, but somehow the complexity of doing this by hand exploded.

But turns out there’s another way that does not quite ask for that much effort and that scales for any number of variables (at least in principle; let us not discuss issues of numerical stability here).

The denominator

turns out to be the determinant of the matrix

and you can verify that

gives the same answer as the classical formulas.

Now you’re thinking *wow, and from where did that come from?!*

Well, let’s see.

Usually, we can solve a linear system of simultaneous equations if we have equations and variables. The coefficients of the equations form the square matrix , the variable the column vector and on the RHS the column vector , and they are linked by the matrix equation

that we solve by essentially inverting the matrix , that is

But that works fine when the matrix is square and non singular, that is, there’s a (classical) inverse. When it’s not rectangular and/or singular, as it is the case with linear regression, we might think we’re done for, but we’re not, at least, not always.

In the case of linear regression, we have a bunch of equations (all those equations), and instead of having a nice square matrix, we have a matrix representing the equations which is not the type of matrix we’re used to.

In vector form, can be written as

And if you stack all the equations in the same system, you get

which can’t be solved as is. Before we continue, let

, , and

(it will make things a lot simpler from now on). So we have

which you can’t solve directly because is not a square matrix. Turns out there’s a nifty trick (due to Moore and Penrose?) called the normal equations.

We take

and multiply both sides by (it’s obligatory to have the matrices to be compatible for multiplication), yielding

and since is square and positive definite, we can invert it:

giving, finally

With being the best values of and , solved to minimize squared errors. You can verify that

*

* *

OK, this seems like a lot of work to get a simple linear regression, but the main point is that this method generalizes to any number of variables. To fit a plane, you’d use the function (one of the possible forms for the equation of a plane) and you’d have

,

and have vectors and and you solve the equation for , , using

and you’re done; we have moved from a tedious process to a convenient recipe.

*

* *

The special matrix is a Vandermonde matrix, which has plenty of good properties, such as if it happens to be square, can be inverted in which is *much* faster than the usual . The special matrix is also special (I think it is a Hankel matrix) and probably has efficient inversion algorithms, although I could find any by googling (OK, maybe I didn’t google too hard ;) ).

“if you suspect that they do not lie on a line, don’t do a linear regression, try another model”

I think that statement is not completely true. Linear regression means your model is linear with respect to the parameters. You can use it, for instance, to fit the data to a parabola ax^2 + bx + c, or even to a sum of sinusoids a*sin(w1*x) + b*cos(w1*x). (Fourier series are a generalization of this).

I meant the model , but you’re right: these techniques work for any

linear combinationof functions; and so is a linear combination of the functions , and the constant .[…] a previous post, I presented (generalized) linear regression from a linear algebra point of view (namely, the […]

[…] a number of previous occasions, I have used the pseudoinverse of a matrix solve systems of equations, and do other things such as […]

I think the expression for ‘a’ after the partial derivative should have Sigma[1] =n in the denominator instead of Sigma[x(i)*x(i)]

In fact, it’s much worse: the expressions are inverted. That’s fixed now. Thanks for spotting the error.

Well, it was still inverted, but because I (re)solved for and not . Also rework’d the rest so that it’s everywhere.