I always disliked when a book gives equations and formulas as if of fiat without providing justification or demonstration; and I don’t mind that they skip a few steps as long that I can fill in the blanks myself if I care to. Linear regression is one of those formula we see but we’d like to understand better.
So let us see how to derive the formulas for linear regression and see how they generalize to any number of unknowns.
First, you start with a bunch of points (for
) that you suppose are more or less lying on a line—if you suspect that they do not lie on a line, don’t do a linear regression, try another model! You want to find
and
such that
.
You do not want just any and
, you want
and
so that the error between
and
is minimized. (I use the “hat notation” to denote predicted or estimated quantities.) You can use any measure of error (as long as it is a metric) but analytically, the squared error is convenient. So you can write
,
the sum of the squared error. There are two advantages to use this particular error form. First, every error is either zero or positive so you don’t have “negative errors” that gives a better fit despite being (negatively) greater errors. Second, in general, solving for the derivatives of the error relative to the unknowns is analytically friendly (I won’t say necessarily easy). Third, the error function, in our case, is convex, and allows us to solve
and
(simultaneously) for the best parameters and
. If you solve
and isolate
then
and isolate
, you find
and
where is shorthand for
. These two equations are cute until you notice that they mutually refer each other. So you have to expand one or the other, or, what you find in most books, both. After basic (but rather lengthy) manipulations you arrive at the textbook forms of
and
which are the equations found in, say, Mathworld. So the steps were:
- Establish a model. In our example, it’s the line
.
- Establish a (convenient) error function, in our case, squared errors:
,
- Because the sum of squared errors is convex in
and
, we can solve for the minimum, where the derivatives are zero (since it’s convex, it conveniently has only one such zero), that is solving:
and
for
and
.
We can move on to actually compute the parameters and
efficiently by noticing there aren’t that many different sums that compose the two formula. We may be extra careful if the denominator turns out extremely small, but that’s about it.
*
* *
That’s usually where the story ends. You have and
, so rejoice, end. But the error-derivative-solve method has a serious limitation, at least from a pragmatic point of view. If you fit a (linear) model with 3 variables, you end up with 3 derivatives, for which you solve for the variables, and you get 3 equations. But the thing is that the three equations are expressed in terms of the two other variables, just like you got two equations expressed in terms of the other variable for simple 1D linear regression. So you can imagine how much fun it is to replace
and
in the equation for
. It’s entirely possible to do so, but somehow the complexity of doing this by hand exploded.
But turns out there’s another way that does not quite ask for that much effort and that scales for any number of variables (at least in principle; let us not discuss issues of numerical stability here).
The denominator
turns out to be the determinant of the matrix
and you can verify that
gives the same answer as the classical formulas.
Now you’re thinking wow, and from where did that come from?!
Well, let’s see.
Usually, we can solve a linear system of simultaneous equations if we have equations and
variables. The coefficients of the equations form the square matrix
, the variable the column vector
and on the RHS the column vector
, and they are linked by the matrix equation
that we solve by essentially inverting the matrix , that is
But that works fine when the matrix is square and non singular, that is, there’s a (classical) inverse. When it’s not rectangular and/or singular, as it is the case with linear regression, we might think we’re done for, but we’re not, at least, not always.
In the case of linear regression, we have a bunch of equations (all those equations), and instead of having a nice square matrix, we have a
matrix representing the equations which is not the type of matrix we’re used to.
In vector form, can be written as
And if you stack all the equations in the same system, you get
which can’t be solved as is. Before we continue, let
,
, and
(it will make things a lot simpler from now on). So we have
which you can’t solve directly because is not a square matrix. Turns out there’s a nifty trick (due to Moore and Penrose?) called the normal equations.
We take
and multiply both sides by (it’s obligatory to have the matrices to be compatible for multiplication), yielding
and since is square and positive definite, we can invert it:
giving, finally
With being the best values of
and
, solved to minimize squared errors. You can verify that
*
* *
OK, this seems like a lot of work to get a simple linear regression, but the main point is that this method generalizes to any number of variables. To fit a plane, you’d use the function (one of the possible forms for the equation of a plane) and you’d have
,
and have vectors and
and you solve the equation for
,
,
using
and you’re done; we have moved from a tedious process to a convenient recipe.
*
* *
The special matrix is a Vandermonde matrix, which has plenty of good properties, such as if it happens to be square, can be inverted in
which is much faster than the usual
. The special matrix
is also special (I think it is a Hankel matrix) and probably has efficient inversion algorithms, although I could find any by googling (OK, maybe I didn’t google too hard ;) ).

“if you suspect that they do not lie on a line, don’t do a linear regression, try another model”
I think that statement is not completely true. Linear regression means your model is linear with respect to the parameters. You can use it, for instance, to fit the data to a parabola ax^2 + bx + c, or even to a sum of sinusoids a*sin(w1*x) + b*cos(w1*x). (Fourier series are a generalization of this).
I meant the model
, but you’re right: these techniques work for any linear combination of functions; and so
is a linear combination of the functions
,
and the constant
.
[...] a previous post, I presented (generalized) linear regression from a linear algebra point of view (namely, the [...]