Linear Regression Revisited.

I always disliked when a book gives equations and formulas as if of fiat without providing justification or demonstration; and I don’t mind that they skip a few steps as long that I can fill in the blanks myself if I care to. Linear regression is one of those formula we see but we’d like to understand better.

So let us see how to derive the formulas for linear regression and see how they generalize to any number of unknowns.

First, you start with a bunch of points (x_i,y_i) (for i=1,2,\ldots,n) that you suppose are more or less lying on a line—if you suspect that they do not lie on a line, don’t do a linear regression, try another model! You want to find a and b such that

\hat{y}_i=\hat{y}(x_i)=ax_i+b.

You do not want just any a and b, you want a and b so that the error between y_i and \hat{y}_i is minimized. (I use the “hat notation” to denote predicted or estimated quantities.) You can use any measure of error (as long as it is a metric) but analytically, the squared error is convenient. So you can write

E(a,b)=\sum_{i=1}^n \left(y_i-\hat{y}_i\right)^2,

the sum of the squared error. There are two advantages to use this particular error form. First, every error is either zero or positive so you don’t have “negative errors” that gives a better fit despite being (negatively) greater errors. Second, in general, solving for the derivatives of the error relative to the unknowns is analytically friendly (I won’t say necessarily easy). Third, the error function, in our case, is convex, and allows us to solve

\displaystyle\frac{\partial{E(a,b)}}{\partial{a}}=0 and \displaystyle\frac{\partial{E(a,b)}}{\partial{b}}=0

(simultaneously) for the best parameters a and b. If you solve \partial{E(a,b)}/\partial{a}=0 and isolate a then \partial{E(a,b)}/\partial{b}=0 and isolate b, you find

\displaystyle{a=\frac{\sum_ix_iy_i-b\sum_ix_i}{\sum_i x_i^2}}

and

\displaystyle{b=\frac{\sum_iy_i-a\sum_ix_i}{n}}

where \sum_i is shorthand for \sum_{i=1}^n. These two equations are cute until you notice that they mutually refer each other. So you have to expand one or the other, or, what you find in most books, both. After basic (but rather lengthy) manipulations you arrive at the textbook forms of

a=\displaystyle\frac{\left(\sum_iy_i\right)\left(\sum_ix_i^2\right)-\left(\sum_ix_i\right)\left(\sum_ix_iy_i\right)}{n\left(\sum_ix_i^2\right)-\left(\sum_ix_i\right)^2}

and

b=\displaystyle\frac{n\left(\sum_ix_iy_i\right)-\left(\sum_ix_i\right)\left(\sum_iy_i\right)}{n\left(\sum_ix_i^2\right)-\left(\sum_ix_i\right)^2}

which are the equations found in, say, Mathworld. So the steps were:

  1. Establish a model. In our example, it’s the line ax+b.
  2. Establish a (convenient) error function, in our case, squared errors:

    E(a,b)=\sum_{i=1}^n \left(y_i-\hat{y}_i\right)^2,

  3. Because the sum of squared errors is convex in a and b, we can solve for the minimum, where the derivatives are zero (since it’s convex, it conveniently has only one such zero), that is solving:

    \displaystyle\frac{\partial{E(a,b)}}{\partial{a}}=0 and \displaystyle\frac{\partial{E(a,b)}}{\partial{b}}=0

    for a and b.

We can move on to actually compute the parameters a and b efficiently by noticing there aren’t that many different sums that compose the two formula. We may be extra careful if the denominator turns out extremely small, but that’s about it.

*
* *

That’s usually where the story ends. You have a and b, so rejoice, end. But the error-derivative-solve method has a serious limitation, at least from a pragmatic point of view. If you fit a (linear) model with 3 variables, you end up with 3 derivatives, for which you solve for the variables, and you get 3 equations. But the thing is that the three equations are expressed in terms of the two other variables, just like you got two equations expressed in terms of the other variable for simple 1D linear regression. So you can imagine how much fun it is to replace b and c in the equation for a. It’s entirely possible to do so, but somehow the complexity of doing this by hand exploded.

But turns out there’s another way that does not quite ask for that much effort and that scales for any number of variables (at least in principle; let us not discuss issues of numerical stability here).

The denominator

n\left(\sum_ix_i^2\right)-\left(\sum_ix_i\right)^2

turns out to be the determinant of the matrix

\left[\begin{array}{cc}\sum_ix_i^2&\sum_ix_i\\\sum_ix_i&n\end{array}\right]

and you can verify that

\left[\begin{array}{c}a\\b\end{array}\right]=\left[\begin{array}{cc}\sum_ix_i^2&\sum_ix_i\\\sum_ix_i&n\end{array}\right]^{-1}\left[\begin{array}{c}\sum_iy_i\\\sum_ix_iy_i\end{array}\right]

gives the same answer as the classical formulas.

Now you’re thinking wow, and from where did that come from?!

Well, let’s see.

Usually, we can solve a linear system of simultaneous equations if we have m equations and m variables. The coefficients of the equations form the square matrix A, the variable the column vector b and on the RHS the column vector y, and they are linked by the matrix equation

Ab=y

that we solve by essentially inverting the matrix A, that is

b=A^{-1}y

But that works fine when the matrix A is square and non singular, that is, there’s a (classical) inverse. When it’s not rectangular and/or singular, as it is the case with linear regression, we might think we’re done for, but we’re not, at least, not always.

In the case of linear regression, we have a bunch of equations (all those ax_i+b=y_i equations), and instead of having a nice square matrix, we have a 2\times{}n matrix representing the equations which is not the type of matrix we’re used to.

In vector form, a+bx=y can be written as

[x~~1]\left[\begin{array}{c}a\\b\end{array}\right]=y

And if you stack all the equations in the same system, you get

\left[\begin{array}{cc}x_1&1\\x_2&1\\\vdots&\vdots\\x_n&1\end{array}\right]\left[\begin{array}{c}a\\b\end{array}\right]=\left[\begin{array}{c}y_1\\y_2\\\vdots\\y_n\end{array}\right]

which can’t be solved as is. Before we continue, let

X=\left[\begin{array}{cc}x_1&1\\x_2&1\\\vdots&\vdots\\x_n&1\end{array}\right], c=\left[\begin{array}{c}a\\b\end{array}\right], and y=\left[\begin{array}{c}y_1\\y_2\\\vdots\\y_n\end{array}\right]

(it will make things a lot simpler from now on). So we have

Xc=y

which you can’t solve directly because X is not a square matrix. Turns out there’s a nifty trick (due to Moore and Penrose?) called the normal equations.

We take

Xc=y

and multiply both sides by X^{T} (it’s obligatory to have the matrices to be compatible for multiplication), yielding

X^{T}Xc=X^{T}y

and since X^{T}X is square and positive definite, we can invert it:

(X^{T}X)^{-1}(X^{T}X)c=(X^{T}X)^{-1}X^{T}y

giving, finally

c^{^*}=(X^{T}X)^{-1}X^{T}y

With c^{^*} being the best values of a and b, solved to minimize squared errors. You can verify that

(X^{T}X)^{-1}X^{T}y=\left[\begin{array}{cc}\sum_ix_i^2&\sum_ix_i\\\sum_ix_i&n\end{array}\right]^{-1}\left[\begin{array}{c}\sum_iy_i\\\sum_ix_iy_i\end{array}\right]=\left[\begin{array}{c}a\\b\end{array}\right]

*
* *

OK, this seems like a lot of work to get a simple linear regression, but the main point is that this method generalizes to any number of variables. To fit a plane, you’d use the function \hat{z}(x,y)=ax+by+c (one of the possible forms for the equation of a plane) and you’d have

X=\left[\begin{array}{ccc}x_1&y_1&1\\x_2&y_2&1\\\vdots&\vdots\\x_n&y_n&1\end{array}\right],

and have vectors d=[a~b~c] and z=[z_1~z_2~\cdots~z_n] and you solve the equation for a, b, c using

d=(X^{T}X)^{-1}X^{T}z

and you’re done; we have moved from a tedious process to a convenient recipe.

*
* *

The special matrix X is a Vandermonde matrix, which has plenty of good properties, such as if it happens to be square, can be inverted in O(n^2) which is much faster than the usual O(n^3). The special matrix X^{T}X is also special (I think it is a Hankel matrix) and probably has efficient inversion algorithms, although I could find any by googling (OK, maybe I didn’t google too hard ;) ).

7 Responses to Linear Regression Revisited.

  1. Alejandro says:

    “if you suspect that they do not lie on a line, don’t do a linear regression, try another model”

    I think that statement is not completely true. Linear regression means your model is linear with respect to the parameters. You can use it, for instance, to fit the data to a parabola ax^2 + bx + c, or even to a sum of sinusoids a*sin(w1*x) + b*cos(w1*x). (Fourier series are a generalization of this).

  2. Steven Pigeon says:

    I meant the model \hat{y}(x)=ax+b, but you’re right: these techniques work for any linear combination of functions; and so ax^2+bx+c is a linear combination of the functions x^2, x and the constant 1.

  3. […] a previous post, I presented (generalized) linear regression from a linear algebra point of view (namely, the […]

  4. […] a number of previous occasions, I have used the pseudoinverse of a matrix solve systems of equations, and do other things such as […]

  5. PradeepK says:

    I think the expression for ‘a’ after the partial derivative should have Sigma[1] =n in the denominator instead of Sigma[x(i)*x(i)]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: