In a previous post, I presented (generalized) linear regression from a linear algebra point of view (namely, the “normal equations”) and in this post, I discuss yet another take on the problem, this time using gradient descent.

Gradient descent isn’t particularly fascinating for this particular task (as we know closed, analytical expressions for obtaining the parameters), but linear regression is the simplest example of gradient descent I could come up with without being completely trivial.

So we have a bunch of points , and we want to fit a line of equation through them, minimizing some error function. Usually, we use the least mean square error (which corresponds to “vertical offsets“) because it usually yields easier analytical solutions. Let us define

,

and the error function

.

Since is convex in both and , we can solve

and

for and . After quite a bit of algebra, we find

and

.

Noticing that terms repeat a number of times, and possibly using special cases if the , we can arrive at an efficient, linear-time, implementation.

*

* *

While the relatively simple case of linear regression affords an analytical solution, we may (in real life) end up with another function 1) for which we can compute partial derivatives but 2) we cannot solve analytically for all its parameters, because, maybe, it is non-elementary, or it poses some other difficulty. The technique to use then is gradient descent.

Gradient descent is a relatively simple procedure conceptually—while in practice it does have its share of gotchas. Let’s say we have some function with parameter(s) which we want to minimize over the s. The variable(s) to adjust is . If we want to minimize , we must find that minimizes given the s which are fixed (or “observed” depending on how you look at it). If we can compute the partial derivative of with respect to , we know in which direction change so that decreases. We then have the update rule

,

which changes in the direction in which decreases.

However, if we know in which direction we must change it, we do not necessarily know by how much exactly, and so a function-dependent scaling factor is introduced. This scaling, the gradient step, will prevent us from updating too far and miss the minimum. If is too small it will take too long to find the minimum; if it is too large the procedure may diverge—essentially crash.

(The underlying assumption for gradient descent to work properly is that the function one wants to minimize is somewhat approximately convex, so that we follow the slope and find a (good local) minimum. If the function would highly non-convex, it would probably not work very well.)

If we have more than one parameter, then is a vector of parameters, and the derivative is replaced by a gradient, which is the vector formed by the partial derivatives in all the parameters. That is,

if we have parameters. It is therefore the general case where is a vector that gives the method its name, gradient descent.

*

* *

Back to the linear regression problem. Using , we find that the partial derivative w/ respect to the two parameters and are

and

and therefore

and the update rule is

However, for faster convergence, it is sometimes useful to use a different for each parameter. We then have

*

* *

Let us have a look at a C++ implementation. First, as with the general regression, we see that the expressions of the gradient have quite a few invariants we can factor out. Using the identities

and

,

we arrive at a constant-time computation of gradient in the iteration step (but it does take linear time, once, to initialize the invariants). Here’s how the actual code looks like:

void gradient_descent( const std::vector<float_x> & y, float_x & a, float_x & b ) { const size_t n=y.size(); const float_x sx=sum_of_int(n-1); const float_x sx2=sum_of_sqr(n-1); float_x sy=0; float_x sxy=0; for (size_t i=0;i<n;i++) { sy+=y[i]; sxy+=i*y[i]; } // initial guesses b=0; a=sy/n; // iterations float_x last_a,last_b; float_x sa=1e-3; // for n=100 float_x sb=1e-6; // for n=100 do { last_a=a; last_b=b; float_x db= -(sxy - a*sx - b*sx2); float_x da= -(sy - n*a - b*sx ); a-=sa*da; b-=sb*db; } while ( (std::fabs(a-last_a)/a > 0.0001) || (std::fabs(b-last_b)/b > 0.0001)); }

*

* *

Where are the factors in front of the partial derivatives? Why are they gone? Think about it for a while, I’ll come back to it in a moment.

*

* *

Let us try it out!

Let and (why not) and let us add some uniform random noise . With , and not particularly efficient initial values for and , we have the following convergence:

So we see the parameters adjusting smoothly (which will not always be the case; more complex functions may yield surprises) and that we arrive to very good estimates after iterations (which, in this case, are inexpensive as being constant-time). After iterations, the algorithm bails out as it has detected convergence.

*

* *

We said earlier that might be function-specific. In the case of linear regression by means of gradient descent, the following s seems to be working fine:

100 | 1e-3 | 1e-6 |

1000 | 1e-4 | 1e-9 |

10000 | 1e-5 | 1e-12 |

100000 | 1e-6 | 1e-15 |

So and are the reason why we can drop the in front of the derivatives: the constant is absorbed into the and and so can be ignored.

*

* *

While this seems to be quite laborious for linear regression, gradient descent proves most useful in all kind of optimization problems, including machine learning. In machine learning gradient descent (and back-propagation) is used to train neural networks of all kinds. Newton’s method for finding square roots (last week’s post) can be thought of as gradient-descent type algorithm with an expression for that depends explicitly on .

We will certainly discuss gradient-descent methods again future posts.

Can you please explain the same thing with a small example?

Ok. Let . If you want to find the minimum of , you can either solve analytically the equation (and find trivially that ) or use gradient descend, start at some reasonnable guess and use gradient descent to find . You will find

So you would iterate

,

starting at with reasonnable guess . Of course you can conflapulate into , and get .

Normally, you don’t get functions that are trivially analytically solvable. Usually, you get monsters with lots of internal parameters and non-separable expression, and you have to use gradient descent.

Let re-do the example with a “more realistic” example: —that’s still a basic quadratic, but now the minimum is not zero, it’s , found at . Applying gradient descent:

.

We start at some (possibly if you have

a prioriassumptions that should be small-ish), and we iterate:.

Note that formally, we do not observe , , and . We can however estimate

for a small-ish . The choice of is also very important and related to the (local) curvature of the function. If you pick it too large, you oscillate around the solution (and if the function is not very well-behaved, it can even bring you further away from the solution), if you pick it too small, it takes forever to get to the solution.

With the preceeding example, with , , and $c=3$, with initial guess , the gradient descent produces the following path (in magenta):

(right-click and “view image” to view image to full size).

I hope that helped a bit.

[…] 这种找出最佳权重的办法被称为批量梯度下降，上面是对它的高度概括。如果想搞懂细节，不要害怕，继续深入下去吧。 […]

[…] 这种找出最佳权重的办法被称为批量梯度下降，上面是对它的高度概括。如果想搞懂细节，不要害怕，继续深入下去吧。 […]

[…] way to find the best weights for your function called batch gradient descent. Don’t be afraid to dig deeper if you are interested on learning the […]

[…] 这种找出最佳权重的办法被称为批量梯度下降，上面是对它的高度概括。如果想搞懂细节，不要害怕，继续深入下去吧。 […]