Introduction to Gradient Descent

In a previous post, I presented (generalized) linear regression from a linear algebra point of view (namely, the “normal equations”) and in this post, I discuss yet another take on the problem, this time using gradient descent.

Gradient descent isn’t particularly fascinating for this particular task (as we know closed, analytical expressions for obtaining the parameters), but linear regression is the simplest example of gradient descent I could come up with without being completely trivial.

So we have a bunch of n points \{(x_i,y_i)\}_{i=1}^n, and we want to fit a line of equation a+bx through them, minimizing some error function. Usually, we use the least mean square error (which corresponds to “vertical offsets“) because it usually yields easier analytical solutions. Let us define


and the error function


Since E(a,b) is convex in both a and b, we can solve

\displaystyle\frac{\partial E(a,b)}{\partial a}=0 and \displaystyle\frac{\partial E(a,b)}{\partial b}=0

for a and b. After quite a bit of algebra, we find

\displaystyle a=\frac{\left(\sum_i y_i\right)\left(\sum_i x_i^2\right)-\left(\sum_i x_i\right)\left(\sum_i x_iy_i\right)}{n\left(\sum_ix_i^2\right)-\left(\sum_i x_i\right)^2}


\displaystyle a=\frac{n\left(\sum_i x_iy_i\right)-\left(\sum_i x_i\right)\left(\sum_i y_i\right)}{n\left(\sum_ix_i^2\right)-\left(\sum_i x_i\right)^2}.

Noticing that terms repeat a number of times, and possibly using special cases if the \{x_i\}_{i=1}^n=\{0,1,2,\ldots,n-1\}, we can arrive at an efficient, linear-time, implementation.

* *

While the relatively simple case of linear regression affords an analytical solution, we may (in real life) end up with another function 1) for which we can compute partial derivatives but 2) we cannot solve analytically for all its parameters, because, maybe, it is non-elementary, or it poses some other difficulty. The technique to use then is gradient descent.

Gradient descent is a relatively simple procedure conceptually—while in practice it does have its share of gotchas. Let’s say we have some function f(x;\theta) with parameter(s) \theta which we want to minimize over the xs. The variable(s) to adjust is \theta. If we want to minimize f(x;\theta), we must find \theta^* that minimizes given the xs which are fixed (or “observed” depending on how you look at it). If we can compute the partial derivative of f(x;\theta) with respect to \theta, we know in which direction change \theta so that f(x;\theta) decreases. We then have the update rule

\displaystyle\theta^\prime=\theta-\eta\frac{\partial f(x;\theta)}{\partial \theta},

which changes \theta in the direction in which f(x;\theta) decreases.

However, if we know in which direction we must change it, we do not necessarily know by how much exactly, and so a function-dependent scaling factor \eta is introduced. This scaling, the gradient step, will prevent us from updating \theta too far and miss the minimum. If \theta is too small it will take too long to find the minimum; if it is too large the procedure may diverge—essentially crash.

(The underlying assumption for gradient descent to work properly is that the function one wants to minimize is somewhat approximately convex, so that we follow the slope and find a (good local) minimum. If the function would highly non-convex, it would probably not work very well.)

If we have more than one parameter, then \theta is a vector of parameters, and the derivative is replaced by a gradient, which is the vector formed by the partial derivatives in all the parameters. That is,

\displaystyle \nabla f=\left(\frac{\partial f(x;\theta)}{\partial \theta_1}, \frac{\partial f(x;\theta)}{\partial \theta_2},\cdots, \frac{\partial f(x;\theta)}{\partial \theta_k}\right),

if we have k parameters. It is therefore the general case where \theta is a vector that gives the method its name, gradient descent.

* *

Back to the linear regression problem. Using E(a,b), we find that the partial derivative w/ respect to the two parameters a and b are

\displaystyle\frac{\partial E(a,b)}{\partial a}=-2\sum_{i=1}^n(y_i-a-bx_i)


\displaystyle\frac{\partial E(a,b)}{\partial b}=-2\sum_{i=1}^n(y_i-a-bx_i)x_i

and therefore

\nabla E(a,b)=\left(\displaystyle\frac{\partial E(a,b)}{\partial a},\displaystyle\frac{\partial E(a,b)}{\partial b}\right)

and the update rule is

(a,b)^\prime=(a,b)-\eta\nabla E(a,b)

However, for faster convergence, it is sometimes useful to use a different \eta for each parameter. We then have

\displaystyle (a,b)^\prime=(a,b)-\left(\eta_a \frac{\partial E(a,b)}{\partial a},\eta_b \frac{\partial E(a,b)}{\partial b}\right)

* *

Let us have a look at a C++ implementation. First, as with the general regression, we see that the expressions of the gradient have quite a few invariants we can factor out. Using the identities

\sum_{i=1}^n i=\frac{1}{2}n(n+1)


\sum_{i=1}^n i^2=\frac{1}{6}n(n+1)(2n+1),

we arrive at a constant-time computation of gradient in the iteration step (but it does take linear time, once, to initialize the invariants). Here’s how the actual code looks like:

void gradient_descent( const std::vector<float_x> & y,
                       float_x & a,
                       float_x & b )
  const size_t n=y.size();
  const float_x sx=sum_of_int(n-1);
  const float_x sx2=sum_of_sqr(n-1);

  float_x sy=0;
  float_x sxy=0;

  for (size_t i=0;i<n;i++)

  // initial guesses

  // iterations
  float_x last_a,last_b;
  float_x sa=1e-3; // for n=100
  float_x sb=1e-6; // for n=100

    float_x db= -(sxy - a*sx - b*sx2);
    float_x da= -(sy - n*a - b*sx );

  while ( (std::fabs(a-last_a)/a > 0.0001)
          (std::fabs(b-last_b)/b > 0.0001));

* *

Where are the factors 2 in front of the partial derivatives? Why are they gone? Think about it for a while, I’ll come back to it in a moment.

* *

Let us try it out!

Let a=5 and b=3 (why not) and let us add some uniform random noise \varepsilon \sim \mathcal{U}(-1,1). With n=100, and not particularly efficient initial values for a and b, we have the following convergence:

So we see the parameters adjusting smoothly (which will not always be the case; more complex functions may yield surprises) and that we arrive to very good estimates after \approx 250 iterations (which, in this case, are inexpensive as being constant-time). After \approx 475 iterations, the algorithm bails out as it has detected convergence.

* *

We said earlier that \eta might be function-specific. In the case of linear regression by means of gradient descent, the following \etas seems to be working fine:

n \eta_a \eta_b
100 1e-3 1e-6
1000 1e-4 1e-9
10000 1e-5 1e-12
100000 1e-6 1e-15

So \eta_a and \eta_b are the reason why we can drop the 2 in front of the derivatives: the constant is absorbed into the \eta_a and \eta_b and so can be ignored.

* *

While this seems to be quite laborious for linear regression, gradient descent proves most useful in all kind of optimization problems, including machine learning. In machine learning gradient descent (and back-propagation) is used to train neural networks of all kinds. Newton’s method for finding square roots (last week’s post) can be thought of as gradient-descent type algorithm with an expression for \eta that depends explicitly on x.

We will certainly discuss gradient-descent methods again future posts.

11 Responses to Introduction to Gradient Descent

  1. Manish says:

    Can you please explain the same thing with a small example?

    • Ok. Let f(x)=x^2. If you want to find the minimum of f(x), you can either solve analytically the equation (and find trivially that x=0) or use gradient descend, start at some reasonnable guess x_0 and use gradient descent to find x_{min}. You will find

      \displaystyle \frac{\partial f(x)}{\partial x}=2x

      So you would iterate

      x_t = x_{t-1}-\eta  \displaystyle \frac{\partial f(x)}{\partial x} = x_{t-1}-2\eta x,

      starting at t=0 with reasonnable guess x_0. Of course you can conflapulate 2\eta into \eta', and get x_t=x_{t-1}-2\eta' x.

      Normally, you don’t get functions that are trivially analytically solvable. Usually, you get monsters with lots of internal parameters and non-separable expression, and you have to use gradient descent.

      Let re-do the x^2 example with a “more realistic” example: g(x)=(ax+b)^2+c—that’s still a basic quadratic, but now the minimum is not zero, it’s c, found at x=-\frac{b}{a}. Applying gradient descent:

      \displaystyle \frac{\partial g(x)}{\partial x}=2a(ax+b).

      We start at some x_0 (possibly x_0=0 if you have a priori assumptions that b should be small-ish), and we iterate:

      x_t=x_{t-1}+\displaystyle \frac{\partial g(x)}{\partial x}=x_{t-1}+\eta a(ax+b).

      Note that formally, we do not observe a, b, and c. We can however estimate

      \displaystyle \frac{\partial g(x)}{\partial x} \approx \frac{g(x)-g(x+\varepsilon)}{\varepsilon}

      for a small-ish \varepsilon>0. The choice of \eta is also very important and related to the (local) curvature of the function. If you pick it too large, you oscillate around the solution (and if the function is not very well-behaved, it can even bring you further away from the solution), if you pick it too small, it takes forever to get to the solution.

      With the preceeding example, with a=2, b=1.2, and $c=3$, with initial guess x_0=2, the gradient descent produces the following path (in magenta):

      Example of Gradient Descent

      (right-click and “view image” to view image to full size).

      I hope that helped a bit.

  2. […] 这种找出最佳权重的办法被称为批量梯度下降,上面是对它的高度概括。如果想搞懂细节,不要害怕,继续深入下去吧。 […]

  3. […]   这种找出最佳权重的办法被称为批量梯度下降,上面是对它的高度概括。如果想搞懂细节,不要害怕,继续深入下去吧。 […]

  4. […] way to find the best weights for your function called batch gradient descent. Don’t be afraid to dig deeper if you are interested on learning the […]

  5. […] 这种找出最佳权重的办法被称为批量梯度下降,上面是对它的高度概括。如果想搞懂细节,不要害怕,继续深入下去吧。 […]

  6. […] نام دارد. اگر به این روش علاقمند شدید، جزئیات بیشتری در این جا پیدا خواهید […]

  7. […] نام دارد. اگر به این روش علاقمند شدید، جزئیات بیشتری در این جا پیدا خواهید […]

  8. […] to find the best weights for your function called batch gradient descent. Don’t be afraid to dig deeper if you are interested on learning the […]

  9. […] la vostra funzione chiamato Batch Gradient Descent (discesa del gradiente). Non abbiate paura di scavare più a fondo, se siete interessati a capirne i […]

  10. […] to find the best weights for your function called batch gradient descent. Don’t be afraid to dig deeper if you are interested on learning the […]

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: