In a previous post, I presented (generalized) linear regression from a linear algebra point of view (namely, the “normal equations”) and in this post, I discuss yet another take on the problem, this time using gradient descent.

Gradient descent isn’t particularly fascinating for this particular task (as we know closed, analytical expressions for obtaining the parameters), but linear regression is the simplest example of gradient descent I could come up with without being completely trivial.

So we have a bunch of $n$ points $\{(x_i,y_i)\}_{i=1}^n$, and we want to fit a line of equation $a+bx$ through them, minimizing some error function. Usually, we use the least mean square error (which corresponds to “vertical offsets“) because it usually yields easier analytical solutions. Let us define

$\hat{y}_i=\hat{y}(x_i)=a+bx_i$,

and the error function

$E(a,b)=\displaystyle\sum_{i=1}^n\left(y_i-\hat{y}_i\right)^2$.

Since $E(a,b)$ is convex in both $a$ and $b$, we can solve

$\displaystyle\frac{\partial E(a,b)}{\partial a}=0$ and $\displaystyle\frac{\partial E(a,b)}{\partial b}=0$

for $a$ and $b$. After quite a bit of algebra, we find

$\displaystyle a=\frac{\left(\sum_i y_i\right)\left(\sum_i x_i^2\right)-\left(\sum_i x_i\right)\left(\sum_i x_iy_i\right)}{n\left(\sum_ix_i^2\right)-\left(\sum_i x_i\right)^2}$

and

$\displaystyle a=\frac{n\left(\sum_i x_iy_i\right)-\left(\sum_i x_i\right)\left(\sum_i y_i\right)}{n\left(\sum_ix_i^2\right)-\left(\sum_i x_i\right)^2}$.

Noticing that terms repeat a number of times, and possibly using special cases if the $\{x_i\}_{i=1}^n=\{0,1,2,\ldots,n-1\}$, we can arrive at an efficient, linear-time, implementation.

*
* *

While the relatively simple case of linear regression affords an analytical solution, we may (in real life) end up with another function 1) for which we can compute partial derivatives but 2) we cannot solve analytically for all its parameters, because, maybe, it is non-elementary, or it poses some other difficulty. The technique to use then is gradient descent.

Gradient descent is a relatively simple procedure conceptually—while in practice it does have its share of gotchas. Let’s say we have some function $f(x;\theta)$ with parameter(s) $\theta$ which we want to minimize over the $x$s. The variable(s) to adjust is $\theta$. If we want to minimize $f(x;\theta)$, we must find $\theta^*$ that minimizes given the $x$s which are fixed (or “observed” depending on how you look at it). If we can compute the partial derivative of $f(x;\theta)$ with respect to $\theta$, we know in which direction change $\theta$ so that $f(x;\theta)$ decreases. We then have the update rule

$\displaystyle\theta^\prime=\theta-\eta\frac{\partial f(x;\theta)}{\partial \theta}$,

which changes $\theta$ in the direction in which $f(x;\theta)$ decreases.

However, if we know in which direction we must change it, we do not necessarily know by how much exactly, and so a function-dependent scaling factor $\eta$ is introduced. This scaling, the gradient step, will prevent us from updating $\theta$ too far and miss the minimum. If $\theta$ is too small it will take too long to find the minimum; if it is too large the procedure may diverge—essentially crash.

(The underlying assumption for gradient descent to work properly is that the function one wants to minimize is somewhat approximately convex, so that we follow the slope and find a (good local) minimum. If the function would highly non-convex, it would probably not work very well.)

If we have more than one parameter, then $\theta$ is a vector of parameters, and the derivative is replaced by a gradient, which is the vector formed by the partial derivatives in all the parameters. That is,

$\displaystyle \nabla f=\left(\frac{\partial f(x;\theta)}{\partial \theta_1}, \frac{\partial f(x;\theta)}{\partial \theta_2},\cdots, \frac{\partial f(x;\theta)}{\partial \theta_k}\right),$

if we have $k$ parameters. It is therefore the general case where $\theta$ is a vector that gives the method its name, gradient descent.

*
* *

Back to the linear regression problem. Using $E(a,b)$, we find that the partial derivative w/ respect to the two parameters $a$ and $b$ are

$\displaystyle\frac{\partial E(a,b)}{\partial a}=-2\sum_{i=1}^n(y_i-a-bx_i)$

and

$\displaystyle\frac{\partial E(a,b)}{\partial b}=-2\sum_{i=1}^n(y_i-a-bx_i)x_i$

and therefore

$\nabla E(a,b)=\left(\displaystyle\frac{\partial E(a,b)}{\partial a},\displaystyle\frac{\partial E(a,b)}{\partial b}\right)$

and the update rule is

$(a,b)^\prime=(a,b)-\eta\nabla E(a,b)$

However, for faster convergence, it is sometimes useful to use a different $\eta$ for each parameter. We then have

$\displaystyle (a,b)^\prime=(a,b)-\left(\eta_a \frac{\partial E(a,b)}{\partial a},\eta_b \frac{\partial E(a,b)}{\partial b}\right)$

*
* *

Let us have a look at a C++ implementation. First, as with the general regression, we see that the expressions of the gradient have quite a few invariants we can factor out. Using the identities

$\sum_{i=1}^n i=\frac{1}{2}n(n+1)$

and

$\sum_{i=1}^n i^2=\frac{1}{6}n(n+1)(2n+1)$,

we arrive at a constant-time computation of gradient in the iteration step (but it does take linear time, once, to initialize the invariants). Here’s how the actual code looks like:

void gradient_descent( const std::vector<float_x> & y,
float_x & a,
float_x & b )
{
const size_t n=y.size();
const float_x sx=sum_of_int(n-1);
const float_x sx2=sum_of_sqr(n-1);

float_x sy=0;
float_x sxy=0;

for (size_t i=0;i<n;i++)
{
sy+=y[i];
sxy+=i*y[i];
}

// initial guesses
b=0;
a=sy/n;

// iterations
float_x last_a,last_b;
float_x sa=1e-3; // for n=100
float_x sb=1e-6; // for n=100

do
{
last_a=a;
last_b=b;
float_x db= -(sxy - a*sx - b*sx2);
float_x da= -(sy - n*a - b*sx );

a-=sa*da;
b-=sb*db;
}
while ( (std::fabs(a-last_a)/a > 0.0001)
||
(std::fabs(b-last_b)/b > 0.0001));
}


*
* *

Where are the factors $2$ in front of the partial derivatives? Why are they gone? Think about it for a while, I’ll come back to it in a moment.

*
* *

Let us try it out!

Let $a=5$ and $b=3$ (why not) and let us add some uniform random noise $\varepsilon \sim \mathcal{U}(-1,1)$. With $n=100$, and not particularly efficient initial values for $a$ and $b$, we have the following convergence:

So we see the parameters adjusting smoothly (which will not always be the case; more complex functions may yield surprises) and that we arrive to very good estimates after $\approx 250$ iterations (which, in this case, are inexpensive as being constant-time). After $\approx 475$ iterations, the algorithm bails out as it has detected convergence.

*
* *

We said earlier that $\eta$ might be function-specific. In the case of linear regression by means of gradient descent, the following $\eta$s seems to be working fine:

 $n$ $\eta_a$ $\eta_b$ 100 1e-3 1e-6 1000 1e-4 1e-9 10000 1e-5 1e-12 100000 1e-6 1e-15

So $\eta_a$ and $\eta_b$ are the reason why we can drop the $2$ in front of the derivatives: the constant is absorbed into the $\eta_a$ and $\eta_b$ and so can be ignored.

*
* *

While this seems to be quite laborious for linear regression, gradient descent proves most useful in all kind of optimization problems, including machine learning. In machine learning gradient descent (and back-propagation) is used to train neural networks of all kinds. Newton’s method for finding square roots (last week’s post) can be thought of as gradient-descent type algorithm with an expression for $\eta$ that depends explicitly on $x$.

We will certainly discuss gradient-descent methods again future posts.

### 2 Responses to Introduction to Gradient Descent

1. Manish says:

Can you please explain the same thing with a small example?

• Ok. Let $f(x)=x^2$. If you want to find the minimum of $f(x)$, you can either solve analytically the equation (and find trivially that $x=0$) or use gradient descend, start at some reasonnable guess $x_0$ and use gradient descent to find $x_{min}$. You will find

$\displaystyle \frac{\partial f(x)}{\partial x}=2x$

So you would iterate

$x_t = x_{t-1}-\eta \displaystyle \frac{\partial f(x)}{\partial x} = x_{t-1}-2\eta x$,

starting at $t=0$ with reasonnable guess $x_0$. Of course you can conflapulate $2\eta$ into $\eta'$, and get $x_t=x_{t-1}-2\eta' x$.

Normally, you don’t get functions that are trivially analytically solvable. Usually, you get monsters with lots of internal parameters and non-separable expression, and you have to use gradient descent.

Let re-do the $x^2$ example with a “more realistic” example: $g(x)=(ax+b)^2+c$—that’s still a basic quadratic, but now the minimum is not zero, it’s $c$, found at $x=-\frac{b}{a}$. Applying gradient descent:

$\displaystyle \frac{\partial g(x)}{\partial x}=2a(ax+b)$.

We start at some $x_0$ (possibly $x_0=0$ if you have a priori assumptions that $b$ should be small-ish), and we iterate:

$x_t=x_{t-1}+\displaystyle \frac{\partial g(x)}{\partial x}=x_{t-1}+\eta a(ax+b)$.

Note that formally, we do not observe $a$, $b$, and $c$. We can however estimate

$\displaystyle \frac{\partial g(x)}{\partial x} \approx \frac{g(x)-g(x+\varepsilon)}{\varepsilon}$

for a small-ish $\varepsilon>0$. The choice of $\eta$ is also very important and related to the (local) curvature of the function. If you pick it too large, you oscillate around the solution (and if the function is not very well-behaved, it can even bring you further away from the solution), if you pick it too small, it takes forever to get to the solution.

With the preceeding example, with $a=2$, $b=1.2$, and $c=3$, with initial guess $x_0=2$, the gradient descent produces the following path (in magenta):

(right-click and “view image” to view image to full size).

I hope that helped a bit.