**Gradient Descent Algorithm**

Let $J(w,b)$ be any cost function of two variables. If we want to be specific, we can assume that $J(w,b)$ is the usual ** mean squared error cost function** from the

*linear regression model.*

**univariate**1) The algorithm first sets two arbitrary values into $w$ and $b$ (e.g. $w := 0,\ b := 0$)

2) Then a loop begins. At each step of that loop the algorithm applies these two updates simultaneously

2.1) $w := w - \alpha \cdot \frac{\partial J}{\partial w}(w,b)$

2.2) $b := b - \alpha \cdot \frac{\partial J}{\partial b}(w,b)$

Here $\alpha$ is some small positive number e.g. $\alpha=0.01$

This number $\alpha$ is called the ** learning rate**. It controls the size of the step which the algorithm takes downhill in each iteration of the loop.

3) The loop repeats step 2) until the values of $(w,b)$ eventually settle at some point $(w_0, b_0)$. Settle means that after a number of steps, the values of $w$ and $b$ * no longer change much *with each additional step of the algorithm. At that point, the algorithm exits the loop. The point $(w_0, b_0)$ is the point where $J(w,b)$ has a local minimum. We also say that the algorithm converges to the point $(w_0, b_0)$ after a number of steps.

**Note 1: **A correct implementation ensures that the updates in step 2) are done simultaneously. What does that mean? That means in the two right hand sides of the assignments 2.1) and 2.2), we should make sure we are using "the old values" of $w$ and $b$. Usually this is implemented in the following way.

$tmp\_w := w - \alpha \cdot \frac{\partial J}{\partial w}(w,b)$

$tmp\_b := b - \alpha \cdot \frac{\partial J}{\partial b}(w,b)$

$w := tmp\_w$

$b := tmp\_b$

**Note 2: **Choosing a good value for $\alpha$ is very important. If $\alpha$ is too small, the gradient descent algorithm will work correctly but might be too slow. If $\alpha$ is too big, the gradient descent algorithm may overshoot (as they call it), i.e. it may miss the local minimum, i.e. the algorithm may diverge instead of converge (this is usually detected by the condition that after several iterations of the algorithm, the cost $J(w,b)$ actually gets bigger instead of getting smaller).