Search This Blog

Machine Learning Notes 3

Gradient descent algorithm

for univariate linear regression

(i.e. for linear regression with one variable)


Let $J(w,b)$ be any cost function of two variables. If we want to be specific, we can assume that $J(w,b)$ is the usual mean squared error cost function from the univariate linear regression model.

See here: ML Notes 2

1) The algorithm first sets two arbitrary values into $w$ and $b$ (e.g. $w := 0,\ b := 0$)

2) Then a loop begins. At each step of that loop the algorithm applies these two updates simultaneously 

2.1) $w := w - \alpha \cdot \frac{\partial J}{\partial w}(w,b)$

2.2) $b := b - \alpha \cdot \frac{\partial J}{\partial b}(w,b)$

Here $\alpha$ is some small positive number e.g. $\alpha=0.01$

This number $\alpha$ is called the learning rate. It controls the size of the step which the algorithm takes downhill in each iteration of the loop.

3) The loop repeats step 2) until the values of $(w,b)$ eventually settle at some point $(w_0, b_0)$. Settle means that after a number of steps, the values of $w$ and $b$ no longer change much with each additional step of the algorithm. At that point, the algorithm exits the loop. The point $(w_0, b_0)$ is the point where $J(w,b)$ has a local minimum. We also say that the algorithm converges to the point $(w_0, b_0)$ after a number of steps.

Note 1: A correct implementation ensures that the updates in step 2) are done simultaneously. What does that mean? That means in the two right hand sides of the assignments 2.1) and 2.2), we should make sure we are using "the old values" of $w$ and $b$. Usually this is implemented in the following way. 

$tmp\_w := w - \alpha \cdot \frac{\partial J}{\partial w}(w,b)$

$tmp\_b := b - \alpha \cdot \frac{\partial J}{\partial b}(w,b)$

$w := tmp\_w$

$b := tmp\_b$

Note 2: Choosing a good value for $\alpha$ is very important. If $\alpha$ is too small, the gradient descent algorithm will work correctly but might be too slow. If $\alpha$ is too big, the gradient descent algorithm may overshoot (as they call it), i.e. it may miss the local minimum, i.e. the algorithm may diverge instead of converge (this is usually detected by the condition that after several iterations of the algorithm, the cost $J(w,b)$ actually gets bigger instead of getting smaller).

This algorithm is also known as batch gradient descent which refers to the fact that at each step of the algorithm all the training examples are used (i.e. at each step the entire training set is used). There are also other versions of the gradient descent algorithm which at each step use only subsets of the entire training set.


Real Analysis - Basic Differentiation Rules

$C' = 0$, where $C$ is a constant

$(C \cdot x)' = C$, where $C$ is a constant

$(x^n)' = n \cdot x^{n-1}$, $x \in \mathbb{R}$, $n \in \mathbb{Z}$

$(x^a)' = a \cdot x^{a-1}$, $x \in \mathbb{R}$, $x \gt 0$, $a \in \mathbb{R}$

$(\sin{x})' = \cos{x}$

$(\cos{x})' = -\sin{x}$

$(\tan{x})' = (tg\ {x})' = 1 / \cos^2{x}$

$(\cot{x})' = (cotg\ {x})' = -1 / \sin^2{x}$

$(\arcsin{x})' = 1 / \sqrt{1-x^2}$

$(\arccos{x})' = -1 / \sqrt{1-x^2}$

$(arctan\ {x})' = (arctg\ {x})' = 1 / (1 + x^2)$

$(arccot\ {x})' = (arccotg\ {x})' = -1 / (1 + x^2)$

$(e^x)' = e^x$

$(a^x)' = a^x \cdot \ln{a}\ \ \ $ ($a \gt 0$ is a constant)

$(\ln{x})' = 1 / x$

$(\log_{a}x)' = 1 / (x \cdot \ln{a})\ \ \ $ ($a \gt 0$ is a constant)


$$(u + v)' = u' + v'$$

$$(u - v)' = u' - v'$$

$$(uv)' = u' \cdot v + v' \cdot u $$

$$\left(\frac{u}{v}\right)' = \frac{u'v - v'u}{v^2}$$

$$(f(g(x)))' = f'(g(x)) \cdot g'(x)$$