Search This Blog

Machine Learning Notes 4

Gradient descent algorithm for multiple linear regression

(i.e. for linear regression with multiple variables)


Terminology (notation used):

$n$ - this is the number of features (the number of input variables)

$X_1, X_2, \dots, X_n$ - these are the features

$X_j$ - the $j$-th feature ($j = 1,2,\dots,n$)

$Y$ - the target (output) variable

$m$ - the size of the training set (the number of training examples)

$X_j^{(i)}$ - the $i$-th training value from the $j$-th feature ($i = 1,2,\dots,m$)

Let's assume that we have the following features and training examples.
 
# $X_1$ $X_2$ $X_3$ ... $X_n$ $Y$
1 2100 5 1 ... 45 460
2 1400 3 2 ... 40 235
3 1530 3 2 ... 30 315
... ... ... ... ... ... ...
m 850 2 1 ... 35 175

Here we have that $X_1^{(3)} = 1530$, $X_2^{(2)} = 3$, $X_n^{(3)} = 30$, $X_n^{(m)} = 35$

$\overrightarrow{X}^{(i)}$ - this is a row vector of all $i$-th training values (basically the $i$-th row from the table above without the Y values), so this is the $i$-th training example

$\overrightarrow{X}^{(i)} = (X_1^{(i)}, X_2^{(i)}, \dots, X_n^{(i)})$, for $i=1,2,\dots,m$ 

Note: The training examples are the rows from the table above.

The linear regression model for a single feature was defined as follows.

$f_{w,b} = w \cdot x + b \tag{1}$

The linear regression model for multiple features is defined as follows.

$f_{w,b} = \sum_{j=1}^n w_j \cdot x_j + b \tag{2}$

We can rewrite this model in vector notation. Let us denote 

$\overrightarrow {w} = (w_1, w_2, \dots, w_n)$ - this is a row vector of $n$ scalars (these are the weights assigned to the features)

$b$ - this is a scalar (a number), this value $b$ is the bias

$\overrightarrow {x} = (x_1, x_2, \dots, x_n)$ - this is a row vector of $n$ scalars (the input variables to our model)

Then for the multiple linear regression model we obtain

$f_{\overrightarrow{w},b} = \overrightarrow{w} \cdot \overrightarrow{x} + b \tag{3}$

Here $\overrightarrow{w}, \overrightarrow{b}$ are the parameters of the multiple linear regression model.

In this notation $\overrightarrow{w}\cdot \overrightarrow{x}$ is the dot product of the vectors $\overrightarrow{w}$ and $\overrightarrow{x}$ i.e. 

$\overrightarrow{w} \cdot \overrightarrow{x} = \sum_{j=1}^n w_j \cdot x_j \tag{4}$

Note: This is called multiple linear regression, and not multivariate linear regression. The term multivariate linear regression in machine learning is used for something else. 

For example in the table above $Y$ might be the price of a house (in 1000s of US dollars) which we are trying to estimate (i.e. build a model for), $X_1$ might be the size of the house in square feet, $X_2$ might be the number of bedrooms, $X_3$ might be the number of floors, and $X_n$ might be the age of the house in years (let us say n=4). Then one possible linear model for the price (Y) of the house could be as follows: 

$f_{w,b}(x) = 0.1 \cdot x_1 + 4 \cdot x_2 + 10 \cdot x_3 + (-2) \cdot x_4 + 80 \tag{5}$ 

Here we have $b=80$, and this is called the base price. 

We denote 

$\hat{y}^{(j)} = f_{\overrightarrow{w}, b} (x^{(j)}) = \sum_{i=1}^n w_i \cdot x_i^{(j)} + b\tag{6}$ 
This value is the predicted value (by the linear regression model) which 
corresponds to the $j$-th training example (here $j=1,2,\dots,m$). 

We also have 

$y^{(j)}$ - the actual value corresponding to the $j$-th training example ($j=1,2,\dots,m$).

The least mean squares (LMS) cost function is defined as follows 

$$J(\overrightarrow{w},b) = J(w_1, w_2, \dots, w_n, b) = \frac{1}{2m}\sum_{i=1}^m (\hat{y}^{(i)} - y^{(i)})^2 \tag{7}$$

$$J(\overrightarrow{w},b) = \frac{1}{2m}\sum_{i=1}^m (f_{\overrightarrow{w},b}(\overrightarrow{x}^{(i)}) - y^{(i)})^2 \tag{8}$$

$$J(\overrightarrow{w},b) = \frac{1}{2m} \sum_{i=1}^m (w_1 x_1^{(i)} + w_2 x_2^{(i)} + \dots + w_n x_n^{(i)} + b - y ^ {(i)})^2 \tag{9}$$   

So we get that 

$$J(\overrightarrow{w},b) = \frac{1}{2m}\sum_{i=1}^m [  (\sum_{j=1}^n w_j \cdot x_j^{(i)}) + b - y^{(i)}]^2 \tag{10}$$ 

The gradient descent algorithm essentially varies the $w_j$ and $b$ parameters of the model to find the optimal i.e. the minimal value for $J$ (since $J$ is a function of the parameters $b$ and $w_j$, $j=1,2,\dots,n$). 

So we compute 

$$\frac{\partial{J}}{\partial{w_j}} = \frac{1}{m}\sum_{i=1}^m (f_{\overrightarrow{w},b}(\overrightarrow{x}^{(i)}) - y^{(i)}) \cdot x_j^{(i)} \tag{11}$$ 

(here $j=1,2,\dots,n$)

and 

$$\frac{\partial{J}}{\partial{b}} = \frac{1}{m}\sum_{i=1}^m (f_{\overrightarrow{w},b}(\overrightarrow{x}^{(i)}) - y^{(i)}) \tag{12}$$ 

Now that we have formulas for these partial derivatives, the gradient descent algorithm for multiple linear regression goes as follows. 

1. Pick a small positive number $\alpha \gt 0$. 
2. Repeat these four actions until convergence 

2.1. 
$$temporary\_w_{j} = w_{j} - \alpha \cdot \frac{\partial{J}}{\partial{w_j}}$$ for $j=1,2,\dots,n$

(use the partial derivative expression from (11))

2.2.

$$temporary\_b = b - \alpha \cdot \frac{\partial{J}}{\partial{b}}$$

(use the partial derivative expression from (12))

2.3.

$$w_j = temporary\_w_j$$ for $j=1,2,\dots,n$

2.4. 

$$b = temporary\_b$$

  • The temporary variables here are used to underline the fact that the algorithm is doing a simultaneous update of the (n+1) parameters $w_1, w_2, \dots, w_n, b$. That means we should not change any of the parameters before we have computed the new values for all the parameters. 

  • Convergence means that the values of the (n+1) parameters $w_1, w_2, \dots, w_n, b$ have settled i.e.  it means they all change by very little after certain number of iterations (say after K iterations/steps). To detect convergence we can use some sort of threshold e.g. we can assume the algorithm has converged if all parameters change only by less then 0.05% after certain number of iterations/steps.

This completes our description of the gradient descent algorithm for multiple linear regression.