Gradient descent algorithm for multiple linear regression
(i.e. for linear regression with multiple variables)
Terminology (notation used):
$n$ - this is the number of features (the number of input variables)
$X_1, X_2, \dots, X_n$ - these are the features
$X_j$ - the $j$-th feature ($j = 1,2,\dots,n$)
$Y$ - the target (output) variable
$m$ - the size of the training set (the number of training examples)
$X_j^{(i)}$ - the $i$-th training value from the $j$-th feature ($i = 1,2,\dots,m$)
Let's assume that we have the following features and training examples.# | $X_1$ | $X_2$ | $X_3$ | ... | $X_n$ | $Y$ |
1 | 2100 | 5 | 1 | ... | 45 | 460 |
2 | 1400 | 3 | 2 | ... | 40 | 235 |
3 | 1530 | 3 | 2 | ... | 30 | 315 |
... | ... | ... | ... | ... | ... | ... |
m | 850 | 2 | 1 | ... | 35 | 175 |
Here we have that
$X_1^{(3)} = 1530$, $X_2^{(2)} = 3$, $X_n^{(3)} = 30$, $X_n^{(m)} = 35$
$\overrightarrow{X}^{(i)}$ - this is a row vector of all $i$-th training values (basically the $i$-th row from the table above without the Y values), so this is the $i$-th training example
$\overrightarrow{X}^{(i)} = (X_1^{(i)}, X_2^{(i)}, \dots, X_n^{(i)})$, for $i=1,2,\dots,m$
Note: The training examples are the rows from the table above.
The linear regression model for a single feature was defined as follows.
$f_{w,b} = w \cdot x + b \tag{1}$
The linear regression model for multiple features is defined as follows.
$f_{w,b} = \sum_{j=1}^n w_j \cdot x_j + b \tag{2}$
We can rewrite this model in vector notation. Let us denote
$\overrightarrow {w} = (w_1, w_2, \dots, w_n)$ - this is a row vector of $n$ scalars (these are the weights assigned to the features)
$b$ - this is a scalar (a number), this value $b$ is the bias
$\overrightarrow {x} = (x_1, x_2, \dots, x_n)$ - this is a row vector of $n$ scalars (the input variables to our model)
Then for the multiple linear regression model we obtain
$f_{\overrightarrow{w},b} = \overrightarrow{w} \cdot \overrightarrow{x} + b \tag{3}$
Here $\overrightarrow{w}, \overrightarrow{b}$ are the parameters of the multiple linear regression model.
In this notation $\overrightarrow{w}\cdot \overrightarrow{x}$ is the dot product of the vectors $\overrightarrow{w}$ and $\overrightarrow{x}$ i.e.
$\overrightarrow{w} \cdot \overrightarrow{x} = \sum_{j=1}^n w_j \cdot x_j \tag{4}$
Note: This is called multiple linear regression, and not multivariate linear regression. The term multivariate linear regression in machine learning is used for something else.
For example in the table above $Y$ might be the price of a house (in 1000s of US dollars) which we are trying to estimate (i.e. build a model for), $X_1$ might be the size of the house in square feet, $X_2$ might be the number of bedrooms, $X_3$ might be the number of floors, and $X_n$ might be the age of the house in years (let us say n=4). Then one possible linear model for the price (Y) of the house could be as follows:
$f_{w,b}(x) = 0.1 \cdot x_1 + 4 \cdot x_2 + 10 \cdot x_3 + (-2) \cdot x_4 + 80 \tag{5}$
Here we have $b=80$, and this is called the base price.
We denote
$\hat{y}^{(j)} = f_{\overrightarrow{w}, b} (x^{(j)}) = \sum_{i=1}^n w_i \cdot x_i^{(j)} + b\tag{6}$
This value is the predicted value (by the linear regression model) which
corresponds to the $j$-th training example (here $j=1,2,\dots,m$).
We also have
$y^{(j)}$ - the actual value corresponding to the $j$-th training example ($j=1,2,\dots,m$).
The least mean squares (LMS) cost function is defined as follows
$$J(\overrightarrow{w},b) = J(w_1, w_2, \dots, w_n, b) = \frac{1}{2m}\sum_{i=1}^m (\hat{y}^{(i)} - y^{(i)})^2 \tag{7}$$
$$J(\overrightarrow{w},b) = \frac{1}{2m}\sum_{i=1}^m (f_{\overrightarrow{w},b}(\overrightarrow{x}^{(i)}) - y^{(i)})^2 \tag{8}$$
$$J(\overrightarrow{w},b) = \frac{1}{2m} \sum_{i=1}^m (w_1 x_1^{(i)} + w_2 x_2^{(i)} + \dots + w_n x_n^{(i)} + b - y ^ {(i)})^2 \tag{9}$$
So we get that
$$J(\overrightarrow{w},b) = \frac{1}{2m}\sum_{i=1}^m [ (\sum_{j=1}^n w_j \cdot x_j^{(i)}) + b - y^{(i)}]^2 \tag{10}$$
The gradient descent algorithm essentially varies the $w_j$ and $b$ parameters of the model to find the optimal i.e. the minimal value for $J$ (since $J$ is a function of the parameters $b$ and $w_j$, $j=1,2,\dots,n$).
So we compute
$$\frac{\partial{J}}{\partial{w_j}} = \frac{1}{m}\sum_{i=1}^m (f_{\overrightarrow{w},b}(\overrightarrow{x}^{(i)}) - y^{(i)}) \cdot x_j^{(i)} \tag{11}$$
(here $j=1,2,\dots,n$)
and
$$\frac{\partial{J}}{\partial{b}} = \frac{1}{m}\sum_{i=1}^m (f_{\overrightarrow{w},b}(\overrightarrow{x}^{(i)}) - y^{(i)}) \tag{12}$$
Now that we have formulas for these partial derivatives, the gradient descent algorithm for multiple linear regression goes as follows.
1. Pick a small positive number $\alpha \gt 0$.
2. Repeat these four actions until convergence
2.1.
$$temporary\_w_{j} = w_{j} - \alpha \cdot \frac{\partial{J}}{\partial{w_j}}$$ for $j=1,2,\dots,n$
(use the partial derivative expression from (11))
2.2.
$$temporary\_b = b - \alpha \cdot \frac{\partial{J}}{\partial{b}}$$
(use the partial derivative expression from (12))
2.3.
$$w_j = temporary\_w_j$$ for $j=1,2,\dots,n$
2.4.
$$b = temporary\_b$$
- The temporary variables here are used to underline the fact that the algorithm is doing a simultaneous update of the (n+1) parameters $w_1, w_2, \dots, w_n, b$. That means we should not change any of the parameters before we have computed the new values for all the parameters.
- Convergence means that the values of the (n+1) parameters $w_1, w_2, \dots, w_n, b$ have settled i.e. it means they all change by very little after certain number of iterations (say after K iterations/steps). To detect convergence we can use some sort of threshold e.g. we can assume the algorithm has converged if all parameters change only by less then 0.05% after certain number of iterations/steps.
This completes our description of the gradient descent algorithm for multiple linear regression.