Gradient descent algorithm for multiple linear regression
(i.e. for linear regression with multiple variables)
Terminology (notation used):
n - this is the number of features (the number of input variables)
X1,X2,…,Xn - these are the features
Xj - the j-th feature (j=1,2,…,n)
Y - the target (output) variable
m - the size of the training set (the number of training examples)
X(i)j - the i-th training value from the j-th feature (i=1,2,…,m)
Let's assume that we have the following features and training examples.# | X1 | X2 | X3 | ... | Xn | Y |
1 | 2100 | 5 | 1 | ... | 45 | 460 |
2 | 1400 | 3 | 2 | ... | 40 | 235 |
3 | 1530 | 3 | 2 | ... | 30 | 315 |
... | ... | ... | ... | ... | ... | ... |
m | 850 | 2 | 1 | ... | 35 | 175 |
Here we have that
X(3)1=1530, X(2)2=3, X(3)n=30, X(m)n=35
→X(i) - this is a row vector of all i-th training values (basically the i-th row from the table above without the Y values), so this is the i-th training example
→X(i)=(X(i)1,X(i)2,…,X(i)n), for i=1,2,…,m
Note: The training examples are the rows from the table above.
The linear regression model for a single feature was defined as follows.
fw,b=w⋅x+b
The linear regression model for multiple features is defined as follows.
fw,b=∑nj=1wj⋅xj+b
We can rewrite this model in vector notation. Let us denote
→w=(w1,w2,…,wn) - this is a row vector of n scalars (these are the weights assigned to the features)
b - this is a scalar (a number), this value b is the bias
→x=(x1,x2,…,xn) - this is a row vector of n scalars (the input variables to our model)
Then for the multiple linear regression model we obtain
f→w,b=→w⋅→x+b
Here →w,→b are the parameters of the multiple linear regression model.
In this notation →w⋅→x is the dot product of the vectors →w and →x i.e.
→w⋅→x=∑nj=1wj⋅xj
Note: This is called multiple linear regression, and not multivariate linear regression. The term multivariate linear regression in machine learning is used for something else.
For example in the table above Y might be the price of a house (in 1000s of US dollars) which we are trying to estimate (i.e. build a model for), X1 might be the size of the house in square feet, X2 might be the number of bedrooms, X3 might be the number of floors, and Xn might be the age of the house in years (let us say n=4). Then one possible linear model for the price (Y) of the house could be as follows:
fw,b(x)=0.1⋅x1+4⋅x2+10⋅x3+(−2)⋅x4+80
Here we have b=80, and this is called the base price.
We denote
ˆy(j)=f→w,b(x(j))=∑ni=1wi⋅x(j)i+b
This value is the predicted value (by the linear regression model) which
corresponds to the j-th training example (here j=1,2,…,m).
We also have
y(j) - the actual value corresponding to the j-th training example (j=1,2,…,m).
The least mean squares (LMS) cost function is defined as follows
J(→w,b)=J(w1,w2,…,wn,b)=12mm∑i=1(ˆy(i)−y(i))2
J(→w,b)=12mm∑i=1(f→w,b(→x(i))−y(i))2
J(→w,b)=12mm∑i=1(w1x(i)1+w2x(i)2+⋯+wnx(i)n+b−y(i))2
So we get that
J(→w,b)=12mm∑i=1[(n∑j=1wj⋅x(i)j)+b−y(i)]2
The gradient descent algorithm essentially varies the wj and b parameters of the model to find the optimal i.e. the minimal value for J (since J is a function of the parameters b and wj, j=1,2,…,n).
So we compute
∂J∂wj=1mm∑i=1(f→w,b(→x(i))−y(i))⋅x(i)j
(here j=1,2,…,n)
and
∂J∂b=1mm∑i=1(f→w,b(→x(i))−y(i))
Now that we have formulas for these partial derivatives, the gradient descent algorithm for multiple linear regression goes as follows.
1. Pick a small positive number α>0.
2. Repeat these four actions until convergence
2.1.
temporary_wj=wj−α⋅∂J∂wj for j=1,2,…,n
(use the partial derivative expression from (11))
2.2.
temporary_b=b−α⋅∂J∂b
(use the partial derivative expression from (12))
2.3.
wj=temporary_wj for j=1,2,…,n
2.4.
b=temporary_b
- The temporary variables here are used to underline the fact that the algorithm is doing a simultaneous update of the (n+1) parameters w1,w2,…,wn,b. That means we should not change any of the parameters before we have computed the new values for all the parameters.
- Convergence means that the values of the (n+1) parameters w1,w2,…,wn,b have settled i.e. it means they all change by very little after certain number of iterations (say after K iterations/steps). To detect convergence we can use some sort of threshold e.g. we can assume the algorithm has converged if all parameters change only by less then 0.05% after certain number of iterations/steps.
This completes our description of the gradient descent algorithm for multiple linear regression.
No comments:
Post a Comment