In this report, we will solve linear regression using both the closed-form solution and gradient descent method based on a small dataset. After that, we will further learn to tune some parameters such as the learing rate to optimizate our gradient descent model.
In statistics, linear regression is a linear approach to modelling the relationship between a scalar response (or dependent variable) and one or more explanatory variables (or independent variables). The case of one explanatory variable is called simple linear regression.[2] Closed-form solution and gradient descent are two methods for solving simple linear regression. Motivations of the report are listed below:
- Further understand of linear regression ,closed-form solution and Stochastic gradient descent.
- Conduct some experiments under small scale dataset.
- Realize the process of optimization and adjusting parameters.
The equation of simple linear regression can be described as:
Let
and then equation (1) can be changed into
The mean square loss of simple linear regression is
The corresponding gradient with respect to in simple linear regression is
To minimize the mean square loss Lreg, we can use closed-formed solution or the gradient descent method.
if the matrix is a full-rank matrix or a positive definite matrix, then its inverse matrix exists.
Thus we can use the equation (5) to calculate the best weight vector
.
However, in most cases the inverse matrix of a given matrix may not exist. So the closed-form solution can't always work. Gracefully, gradient descent can help.
Gradient Descent (GD) tries to minimize the loss function by updating weight vector to minimize the learning rate muplitying the correspondent gradient with respect to weighted vector in the loss function.
In our linear regression model, it looks like this:
With regularization, the loss function (3) can be changed into the objective function
Then equation (7) becomes
In this experiment, to perform linear regression we uses housing_scale in LIBSVM Data, including 506 samples and each sample has 13 features. The dataset is then divided into train set and validation set.
- Load the housing_scale dataset and divide it into training set and validation set.
- Initialize linear model parameters. Set all parameter into zero, initialize it randomly or with normal distribution.
- Select the mean square loss as the loss function and calculate mean square loss of the training set with the weight vector, denoted as Loss.
- Use the formula of the closed-form solution (5) to get the best weighted vector.
- Get the Loss, Loss_train under the training set and Loss_val by validating under validation set and output them.
import numpy as np
from sklearn.metrics import mean_squared_error
def linear_reg_closed_form(X_train, y_train, X_val, y_val):
'''Use the closed-form solution to optimize simple linear regression.
Attention: This function may not work because the inverse of a given
matrix may not exist.
Parameters
----------
X_train: array-like of shape = (n_train_samples, n_features)
Samples of training set.
y_train: array-like of shape = (n_train_samples, 1)
Groundtruth labels of training set.
X_val: array-like of shape = (n_val_samples, n_features)
Samples of validation set.
y_val: array-like of shape = (n_val_samples, 1)
Groundtruth labels of validation set.
Returns
-------
w: array-like of shape = (n_features, 1)
The weight vector.
b: int
The bias of linear regression.
losses_dict: dict
A dict containing losses evaluated before and after training
'''
n_features = X_train.shape[1]
# make all X[i, 0] = 1
X_train = np.hstack((np.ones(y_train.shape), X_train))
X_val = np.hstack((np.ones(y_val.shape), X_val))
# init weight vector
w = np.zeros((n_features + 1, 1)) # zero based weight vector
# w = np.random.random((n_features+1, 1)) # initialize with small random values
# w = np.random.normal(1, 1, size=(n_features+1, 1))
losses_dict = {}
losses_dict['losses_train_origin'] = mean_squared_error(
y_true=y_train, y_pred=np.dot(X_train, w))
losses_dict['losses_val_origin'] = mean_squared_error(
y_true=y_val, y_pred=np.dot(X_val, w))
# use closed-form solution to update w
# @ operation equals to np.dot
try:
w = np.mat(X_train.T @ X_train).I.getA() @ X_train.T @ y_train
except Exception as ex:
print('The inverse of the matrix X_train.T @ X_train doesn\'t exist.')
print(ex)
losses_dict['losses_train'] = mean_squared_error(y_train, np.dot(
X_train, w))
losses_dict['losses_val'] = mean_squared_error(y_val, np.dot(X_val, w))
w, b = w[1:, ], w[0, 0]
return w, b, losses_dict
- Load and divide dataset.
- Initialize linear model parameters. Set all parameter into zero, initialize it randomly or with normal distribution.
- Choose mean square loss as the loss function.
- Calculate the gradient with respect to weight in the objective funtion from each example using equation (8). Denote the opposite direction of gradient as D.
- Update model:
.
- Get the loss loss_train under the training set and loss_val by validating under validation set.
- Repeate step 4 to 6 for several times, and use the values of loss_train and loss_val to plot the loss graph.
import numpy as np
from sklearn.metrics import mean_squared_error
def linear_reg_GD(X_train,
y_train,
X_val,
y_val,
max_epoch=200,
learning_rate=0.01,
penalty_factor=0.5):
'''Use the gradient descent method to solve simple linear regression.
Parameters
----------
X_train, X_val : array-like of shape = (n_train_samples, n_features) and (n_val_samples, n_features)
Samples of training set and validation set.
y_train : array-like of shape = (n_train_samples, 1) and (n_val_samples, 1) respectively
Groundtruth labels of training set and validation set.
max_epoch : int
The max epoch for training.
learning_rate : float
The hyper parameter to control the velocity of gradient descent process,
also called step_size.
penalty_factor : float
The L2 regular term factor for the objective function.
Returns
-------
w: array-like of shape = (n_features, 1)
The weight vector.
b: int
The bias of linear regression.
losses_dict: dict
A dict containing losses evaluated before and after training
'''
n_features = X_train.shape[1]
# make all X[i, 0] = 1
X_train = np.hstack((np.ones(y_train.shape), X_train))
X_val = np.hstack((np.ones(y_val.shape), X_val))
# init weight vector
w = np.zeros((n_features + 1, 1)) # zero based weight vector
# w = np.random.random((n_features+1, 1)) # initialize with small random values
# w = np.random.normal(1, 1, size=(n_features+1, 1))
losses_train, losses_val = [], []
for epoch in range(0, max_epoch):
d = -penalty_factor * w + X_train.T @ (y_train - X_train @ w)
w += learning_rate * d
# update learning rate if necessary
# learning_rate /= (epoch + 1) # emmm...no so good
# learning_rate /= 1 + learning_rate * penalty_factor * (epoch + 1)
loss_train = mean_squared_error(
y_true=y_train, y_pred=np.dot(X_train, w))
loss_val = mean_squared_error(y_true=y_val, y_pred=np.dot(X_val, w))
losses_train.append(loss_train)
losses_val.append(loss_val)
w, b = w[1:, ], w[0, 0]
losses_dict = {'losses_train': losses_train, 'losses_val': losses_val}
return w, b, losses_dict
For this small dataset with 13 features and 506 samples, the closed-form solution can easily and quickly calculate the desired weight vector and generate output results.
closed-form solution for linear regression
losses_train_origin = 605.93385
losses_val_origin = 23.476533
losses_train = 23.476533
losses_val = 18.176029
With carefully selecting suitable hyper parameters learning_rate and penalty_factor, the gradient descent method can minimize the mean square loss in a certain number of epoches to a low level.
In the gradient descent method, the learning rate is a important parameter to control the velocity of the gradient descent process, which has a large impact on convergence. The following two graphes show that:
- Too large η causes oscillatory and may even diverge
- Too small η makes it too slow to converge
We can use the adaptive learning rate:
- Set larger learning rate at the begining
- Use relatively smaller learning rate in the later epochs
- Decrease the learning rate:
In this report, we manage to perform a simple linear regression simulation based on a small dataset, using both the closed-form solution and the gradient descent method. Then we do serveral simple experiments to estimate their performance and in addition to learn to tune the hyper parameter learning rate.
- Basic Concepts about Machine Learning: Regression and Optimization, Prof. Mingkui Tan
- linear_regression, Wikipidea