Implementing Multiple Linear Regression from scratch with Python
Multiple linear regression model is a linear model that estimates the relationship between a quantitative dependent variable and two or more independent variables using a straight line. The general equation of a multiple linear regression model is as follows: ŷ = β0 + β1x1 +β2x2+…+βnxn+e, where n represents the number of independent variables.
Implementing the model in Python
For the purpose of this demonstration, we will use the House Sales Data in King County that is available here. In order to import the data, and explore it, we first import a few Python libraries. For this project we will need to import 3 Python libraries: Pandas for data manipulation, NumPy for mathematical operations, and MatplotLib for visualize the data.
The dataset has one dependent variable (the median price), while the rest of the variable are independent (used for prediction). We therefore assign the median price, the ‘y’ variable while the rest are assigned to the ‘x’ variable.
Building the model
Gradient Descent is an optimization algorithm for locating a differentiable function’s local minimum. Gradient descent in machine learning is simply used to find the values of a function’s parameters (coefficients) that minimize a cost function as much as possible.
Cost function
We start first by understanding the intuition behind the loss function. From simple linear regression, we had the regression equation defined as y =wx + b, where w and b are parameters. From this, we define the loss function, which tells us how bad the fit is: L(y,t) = 1/2(y-t)², where y-t define the residuals. The cost function is defined as the loss function averaged over all the training examples, and is implemented as follows in python.
As seen from the code above, we have to vectorize the values, whereby the values are converted to a matrix; x.dot(w) + b. Vectorization is essential as it makes it easier to run the code and gets rids of dummy variables. This cost function is what we would like to minimize in order to get the line that best fits our model. We do this by using the method of gradient descent.
Gradient descent
Gradient descent is one of the methods that can be used to minimize the cost function. Through gradient descent, we start with initial weight values, (w) that we continuously update until we reach our desired criteria. For example, we may initialize the weight values to all zeros and repeatedly adjust them to the direction of the steepest descent.
Our gradient descent function above has six variables:
- x and y are the predictor and dependent variables that we use to train our model.
- w and b are the initial weights that we set to 0, and progressively update until we reach a minimum.
- learning_rate defines the size of steps taken to descent to the minimum, with a small learning rate meaning small steps and a large learning rate meaning larger steps
- epochs are the number of the steps that we wish to take to reach the minimum
For each epoch, the w and b values are updated, and the cost of each is recorded at the epoch_list. After we have exhausted all the epochs, we have our final values of w and b, which will be the final coefficients of our model. We also return the epoch_list, which will be used see how the cost changed with each epoch.
Now that our model is ready, we fit it into our dataset.
The graph for the cost against the number of iterations, show us how the cost reduces after each subsequent iteration
That it! We have built our first multiple linear regression model from scratch! In the next post, we shall asses make predictions using the model, asses how good it is, and appropriately tune the parameters to make it better.