Photo by Markus Winkler on Unsplash
Implementing linear regression practically: For beginners like me
In my previous post, I shared some theoretical information about linear regression.
Today, We will study a simple example to implement it practically on the Boston housing price dataset using Google Collab.
We will learn how to Import libraries, use datasets, tune parameters, and a lot more. So, let's Begin!!!
First, we will import all the libraries ( there use will be demonstrated later) and the data set to work upon. Earlier we could have imported the dataset simply using following the command with sklearn library
from sklearn.datasets import load_boston
But since this data set has been deprecated for newer versions of sklearn, we will install sklearn 1.1.3 to import it.
import pandas as pd
import numpy as np
import pandas as pd
!pip install scikit-learn==1.1.3
Now restart the collab runtime and then use following command to load dataset and then display it.
from sklearn.datasets import load_boston
load_boston()
The output will be in the form of key value pairs which gives us information like features(features present in our dataset), target( our output feature) which tells us values of expected house price, descr which describes our features and gives other information.
We will now create a dataframe from this data set, and set the column names equal to features so that it is easier for us to interpret the data.
from sklearn.datasets import load_boston
df=load_boston() #saving the data
dataset=pd.DataFrame(df.data) #Creating the datafram from data
dataset.columns=df.feature_names #columns name = features of our data
dataset.head() #displaying first 5 rows of our dataset
We can see all our independent features in our above output.
We can store our independent and dependent features ( our target feature in this case) separately
x=dataset
y=df.target
We will split our dataset into training and testing data. For this example, we will treat 30% of our dataset as testing data(represented by test_size.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.30,random_state=42)
Now, our next step is to apply linear regression in the above dataset. Before doing so, we will Standardize our dataset. To do this, We will import StandardScaler from sklearn library and apply scaler.transform on testing data ,scaler.fit_transform on our training data.
The goal of scaling the features(or standardizing) is to ensure that all features have the same scale and variance, so that no feature dominates the others, and the algorithm can learn from all of them equally. This is particularly important when the features have different units or ranges, as in the case of linear regression.
from sklearn.preprocessing import StandardScaler
scaler=StandardScaler()
X_train=scaler.fit_transform(X_train)
X_test=scaler.transform(X_test)
In the above code, we have used fit_transform for training data because we want the scaler to adjust the scaling parameters(i.e. change them) according to the given data(we need this because we are performing scaling for first time. so we need to adjust the parameters).
For testing data, we have used transform instead of fit_transform, because we want the testing data to be scaled on the same parameters as training data.
To apply linear regression on this transformed data, import linear regression from sklearn. linearModel
#Import linear regression
from sklearn.linear_model import LinearRegression
regression=LinearRegression()
regression.fit(X_train,y_train) #training our linear regression model
Apply cross-validation on our model, and calculate the final mean square error
Here's a step-by-step breakdown of the concept:
Cross-validation: Cross-validation is a technique used to evaluate the performance of a machine learning model on an independent dataset. In k-fold cross-validation, the dataset is split into k equal folds. The model is trained on k-1 folds and evaluated on the remaining fold. This process is repeated k times, with each fold being used as the validation set exactly once. The performance of the model is then averaged across the k folds.
Negative MSE: MSE is a common metric used to evaluate the performance of a regression model. It measures the average squared difference between the predicted and actual values. A negative sign is added to the MSE to ensure that a higher score is always worse. This means that lower values of negative MSE indicate better performance.
Cross-validated negative MSE: Cross-validated negative MSE is obtained by performing k-fold cross-validation and computing the mean of the negative MSE scores obtained at each fold. This gives us a single metric that represents the performance of the model on the entire dataset.
#cross_validation
from sklearn.model_selection import cross_val_score
mse=cross_val_score(regression,X_train,y_train,scoring="neg_mean_squared_error",cv=5)
#Final mean square error. Lesser the MSE, better will be prediction
np.mean(mse)
Now it's time for the fun part!!! We will predict the values of the target variable for the testing data.
#prediction
reg_pred=regression.predict(X_test)
reg_pred
We can compare it to the expected values, and then visualize the given information using a graph
#Now compare it to truth value
import seaborn as sns
sns.displot(reg_pred-y_test,kind='kde')
Our graph has low variance, which means it's a good dataset and our model is well-trained. We can also calculate other metrics like R2 score for our linear regression model.
from sklearn.metrics import r2_score
score=r2_score(reg_pred,y_test)
score
That's it. This is the basic implementation of a linear regression model. I have not dived deep into the theoretical concepts of data transformation and cross-validation in this blog, because I wanted to focus more on the coding part of it without making it overcomplicated.
Congrats on making it all the way to the end!! Feel free to share any suggestions in the comments