Multicollinearity: & why it is a problem

Multicollinearity: & why it is a problem

Multicollinearity is a commonly heard term in data science. Today we will see:

  • What is MultiCollinearity?

  • Why does it happen?

  • Why is it a problem?

  • How can we remove it?

So our first question is what is Multicollinearity?

It is a situation where the independent features are co-related to each other. That is, increasing the value of one feature also increases the value of the another feature (positive co-relation) or increasing the value of one feature decreases the value of another feature (negative co-relation). It is generally observed in regression models (we will see why)

Why does multicollinearity happen?

There Are two common causes: first, the selected feature is a component of another feature. For example, if we tack Advertisment_cost and TV_Advertisement cost as our input features, then there will be stronger collinearity because TV_advertisement_cost is the component of Advertisement_Cost.

Second: structure of our dataset. If we have a feature like Production_cost and (production_cost)^2, then also we get a strongly co-related feature.

Why is it a problem?

In regression models, our goal is to find the effect of each independent feature on our dependent feature. we give the value of the output feature in the following form:

Y = m1X1 + m2X2 + m3X3........

here m1 represents the no of units by which Y will change if we change X1 alone keeping all other features unchanged. But if X1 and X2 are strongly co-related, we cannot make this observation. It is because changing the value of X1 will also change the value of X2 and vice versa. This will spoil the coefficients which we obtain after training our model and decrease its accuracy.

How to remove it?

We generally construct a correlation matrix,then find out which features are strongly co-related. Suppose X1 and X2 are strongly correlated, then we remove one of them depending upon how much they impact our dependent variable. Another method is to use a regression model like Lasso regression or Ridge regression which penalizes us for using duplicate information.

That's it. These were the basics of collinearity in machine learning models. I hope this was helpful and provided an understanding of this concept.