What is Multicollinearity, how to test it in R and how to handle the problems caused by it?

hat is Multicollinearity?

Multicollinearity is a state of very high intercorrelations or inter-associations among the independent variables. It is therefore a type of disturbance in the data, and if present in the data the statistical inferences made about the data may not be reliable.

There are certain reasons why multicollinearity occurs:

  1. It is caused by the inclusion of a variable which is computed from other variables in the data set. For ex: in the dataset of physical attributes of a person, Weight, Height, BMI are three of the variables in the dataset, where BMI=Weight/Height. So, it is a derived variable and is positively correlated with weight. So, we can say multicollinearity occurs between Weight and BMI.
  2. Multicollinearity can also result from the repetition of the same kind of variable.
  3. Generally occurs when the variables are highly correlated to each other.

How to test Multicollinearity in R? 

For a given predictor (p), multicollinearity can assessed by computing a score called the variance inflation factor (or VIF), which measures how much the variance of a regression coefficient is inflated due to multicollinearity in the model.

The smallest possible value of VIF is one (absence of multicollinearity). As a rule of thumb, a VIF value that exceeds 5 or 10 indicates a problematic amount of collinearity..

R code:

ind = sample(2, nrow(winequality), replace = TRUE, prob=c(0.7, 0.3))
trainset = winequality[ind == 1,]
testset = winequality[ind == 2,]
model=lm(quality~., data=trainset)
predictions <- model %>% predict(testset)
RMSE = RMSE(predictions, testset$quality),
R2 = R2(predictions, testset$quality)

Output: RMSE R2
1 0.6304587 0.3623385
> car::vif(model)
fixed.acidity volatile.acidity citric.acid residual.sugar
7.842042 1.835592 3.169716 1.670375
chlorides free.sulfur.dioxide total.sulfur.dioxide density
1.410560 2.073626 2.235259 6.422401
pH sulphates alcohol
3.268406 1.417728 2.972688

So, here, fixed.acidity and density have high multicollinearity.

Problems with Multicollinearity:

Multicollinearity causes the following two basic types of problems:

  • The coefficientestimates can swing wildly making those become very sensitive to small changes in the model.
  • Multicollinearity reduces the precision of the estimate coefficients, which weakens the statistical powerof your regression model. We cannot trust on the p-values to identify independent variables that are statistically significant.
  • The partial regression coefficient due to multicollinearity may not be estimated precisely. The standard errors are likely to be high.
  • Multicollinearity results in a change in the signs as well as in the magnitudes of the partial regression coefficients from one sample to another sample.
  • Multicollinearity makes it tedious to assess the relative importance of the independent variables in explaining the variation caused by the dependent variable.

In the presence of high multicollinearity, the confidence intervals of the coefficients tend to become very wide and the statistics tend to be very small. It becomes difficult to reject the null hypothesis of any study when multicollinearity is present in the data under study.

Types of Multicollinearity:

There are two basic kinds of multicollinearity:

  1. Structural multicollinearity: This type occurs when we create a model term using other terms. In other words, it’s a byproduct of the model that we specify rather than being present in the data itself. For example, in the dataset of physical attributes of a person, Weight, Height, BMI are three of the variables in the dataset, where BMI=Weight/Height. So, it is a derived variable and is positively correlated with weight. So, we can say multicollinearity occurs between Weight and BMI.
  • Data multicollinearity: This type of multicollinearity is present in the data itself rather than being an artifact of our model. Observational experiments are more likely to exhibit this kind of multicollinearity.

Fixing Multicollinearity:

Multicollinearity makes it hard to interpret the model coefficients, and it reduces the power of the model to identify independent variables that are statistically significant. These are definitely serious problems.

The need to reduce multicollinearity depends on its severity and our primary goal for the regression model by keeping the following three points in mind:

  1. The severity of the problems increases with the degree of the multicollinearity. Therefore, if we have only moderate multicollinearity, you may not need to resolve it.
  2. Multicollinearity affects only the specific independent variables that are correlated. Therefore, if multicollinearity is not present for the independent variables that we are particularly interested in, we may not need to resolve it. Suppose our model contains the experimental variables of interest and some control variables. If high multicollinearity exists for the control variables but not the experimental variables, then we can interpret the experimental variables without problems.
  3. Multicollinearity affects the coefficients and p-values, but it does not influence the predictions, precision of the predictions, and the goodness-of-fit statistics. If our primary goal is to make predictions, and we don’t need to understand the role of each independent variable, we don’t need to reduce severe multicollinearity.

How to Deal with Multicollinearity

But, what if you have severe multicollinearity in our data and you find that we must deal with it? What do we do then? Even though, there are a variety of methods that we can try, but each one has some drawbacks. We have need to use our subject-area knowledge and factor in the goals of our study to pick the solution that provides the best mix of advantages and disadvantages.

The potential solutions include the following:

  • Remove some of the highly correlated independent variables.
  • Linearly combine the independent variables, such as adding them together.
  • Perform an analysis designed for highly correlated variables, such as principal components analysis or partial least squares regression.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s