Open In App

Detecting Multicollinearity with VIF - Python

Last Updated : 22 May, 2025
Comments
Improve
Suggest changes
Like Article
Like
Report

Multicollinearity occurs when two or more independent variables in a regression model are highly correlated which leads to unstable coefficient estimates and reduces model reliability. This makes it difficult to identify the individual effect of each predictor on the dependent variable. The Variance Inflation Factor (VIF) is used to detect multicollinearity in regression analysis. In this article, we’ll see VIF and how to use it in Python to identify multicollinearity.

Mathematics behind Variance Inflation Factor (VIF) Formula

Variance Inflation Factor (VIF) measures the increase in the variance of a regression coefficient caused by multicollinearity among predictor variables. It does this by regressing each independent variable against all other independent variables in the model to calculate the coefficient of determination or R^2

Formula for VIF is:

VIF=\frac{1}{1-R^2}

where R-squared(R^2) is the coefficient of determination in linear regression which represents how well one feature can be predicted from others with values ranging between 0 and 1. A higher R^2 means a stronger relationship with other variables which leads to a higher VIF.

If R-squared is close to 1 this indicates high multicollinearity because other variables almost entirely explain the variable.

As we see from the formula, greater the value of R-squared greater is the VIF. Hence greater VIF denotes greater correlation. Generally a VIF above 5 shows a high multicollinearity.

By understanding the VIF formula we can accurately detect multicollinearity in our regression models and take necessary steps to address it.

Multicollinearity Detection using VIF in Python

To detect multicollinearity in regression analysis we can implement the Variance Inflation Factor (VIF) using the statsmodels library. This function calculates the VIF value for each feature in the dataset helping us identify multicollinearity.

Syntax : statsmodels.stats.outliers_influence.variance_inflation_factor(exog, exog_idx)

Parameters:

  • exog: Array or DataFrame of independent variables (features).
  • exog_idx: Index of the feature for which VIF is calculated.

Consider a dataset of 500 individuals containing their gender, height, weight and Body Mass Index (BMI). Here, Index is the dependent variable and Gender, Height and Weight are independent variables. You can download the dataset from here. We will be using Pandas library for its implementation.

Python
import pandas as pd 

data = pd.read_csv('/content/BMI.csv')

print(data.head())

Output:

vif
Head()

Here we are using the below approch:

  • Converting categorical variables like Gender into numeric form.
  • Passing each feature index to variance_inflation_factor() to calculate the VIF.
  • Storing the results in a Pandas DataFrame for easy interpretation.
Python
from statsmodels.stats.outliers_influence import variance_inflation_factor

data['Gender'] = data['Gender'].map({'Male':0, 'Female':1})

X = data[['Gender', 'Height', 'Weight']]

vif_data = pd.DataFrame()
vif_data["feature"] = X.columns

vif_data["VIF"] = [variance_inflation_factor(X.values, i)
                          for i in range(len(X.columns))]
print(vif_data)

Output :

vif3
VIF values for Height and Weight

High VIF values for Height and Weight shows strong multicollinearity between these two variables which makes sense because a person’s height influences their weight. Detecting such relationships helps us to understand and improve the stability of our regression models.

What to do if VIF is High?

Here are several effective strategies to address high VIF values and improve model performance:

1. Removing Highly Correlated Features

  • Use a correlation matrix to identify features with strong correlations typically above 0.7 or 0.8.
  • Drop one of the correlated features, the one which is less important or with a higher VIF. Removing such features reduces redundancy and improves model interpretability and stability.

2. Combining Variables or Using Dimensionality Reduction Techniques

  • Create new variables by combining correlated features like calculating Body Mass Index (BMI) from height and weight.
  • Apply Principal Component Analysis (PCA) to transform correlated variables into uncorrelated components. These components can replace original features which helps in removing multicollinearity while preserving most of the data’s variance.

Understanding and correcting multicollinearity in regression is important for improving model accuracy in fields like econometrics where variable relationships play a important role.


Next Article

Similar Reads