Detecting Multicollinearity with VIF - Python
Last Updated :
22 May, 2025
Multicollinearity occurs when two or more independent variables in a regression model are highly correlated which leads to unstable coefficient estimates and reduces model reliability. This makes it difficult to identify the individual effect of each predictor on the dependent variable. The Variance Inflation Factor (VIF) is used to detect multicollinearity in regression analysis. In this article, we’ll see VIF and how to use it in Python to identify multicollinearity.
Variance Inflation Factor (VIF) measures the increase in the variance of a regression coefficient caused by multicollinearity among predictor variables. It does this by regressing each independent variable against all other independent variables in the model to calculate the coefficient of determination or R^2
Formula for VIF is:
VIF=\frac{1}{1-R^2}
where R-squared(R^2) is the coefficient of determination in linear regression which represents how well one feature can be predicted from others with values ranging between 0 and 1. A higher R^2 means a stronger relationship with other variables which leads to a higher VIF.
If R-squared is close to 1 this indicates high multicollinearity because other variables almost entirely explain the variable.
As we see from the formula, greater the value of R-squared greater is the VIF. Hence greater VIF denotes greater correlation. Generally a VIF above 5 shows a high multicollinearity.
By understanding the VIF formula we can accurately detect multicollinearity in our regression models and take necessary steps to address it.
Multicollinearity Detection using VIF in Python
To detect multicollinearity in regression analysis we can implement the Variance Inflation Factor (VIF) using the statsmodels library. This function calculates the VIF value for each feature in the dataset helping us identify multicollinearity.
Syntax : statsmodels.stats.outliers_influence.variance_inflation_factor(exog, exog_idx)
Parameters:
- exog: Array or DataFrame of independent variables (features).
- exog_idx: Index of the feature for which VIF is calculated.
Consider a dataset of 500 individuals containing their gender, height, weight and Body Mass Index (BMI). Here, Index is the dependent variable and Gender, Height and Weight are independent variables. You can download the dataset from here. We will be using Pandas library for its implementation.
Python
import pandas as pd
data = pd.read_csv('/content/BMI.csv')
print(data.head())
Output:
Head()Here we are using the below approch:
- Converting categorical variables like Gender into numeric form.
- Passing each feature index to variance_inflation_factor() to calculate the VIF.
- Storing the results in a Pandas DataFrame for easy interpretation.
Python
from statsmodels.stats.outliers_influence import variance_inflation_factor
data['Gender'] = data['Gender'].map({'Male':0, 'Female':1})
X = data[['Gender', 'Height', 'Weight']]
vif_data = pd.DataFrame()
vif_data["feature"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i)
for i in range(len(X.columns))]
print(vif_data)
Output :
VIF values for Height and WeightHigh VIF values for Height and Weight shows strong multicollinearity between these two variables which makes sense because a person’s height influences their weight. Detecting such relationships helps us to understand and improve the stability of our regression models.
What to do if VIF is High?
Here are several effective strategies to address high VIF values and improve model performance:
1. Removing Highly Correlated Features
- Use a correlation matrix to identify features with strong correlations typically above 0.7 or 0.8.
- Drop one of the correlated features, the one which is less important or with a higher VIF. Removing such features reduces redundancy and improves model interpretability and stability.
2. Combining Variables or Using Dimensionality Reduction Techniques
- Create new variables by combining correlated features like calculating Body Mass Index (BMI) from height and weight.
- Apply Principal Component Analysis (PCA) to transform correlated variables into uncorrelated components. These components can replace original features which helps in removing multicollinearity while preserving most of the data’s variance.
Understanding and correcting multicollinearity in regression is important for improving model accuracy in fields like econometrics where variable relationships play a important role.
Similar Reads
Test of Multicollinearity Multicollinearity: It generally occurs when the independent variables in a regression model are correlated with each other. This correlation is not expected as the independent variables are assumed to be independent. If the degree of this correlation is high, it may cause problems while predicting r
2 min read
Multicollinearity in Data Multicollinearity happens when two or more predictor(independent) variables in a model are closely related to each other. Because they give similar information, it becomes difficult to know how each one affects the result, this is a common problem in multiple linear regression and can make the model
8 min read
Data Science - Solving Linear Equations with Python A collection of equations with linear relationships between the variables is known as a system of linear equations. The objective is to identify the values of the variables that concurrently satisfy each equation, each of which is a linear constraint. By figuring out the system, we can learn how the
4 min read
Plagiarism Detection using Python In this article, we will learn how to check plagiarism using Python. Plagiarism: Plagiarism refers to cheating. It means stealing someone else's work, ideas, or information from the resources without providing the necessary credit to the author and for example, copying text from different resources
9 min read
How to Test for Multicollinearity in R Multicollinearity, a common issue in regression analysis, occurs when predictor variables in a model are highly correlated, leading to instability in parameter estimation and difficulty in interpreting the model results accurately. Detecting multicollinearity is crucial for building robust regressio
4 min read
Multicollinearity in Nonlinear Regression Models Multicollinearity poses a significant challenge in regression analysis, affecting the reliability of parameter estimates and model interpretation. While often discussed in the context of linear regression, its impact on nonlinear regression models is equally profound but less commonly addressed. Thi
3 min read
Python | Smile detection using OpenCV Emotion detectors are used in many industries, one being the media industry where it is important for the companies to determine the public reaction to their products. In this article, we are going to build a smile detector using OpenCV which takes in live feed from webcam. The smile/happiness detec
3 min read
Multicollinearity in Regression Analysis Multicollinearity occurs when two or more independent variables in a regression model are highly correlated with each other. In other words, multicollinearity exists when there are linear relationships among the independent variables, this causes issues in regression analysis because it does not fol
7 min read
How can we Handle Multicollinearity in Linear Regression? Multicollinearity occurs when two or more independent variables in a linear regression model are highly correlated. To address multicollinearity, here are a few simple strategies:Increase the sample size: to improve model accuracy, making it easier to differentiate between the effects of different p
7 min read
Determining if Python is Running in a Virtualenv A virtual environment in Python is an isolated setup that allows you to manage dependencies for a specific project without affecting other projects or the global Python installation. Itâs useful for maintaining clean and consistent development environments. Our task is to check if the code is runnin
2 min read