How can we Handle Multicollinearity in Linear Regression?

Question

GeeksforGeeks · Accepted Answer

Multicollinearity occurs when two or more independent variables in a linear regression model are highly correlated. To address multicollinearity, here are a few simple strategies:

Increase the sample size: to improve model accuracy, making it easier to differentiate between the effects of different predictors.
Remove highly correlated predictors using Variance Inflation Factor (VIF), which tells you if certain variables are highly correlated. If the VIF is too high, consider removing one of the correlated predictors to improve model stability.
Combine correlated variables and combine them into a single, more meaningful predictor. This can be done using techniques like Principal Component Analysis (PCA) or factor analysis, which help reduce redundancy by creating a new variable that represents the combined information.

These methods help simplify the model and make sure it provides reliable and interpretable results.

multicollinearity_demoo — Handle Multicollinearity in Linear Regression

High Correlation: The correlation matrix clearly shows that variables X1, X2, and X3 are highly correlated with each other. This is evident from the near-perfect correlation coefficients close to 1.
Impact on Regression: Multicollinearity can lead to unstable and unreliable regression models. The coefficients of the correlated variables become sensitive to small changes in the data, making it difficult to interpret their individual effects.
Variance Inflation Factor (VIF): The VIF plot confirms the presence of multicollinearity. The high VIF values for X1, X2, and X3 indicate that these variables are highly correlated with each other and with other predictors in the model.

Detecting Multicollinearity with The Example

Multicollinearity can be challenging to identify just through numerical data, but visualizing the relationships between independent variables can provide valuable insights. A great way to detect multicollinearity visually is through:

Variance Inflation Factor (VIF): The VIF is a common and effective way to detect multicollinearity. It measures how much the variance of an estimated regression coefficient is increased due to the correlation among the predictors. Here’s how to interpret VIF values:
- VIF = 1: Indicates no correlation.
- 1 < VIF < 5: Suggests moderate correlation.
- VIF > 5: Indicates high correlation.
Correlation Matrix: A correlation matrix can visually show the relationships between all pairs of independent variables. If the correlation between two variables is high (typically above 0.8), it may indicate multicollinearity. However, this method can miss multicollinearity involving three or more variables.
Scatterplot Matrix: A scatterplot matrix provides a visual representation of the relationships between all pairs of independent variables. If one of the scatterplots shows a strong linear relationship, it could indicate multicollinearity.

Consider a scenario where a fitness goods manufacturer is analyzing the impact of several variables on sales: price, advertising, location in the store, and total store volume. A VIF greater than 5 (or 10 in some cases) suggests problematic multicollinearity.

Python

import pandas as pd
import numpy as np
from statsmodels.stats.outliers_influence import variance_inflation_factor

np.random.seed(0)  # For reproducibility
data = {
    'price': np.random.uniform(10, 50, 100),
    'advertising': np.random.uniform(100, 500, 100),
    'location_in_store': np.random.uniform(1, 10, 100),
    'total_store_volume': np.random.uniform(1000, 5000, 100),
    'sales': np.random.uniform(100, 1000, 100)  # Dependent variable
}

df = pd.DataFrame(data)

def calc_vif(X):
    vif = pd.DataFrame()
    vif["variables"] = X.columns
    vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]  # Use X.shape[1] for number of columns
    return vif

# Select independent variables
X = df[['price', 'advertising', 'location_in_store', 'total_store_volume']]
vif_df = calc_vif(X)
print(vif_df)

Output:

            variables       VIF
0               price  5.555893
1         advertising  5.609033
2   location_in_store  3.974464
3  total_store_volume  5.428867

Price (VIF = 5.56), Advertising (VIF = 5.61), Total Store Volume (VIF = 5.43): These variables show high multicollinearity, indicated by their VIF values being above 5. suggesting that their variance is inflated due to correlations with other predictors, making their individual effects harder to isolate. So, we can try removing one or more of these variables, or combining them using techniques like PCA.
Location in Store (VIF = 3.97): This suggests moderate multicollinearity. While it’s not a major issue, it’s still worth monitoring. We can keep this variable but watch for potential correlation with other predictors.

To determine which predictors are highly correlated with one another, we need to examine the pairwise correlation between the variables. To pinpoint the exact correlations, you can calculate the correlation matrix for the predictors and examine the pairwise relationships.

Solutions to Multicollinearity in Linear Regression

When multicollinearity is detected in your regression model, there are several ways to address it. Some common solutions:

Remove Highly Correlated Predictors: If two or more variables are highly correlated, consider removing one of them. This helps reduce redundancy and simplifies the model without losing too much information. Example: If price and advertising have high VIFs and are correlated, you might choose to remove one or combine their effects into a single variable.
Combine Variables (Feature Engineering): Principal Component Analysis (PCA): PCA reduces the dimensionality of the dataset by combining correlated predictors into a smaller number of uncorrelated components. This can help in retaining most of the explanatory power while eliminating multicollinearity.
Factor Analysis: Another dimensionality reduction technique that can combine correlated variables into latent factors.
Use Regularization:
- Ridge Regression (L2 regularization): This method adds a penalty to the size of the coefficients, effectively reducing the impact of correlated predictors by shrinking their values.
- Lasso Regression (L1 regularization): Lasso not only shrinks coefficients but also forces some coefficients to zero, effectively removing less important predictors. This can help in selecting the most relevant variables and reducing multicollinearity.
Increase the Sample Size: Increasing the sample size can reduce the variance of the coefficient estimates and improve the reliability of the regression model. More data can help minimize the effect of multicollinearity by providing more variation among the predictors.
Center the Data (Standardization): In some cases, centering the data (subtracting the mean) can help reduce multicollinearity. It doesn’t always eliminate the problem, but it can make the interpretation of the coefficients more meaningful by reducing correlation between the predictors.
Stepwise Selection (Forward or Backward): Stepwise selection can help you identify the most important features while potentially dropping variables that cause high multicollinearity. This is done iteratively by adding or removing features based on their contribution to the model.

To address the multicollinearity in the given case where Price, Advertising, and Total Store Volume have high VIFs, we can remove the highly correlated variables and then recalculates the VIF to show the effect of the changes. However, If the variables price, advertising, and total_store_volume are important for your analysis but exhibit high multicollinearity (as indicated by their high VIF values), we can address this issue without removing them by applying PCA.

Python

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from statsmodels.stats.outliers_influence import variance_inflation_factor

# the dataset is same as used in above code
np.random.seed(0)  # For reproducibility
data = {
    'price': np.random.uniform(10, 50, 100),
    'advertising': np.random.uniform(100, 500, 100),
    'location_in_store': np.random.uniform(1, 10, 100),
    'total_store_volume': np.random.uniform(1000, 5000, 100),
    'sales': np.random.uniform(100, 1000, 100)  # Dependent variable
}
df = pd.DataFrame(data)
def calc_vif(X):
    vif = pd.DataFrame()
    vif["variables"] = X.columns
    vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
    return vif
# Select independent variables
X = df[['price', 'advertising', 'location_in_store', 'total_store_volume']]

# Calculate VIF before removal
vif_before = calc_vif(X)
print("VIF before removal:")
print(vif_before)

# Visualizing the correlation matrix before removal
corr_before = X.corr()

# Apply PCA to the highly correlated features: price, advertising, and total_store_volume
pca = PCA(n_components=1)  # Combine them into 1 component
X_combined = X[['price', 'advertising', 'total_store_volume']]
X_pca = pca.fit_transform(X_combined)

# Add the PCA result as a new feature to the dataset
X['combined_pca'] = X_pca

# Now, we will drop the original features and use the PCA component
X_reduced = X.drop(columns=['price', 'advertising', 'total_store_volume'])
corr_after_pca = X_reduced.corr()
fig, ax = plt.subplots(1, 2, figsize=(14, 6))
sns.heatmap(corr_before, annot=True, cmap="coolwarm", fmt=".2f", vmin=-1, vmax=1, ax=ax[0])
ax[0].set_title("Correlation Matrix Before PCA")

sns.heatmap(corr_after_pca, annot=True, cmap="coolwarm", fmt=".2f", vmin=-1, vmax=1, ax=ax[1])
ax[1].set_title("Correlation Matrix After PCA")

plt.tight_layout()
plt.show()

# Calculate VIF after PCA (for the reduced set)
vif_after_pca = calc_vif(X_reduced)
print("\nVIF after PCA:")
print(vif_after_pca)

Output:

VIF before removal:
            variables       VIF
0               price  5.555893
1         advertising  5.609033
2   location_in_store  3.974464
3  total_store_volume  5.428867

PCA is an effective method for addressing multicollinearity. In this case, PCA has reduced the multicollinearity among the variables price, advertising, and total_store_volume. This is important because multicollinearity can lead to unstable and unreliable regression models.

V

vaibhav_tyagi

Improve

Article Tags :

Practice Tags :

Machine Learning

How can we Handle Multicollinearity in Linear Regression?

Detecting Multicollinearity with The Example

Solutions to Multicollinearity in Linear Regression

Similar Reads

Introduction to Machine Learning

Python for Machine Learning

Feature Engineering

Supervised Learning

Unsupervised Learning

Model Evaluation and Tuning

Advance Machine Learning Technique

Machine Learning Practice

Thank You!

What kind of Experience do you want to share?