Understanding Feature Importance in Logistic Regression Models
Last Updated :
18 Jul, 2024
Logistic regression is a fundamental classification algorithm in machine learning and statistics. It is widely used for binary classification tasks and can be extended to multiclass problems. Understanding which features influence the predictions of a logistic regression model is crucial for interpretability and model improvement. To analyze a logistic regression model and enhance its performance, one must comprehend the significance of each variable.
This article delves into various methods to determine feature importance in logistic regression, providing a comprehensive guide for data scientists and machine learning practitioners.
Overview of Logistic Regression
A statistical technique called logistic regression is applied to binary classification issues in which there are two possible outcomes for the categorical outcome variable (e.g., yes/no, true/false, 0/1). Logistic regression predicts the likelihood that a given input belongs to a specific class, as opposed to linear regression, which predicts continuous values.
The logistic regression model converts the linear combination of input features into a probability value between 0 and 1 by using the logistic (or sigmoid) function.
Next, we will delve into the methods used to determine the importance of features in a logistic regression model.
Feature Importance Techniques for Logistic Models
1. Coefficient Magnitude
The easiest way to determine the significance of a feature in logistic regression is to look at the size of the coefficients (β). Features with higher absolute coefficient values are deemed more significant. Each coefficient represents the change in the log odds of the outcome for a one-unit change in the predictor variable, holding all other variables constant.
- Positive Coefficient: Indicates that an increase in the predictor variable increases the log odds of the positive class.
- Negative Coefficient: Indicates that an increase in the predictor variable decreases the log odds of the positive class.
For standardized features, the magnitude of the coefficients can be directly compared to assess the relative importance of each feature.
2. Odds Ratios
An additional method of interpreting the coefficients is through odds ratios. One way to get the odds ratio for a feature is to exponentiate the coefficient:
- Odds Ratio > 1: The feature increases the odds of the outcome.
- Odds Ratio < 1: The feature decreases the odds of the outcome.
- Odds Ratio = 1: The feature does not affect the odds of the outcome.
3. Recursive Feature Elimination (RFE)
Until the required number of features is reached, the least significant feature (or features) are removed iteratively using the RFE method of fitting the model.
- After fitting the model, rank the features.
- Eliminate the feature(s) that are least important.
- Continue until the required quantity of characteristics is maintained.
4. L1 Regularization (Lasso)
By reducing some coefficients to zero, L1 regularization increases sparsity by adding a penalty proportional to the absolute value of the magnitude of the coefficients.
- Utilizing L1 regularization, fit the logistic regression model.
- Features that have coefficients that are not zero are deemed significant.
5. Cross-Validation
When evaluating the model's stability and performance with various feature subsets, cross-validation can be helpful. Characteristics that reliably lead to strong performance are considered significant.
6. Permutation Importance
The process of permutation importance entails rearranging feature values at random and calculating the reduction in model performance. A more significant feature is indicated by a greater reduction.
- After fitting the model, note the performance baseline.
- Reassess the model after rearranging the values of a single feature.
- Compute the performance decline.
- Continue for every feature, then order them according to the decline in performance.
Feature Importance in Logistic Regression with Scikit-Learn
Here is a Python code example using scikit-learn to demonstrate how to assess feature importance in a logistic regression model. This example includes coefficient magnitudes, odds ratios, and permutation importance.
Step 1: Import Libraries
Python
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.inspection import permutation_importance
from sklearn.feature_selection import RFE
Step 2: Load and Prepare Dataset
Python
# Load dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target)
Step 3: Split Dataset into Training and Test Sets
Python
# Split dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
Step 4: Create and Fit Logistic Regression Model
Python
# Create and fit logistic regression model
model = LogisticRegression(max_iter=10000, solver='liblinear')
model.fit(X_train, y_train)
Step 5: Calculate Model Accuracy
Python
# Calculate accuracy
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.2f}")
Output:
Model Accuracy: 0.96
Step 6: Compute Coefficients and Odds Ratios
Python
# Coefficients and Odds Ratios
coefficients = model.coef_[0]
odds_ratios = np.exp(coefficients)
# Display feature importance using coefficients and odds ratios
feature_importance = pd.DataFrame({
'Feature': X.columns,
'Coefficient': coefficients,
'Odds Ratio': odds_ratios
})
print("\nFeature Importance (Coefficient and Odds Ratio):")
print(feature_importance.sort_values(by='Coefficient', ascending=False))
Output:
Feature Importance (Coefficient and Odds Ratio):
Feature Coefficient Odds Ratio
0 mean radius 2.175329 8.805078
11 texture error 1.403643 4.070000
20 worst radius 1.155193 3.174638
1 mean texture 0.159658 1.173109
12 perimeter error 0.117866 1.125094
19 fractal dimension error -0.000769 0.999231
3 mean area -0.004002 0.996006
14 smoothness error -0.014646 0.985461
23 worst area -0.021324 0.978902
15 compactness error -0.024838 0.975468
9 mean fractal dimension -0.029289 0.971135
17 concave points error -0.041148 0.959687
18 symmetry error -0.048783 0.952388
16 concavity error -0.063487 0.938487
10 radius error -0.066118 0.936020
22 worst perimeter -0.076792 0.926082
13 area error -0.109265 0.896493
29 worst fractal dimension -0.110785 0.895131
2 mean perimeter -0.125372 0.882168
4 mean smoothness -0.130413 0.877733
8 mean symmetry -0.202222 0.816914
24 worst smoothness -0.242144 0.784943
7 mean concave points -0.350106 0.704613
21 worst texture -0.390328 0.676835
5 mean compactness -0.411271 0.662807
27 worst concave points -0.617351 0.539371
6 mean concavity -0.655026 0.519429
28 worst symmetry -0.729143 0.482322
25 worst compactness -1.139760 0.319896
26 worst concavity -1.579345 0.206110
Step 7: Compute Permutation Importance
Python
# Permutation Importance
perm_importance = permutation_importance(model, X_test, y_test, n_repeats=30, random_state=42, n_jobs=-1)
perm_importance_df = pd.DataFrame({
'Feature': X.columns,
'Importance Mean': perm_importance.importances_mean,
'Importance Std': perm_importance.importances_std
})
print("\nPermutation Importance:")
print(perm_importance_df.sort_values(by='Importance Mean', ascending=False))
Output:
Permutation Importance:
Feature Importance Mean Importance Std
23 worst area 0.475244 0.037474
2 mean perimeter 0.147173 0.023298
13 area error 0.119493 0.022431
22 worst perimeter 0.111696 0.020051
0 mean radius 0.098441 0.023102
21 worst texture 0.082066 0.018523
20 worst radius 0.053216 0.018204
3 mean area 0.024172 0.013651
1 mean texture 0.003509 0.008075
11 texture error 0.001559 0.006213
17 concave points error 0.000000 0.000000
28 worst symmetry 0.000000 0.000000
27 worst concave points 0.000000 0.000000
24 worst smoothness 0.000000 0.000000
19 fractal dimension error 0.000000 0.000000
18 symmetry error 0.000000 0.000000
15 compactness error 0.000000 0.000000
16 concavity error 0.000000 0.000000
14 smoothness error 0.000000 0.000000
10 radius error 0.000000 0.000000
9 mean fractal dimension 0.000000 0.000000
8 mean symmetry 0.000000 0.000000
7 mean concave points 0.000000 0.000000
5 mean compactness 0.000000 0.000000
4 mean smoothness 0.000000 0.000000
29 worst fractal dimension 0.000000 0.000000
26 worst concavity -0.000195 0.003197
6 mean concavity -0.000195 0.001050
25 worst compactness -0.000975 0.002179
12 perimeter error -0.001949 0.003143
Python
# Recursive Feature Elimination (RFE)
rfe_model = LogisticRegression(max_iter=10000, solver='liblinear')
rfe = RFE(rfe_model, n_features_to_select=5)
rfe.fit(X_train, y_train)
rfe_features = X.columns[rfe.support_]
print("\nSelected Features by RFE:")
print(rfe_features)
Output:
Selected Features by RFE:
Index(['mean radius', 'mean concavity', 'worst radius', 'worst concavity',
'worst concave points'],
dtype='object')
Full Implementation Code:
Python
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.inspection import permutation_importance
# Load dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target)
# Split dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Create and fit logistic regression model
model = LogisticRegression(max_iter=10000, solver='liblinear')
model.fit(X_train, y_train)
# Calculate accuracy
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.2f}")
# Coefficients and Odds Ratios
coefficients = model.coef_[0]
odds_ratios = np.exp(coefficients)
# Display feature importance using coefficients and odds ratios
feature_importance = pd.DataFrame({
'Feature': X.columns,
'Coefficient': coefficients,
'Odds Ratio': odds_ratios
})
print("\nFeature Importance (Coefficient and Odds Ratio):")
print(feature_importance.sort_values(by='Coefficient', ascending=False))
# Permutation Importance
perm_importance = permutation_importance(model, X_test, y_test, n_repeats=30, random_state=42, n_jobs=-1)
perm_importance_df = pd.DataFrame({
'Feature': X.columns,
'Importance Mean': perm_importance.importances_mean,
'Importance Std': perm_importance.importances_std
})
print("\nPermutation Importance:")
print(perm_importance_df.sort_values(by='Importance Mean', ascending=False))
# Recursive Feature Elimination (RFE)
from sklearn.feature_selection import RFE
rfe_model = LogisticRegression(max_iter=10000, solver='liblinear')
rfe = RFE(rfe_model, n_features_to_select=5)
rfe.fit(X_train, y_train)
rfe_features = X.columns[rfe.support_]
print("\nSelected Features by RFE:")
print(rfe_features)
Comparison of Methods : When To Use
Technique | Description | Interpretation | When to Use |
---|
Coefficient Magnitude | Evaluate feature significance by coefficient magnitude | Positive/Negative Coefficient | Quick initial evaluation |
---|
Odds Ratios | Interpret coefficients through odds ratios | Odds Ratio >/< 1 | Interpretable measure of feature importance |
---|
Recursive Feature Elimination (RFE) | Iteratively remove least significant features | Rank features, eliminate least important | Subset of most important features |
---|
L1 Regularization (Lasso) | Add penalty to increase sparsity | Features with non-zero coefficients | High-dimensional datasets, feature selection |
---|
Cross-Validation | Evaluate model stability and performance with different feature subsets | Identify consistently strong features | Model stability and performance evaluation |
---|
Permutation Importance | Measure performance decline by permuting feature values | Rank features by performance decline | Detailed understanding of feature contributions |
---|
Handling Multicollinearity
Multicollinearity occurs when predictor variables are highly correlated, which can inflate the variance of the coefficient estimates and make the model unstable. To handle multicollinearity, consider the following approaches:
- Remove Highly Correlated Features: Identify and remove one of the correlated features.
- Combine Features: Create a new feature that combines the information from the correlated features.
- Regularization: Use techniques like L1 (Lasso) or L2 (Ridge) regularization to penalize large coefficients.
Applications of Understanding Feature Importance
Understanding feature importance is crucial in various applications, such as:
- Medical Diagnosis: Identifying the most important features contributing to the diagnosis of diseases can help in developing more accurate and efficient diagnostic tools.
- Marketing: Determining the most important features influencing customer behavior can aid in creating targeted marketing campaigns.
- Financial Analysis: Evaluating the importance of features in predicting stock prices or credit risk can improve investment decisions.
Conclusion
Determining feature importance in logistic regression is essential for model interpretability and improvement. Various methods, including coefficient magnitude, standardized coefficients, permutation importance, RFE, and SelectFromModel, provide insights into which features are most influential. Handling multicollinearity and correctly interpreting the coefficients are crucial steps in this process. By understanding and applying these techniques, data scientists can build more transparent and effective logistic regression models.
Similar Reads
Understanding Feature Importance and Visualization of Tree Models Feature importance is a crucial concept in machine learning, particularly in tree-based models. It refers to techniques that assign a score to input features based on their usefulness in predicting a target variable. This article will delve into the methods of calculating feature importance, the sig
5 min read
Outlier Detection in Logistic Regression Outliers, data points that deviate significantly from the rest, can significantly impact the performance of logistic regression models. In this article we will explore various techniques for detecting and handling outliers in Logistic regression. What are Outliers?An outlier is an observation that f
8 min read
Simulating logistic regression from saved estimates in R Logistic regression is a statistical method used for binary classification, where the response variable can take on two possible outcomes. It is widely used in various fields, such as medicine, social sciences, and machine learning, to predict the probability of a binary response based on one or mor
5 min read
Heart Disease Prediction Using Logistic Regression in R Machine learning can effectively identify patterns in data, providing valuable insights from this data. This article explores one of these machine learning techniques called Logistic regression and how it can analyze the key patient details and determine the probability of heart disease based on pat
13 min read
Deciding threshold for glm logistic regression model in R Logistic regression is a powerful statistical method used for modeling binary outcomes. When applying logistic regression in practice, one common challenge is deciding the threshold probability that determines the classification of observations into binary classes (e.g., yes/no, 1/0). This article e
4 min read