Open In App

Understanding Feature Importance in Logistic Regression Models

Last Updated : 18 Jul, 2024
Comments
Improve
Suggest changes
Like Article
Like
Report

Logistic regression is a fundamental classification algorithm in machine learning and statistics. It is widely used for binary classification tasks and can be extended to multiclass problems. Understanding which features influence the predictions of a logistic regression model is crucial for interpretability and model improvement. To analyze a logistic regression model and enhance its performance, one must comprehend the significance of each variable.

This article delves into various methods to determine feature importance in logistic regression, providing a comprehensive guide for data scientists and machine learning practitioners.

Overview of Logistic Regression

A statistical technique called logistic regression is applied to binary classification issues in which there are two possible outcomes for the categorical outcome variable (e.g., yes/no, true/false, 0/1). Logistic regression predicts the likelihood that a given input belongs to a specific class, as opposed to linear regression, which predicts continuous values.

The logistic regression model converts the linear combination of input features into a probability value between 0 and 1 by using the logistic (or sigmoid) function.

Next, we will delve into the methods used to determine the importance of features in a logistic regression model.

Feature Importance Techniques for Logistic Models

1. Coefficient Magnitude

The easiest way to determine the significance of a feature in logistic regression is to look at the size of the coefficients (β). Features with higher absolute coefficient values are deemed more significant. Each coefficient represents the change in the log odds of the outcome for a one-unit change in the predictor variable, holding all other variables constant.

  • Positive Coefficient: Indicates that an increase in the predictor variable increases the log odds of the positive class.
  • Negative Coefficient: Indicates that an increase in the predictor variable decreases the log odds of the positive class.

For standardized features, the magnitude of the coefficients can be directly compared to assess the relative importance of each feature.

2. Odds Ratios

An additional method of interpreting the coefficients is through odds ratios. One way to get the odds ratio for a feature is to exponentiate the coefficient:

  • Odds Ratio > 1: The feature increases the odds of the outcome.
  • Odds Ratio < 1: The feature decreases the odds of the outcome.
  • Odds Ratio = 1: The feature does not affect the odds of the outcome.

3. Recursive Feature Elimination (RFE)

Until the required number of features is reached, the least significant feature (or features) are removed iteratively using the RFE method of fitting the model.

  • After fitting the model, rank the features.
  • Eliminate the feature(s) that are least important.
  • Continue until the required quantity of characteristics is maintained.

4. L1 Regularization (Lasso)

By reducing some coefficients to zero, L1 regularization increases sparsity by adding a penalty proportional to the absolute value of the magnitude of the coefficients.

  • Utilizing L1 regularization, fit the logistic regression model.
  • Features that have coefficients that are not zero are deemed significant.

5. Cross-Validation

When evaluating the model's stability and performance with various feature subsets, cross-validation can be helpful. Characteristics that reliably lead to strong performance are considered significant.

6. Permutation Importance

The process of permutation importance entails rearranging feature values at random and calculating the reduction in model performance. A more significant feature is indicated by a greater reduction.

  • After fitting the model, note the performance baseline.
  • Reassess the model after rearranging the values of a single feature.
  • Compute the performance decline.
  • Continue for every feature, then order them according to the decline in performance.

Feature Importance in Logistic Regression with Scikit-Learn

Here is a Python code example using scikit-learn to demonstrate how to assess feature importance in a logistic regression model. This example includes coefficient magnitudes, odds ratios, and permutation importance.

Step 1: Import Libraries

Python
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.inspection import permutation_importance
from sklearn.feature_selection import RFE

Step 2: Load and Prepare Dataset

Python
# Load dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target)

Step 3: Split Dataset into Training and Test Sets

Python
# Split dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

Step 4: Create and Fit Logistic Regression Model

Python
# Create and fit logistic regression model
model = LogisticRegression(max_iter=10000, solver='liblinear')
model.fit(X_train, y_train)

Step 5: Calculate Model Accuracy

Python
# Calculate accuracy
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.2f}")

Output:

Model Accuracy: 0.96

Step 6: Compute Coefficients and Odds Ratios

Python
# Coefficients and Odds Ratios
coefficients = model.coef_[0]
odds_ratios = np.exp(coefficients)


# Display feature importance using coefficients and odds ratios
feature_importance = pd.DataFrame({
    'Feature': X.columns,
    'Coefficient': coefficients,
    'Odds Ratio': odds_ratios
})
print("\nFeature Importance (Coefficient and Odds Ratio):")
print(feature_importance.sort_values(by='Coefficient', ascending=False))

Output:

Feature Importance (Coefficient and Odds Ratio):
Feature Coefficient Odds Ratio
0 mean radius 2.175329 8.805078
11 texture error 1.403643 4.070000
20 worst radius 1.155193 3.174638
1 mean texture 0.159658 1.173109
12 perimeter error 0.117866 1.125094
19 fractal dimension error -0.000769 0.999231
3 mean area -0.004002 0.996006
14 smoothness error -0.014646 0.985461
23 worst area -0.021324 0.978902
15 compactness error -0.024838 0.975468
9 mean fractal dimension -0.029289 0.971135
17 concave points error -0.041148 0.959687
18 symmetry error -0.048783 0.952388
16 concavity error -0.063487 0.938487
10 radius error -0.066118 0.936020
22 worst perimeter -0.076792 0.926082
13 area error -0.109265 0.896493
29 worst fractal dimension -0.110785 0.895131
2 mean perimeter -0.125372 0.882168
4 mean smoothness -0.130413 0.877733
8 mean symmetry -0.202222 0.816914
24 worst smoothness -0.242144 0.784943
7 mean concave points -0.350106 0.704613
21 worst texture -0.390328 0.676835
5 mean compactness -0.411271 0.662807
27 worst concave points -0.617351 0.539371
6 mean concavity -0.655026 0.519429
28 worst symmetry -0.729143 0.482322
25 worst compactness -1.139760 0.319896
26 worst concavity -1.579345 0.206110

Step 7: Compute Permutation Importance

Python
# Permutation Importance
perm_importance = permutation_importance(model, X_test, y_test, n_repeats=30, random_state=42, n_jobs=-1)
perm_importance_df = pd.DataFrame({
    'Feature': X.columns,
    'Importance Mean': perm_importance.importances_mean,
    'Importance Std': perm_importance.importances_std
})
print("\nPermutation Importance:")
print(perm_importance_df.sort_values(by='Importance Mean', ascending=False))

Output:

Permutation Importance:
Feature Importance Mean Importance Std
23 worst area 0.475244 0.037474
2 mean perimeter 0.147173 0.023298
13 area error 0.119493 0.022431
22 worst perimeter 0.111696 0.020051
0 mean radius 0.098441 0.023102
21 worst texture 0.082066 0.018523
20 worst radius 0.053216 0.018204
3 mean area 0.024172 0.013651
1 mean texture 0.003509 0.008075
11 texture error 0.001559 0.006213
17 concave points error 0.000000 0.000000
28 worst symmetry 0.000000 0.000000
27 worst concave points 0.000000 0.000000
24 worst smoothness 0.000000 0.000000
19 fractal dimension error 0.000000 0.000000
18 symmetry error 0.000000 0.000000
15 compactness error 0.000000 0.000000
16 concavity error 0.000000 0.000000
14 smoothness error 0.000000 0.000000
10 radius error 0.000000 0.000000
9 mean fractal dimension 0.000000 0.000000
8 mean symmetry 0.000000 0.000000
7 mean concave points 0.000000 0.000000
5 mean compactness 0.000000 0.000000
4 mean smoothness 0.000000 0.000000
29 worst fractal dimension 0.000000 0.000000
26 worst concavity -0.000195 0.003197
6 mean concavity -0.000195 0.001050
25 worst compactness -0.000975 0.002179
12 perimeter error -0.001949 0.003143

Step 8: Perform Recursive Feature Elimination (RFE)

Python
# Recursive Feature Elimination (RFE)
rfe_model = LogisticRegression(max_iter=10000, solver='liblinear')
rfe = RFE(rfe_model, n_features_to_select=5)
rfe.fit(X_train, y_train)


rfe_features = X.columns[rfe.support_]
print("\nSelected Features by RFE:")
print(rfe_features)

Output:

Selected Features by RFE:
Index(['mean radius', 'mean concavity', 'worst radius', 'worst concavity',
'worst concave points'],
dtype='object')

Full Implementation Code:

Python
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.inspection import permutation_importance

# Load dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target)

# Split dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create and fit logistic regression model
model = LogisticRegression(max_iter=10000, solver='liblinear')
model.fit(X_train, y_train)

# Calculate accuracy
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.2f}")

# Coefficients and Odds Ratios
coefficients = model.coef_[0]
odds_ratios = np.exp(coefficients)

# Display feature importance using coefficients and odds ratios
feature_importance = pd.DataFrame({
    'Feature': X.columns,
    'Coefficient': coefficients,
    'Odds Ratio': odds_ratios
})
print("\nFeature Importance (Coefficient and Odds Ratio):")
print(feature_importance.sort_values(by='Coefficient', ascending=False))

# Permutation Importance
perm_importance = permutation_importance(model, X_test, y_test, n_repeats=30, random_state=42, n_jobs=-1)
perm_importance_df = pd.DataFrame({
    'Feature': X.columns,
    'Importance Mean': perm_importance.importances_mean,
    'Importance Std': perm_importance.importances_std
})
print("\nPermutation Importance:")
print(perm_importance_df.sort_values(by='Importance Mean', ascending=False))

# Recursive Feature Elimination (RFE)
from sklearn.feature_selection import RFE

rfe_model = LogisticRegression(max_iter=10000, solver='liblinear')
rfe = RFE(rfe_model, n_features_to_select=5)
rfe.fit(X_train, y_train)

rfe_features = X.columns[rfe.support_]
print("\nSelected Features by RFE:")
print(rfe_features)

Comparison of Methods : When To Use

TechniqueDescriptionInterpretationWhen to Use
Coefficient MagnitudeEvaluate feature significance by coefficient magnitudePositive/Negative CoefficientQuick initial evaluation
Odds RatiosInterpret coefficients through odds ratiosOdds Ratio >/< 1Interpretable measure of feature importance
Recursive Feature Elimination (RFE)Iteratively remove least significant featuresRank features, eliminate least importantSubset of most important features
L1 Regularization (Lasso)Add penalty to increase sparsityFeatures with non-zero coefficientsHigh-dimensional datasets, feature selection
Cross-ValidationEvaluate model stability and performance with different feature subsetsIdentify consistently strong featuresModel stability and performance evaluation
Permutation ImportanceMeasure performance decline by permuting feature valuesRank features by performance declineDetailed understanding of feature contributions

Handling Multicollinearity

Multicollinearity occurs when predictor variables are highly correlated, which can inflate the variance of the coefficient estimates and make the model unstable. To handle multicollinearity, consider the following approaches:

  • Remove Highly Correlated Features: Identify and remove one of the correlated features.
  • Combine Features: Create a new feature that combines the information from the correlated features.
  • Regularization: Use techniques like L1 (Lasso) or L2 (Ridge) regularization to penalize large coefficients.

Applications of Understanding Feature Importance

Understanding feature importance is crucial in various applications, such as:

  • Medical Diagnosis: Identifying the most important features contributing to the diagnosis of diseases can help in developing more accurate and efficient diagnostic tools.
  • Marketing: Determining the most important features influencing customer behavior can aid in creating targeted marketing campaigns.
  • Financial Analysis: Evaluating the importance of features in predicting stock prices or credit risk can improve investment decisions.

Conclusion

Determining feature importance in logistic regression is essential for model interpretability and improvement. Various methods, including coefficient magnitude, standardized coefficients, permutation importance, RFE, and SelectFromModel, provide insights into which features are most influential. Handling multicollinearity and correctly interpreting the coefficients are crucial steps in this process. By understanding and applying these techniques, data scientists can build more transparent and effective logistic regression models.


Next Article

Similar Reads