Variable selection for classification and regression using R

Variable selection for
classification and regression using R
498
Gregg Barrett

Introduction
This document provides a brief summary of several variable selection methods that can be utilised
within the R environment. Examples are provided for classification and regression.
Models for classification:
Logistic model
- glmulti
Linear Discriminate Analysis model
- Tabu search
- Memetic
Random Forest model
- Variable importance assessment
Models for regression:
Least squares model
- Best subset
Lasso and Elastic net models
- glmnet
GAMLSS model
- gamboostLSS
Boosting model
- Relative influence assessment
Methods for classification
Logistic model
glmulti
When fitting a logistic regression model the R package glmulti uses an information-theoretic approach
for model selection. From a list of explanatory variables, the provided function glmulti builds all
possible unique models involving these variables and, optionally, their pairwise interactions. Models
are fitted with standard R functions like glm. The n best models and their support (e.g., (Q)AIC, (Q)AICc,
or BIC) are returned, allowing model selection and multi-model inference through standard R
functions.
Resources:
CRAN: Package “glmulti”
https://cran.r-project.org/web/packages/glmulti/glmulti.pdf
Paper: glmulti: An R Package for Easy Automated Model Selection with (Generalized) Linear Models
https://www.jstatsoft.org/index.php/jss/article/view/v034i12/v34i12.pdf
Tutorial: Model Selection using the glmulti Package
http://www.metafor-project.org/doku.php/tips:model_selection_with_glmulti#variable_importance
Tutorial: Model Selection and Multi-Model Inference
http://www.noamross.net/blog/2013/2/20/model-selection-drug.html

Linear Discriminate Analysis model
Tabu Search
This procedure explores the solution space beyond the local optimum. Once a local optimum is
reached, upward moves and those worsening the solutions are allowed. Simultaneously, the last
moves are marked as tabu during the following iterations to avoid cycling.
(Pacheco, Casado, Gomez, Nunez, 2005, pg. 9)
Resources:
CRAN: Package “tabuSearch”
https://cran.r-project.org/web/packages/tabuSearch/tabuSearch.pdf
Paper: Analysis of new variable selection methods for discriminant analysis
http://globalcampus.ie.edu/webes/servicios/descarga_sgd_intranet/envia_doc.asp?id=2207&nomb
re=2207.pdf&clave=WP05-33&clave2=DF8-120-I
Paper: Tabu Search, Fred Glover and Rafael Martí
http://www.uv.es/rmarti/paper/docs/ts2.pdf
Memetic
In simple terms, Memetic Algorithms (MAs) have been shown to be very effective in finding
near-optimum solutions to hard combinatorial optimization problems.
Memetic Algorithms are population-based methods and they have proved to be faster than Genetic
Algorithms for certain types of problems, (Moscato and Laguna, 1996). In brief, they combine local
search procedures with crossing or mutating operators; due to their structure some researchers have
called them Hybrid Genetic Algorithms, Parallel Genetic Algorithms (PGAs) or Genetic Local Search
methods. The method is gaining wide acceptance particularly for the well-known problems of
combinatorial optimization.
(Pacheco et al., 2005, pg. 6)
Resources:
CRAN: Package “Rmalschains”
https://cran.r-project.org/web/packages/Rmalschains/Rmalschains.pdf
Paper: A Gentle Introduction to Memetic Algorithms
http://www.lcc.uma.es/~ccottap/papers/handbook03memetic.pdf
Paper: Memetic Algorithms for the Traveling Salesman Problem
http://www.complex-systems.com/pdf/13-4-1.pdf

Random Forest model
Variable importance assessment
Using the importance() function in the package “randomForest”, two measures of variable importance
are reported, %IncMSE and IncNodePurity. The former is based upon the mean decrease of accuracy
in predictions on the out of bag samples when a given variable is excluded from the model. The latter
is a measure of the total decrease in node impurity that results from splits over that variable, averaged
over all trees. In the case of classification trees, the node impurity is measured by the deviance.
(James, Witten, Hastie, Tibshirani, 2013, pg. 330)
One approach is to use backward elimination (also called recursive feature elimination) where the
least important variables are removed until out-of-bag prediction accuracy drops.
Resources:
CRAN: Package “randomForest”
https://cran.r-project.org/web/packages/randomForest/randomForest.pdf
Presentation: Why and how to use random forest variable importance measures (and how you
shouldn’t)
http://www.statistik.uni-dortmund.de/useR-2008/slides/Strobl+Zeileis.pdf
Paper: Bias in Random Forest Variable Importance Measures: Illustrations, Sources and a Solution
http://www.statistik.lmu.de/sfb386/papers/dsp/paper490.pdf
Methods for regression
Least squares model
Best subset
leaps() performs an exhaustive search for the best subsets of the variables in x for predicting y in linear
regression, using an efficient branch-and-bound algorithm. It is a compatibility wrapper for regsubsets
that does the same thing better. Since the algorithm returns a best model of each size, the results do
not depend on a penalty model for model size: it doesn’t make any difference whether you want to
use AIC, BIC etc.
(Miller, 2015, pg. 1)
AIC, BIC can be used to select between the resulting models.
Best subset selection however becomes computationally infeasible when the number of variables
becomes greater than around 40.
Resources:
CRAN: Package “leaps”
https://cran.r-project.org/web/packages/leaps/leaps.pdf

Lasso model
In the case of the lasso, the ℓ1 penalty has the effect of forcing some of the coefficient estimates to be
exactly equal to zero when the tuning parameter λ is sufficiently large. Cross-validation can be used
to select the value of λ.
Elastic net model
Similar to the lasso, the elastic net simultaneously does automatic variable selection and continuous
shrinkage, and it can select groups of correlated variables. It is like a stretchable fishing net that retains
“all the big fish”. Simulation studies and real data examples show that the elastic net often
outperforms the lasso in terms of prediction accuracy.
(Zou, Hastie, 2005, pg. 302)
Resources:
CRAN: Package “glmnet”
https://cran.r-project.org/web/packages/glmnet/glmnet.pdf
glmnet webinar: Sparse Linear Models with demonstrations in GLMNET, presented by Trevor Hastie.
https://www.youtube.com/watch?v=BU2gjoLPfDc&feature=youtu.be
glmnet vignette: By Trevor Hastie and Junyang Qian
https://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html
Paper - Lasso: Strong Rules for Discarding Predictors in Lasso-type Problems
http://statweb.stanford.edu/~tibs/ftp/strong.pdf
Paper – Elastic net: Regularization and variable selection via the elastic net
http://web.stanford.edu/~hastie/Papers/B67.2%20%282005%29%20301-
320%20Zou%20&%20Hastie.pdf
Technical presentation – Elastic net: Regularization and Variable Selection via the Elastic net
http://web.stanford.edu/~hastie/TALKS/enet_talk.pdf
GAMLSS model
gamboostLSS
Generalized additive models for location, scale and shape (GAMLSS) are a flexible class of regression
models that allow for the modelling of multiple parameters of a distribution function, such as the
mean and the standard deviation, simultaneously. The R package gamboostLSS, provides a boosting
method to fit these models.
(Hofner, Mayr, Schmid, 2014, pg. 1)
Variable selection is included in the fitting process - through the main tuning parameter which are
the stopping iterations “mstop” – this controls the variable selection and the amount of shrinkage.
The selection of “mstop” can be done through cross validation.

Resources:
CRAN: Package “gamboostLSS”
https://cran.r-project.org/web/packages/gamboostLSS/gamboostLSS.pdf
Presentation: An R Package for Model Building and Variable Selection in the GAMLSS Framework
http://arxiv.org/pdf/1407.1774.pdf
Paper: gamboostLSS - boosting generalized additive models for location scale and shape
https://web.warwick.ac.uk/statsdept/user-2011/TalkSlides/Contributed/16Aug_1600_FocusII_5-
DimReduction_3-Hofner.pdf
Boosting model
Relative influence assessment
One approach is to use backward elimination (also called recursive feature elimination) where the
least important variables are removed until out-of-bag prediction accuracy drops. This can be
combined with a randomised feature selection approach.
Resources:
CRAN: Package “gbm”
https://cran.r-project.org/web/packages/gbm/gbm.pdf
Paper: Feature selection for ranking using boosted trees
https://pdfs.semanticscholar.org/156e/3c979e7bc25381fdd0614d1bab60b7aa5dfd.pdf

Reference
Pacheco, J., Casado, S., Gomez, O., Nunez, L. (2005). Analysis of new variable selection methods for
discriminate analysis. [pdf]. Retrieved from
http://globalcampus.ie.edu/webes/servicios/descarga_sgd_intranet/envia_doc.asp?id=2207&nombr
e=2207.pdf&clave=WP05-33&clave2=DF8-120-I
James, G., Witten, D., Hastie, T., Tibshirani, R. (2013). Introduction to statistical learning with applications in R.
[ebook]. Retrieved from http://www-bcf.usc.edu/~gareth/ISL/getbook.html
Miller, A. (2015). Package “leaps”. [pdf]. Retrieved from
https://cran.r-project.org/web/packages/leaps/leaps.pdf
Zou, H., Hastie, T. (2005). Regularization and variable selection via the elastic net. [pdf]. Retrieved from
http://web.stanford.edu/~hastie/Papers/B67.2%20%282005%29%20301-
320%20Zou%20&%20Hastie.pdf
Hofner, B., Mayr, A., Schmid, M. (2014). gamboostlss: an r package for model building and variable selection
in the gamlss framework. [pdf]. Retrieved from http://arxiv.org/pdf/1407.1774.pdf

Variable selection for classification and regression using R

More Related Content

Similar to Variable selection for classification and regression using R (20)

More from Gregg Barrett (20)

Recently uploaded (20)

Variable selection for classification and regression using R