SlideShare a Scribd company logo
Variable selection for
classification and regression using R
498
Gregg Barrett
Introduction
This document provides a brief summary of several variable selection methods that can be utilised
within the R environment. Examples are provided for classification and regression.
Models for classification:
Logistic model
- glmulti
Linear Discriminate Analysis model
- Tabu search
- Memetic
Random Forest model
- Variable importance assessment
Models for regression:
Least squares model
- Best subset
Lasso and Elastic net models
- glmnet
GAMLSS model
- gamboostLSS
Boosting model
- Relative influence assessment
Methods for classification
Logistic model
glmulti
When fitting a logistic regression model the R package glmulti uses an information-theoretic approach
for model selection. From a list of explanatory variables, the provided function glmulti builds all
possible unique models involving these variables and, optionally, their pairwise interactions. Models
are fitted with standard R functions like glm. The n best models and their support (e.g., (Q)AIC, (Q)AICc,
or BIC) are returned, allowing model selection and multi-model inference through standard R
functions.
Resources:
CRAN: Package “glmulti”
https://cran.r-project.org/web/packages/glmulti/glmulti.pdf
Paper: glmulti: An R Package for Easy Automated Model Selection with (Generalized) Linear Models
https://www.jstatsoft.org/index.php/jss/article/view/v034i12/v34i12.pdf
Tutorial: Model Selection using the glmulti Package
http://www.metafor-project.org/doku.php/tips:model_selection_with_glmulti#variable_importance
Tutorial: Model Selection and Multi-Model Inference
http://www.noamross.net/blog/2013/2/20/model-selection-drug.html
Linear Discriminate Analysis model
Tabu Search
This procedure explores the solution space beyond the local optimum. Once a local optimum is
reached, upward moves and those worsening the solutions are allowed. Simultaneously, the last
moves are marked as tabu during the following iterations to avoid cycling.
(Pacheco, Casado, Gomez, Nunez, 2005, pg. 9)
Resources:
CRAN: Package “tabuSearch”
https://cran.r-project.org/web/packages/tabuSearch/tabuSearch.pdf
Paper: Analysis of new variable selection methods for discriminant analysis
http://globalcampus.ie.edu/webes/servicios/descarga_sgd_intranet/envia_doc.asp?id=2207&nomb
re=2207.pdf&clave=WP05-33&clave2=DF8-120-I
Paper: Tabu Search, Fred Glover and Rafael Martí
http://www.uv.es/rmarti/paper/docs/ts2.pdf
Memetic
In simple terms, Memetic Algorithms (MAs) have been shown to be very effective in finding
near-optimum solutions to hard combinatorial optimization problems.
Memetic Algorithms are population-based methods and they have proved to be faster than Genetic
Algorithms for certain types of problems, (Moscato and Laguna, 1996). In brief, they combine local
search procedures with crossing or mutating operators; due to their structure some researchers have
called them Hybrid Genetic Algorithms, Parallel Genetic Algorithms (PGAs) or Genetic Local Search
methods. The method is gaining wide acceptance particularly for the well-known problems of
combinatorial optimization.
(Pacheco et al., 2005, pg. 6)
Resources:
CRAN: Package “Rmalschains”
https://cran.r-project.org/web/packages/Rmalschains/Rmalschains.pdf
Paper: A Gentle Introduction to Memetic Algorithms
http://www.lcc.uma.es/~ccottap/papers/handbook03memetic.pdf
Paper: Memetic Algorithms for the Traveling Salesman Problem
http://www.complex-systems.com/pdf/13-4-1.pdf
Random Forest model
Variable importance assessment
Using the importance() function in the package “randomForest”, two measures of variable importance
are reported, %IncMSE and IncNodePurity. The former is based upon the mean decrease of accuracy
in predictions on the out of bag samples when a given variable is excluded from the model. The latter
is a measure of the total decrease in node impurity that results from splits over that variable, averaged
over all trees. In the case of classification trees, the node impurity is measured by the deviance.
(James, Witten, Hastie, Tibshirani, 2013, pg. 330)
One approach is to use backward elimination (also called recursive feature elimination) where the
least important variables are removed until out-of-bag prediction accuracy drops.
Resources:
CRAN: Package “randomForest”
https://cran.r-project.org/web/packages/randomForest/randomForest.pdf
Presentation: Why and how to use random forest variable importance measures (and how you
shouldn’t)
http://www.statistik.uni-dortmund.de/useR-2008/slides/Strobl+Zeileis.pdf
Paper: Bias in Random Forest Variable Importance Measures: Illustrations, Sources and a Solution
http://www.statistik.lmu.de/sfb386/papers/dsp/paper490.pdf
Methods for regression
Least squares model
Best subset
leaps() performs an exhaustive search for the best subsets of the variables in x for predicting y in linear
regression, using an efficient branch-and-bound algorithm. It is a compatibility wrapper for regsubsets
that does the same thing better. Since the algorithm returns a best model of each size, the results do
not depend on a penalty model for model size: it doesn’t make any difference whether you want to
use AIC, BIC etc.
(Miller, 2015, pg. 1)
AIC, BIC can be used to select between the resulting models.
Best subset selection however becomes computationally infeasible when the number of variables
becomes greater than around 40.
Resources:
CRAN: Package “leaps”
https://cran.r-project.org/web/packages/leaps/leaps.pdf
Lasso model
In the case of the lasso, the ℓ1 penalty has the effect of forcing some of the coefficient estimates to be
exactly equal to zero when the tuning parameter λ is sufficiently large. Cross-validation can be used
to select the value of λ.
Elastic net model
Similar to the lasso, the elastic net simultaneously does automatic variable selection and continuous
shrinkage, and it can select groups of correlated variables. It is like a stretchable fishing net that retains
“all the big fish”. Simulation studies and real data examples show that the elastic net often
outperforms the lasso in terms of prediction accuracy.
(Zou, Hastie, 2005, pg. 302)
Resources:
CRAN: Package “glmnet”
https://cran.r-project.org/web/packages/glmnet/glmnet.pdf
glmnet webinar: Sparse Linear Models with demonstrations in GLMNET, presented by Trevor Hastie.
https://www.youtube.com/watch?v=BU2gjoLPfDc&feature=youtu.be
glmnet vignette: By Trevor Hastie and Junyang Qian
https://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html
Paper - Lasso: Strong Rules for Discarding Predictors in Lasso-type Problems
http://statweb.stanford.edu/~tibs/ftp/strong.pdf
Paper – Elastic net: Regularization and variable selection via the elastic net
http://web.stanford.edu/~hastie/Papers/B67.2%20%282005%29%20301-
320%20Zou%20&%20Hastie.pdf
Technical presentation – Elastic net: Regularization and Variable Selection via the Elastic net
http://web.stanford.edu/~hastie/TALKS/enet_talk.pdf
GAMLSS model
gamboostLSS
Generalized additive models for location, scale and shape (GAMLSS) are a flexible class of regression
models that allow for the modelling of multiple parameters of a distribution function, such as the
mean and the standard deviation, simultaneously. The R package gamboostLSS, provides a boosting
method to fit these models.
(Hofner, Mayr, Schmid, 2014, pg. 1)
Variable selection is included in the fitting process - through the main tuning parameter which are
the stopping iterations “mstop” – this controls the variable selection and the amount of shrinkage.
The selection of “mstop” can be done through cross validation.
Resources:
CRAN: Package “gamboostLSS”
https://cran.r-project.org/web/packages/gamboostLSS/gamboostLSS.pdf
Presentation: An R Package for Model Building and Variable Selection in the GAMLSS Framework
http://arxiv.org/pdf/1407.1774.pdf
Paper: gamboostLSS - boosting generalized additive models for location scale and shape
https://web.warwick.ac.uk/statsdept/user-2011/TalkSlides/Contributed/16Aug_1600_FocusII_5-
DimReduction_3-Hofner.pdf
Boosting model
Relative influence assessment
One approach is to use backward elimination (also called recursive feature elimination) where the
least important variables are removed until out-of-bag prediction accuracy drops. This can be
combined with a randomised feature selection approach.
Resources:
CRAN: Package “gbm”
https://cran.r-project.org/web/packages/gbm/gbm.pdf
Paper: Feature selection for ranking using boosted trees
https://pdfs.semanticscholar.org/156e/3c979e7bc25381fdd0614d1bab60b7aa5dfd.pdf
Reference
Pacheco, J., Casado, S., Gomez, O., Nunez, L. (2005). Analysis of new variable selection methods for
discriminate analysis. [pdf]. Retrieved from
http://globalcampus.ie.edu/webes/servicios/descarga_sgd_intranet/envia_doc.asp?id=2207&nombr
e=2207.pdf&clave=WP05-33&clave2=DF8-120-I
James, G., Witten, D., Hastie, T., Tibshirani, R. (2013). Introduction to statistical learning with applications in R.
[ebook]. Retrieved from http://www-bcf.usc.edu/~gareth/ISL/getbook.html
Miller, A. (2015). Package “leaps”. [pdf]. Retrieved from
https://cran.r-project.org/web/packages/leaps/leaps.pdf
Zou, H., Hastie, T. (2005). Regularization and variable selection via the elastic net. [pdf]. Retrieved from
http://web.stanford.edu/~hastie/Papers/B67.2%20%282005%29%20301-
320%20Zou%20&%20Hastie.pdf
Hofner, B., Mayr, A., Schmid, M. (2014). gamboostlss: an r package for model building and variable selection
in the gamlss framework. [pdf]. Retrieved from http://arxiv.org/pdf/1407.1774.pdf

More Related Content

Similar to Variable selection for classification and regression using R (20)

PDF
Chapter 18,19
heba_ahmad
 
PDF
Insurance risk pricing with XGBoost
Matthew Evans
 
DOCX
R Machine Learning packages( generally used)
Dr. Volkan OBAN
 
PPTX
Machine Learning with R
Barbara Fusinska
 
PPTX
GBM package in r
mark_landry
 
PDF
1555 track 2 ning_using our laptop
Rising Media, Inc.
 
PDF
ABC short course: final chapters
Christian Robert
 
PDF
RDataMining slides-regression-classification
Yanchang Zhao
 
PDF
Applied machine learning: Insurance
Gregg Barrett
 
PDF
Linear regression in R
Leon Kim
 
PDF
Machine Learning - Supervised Learning
Giorgio Alfredo Spedicato
 
PPT
Tree net and_randomforests_2009
Matthew Magistrado
 
PDF
Introduction to Supervised ML Concepts and Algorithms
NBER
 
PPTX
Introduction to Regression Analysis and R
Rachana Taneja Bhatia
 
PPTX
Ai saturdays presentation
Gurram Poorna Prudhvi
 
PPTX
Introduction to MARS (1999)
Salford Systems
 
PPTX
Decision Tree.pptx
Ramakrishna Reddy Bijjam
 
PDF
Petrini - MSc Thesis
Leonardo Petrini
 
PDF
用R语言做曲线拟合
Wenxiang Zhu
 
Chapter 18,19
heba_ahmad
 
Insurance risk pricing with XGBoost
Matthew Evans
 
R Machine Learning packages( generally used)
Dr. Volkan OBAN
 
Machine Learning with R
Barbara Fusinska
 
GBM package in r
mark_landry
 
1555 track 2 ning_using our laptop
Rising Media, Inc.
 
ABC short course: final chapters
Christian Robert
 
RDataMining slides-regression-classification
Yanchang Zhao
 
Applied machine learning: Insurance
Gregg Barrett
 
Linear regression in R
Leon Kim
 
Machine Learning - Supervised Learning
Giorgio Alfredo Spedicato
 
Tree net and_randomforests_2009
Matthew Magistrado
 
Introduction to Supervised ML Concepts and Algorithms
NBER
 
Introduction to Regression Analysis and R
Rachana Taneja Bhatia
 
Ai saturdays presentation
Gurram Poorna Prudhvi
 
Introduction to MARS (1999)
Salford Systems
 
Decision Tree.pptx
Ramakrishna Reddy Bijjam
 
Petrini - MSc Thesis
Leonardo Petrini
 
用R语言做曲线拟合
Wenxiang Zhu
 

More from Gregg Barrett (20)

PDF
Cirrus: Africa's AI initiative, Proposal 2018
Gregg Barrett
 
PDF
Cirrus: Africa's AI initiative
Gregg Barrett
 
DOCX
Road and Track Vehicle - Project Document
Gregg Barrett
 
PDF
Modelling the expected loss of bodily injury claims using gradient boosting
Gregg Barrett
 
PDF
Data Science Introduction - Data Science: What Art Thou?
Gregg Barrett
 
PDF
Revenue Generation Ideas for Tesla Motors
Gregg Barrett
 
PDF
Data science unit introduction
Gregg Barrett
 
PDF
Social networking brings power
Gregg Barrett
 
PDF
Procurement can be exciting
Gregg Barrett
 
PDF
Machine Learning Approaches to Brewing Beer
Gregg Barrett
 
PDF
A note to Data Science and Machine Learning managers
Gregg Barrett
 
PDF
Quick Introduction: To run a SQL query on the Chicago Employee Data, using Cl...
Gregg Barrett
 
PDF
Efficient equity portfolios using mean variance optimisation in R
Gregg Barrett
 
PDF
Hadoop Overview
Gregg Barrett
 
PDF
Diabetes data - model assessment using R
Gregg Barrett
 
PDF
Introduction to Microsoft R Services
Gregg Barrett
 
PDF
Insurance metrics overview
Gregg Barrett
 
PDF
Review of mit sloan management review case study on analytics at Intermountain
Gregg Barrett
 
PDF
Example: movielens data with mahout
Gregg Barrett
 
PDF
Overview of mit sloan case study on ge data and analytics initiative titled g...
Gregg Barrett
 
Cirrus: Africa's AI initiative, Proposal 2018
Gregg Barrett
 
Cirrus: Africa's AI initiative
Gregg Barrett
 
Road and Track Vehicle - Project Document
Gregg Barrett
 
Modelling the expected loss of bodily injury claims using gradient boosting
Gregg Barrett
 
Data Science Introduction - Data Science: What Art Thou?
Gregg Barrett
 
Revenue Generation Ideas for Tesla Motors
Gregg Barrett
 
Data science unit introduction
Gregg Barrett
 
Social networking brings power
Gregg Barrett
 
Procurement can be exciting
Gregg Barrett
 
Machine Learning Approaches to Brewing Beer
Gregg Barrett
 
A note to Data Science and Machine Learning managers
Gregg Barrett
 
Quick Introduction: To run a SQL query on the Chicago Employee Data, using Cl...
Gregg Barrett
 
Efficient equity portfolios using mean variance optimisation in R
Gregg Barrett
 
Hadoop Overview
Gregg Barrett
 
Diabetes data - model assessment using R
Gregg Barrett
 
Introduction to Microsoft R Services
Gregg Barrett
 
Insurance metrics overview
Gregg Barrett
 
Review of mit sloan management review case study on analytics at Intermountain
Gregg Barrett
 
Example: movielens data with mahout
Gregg Barrett
 
Overview of mit sloan case study on ge data and analytics initiative titled g...
Gregg Barrett
 
Ad

Recently uploaded (20)

PDF
Kafka Use Cases Real-World Applications
Accentfuture
 
PPT
Reliability Monitoring of Aircrfat commerce
Rizk2
 
PDF
Informatics Market Insights AI Workforce.pdf
karizaroxx
 
PPSX
PPT1_CB_VII_CS_Ch3_FunctionsandChartsinCalc.ppsx
animaroy81
 
PPTX
Indigo dyeing Presentation (2).pptx as dye
shreeroop1335
 
DOCX
brigada_PROGRAM_25.docx the boys white house
RonelNebrao
 
PDF
Blood pressure (3).pdfbdbsbsbhshshshhdhdhshshs
hernandezemma379
 
PDF
Orchestrating Data Workloads With Airflow.pdf
ssuserae5511
 
PPTX
Natural Language Processing Datascience.pptx
Anandh798253
 
PPTX
Model Evaluation & Visualisation part of a series of intro modules for data ...
brandonlee626749
 
PPTX
Mynd company all details what they are doing a
AniketKadam40952
 
PPTX
Artificial intelligence Presentation1.pptx
SaritaMahajan5
 
PPTX
Daily, Weekly, Monthly Report MTC March 2025.pptx
PanjiDewaPamungkas1
 
PDF
SaleServicereport and SaleServicereport
2251330007
 
PPTX
Module-2_3-1eentzyssssssssssssssssssssss.pptx
ShahidHussain66691
 
PDF
5- Global Demography Concepts _ Population Pyramids .pdf
pkhadka824
 
PPTX
Generative AI Boost Data Governance and Quality- Tejasvi Addagada
Tejasvi Addagada
 
PDF
IT GOVERNANCE 4-2 - Information System Security (1).pdf
mdirfanuddin1322
 
PPTX
english9quizw1-240228142338-e9bcf6fd.pptx
rossanthonytan130
 
PPTX
RESEARCH-FINAL-GROUP-3, about the final .pptx
gwapokoha1
 
Kafka Use Cases Real-World Applications
Accentfuture
 
Reliability Monitoring of Aircrfat commerce
Rizk2
 
Informatics Market Insights AI Workforce.pdf
karizaroxx
 
PPT1_CB_VII_CS_Ch3_FunctionsandChartsinCalc.ppsx
animaroy81
 
Indigo dyeing Presentation (2).pptx as dye
shreeroop1335
 
brigada_PROGRAM_25.docx the boys white house
RonelNebrao
 
Blood pressure (3).pdfbdbsbsbhshshshhdhdhshshs
hernandezemma379
 
Orchestrating Data Workloads With Airflow.pdf
ssuserae5511
 
Natural Language Processing Datascience.pptx
Anandh798253
 
Model Evaluation & Visualisation part of a series of intro modules for data ...
brandonlee626749
 
Mynd company all details what they are doing a
AniketKadam40952
 
Artificial intelligence Presentation1.pptx
SaritaMahajan5
 
Daily, Weekly, Monthly Report MTC March 2025.pptx
PanjiDewaPamungkas1
 
SaleServicereport and SaleServicereport
2251330007
 
Module-2_3-1eentzyssssssssssssssssssssss.pptx
ShahidHussain66691
 
5- Global Demography Concepts _ Population Pyramids .pdf
pkhadka824
 
Generative AI Boost Data Governance and Quality- Tejasvi Addagada
Tejasvi Addagada
 
IT GOVERNANCE 4-2 - Information System Security (1).pdf
mdirfanuddin1322
 
english9quizw1-240228142338-e9bcf6fd.pptx
rossanthonytan130
 
RESEARCH-FINAL-GROUP-3, about the final .pptx
gwapokoha1
 
Ad

Variable selection for classification and regression using R

  • 1. Variable selection for classification and regression using R 498 Gregg Barrett
  • 2. Introduction This document provides a brief summary of several variable selection methods that can be utilised within the R environment. Examples are provided for classification and regression. Models for classification: Logistic model - glmulti Linear Discriminate Analysis model - Tabu search - Memetic Random Forest model - Variable importance assessment Models for regression: Least squares model - Best subset Lasso and Elastic net models - glmnet GAMLSS model - gamboostLSS Boosting model - Relative influence assessment Methods for classification Logistic model glmulti When fitting a logistic regression model the R package glmulti uses an information-theoretic approach for model selection. From a list of explanatory variables, the provided function glmulti builds all possible unique models involving these variables and, optionally, their pairwise interactions. Models are fitted with standard R functions like glm. The n best models and their support (e.g., (Q)AIC, (Q)AICc, or BIC) are returned, allowing model selection and multi-model inference through standard R functions. Resources: CRAN: Package “glmulti” https://cran.r-project.org/web/packages/glmulti/glmulti.pdf Paper: glmulti: An R Package for Easy Automated Model Selection with (Generalized) Linear Models https://www.jstatsoft.org/index.php/jss/article/view/v034i12/v34i12.pdf Tutorial: Model Selection using the glmulti Package http://www.metafor-project.org/doku.php/tips:model_selection_with_glmulti#variable_importance Tutorial: Model Selection and Multi-Model Inference http://www.noamross.net/blog/2013/2/20/model-selection-drug.html
  • 3. Linear Discriminate Analysis model Tabu Search This procedure explores the solution space beyond the local optimum. Once a local optimum is reached, upward moves and those worsening the solutions are allowed. Simultaneously, the last moves are marked as tabu during the following iterations to avoid cycling. (Pacheco, Casado, Gomez, Nunez, 2005, pg. 9) Resources: CRAN: Package “tabuSearch” https://cran.r-project.org/web/packages/tabuSearch/tabuSearch.pdf Paper: Analysis of new variable selection methods for discriminant analysis http://globalcampus.ie.edu/webes/servicios/descarga_sgd_intranet/envia_doc.asp?id=2207&nomb re=2207.pdf&clave=WP05-33&clave2=DF8-120-I Paper: Tabu Search, Fred Glover and Rafael Martí http://www.uv.es/rmarti/paper/docs/ts2.pdf Memetic In simple terms, Memetic Algorithms (MAs) have been shown to be very effective in finding near-optimum solutions to hard combinatorial optimization problems. Memetic Algorithms are population-based methods and they have proved to be faster than Genetic Algorithms for certain types of problems, (Moscato and Laguna, 1996). In brief, they combine local search procedures with crossing or mutating operators; due to their structure some researchers have called them Hybrid Genetic Algorithms, Parallel Genetic Algorithms (PGAs) or Genetic Local Search methods. The method is gaining wide acceptance particularly for the well-known problems of combinatorial optimization. (Pacheco et al., 2005, pg. 6) Resources: CRAN: Package “Rmalschains” https://cran.r-project.org/web/packages/Rmalschains/Rmalschains.pdf Paper: A Gentle Introduction to Memetic Algorithms http://www.lcc.uma.es/~ccottap/papers/handbook03memetic.pdf Paper: Memetic Algorithms for the Traveling Salesman Problem http://www.complex-systems.com/pdf/13-4-1.pdf
  • 4. Random Forest model Variable importance assessment Using the importance() function in the package “randomForest”, two measures of variable importance are reported, %IncMSE and IncNodePurity. The former is based upon the mean decrease of accuracy in predictions on the out of bag samples when a given variable is excluded from the model. The latter is a measure of the total decrease in node impurity that results from splits over that variable, averaged over all trees. In the case of classification trees, the node impurity is measured by the deviance. (James, Witten, Hastie, Tibshirani, 2013, pg. 330) One approach is to use backward elimination (also called recursive feature elimination) where the least important variables are removed until out-of-bag prediction accuracy drops. Resources: CRAN: Package “randomForest” https://cran.r-project.org/web/packages/randomForest/randomForest.pdf Presentation: Why and how to use random forest variable importance measures (and how you shouldn’t) http://www.statistik.uni-dortmund.de/useR-2008/slides/Strobl+Zeileis.pdf Paper: Bias in Random Forest Variable Importance Measures: Illustrations, Sources and a Solution http://www.statistik.lmu.de/sfb386/papers/dsp/paper490.pdf Methods for regression Least squares model Best subset leaps() performs an exhaustive search for the best subsets of the variables in x for predicting y in linear regression, using an efficient branch-and-bound algorithm. It is a compatibility wrapper for regsubsets that does the same thing better. Since the algorithm returns a best model of each size, the results do not depend on a penalty model for model size: it doesn’t make any difference whether you want to use AIC, BIC etc. (Miller, 2015, pg. 1) AIC, BIC can be used to select between the resulting models. Best subset selection however becomes computationally infeasible when the number of variables becomes greater than around 40. Resources: CRAN: Package “leaps” https://cran.r-project.org/web/packages/leaps/leaps.pdf
  • 5. Lasso model In the case of the lasso, the ℓ1 penalty has the effect of forcing some of the coefficient estimates to be exactly equal to zero when the tuning parameter λ is sufficiently large. Cross-validation can be used to select the value of λ. Elastic net model Similar to the lasso, the elastic net simultaneously does automatic variable selection and continuous shrinkage, and it can select groups of correlated variables. It is like a stretchable fishing net that retains “all the big fish”. Simulation studies and real data examples show that the elastic net often outperforms the lasso in terms of prediction accuracy. (Zou, Hastie, 2005, pg. 302) Resources: CRAN: Package “glmnet” https://cran.r-project.org/web/packages/glmnet/glmnet.pdf glmnet webinar: Sparse Linear Models with demonstrations in GLMNET, presented by Trevor Hastie. https://www.youtube.com/watch?v=BU2gjoLPfDc&feature=youtu.be glmnet vignette: By Trevor Hastie and Junyang Qian https://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html Paper - Lasso: Strong Rules for Discarding Predictors in Lasso-type Problems http://statweb.stanford.edu/~tibs/ftp/strong.pdf Paper – Elastic net: Regularization and variable selection via the elastic net http://web.stanford.edu/~hastie/Papers/B67.2%20%282005%29%20301- 320%20Zou%20&%20Hastie.pdf Technical presentation – Elastic net: Regularization and Variable Selection via the Elastic net http://web.stanford.edu/~hastie/TALKS/enet_talk.pdf GAMLSS model gamboostLSS Generalized additive models for location, scale and shape (GAMLSS) are a flexible class of regression models that allow for the modelling of multiple parameters of a distribution function, such as the mean and the standard deviation, simultaneously. The R package gamboostLSS, provides a boosting method to fit these models. (Hofner, Mayr, Schmid, 2014, pg. 1) Variable selection is included in the fitting process - through the main tuning parameter which are the stopping iterations “mstop” – this controls the variable selection and the amount of shrinkage. The selection of “mstop” can be done through cross validation.
  • 6. Resources: CRAN: Package “gamboostLSS” https://cran.r-project.org/web/packages/gamboostLSS/gamboostLSS.pdf Presentation: An R Package for Model Building and Variable Selection in the GAMLSS Framework http://arxiv.org/pdf/1407.1774.pdf Paper: gamboostLSS - boosting generalized additive models for location scale and shape https://web.warwick.ac.uk/statsdept/user-2011/TalkSlides/Contributed/16Aug_1600_FocusII_5- DimReduction_3-Hofner.pdf Boosting model Relative influence assessment One approach is to use backward elimination (also called recursive feature elimination) where the least important variables are removed until out-of-bag prediction accuracy drops. This can be combined with a randomised feature selection approach. Resources: CRAN: Package “gbm” https://cran.r-project.org/web/packages/gbm/gbm.pdf Paper: Feature selection for ranking using boosted trees https://pdfs.semanticscholar.org/156e/3c979e7bc25381fdd0614d1bab60b7aa5dfd.pdf
  • 7. Reference Pacheco, J., Casado, S., Gomez, O., Nunez, L. (2005). Analysis of new variable selection methods for discriminate analysis. [pdf]. Retrieved from http://globalcampus.ie.edu/webes/servicios/descarga_sgd_intranet/envia_doc.asp?id=2207&nombr e=2207.pdf&clave=WP05-33&clave2=DF8-120-I James, G., Witten, D., Hastie, T., Tibshirani, R. (2013). Introduction to statistical learning with applications in R. [ebook]. Retrieved from http://www-bcf.usc.edu/~gareth/ISL/getbook.html Miller, A. (2015). Package “leaps”. [pdf]. Retrieved from https://cran.r-project.org/web/packages/leaps/leaps.pdf Zou, H., Hastie, T. (2005). Regularization and variable selection via the elastic net. [pdf]. Retrieved from http://web.stanford.edu/~hastie/Papers/B67.2%20%282005%29%20301- 320%20Zou%20&%20Hastie.pdf Hofner, B., Mayr, A., Schmid, M. (2014). gamboostlss: an r package for model building and variable selection in the gamlss framework. [pdf]. Retrieved from http://arxiv.org/pdf/1407.1774.pdf