SlideShare a Scribd company logo
NUS DataScience
Introduction to Data Analytics with R
TOH Wei Zhong
31/10/2015
A little bit about me
• Graduated from NUS, Computational Biology
• Statistics and computing onto biology and healthcare
• E.g. –omics
• Data Scientist in NCS
• Smart Nation projects (defense and public safety)
NUS-DataScience TOH Wei Zhong 231/10/2015
Agenda for this afternoon
• Overview of data analytics
• Introduce key concepts for hands-on session
• Logistic regression
• Decision tree
• Random forest
• Evaluation metrics
• Cross-validation
• Short break
• Hands-on
NUS-DataScience TOH Wei Zhong 331/10/2015
Overview of data
analytics
NUS-DataScience TOH Wei Zhong 431/10/2015
What is data analytics?
• A collection of established methods/techniques
that
• Seeks to make sense of and generate insights and
knowledge from collected data (Big Data or otherwise)
• Is statistically sound and rigorous
• Preferably scalable
• Is used to support decision making
NUS-DataScience TOH Wei Zhong 531/10/2015
Data Science
Data analytics Data visualization
Big Data technologies
Text and network
analytics
NLP
Semantics
Social media
Hadoop
Data streaming
Spark
Tableau
Communication
Grammar of
Graphics
NUS-DataScience TOH Wei Zhong 631/10/2015
A common way to think about data
analytics
Given existing data,
generate some
form of summary /
aggregated view so
that data can be
consumed
Given existing data,
construct models so
that predictions on
future, yet-to-be
collected data can
be made
Given constructed
models,
recommend future
decisions
Descriptive Predictive Prescriptive
NUS-DataScience TOH Wei Zhong 731/10/2015
Key aspects that businesses are
concerned about
Accuracy Value-adding
Interpretability “Factors associated”
NUS-DataScience TOH Wei Zhong 831/10/2015
Key techniques in data analytics
• Feature selection
• Clustering
• Linear models
• Tree-based models
• Evaluation metrics
• Resampling methods
• Hypothesis testing
• Association rule mining
• Time series analysis
• Feature engineering
Statistical learning
Sometimes neglected, but nonetheless
powerful
NUS-DataScience TOH Wei Zhong 931/10/2015
Statistical learning
• Supervised and unsupervised learning (also, semi-
supervised learning)
• Supervised learning: learning with ground
truth/answers available (response variable)
• Classification: response variable is categorical
• Regression: response variable is continuous or numerical
• Unsupervised learning: finding intrinsic
relationships between samples in the dataset
• Clustering algorithms
NUS-DataScience TOH Wei Zhong 1031/10/2015
Supervised learning: linear
models
• Generalized linear models (GLM): mainstay tool in
data analytics
• Generalized in the sense of the type of response
variable:
• Continuous response variable: ordinary least squares
(OLS) regression
• Binary / multinomial response: logistic regression
• Discrete response: Poisson regression
• Gives an equation: y = β0 + β1x1 + β2x2 + … + ε
• Regularization (ridge regression, LASSO regression)
NUS-DataScience TOH Wei Zhong 1131/10/2015
Supervised learning: tree-based
models
• Models that uses
decision trees as
fundamental building
blocks
• Random forest
• Gradient boosting
machines
• Rotation forest
• More on decision trees
and random forest later
NUS-DataScience TOH Wei Zhong 1231/10/2015
Unsupervised learning: clustering
• Clustering: empirically grouping observations /
samples / rows in a dataset together in different
groups (cluster), such that the more similar
observations are grouped together
• Unsupervised because there is no ground truth to
guide the process, unlike e.g. regression
NUS-DataScience TOH Wei Zhong 1331/10/2015
Feature selection
• Feature: a variable / attribute in the dataset
• Feature selection: the process of selecting relevant
features that aids in the modelling process, used
especially when there are too many features in the
dataset to work with
• Curse of dimensionality: the more irrelevant
features are used in a model, the weaker the model
NUS-DataScience TOH Wei Zhong 1431/10/2015
Evaluation metrics
• Measures using which constructed models are
assessed
• Examples include accuracy and ROC-AUC
• Later
NUS-DataScience TOH Wei Zhong 1531/10/2015
Key concepts
NUS-DataScience TOH Wei Zhong 1631/10/2015
Hands-on session
• For the hands-on session, we will be look at a
dataset of emails, consisting of both spam and non-
spam
• The objective is to construct models that can
predict whether a given email is spam or non-spam
NUS-DataScience TOH Wei Zhong 1731/10/2015
NUS-DataScience TOH Wei Zhong 1831/10/2015
A bit on R
• R is a statistical computing language that was
developed with statistical analysis in mind
• One of the most popular tools in the data science
community
• R scripts: sequence of procedures that enables step-by-
step customized data crunching
• R packages: collations of R scripts (functions) that we
can leverage on to do various, more complex tasks
easily, e.g. manipulate data and construct models
• R and Rstudio
NUS-DataScience TOH Wei Zhong 1931/10/2015
Key concepts to be used
• Logistic regression
• Decision tree
• Random forest
• Cross-validation
• Evaluation metrics: accuracy and ROC-AUC
NUS-DataScience TOH Wei Zhong 2031/10/2015
Logistic regression
• A type of generalized linear model (GLM)
• Assigns each variable used in the model with a
coefficient that can be used in summation to
predict log-odds
• y = β0 + β1x1 + β2x2 + … + ε
• Log-odds = log(odds) ∝ probability
• In our case, probability of an email being a spam
email
NUS-DataScience TOH Wei Zhong 2131/10/2015
Pros and cons of logistic
regression
• Pros:
• Easy to interpret – the idea of regression is familiar and
intuitive
• Cons:
• Requires certain statistical assumptions to hold true in
the data
• Generally low predictive accuracy
NUS-DataScience TOH Wei Zhong 2231/10/2015
Decision trees
• A simple model used in supervised learning
• CART, C4.5 – amongst top 10 most popular data
mining algorithms
• Can handle both classification and regression
• The tree package that we are using uses the
recursive partitioning algorithm
NUS-DataScience TOH Wei Zhong 2331/10/2015
Equivalents
• Tree == Binary partitioning of dataset
• Each partition is represented by the mode
(classification) or mean (regression)
NUS-DataScience TOH Wei Zhong 2431/10/2015
Terminologies
• Depth
• Node
• Leaf nodes
• Non-leaf nodes
• The size of a tree
sometimes refers
to the number of
leaf nodes
• Parents and
children
• Branching factor
NUS-DataScience TOH Wei Zhong 2531/10/2015
Pruning
• Typically after
the construction
of a decision
tree, we would
want to prune
the tree,
because the tree
may be overly
complicated
NUS-DataScience TOH Wei Zhong 2631/10/2015
Pruning (2)
• Pruning refers to the
process of trimming the
tree to a more compact
and concise one,
without sacrificing
much performance
• The tree package uses
cost-complexity
pruning
• Comparing the
relationship between
number of leaf nodes
and performance of
model
NUS-DataScience TOH Wei Zhong 2731/10/2015
Pros and cons of decision trees
• Pros:
• Very easy to interpret and communicate to others,
because it is similar to how humans think and make
decisions
• Easy to construct
• Cons:
• Generally unstable
• Generally low predictive accuracy
NUS-DataScience TOH Wei Zhong 2831/10/2015
Random forest
• In the RF model, instead of using one decision tree
to do predictions, we use multiple of them
• The idea is to build decision trees on different
subsets of the training data
• Each subset is known as a “bag”
• Each bag yields one decision tree
• To make a prediction, we ask each tree to make a
predictions
• To get the overall prediction of the RF model, we take a
majority vote
NUS-DataScience TOH Wei Zhong 2931/10/2015
Pros and cons of random forest
• Pros:
• One of the top-performing models in supervised
learning
• With some basic understanding of sampling and
bootstrapping, RF can be easy to communicate. The
intuition of voting as a mechanism to make decisions is
simple
• Able to derive variable importance measures
• Cons:
• Computationally intensive
NUS-DataScience TOH Wei Zhong 3031/10/2015
Evaluation metrics: assessing the
performance of a supervised learning
model
• In order to know whether the models constructed can
perform well in reality, we need to assess some metrics to
assess their performance
• Classification: accuracy / error rate
• Sensitivity, specificity etc.
• Regression: mean squared error
• 𝑀𝑆𝐸 =
1
𝑛
(𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛 − 𝑎𝑐𝑡𝑢𝑎𝑙)2
• Also, there are two types of classification models:
(1) Those that output classes / categories as predictions
(2) Those that output probabilities as predictions
• (2): can use ROC-AUC as a measure of performance
NUS-DataScience TOH Wei Zhong 3131/10/2015
Cross-validation
• Gives rise to the idea of training and testing
datasets
• Rationale:
• Recall that the constructed models are ultimately meant
to do predictions on future, unknown observations
• Models are constructed/trained using input datasets.
We call them training data
• If the models constructed are too attuned to the training
data => overfitting
NUS-DataScience TOH Wei Zhong 3231/10/2015
NUS-DataScience TOH Wei Zhong 3331/10/2015
http://www.turingfinance.com/regression-
analysis-using-python-statsmodels-and-
quandl/
Cross-validation
• In order to know whether our models are
overfitted to the data, we use cross-validation
• Split the dataset in two parts: training and testing
• Use the training set to build the models
• Use the models to make predictions on the testing set
• A way to think about this: studying for an
examination
NUS-DataScience TOH Wei Zhong 3431/10/2015
Hands-on
NUS-DataScience TOH Wei Zhong 3531/10/2015
Thanks!
Questions?
github.com/tohweizhong/NUS-DataScience
sg.linkedin.com/in/tohweizhong
tohweizhong@u.nus.edu

More Related Content

PDF
R User Group Singapore, Data Mining with R (Workshop II) - Random forests
PPTX
Data Preprocessing
PDF
3 module 2
PDF
Data preprocessing and unsupervised learning methods in Bioinformatics
PDF
6 module 4
PDF
2 introductory slides
PPT
Data Mining
PPT
Data preprocessing ppt1
R User Group Singapore, Data Mining with R (Workshop II) - Random forests
Data Preprocessing
3 module 2
Data preprocessing and unsupervised learning methods in Bioinformatics
6 module 4
2 introductory slides
Data Mining
Data preprocessing ppt1

What's hot (20)

PPT
Data preprocessing
PDF
4 preprocess
PPT
Data preprocessing ng
PPTX
Data reduction
PPTX
Machine Learning
PDF
R Regression Models with Zelig
PDF
Introduction to the R Statistical Computing Environment
PPT
Data preprocessing
PPTX
Data Mining: Mining stream time series and sequence data
PPT
Data preprocessing
PDF
LR1. Summary Day 1
PDF
Data structures and algorithm analysis in java
PDF
Data preprocessing using Machine Learning
PPTX
Unit I- Data structures Introduction, Evaluation of Algorithms, Arrays, Spars...
PDF
XPath XSLT Workshop - Concept Listing
PPTX
Lecture 01 Intro to DSA
PPTX
Unit 2 linked list
PDF
Data clustering using map reduce
PPT
Data preprocessing 2
PPT
Storage struct
Data preprocessing
4 preprocess
Data preprocessing ng
Data reduction
Machine Learning
R Regression Models with Zelig
Introduction to the R Statistical Computing Environment
Data preprocessing
Data Mining: Mining stream time series and sequence data
Data preprocessing
LR1. Summary Day 1
Data structures and algorithm analysis in java
Data preprocessing using Machine Learning
Unit I- Data structures Introduction, Evaluation of Algorithms, Arrays, Spars...
XPath XSLT Workshop - Concept Listing
Lecture 01 Intro to DSA
Unit 2 linked list
Data clustering using map reduce
Data preprocessing 2
Storage struct
Ad

Viewers also liked (7)

PDF
Redhyte: Towards a Self-diagnosing, Self-correcting, and Helpful Analytic Pla...
PPTX
A Workshop on R
PPTX
Training in Analytics, R and Social Media Analytics
PPTX
Data Analytics with R and SQL Server
PPTX
R and Data Science
PPTX
Tata consultancy services final
Redhyte: Towards a Self-diagnosing, Self-correcting, and Helpful Analytic Pla...
A Workshop on R
Training in Analytics, R and Social Media Analytics
Data Analytics with R and SQL Server
R and Data Science
Tata consultancy services final
Ad

Similar to Introduction to Data Analytics with R (20)

PDF
Data analysis
PDF
Data Analysis, data types and interpretation.pdf
PDF
03 presentation-bothiesson
PDF
Introduction to Data Analysis for researcher.pdf
PPTX
Multi variate presentation
PPTX
Umm, how did you get that number? Managing Data Integrity throughout the Data...
PPTX
Decision Support Systems
PPTX
Data analytics for engineers- introduction
PPTX
Nursing Data Analysis.pptx
PPTX
The Research specifically DataAnalysis.pptx
PPTX
Data Science and Analysis.pptx
PPTX
Big data analyti data analytical life cycle
PPTX
1. Intro DS.pptx
PDF
Week_2_Lecture.pdf
PPTX
data science, prior knowledge ,modeling, scatter plot
PPTX
ADR UK workshop: Messy and complex data part 1
PPTX
Introduction to Statistics and Probability:
PPTX
Mini datathon - Bengaluru
Data analysis
Data Analysis, data types and interpretation.pdf
03 presentation-bothiesson
Introduction to Data Analysis for researcher.pdf
Multi variate presentation
Umm, how did you get that number? Managing Data Integrity throughout the Data...
Decision Support Systems
Data analytics for engineers- introduction
Nursing Data Analysis.pptx
The Research specifically DataAnalysis.pptx
Data Science and Analysis.pptx
Big data analyti data analytical life cycle
1. Intro DS.pptx
Week_2_Lecture.pdf
data science, prior knowledge ,modeling, scatter plot
ADR UK workshop: Messy and complex data part 1
Introduction to Statistics and Probability:
Mini datathon - Bengaluru

Recently uploaded (20)

PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
Computer network topology notes for revision
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
Database Infoormation System (DBIS).pptx
PPTX
Supervised vs unsupervised machine learning algorithms
PPT
Reliability_Chapter_ presentation 1221.5784
PDF
Introduction to Business Data Analytics.
PDF
Foundation of Data Science unit number two notes
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
IB Computer Science - Internal Assessment.pptx
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Computer network topology notes for revision
IBA_Chapter_11_Slides_Final_Accessible.pptx
Database Infoormation System (DBIS).pptx
Supervised vs unsupervised machine learning algorithms
Reliability_Chapter_ presentation 1221.5784
Introduction to Business Data Analytics.
Foundation of Data Science unit number two notes
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
Fluorescence-microscope_Botany_detailed content
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx
Miokarditis (Inflamasi pada Otot Jantung)
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
IB Computer Science - Internal Assessment.pptx
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
MODULE 8 - DISASTER risk PREPAREDNESS.pptx

Introduction to Data Analytics with R

  • 1. NUS DataScience Introduction to Data Analytics with R TOH Wei Zhong 31/10/2015
  • 2. A little bit about me • Graduated from NUS, Computational Biology • Statistics and computing onto biology and healthcare • E.g. –omics • Data Scientist in NCS • Smart Nation projects (defense and public safety) NUS-DataScience TOH Wei Zhong 231/10/2015
  • 3. Agenda for this afternoon • Overview of data analytics • Introduce key concepts for hands-on session • Logistic regression • Decision tree • Random forest • Evaluation metrics • Cross-validation • Short break • Hands-on NUS-DataScience TOH Wei Zhong 331/10/2015
  • 4. Overview of data analytics NUS-DataScience TOH Wei Zhong 431/10/2015
  • 5. What is data analytics? • A collection of established methods/techniques that • Seeks to make sense of and generate insights and knowledge from collected data (Big Data or otherwise) • Is statistically sound and rigorous • Preferably scalable • Is used to support decision making NUS-DataScience TOH Wei Zhong 531/10/2015
  • 6. Data Science Data analytics Data visualization Big Data technologies Text and network analytics NLP Semantics Social media Hadoop Data streaming Spark Tableau Communication Grammar of Graphics NUS-DataScience TOH Wei Zhong 631/10/2015
  • 7. A common way to think about data analytics Given existing data, generate some form of summary / aggregated view so that data can be consumed Given existing data, construct models so that predictions on future, yet-to-be collected data can be made Given constructed models, recommend future decisions Descriptive Predictive Prescriptive NUS-DataScience TOH Wei Zhong 731/10/2015
  • 8. Key aspects that businesses are concerned about Accuracy Value-adding Interpretability “Factors associated” NUS-DataScience TOH Wei Zhong 831/10/2015
  • 9. Key techniques in data analytics • Feature selection • Clustering • Linear models • Tree-based models • Evaluation metrics • Resampling methods • Hypothesis testing • Association rule mining • Time series analysis • Feature engineering Statistical learning Sometimes neglected, but nonetheless powerful NUS-DataScience TOH Wei Zhong 931/10/2015
  • 10. Statistical learning • Supervised and unsupervised learning (also, semi- supervised learning) • Supervised learning: learning with ground truth/answers available (response variable) • Classification: response variable is categorical • Regression: response variable is continuous or numerical • Unsupervised learning: finding intrinsic relationships between samples in the dataset • Clustering algorithms NUS-DataScience TOH Wei Zhong 1031/10/2015
  • 11. Supervised learning: linear models • Generalized linear models (GLM): mainstay tool in data analytics • Generalized in the sense of the type of response variable: • Continuous response variable: ordinary least squares (OLS) regression • Binary / multinomial response: logistic regression • Discrete response: Poisson regression • Gives an equation: y = β0 + β1x1 + β2x2 + … + ε • Regularization (ridge regression, LASSO regression) NUS-DataScience TOH Wei Zhong 1131/10/2015
  • 12. Supervised learning: tree-based models • Models that uses decision trees as fundamental building blocks • Random forest • Gradient boosting machines • Rotation forest • More on decision trees and random forest later NUS-DataScience TOH Wei Zhong 1231/10/2015
  • 13. Unsupervised learning: clustering • Clustering: empirically grouping observations / samples / rows in a dataset together in different groups (cluster), such that the more similar observations are grouped together • Unsupervised because there is no ground truth to guide the process, unlike e.g. regression NUS-DataScience TOH Wei Zhong 1331/10/2015
  • 14. Feature selection • Feature: a variable / attribute in the dataset • Feature selection: the process of selecting relevant features that aids in the modelling process, used especially when there are too many features in the dataset to work with • Curse of dimensionality: the more irrelevant features are used in a model, the weaker the model NUS-DataScience TOH Wei Zhong 1431/10/2015
  • 15. Evaluation metrics • Measures using which constructed models are assessed • Examples include accuracy and ROC-AUC • Later NUS-DataScience TOH Wei Zhong 1531/10/2015
  • 16. Key concepts NUS-DataScience TOH Wei Zhong 1631/10/2015
  • 17. Hands-on session • For the hands-on session, we will be look at a dataset of emails, consisting of both spam and non- spam • The objective is to construct models that can predict whether a given email is spam or non-spam NUS-DataScience TOH Wei Zhong 1731/10/2015
  • 18. NUS-DataScience TOH Wei Zhong 1831/10/2015
  • 19. A bit on R • R is a statistical computing language that was developed with statistical analysis in mind • One of the most popular tools in the data science community • R scripts: sequence of procedures that enables step-by- step customized data crunching • R packages: collations of R scripts (functions) that we can leverage on to do various, more complex tasks easily, e.g. manipulate data and construct models • R and Rstudio NUS-DataScience TOH Wei Zhong 1931/10/2015
  • 20. Key concepts to be used • Logistic regression • Decision tree • Random forest • Cross-validation • Evaluation metrics: accuracy and ROC-AUC NUS-DataScience TOH Wei Zhong 2031/10/2015
  • 21. Logistic regression • A type of generalized linear model (GLM) • Assigns each variable used in the model with a coefficient that can be used in summation to predict log-odds • y = β0 + β1x1 + β2x2 + … + ε • Log-odds = log(odds) ∝ probability • In our case, probability of an email being a spam email NUS-DataScience TOH Wei Zhong 2131/10/2015
  • 22. Pros and cons of logistic regression • Pros: • Easy to interpret – the idea of regression is familiar and intuitive • Cons: • Requires certain statistical assumptions to hold true in the data • Generally low predictive accuracy NUS-DataScience TOH Wei Zhong 2231/10/2015
  • 23. Decision trees • A simple model used in supervised learning • CART, C4.5 – amongst top 10 most popular data mining algorithms • Can handle both classification and regression • The tree package that we are using uses the recursive partitioning algorithm NUS-DataScience TOH Wei Zhong 2331/10/2015
  • 24. Equivalents • Tree == Binary partitioning of dataset • Each partition is represented by the mode (classification) or mean (regression) NUS-DataScience TOH Wei Zhong 2431/10/2015
  • 25. Terminologies • Depth • Node • Leaf nodes • Non-leaf nodes • The size of a tree sometimes refers to the number of leaf nodes • Parents and children • Branching factor NUS-DataScience TOH Wei Zhong 2531/10/2015
  • 26. Pruning • Typically after the construction of a decision tree, we would want to prune the tree, because the tree may be overly complicated NUS-DataScience TOH Wei Zhong 2631/10/2015
  • 27. Pruning (2) • Pruning refers to the process of trimming the tree to a more compact and concise one, without sacrificing much performance • The tree package uses cost-complexity pruning • Comparing the relationship between number of leaf nodes and performance of model NUS-DataScience TOH Wei Zhong 2731/10/2015
  • 28. Pros and cons of decision trees • Pros: • Very easy to interpret and communicate to others, because it is similar to how humans think and make decisions • Easy to construct • Cons: • Generally unstable • Generally low predictive accuracy NUS-DataScience TOH Wei Zhong 2831/10/2015
  • 29. Random forest • In the RF model, instead of using one decision tree to do predictions, we use multiple of them • The idea is to build decision trees on different subsets of the training data • Each subset is known as a “bag” • Each bag yields one decision tree • To make a prediction, we ask each tree to make a predictions • To get the overall prediction of the RF model, we take a majority vote NUS-DataScience TOH Wei Zhong 2931/10/2015
  • 30. Pros and cons of random forest • Pros: • One of the top-performing models in supervised learning • With some basic understanding of sampling and bootstrapping, RF can be easy to communicate. The intuition of voting as a mechanism to make decisions is simple • Able to derive variable importance measures • Cons: • Computationally intensive NUS-DataScience TOH Wei Zhong 3031/10/2015
  • 31. Evaluation metrics: assessing the performance of a supervised learning model • In order to know whether the models constructed can perform well in reality, we need to assess some metrics to assess their performance • Classification: accuracy / error rate • Sensitivity, specificity etc. • Regression: mean squared error • 𝑀𝑆𝐸 = 1 𝑛 (𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛 − 𝑎𝑐𝑡𝑢𝑎𝑙)2 • Also, there are two types of classification models: (1) Those that output classes / categories as predictions (2) Those that output probabilities as predictions • (2): can use ROC-AUC as a measure of performance NUS-DataScience TOH Wei Zhong 3131/10/2015
  • 32. Cross-validation • Gives rise to the idea of training and testing datasets • Rationale: • Recall that the constructed models are ultimately meant to do predictions on future, unknown observations • Models are constructed/trained using input datasets. We call them training data • If the models constructed are too attuned to the training data => overfitting NUS-DataScience TOH Wei Zhong 3231/10/2015
  • 33. NUS-DataScience TOH Wei Zhong 3331/10/2015 http://www.turingfinance.com/regression- analysis-using-python-statsmodels-and- quandl/
  • 34. Cross-validation • In order to know whether our models are overfitted to the data, we use cross-validation • Split the dataset in two parts: training and testing • Use the training set to build the models • Use the models to make predictions on the testing set • A way to think about this: studying for an examination NUS-DataScience TOH Wei Zhong 3431/10/2015
  • 35. Hands-on NUS-DataScience TOH Wei Zhong 3531/10/2015