SlideShare a Scribd company logo
ClassificationDr. Mostafa A. Elhosseini
Revise
Ꚛ Regression task
Ꚛ Predicting Housing values using
▪ Linear Regression
▪ How to fix underfitting
▪ Decision Trees.
▪ Random Forest
Ꚛ Cross-validation
Ꚛ Fine-tune your model
▪ Grid Search
▪ Randomized Search
▪ Ensemble Methods
Ꚛ Hyberparameter
Agenda
Ꚛ Handwritten digits dataset MINST
MINST
Ꚛ Set of 70,000 small images of digits handwritten by high school
students and employees of the US Census Bureau
Ꚛ Each image is labeled with the digit it represents
Ꚛ It is often called the “Hello World” of Machine Learning
Ꚛ Each image has 28×28 pixels (784 features )
Ꚛ Each feature simply represents one pixel’s intensity, from 0 (white)
to 255 (black)
MINST
Ꚛ Datasets loaded by Scikit-
Learn generally have a similar
dictionary structure
including:
▪ A DESCR key describing the
dataset
▪ A data key containing an array
with one row per instance and
one column per feature
▪ A target key containing an array
with the labels
Peek at one digit from the dataset
▪ To feel complexity of
the classification task
MINST Training & Testing set
Ꚛ You should always create a test set and set it aside before inspecting
the data closely.
Ꚛ The MNIST dataset is actually already split into a training set (the first
60,000 images) and a test set (the last 10,000 images):
Ꚛ Shuffle the training set; this will guarantee that…
▪ All cross-validation folds will be similar (you don’t want one fold to be missing
some digits).
▪ Moreover, some learning algorithms are sensitive to the order of the training
instances, and they perform poorly if they get many similar instances in a row.
Shuffling the dataset ensures that this won’t happen:
Training Binary Classifier
Ꚛ Let’s simplify the problem for now and only try to identify one digit
— for example, the number 5.
Ꚛ This “5-detector” will be an example of a binary classifier, capable of
distinguishing between just two classes, 5 and not-5. Let’s create the
target vectors for this classification task:
▪ y_train_5 = (y_train == 5) # True for all 5s, False for all other digits.
▪ y_test_5 = (y_test == 5)
Ꚛ Okay, now let’s pick a classifier and train it. A good place to start is
with a Stochastic Gradient Descent (SGD) classifier, using Scikit-
Learn’s SGDClassifier class
Stochastic Gradient Descent Classifier
Ꚛ This classifier has the advantage of being capable of handling very large datasets
efficiently.
Ꚛ This is in part because SGD deals with training instances independently, one at a time
(which also makes SGD well suited for online learning), as we will see later.
▪ from sklearn.linear_model import SGDClassifier
▪ sgd_clf = SGDClassifier(random_state=42)
▪ sgd_clf.fit(X_train, y_train_5)
Ꚛ The SGDClassifier relies on randomness during training (hence the name “stochastic”).
Ꚛ If you want reproducible results, you should set the random_state parameter.
Ꚛ The classifier guesses that this image represents a 5 (True)
Performance Measures
Ꚛ Evaluating a classifier is often significantly trickier than evaluating a
regressor
Ꚛ Let’s use the cross_val_score() function to evaluate your SGDClassifier
model using K-fold crossvalidation, with three folds.
Ꚛ Remember that K-fold cross-validation means splitting the training set
into K-folds (in this case, three), then making predictions and evaluating
them on each fold using a model trained on the remaining folds
▪ from sklearn.model_selection import cross_val_score
▪ cross_val_score(sgd_clf, X_train, y_train_5, cv=3, scoring="accuracy")
▪ Out[24]: array([0.94555, 0.9012 , 0.9625 ])
Ꚛ Wow! Around 95% accuracy (ratio of correct predictions) on all cross-
validation folds? This looks amazing, doesn’t it?
Dumb classifier
▪ Well, before you get too excited, let’s look at a very dumb classifier
that just classifies every single image in the “not-5” class
Dumb classifier
Ꚛ It has over 90% accuracy! This is simply because only about 10% of
the images are 5s, so if you always guess that an image is not a 5,
you will be right about 90% of the time.
Ꚛ This demonstrates why accuracy is generally not the preferred
performance measure for classifiers, especially when you are dealing
with skewed datasets (i.e., when some classes are much more
frequent than others)
Confusion Matrix
Ꚛ A much better way to evaluate the performance of a classifier is to look at
the confusion matrix.
Ꚛ The general idea is to count the number of times instances of class A are
classified as class B
▪ For example, to know the number of times the classifier confused images of 5s with
3s, you would look in the 5th row and 3rd column of the confusion matrix
Ꚛ To compute the confusion matrix, you first need to have a set of
predictions, so they can be compared to the actual targets.
Ꚛ You could make predictions on the test set, but let’s keep it untouched for
now
(remember that you want to use the test set only at the very end of your
project, once you have a classifier that you are ready to launch).
▪ Instead, you can use the cross_val_predict() function:
Confusion Matrix
▪ from sklearn.model_selection import cross_val_predict
▪ y_train_pred = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3)
Confusion Matrix
Ꚛ Each row in a confusion matrix represents an actual class, while each
column represents a predicted class.
Ꚛ The first row of this matrix considers non-5 images (the negative class):
53,272 of them were correctly classified as non-5s (they are called true
negatives TN),
▪ while the remaining 1,307 were wrongly classified as 5s (false positives FP).
Ꚛ The second row considers the images of 5s (the positive class): 1,077 were
wrongly classified as non-5s (false negatives FN), while the remaining
4,344 were correctly classified as 5s (true positives TP).
Ꚛ A perfect classifier would have only true positives and true negatives, so
its confusion matrix would have nonzero values only on its main diagonal
(top left to bottom right)
▪ 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
𝑇𝑃
𝑇𝑃+𝐹𝑃
▪ Accuracy of the positive predictions
▪ 𝑅𝑒𝑐𝑎𝑙𝑙 =
𝑇𝑃
𝑇𝑃+𝐹𝑁
▪ Sensitivity = True Positive Rate TPR
Confusion Matrix
▪ Now your 5-detector does not look as shiny as it did when you
looked at its accuracy.
▪ When it claims an image represents a 5, it is correct only 77% of the
time. Moreover, it only detects 79% of the 5s
𝐹1Score
• It is often convenient to combine precision and recall into a single metric
called the 𝐹1 score, in particular if you need a simple way to compare two
classifiers.
• The 𝐹1 score is the harmonic mean of precision and recall
• Whereas the regular mean treats all values equally, the harmonic
mean gives much more weight to low values. As a result, the classifier will
only get a high 𝐹1 score if both recall and precision are high
Which is more important – Precision / Recall?
Ꚛ The 𝐹1 score favors classifiers that have similar precision and recall. This is
not always what you want: in some contexts you mostly care about
precision, and in other contexts you really care about recall.
Ꚛ For example, if you trained a classifier to detect videos that are safe for
kids, you would probably prefer a classifier that rejects many good videos
(low recall) but keeps only safe ones (high precision), rather than a
classifier that has a much higher recall but lets a few really bad videos
show up in your product
Ꚛ On the other hand, suppose you train a classifier to detect shoplifters on
surveillance images: it is probably fine if your classifier has only 30%
precision as long as it has 99% recall (sure, the security guards will get a
few false alerts, but almost all shoplifters will get caught).
Precision/Recall Tradeoff
Ꚛ To understand this tradeoff, let’s look at how the SGDClassifier makes its classification
decisions.
▪ For each instance, it computes a score based on a decision function, and if that score is greater
than a threshold, it assigns the instance to the positive class, or else it assigns it to the negative
class
Ꚛ Figure below shows a few digits positioned from the lowest score on the left to the
highest score on the right.
▪ Suppose the decision threshold is positioned at the central arrow (between the two 5s): you will
find 4 true positives (actual 5s) on the right of that threshold, and one false positive (actually a 6).
▪ Therefore, with that threshold, the precision is 80% (4 out of 5). But out of 6 actual 5s, the
classifier only detects 4, so the recall is 67% (4 out of 6).
Ꚛ Now if you raise the threshold (move it to the arrow on the right), the false positive (the
6) becomes a true negative, thereby increasing precision (up to 100% in this case), but
one true positive becomes a false negative, decreasing recall down to 50%. Conversely,
lowering the threshold increases recall and reduces precision
Precision/Recall Tradeoff
Precision/Recall Tradeoff
Ꚛ Scikit-Learn does not let you
set the threshold directly,
but it does give you access to
the decision scores that it
uses to make predictions.
Ꚛ Instead of calling the
classifier’s predict() method,
you can call its
decision_function() method,
which returns a score for
each instance, and then
make predictions based on
those scores using any
threshold you want:
Precision/Recall Tradeoff
Ꚛ This confirms that raising the threshold decreases recall. The image
actually represents a 5, and the classifier detects it when the
threshold is 0, but it misses it when the threshold is increased to
200,000.
Ꚛ So how can you decide which threshold to use? For this you will first
need to get the scores of all instances in the training set using the
cross_val_predict() function again, but this time specifying that
you want it to return decision scores instead of predictions:
▪ Precision/Recall Tradeoff
Precision/Recall Tradeoff
Ꚛ You may wonder why the precision curve is bumpier than the recall
curve in Figure 3-4. The reason is that precision may sometimes go
down when you raise the threshold (although in general it will go
up).
Ꚛ To understand why, look back at Figure and notice what happens
when you start from the central threshold and move it just one digit
to the right: precision goes from 4/5 (80%) down to 3/4 (75%).
Ꚛ On the other hand, recall can only go down when the threshold is
increased, which explains why its curve looks smooth
Precision/Recall Tradeoff
Ꚛ Now you can simply select
the threshold value that
gives you the best
precision/recall tradeoff for
your task.
Ꚛ Another way to select a
good precision/recall
tradeoff is to plot precision
directly against recall
You can see that precision really starts to fall sharply around 80% recall. You will probably
want to select a precision/recall tradeoff just before that drop — for example, at around 60%
recall. But of course the choice depends on your project
Precision/Recall Tradeoff
Ꚛ So let’s suppose you decide to aim for 90% precision.
Ꚛ You look up the first plot (zooming in a bit) and find that you need to
use a threshold of about 230,000. To make predictions (on the
training set for now), instead of calling the classifier’s predict()
method, you can just run this code:
Precision/Recall Tradeoff
Ꚛ Great, you have a 90% precision classifier (or close enough)! As you
can see, it is fairly easy to create a classifier with virtually any
precision you want: just set a high enough threshold, and you’re
done.
Ꚛ Hmm, not so fast. A high-precision classifier is not very useful if its
recall is too low!
Ꚛ If someone says “let’s reach 99% precision,” you should ask, “at
what recall?”
The ROC Curve
Ꚛ The Receiver Operating Characteristic (ROC) curve is another
common tool used with binary classifiers.
Ꚛ ROC curve plots the true positive rate (another name for recall)
against the false positive rate FPR
Ꚛ The FPR is the ratio of negative instances that are incorrectly
classified as positive.
▪ It is equal to one minus the true negative rate, which is the ratio of negative
instances that are correctly classified as negative.
Ꚛ The TNR is also called specificity.
Ꚛ Hence the ROC curve plots sensitivity (recall) versus 1 – specificity.
▪ The ROC Curve
▪ Once again there is a tradeoff: the higher the recall (TPR), the more false positives (FPR) the classifier
produces.
▪ The dotted line represents the ROC curve of a purely random classifier
▪ A good classifier stays as far away from that line as possible (toward the top-left corner).
▪ The ROC Curve
Ꚛ One way to compare classifiers is to measure the Area Under the
Curve (AUC).
Ꚛ A perfect classifier will have a ROC AUC equal to 1, whereas a purely
random classifier will have a ROC AUC equal to 0.5.
Ꚛ Scikit-Learn provides a function to compute the ROC AUC:
▪ from sklearn.metrics import roc_auc_score
▪ roc_auc_score(y_train_5, y_scores)
Ꚛ As a rule of thumb, you should prefer the PR curve whenever the
positive class is rare or when you care more about the false positives
than the false negatives, and the ROC curve otherwise
▪ The ROC Curve
Ꚛ Let’s train a RandomForestClassifier and compare its ROC curve and ROC
AUC score to the SGDClassifier.
Ꚛ First, you need to get scores for each instance in the training set.
▪ But due to the way it works, the RandomForestClassifier class does not have a
decision_function() method.
Ꚛ Instead it has a predict_proba() method. Scikit-Learn classifiers generally
have one or the other.
Ꚛ The predict_proba() method returns an array containing a row per
instance and a column per class, each containing the probability that the
given instance belongs to the given class (e.g., 70% chance that the image
represents a 5):
▪ The ROC Curve
▪ But to plot a ROC curve, you need scores, not probabilities. A simple
solution is to use the positive class’s probability as the score:
▪ The ROC Curve
Ꚛ The RandomForestClassifier’s
ROC curve looks much better
than the
SGDClassifier’s: it comes much
closer to the top-left corner.
Ꚛ As a result, its ROC AUC score is
also significantly better:
Ad

Recommended

Classification
Classification
CloudxLab
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptx
upamatechverse
 
Autoencoder
Autoencoder
HARISH R
 
From decision trees to random forests
From decision trees to random forests
Viet-Trung TRAN
 
Over fitting underfitting
Over fitting underfitting
SivapriyaS12
 
Transmission impairments
Transmission impairments
avocado1111
 
Carrier Sense Multiple Access (CSMA)
Carrier Sense Multiple Access (CSMA)
Mohammed Abuibaid
 
Unit 2 data link control
Unit 2 data link control
Vishal kakade
 
PURE ALOHA : MEDIUM ACCESS CONTROL PROTOCOL (MAC): Definition : Types : Details
PURE ALOHA : MEDIUM ACCESS CONTROL PROTOCOL (MAC): Definition : Types : Details
Soumen Santra
 
Decision trees
Decision trees
Jagjit Wilku
 
Backtracking
Backtracking
Vikas Sharma
 
Uninformed search /Blind search in AI
Uninformed search /Blind search in AI
Kirti Verma
 
ALOHA Protocol (in detail)
ALOHA Protocol (in detail)
Hinal Lunagariya
 
Pure aloha
Pure aloha
HarshithGade
 
Multiple Access Protocal
Multiple Access Protocal
tes31
 
Sliding window protocol(ARQ technique)
Sliding window protocol(ARQ technique)
shilpa patel
 
Controlled Access Protocols
Controlled Access Protocols
Pruthviraj Konu
 
The medium access sublayer
The medium access sublayer
Lal Bahadur Gehlot
 
Moore Mealy Machine Conversion
Moore Mealy Machine Conversion
Aiman Hafeez
 
Admission control
Admission control
Vishal Waghmare
 
What Is Sliding Window Protocol?
What Is Sliding Window Protocol?
Simplilearn
 
MULTIPLE ACCESS PROTOCOL COMPUTER NETWORKS
MULTIPLE ACCESS PROTOCOL COMPUTER NETWORKS
garishma bhatia
 
Digital to analog conversion
Digital to analog conversion
WaseemKhan00
 
Data link layer
Data link layer
sbkbca
 
Routing algorithms
Routing algorithms
Parameswaran Selvakumar
 
Extension principle
Extension principle
Savo Delić
 
Fuzzy neural networks
Fuzzy neural networks
Brainware University
 
SLOTTED ALOHA and pure aloha are the category of aloha
SLOTTED ALOHA and pure aloha are the category of aloha
AkshathaM29
 
Classification: MNIST, training a Binary classifier, performance measure, mul...
Classification: MNIST, training a Binary classifier, performance measure, mul...
BMS Institute of Technology and Management
 
Hands-on ML - CH3
Hands-on ML - CH3
Jamie (Taka) Wang
 

More Related Content

What's hot (20)

PURE ALOHA : MEDIUM ACCESS CONTROL PROTOCOL (MAC): Definition : Types : Details
PURE ALOHA : MEDIUM ACCESS CONTROL PROTOCOL (MAC): Definition : Types : Details
Soumen Santra
 
Decision trees
Decision trees
Jagjit Wilku
 
Backtracking
Backtracking
Vikas Sharma
 
Uninformed search /Blind search in AI
Uninformed search /Blind search in AI
Kirti Verma
 
ALOHA Protocol (in detail)
ALOHA Protocol (in detail)
Hinal Lunagariya
 
Pure aloha
Pure aloha
HarshithGade
 
Multiple Access Protocal
Multiple Access Protocal
tes31
 
Sliding window protocol(ARQ technique)
Sliding window protocol(ARQ technique)
shilpa patel
 
Controlled Access Protocols
Controlled Access Protocols
Pruthviraj Konu
 
The medium access sublayer
The medium access sublayer
Lal Bahadur Gehlot
 
Moore Mealy Machine Conversion
Moore Mealy Machine Conversion
Aiman Hafeez
 
Admission control
Admission control
Vishal Waghmare
 
What Is Sliding Window Protocol?
What Is Sliding Window Protocol?
Simplilearn
 
MULTIPLE ACCESS PROTOCOL COMPUTER NETWORKS
MULTIPLE ACCESS PROTOCOL COMPUTER NETWORKS
garishma bhatia
 
Digital to analog conversion
Digital to analog conversion
WaseemKhan00
 
Data link layer
Data link layer
sbkbca
 
Routing algorithms
Routing algorithms
Parameswaran Selvakumar
 
Extension principle
Extension principle
Savo Delić
 
Fuzzy neural networks
Fuzzy neural networks
Brainware University
 
SLOTTED ALOHA and pure aloha are the category of aloha
SLOTTED ALOHA and pure aloha are the category of aloha
AkshathaM29
 
PURE ALOHA : MEDIUM ACCESS CONTROL PROTOCOL (MAC): Definition : Types : Details
PURE ALOHA : MEDIUM ACCESS CONTROL PROTOCOL (MAC): Definition : Types : Details
Soumen Santra
 
Uninformed search /Blind search in AI
Uninformed search /Blind search in AI
Kirti Verma
 
ALOHA Protocol (in detail)
ALOHA Protocol (in detail)
Hinal Lunagariya
 
Multiple Access Protocal
Multiple Access Protocal
tes31
 
Sliding window protocol(ARQ technique)
Sliding window protocol(ARQ technique)
shilpa patel
 
Controlled Access Protocols
Controlled Access Protocols
Pruthviraj Konu
 
Moore Mealy Machine Conversion
Moore Mealy Machine Conversion
Aiman Hafeez
 
What Is Sliding Window Protocol?
What Is Sliding Window Protocol?
Simplilearn
 
MULTIPLE ACCESS PROTOCOL COMPUTER NETWORKS
MULTIPLE ACCESS PROTOCOL COMPUTER NETWORKS
garishma bhatia
 
Digital to analog conversion
Digital to analog conversion
WaseemKhan00
 
Data link layer
Data link layer
sbkbca
 
Extension principle
Extension principle
Savo Delić
 
SLOTTED ALOHA and pure aloha are the category of aloha
SLOTTED ALOHA and pure aloha are the category of aloha
AkshathaM29
 

Similar to Lecture 12 binary classifier confusion matrix (20)

Classification: MNIST, training a Binary classifier, performance measure, mul...
Classification: MNIST, training a Binary classifier, performance measure, mul...
BMS Institute of Technology and Management
 
Hands-on ML - CH3
Hands-on ML - CH3
Jamie (Taka) Wang
 
Machine Learning with Python- Machine Learning Algorithms.pdf
Machine Learning with Python- Machine Learning Algorithms.pdf
KalighatOkira
 
introducatio to ml introducatio to ml introducatio to ml
introducatio to ml introducatio to ml introducatio to ml
DecentMusicians
 
20MEMECH Part 3- Classification.pdf
20MEMECH Part 3- Classification.pdf
MariaKhan905189
 
Performance of the classification algorithm
Performance of the classification algorithm
Hoopeer Hoopeer
 
Supervised learning
Supervised learning
Johnson Ubah
 
P07 DWDM S1SI python practice and evaluation.pdf
P07 DWDM S1SI python practice and evaluation.pdf
IKANURLAILIISNAINIYA1
 
P07 DWDM S1SI python practice and evaluation.pdf
P07 DWDM S1SI python practice and evaluation.pdf
IKANURLAILIISNAINIYA1
 
ML-ChapterFour-ModelEvaluation.pptx
ML-ChapterFour-ModelEvaluation.pptx
belay41
 
evaluation and credibility-Part 2
evaluation and credibility-Part 2
Tilani Gunawardena PhD(UNIBAS), BSc(Pera), FHEA(UK), CEng, MIESL
 
Machine Learning Unit 2 Semester 3 MSc IT Part 2 Mumbai University
Machine Learning Unit 2 Semester 3 MSc IT Part 2 Mumbai University
Madhav Mishra
 
[ppt]
[ppt]
butest
 
[ppt]
[ppt]
butest
 
ai4.ppt
ai4.ppt
atul404633
 
Unit 4 Classification of data and more info on it
Unit 4 Classification of data and more info on it
randomguy1722
 
ai4.ppt
ai4.ppt
akshatsharma823122
 
ai4.ppt
ai4.ppt
ssuser448ad3
 
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
Sebastian Raschka
 
IRJET- Supervised Learning Classification Algorithms Comparison
IRJET- Supervised Learning Classification Algorithms Comparison
IRJET Journal
 
Classification: MNIST, training a Binary classifier, performance measure, mul...
Classification: MNIST, training a Binary classifier, performance measure, mul...
BMS Institute of Technology and Management
 
Machine Learning with Python- Machine Learning Algorithms.pdf
Machine Learning with Python- Machine Learning Algorithms.pdf
KalighatOkira
 
introducatio to ml introducatio to ml introducatio to ml
introducatio to ml introducatio to ml introducatio to ml
DecentMusicians
 
20MEMECH Part 3- Classification.pdf
20MEMECH Part 3- Classification.pdf
MariaKhan905189
 
Performance of the classification algorithm
Performance of the classification algorithm
Hoopeer Hoopeer
 
Supervised learning
Supervised learning
Johnson Ubah
 
P07 DWDM S1SI python practice and evaluation.pdf
P07 DWDM S1SI python practice and evaluation.pdf
IKANURLAILIISNAINIYA1
 
P07 DWDM S1SI python practice and evaluation.pdf
P07 DWDM S1SI python practice and evaluation.pdf
IKANURLAILIISNAINIYA1
 
ML-ChapterFour-ModelEvaluation.pptx
ML-ChapterFour-ModelEvaluation.pptx
belay41
 
Machine Learning Unit 2 Semester 3 MSc IT Part 2 Mumbai University
Machine Learning Unit 2 Semester 3 MSc IT Part 2 Mumbai University
Madhav Mishra
 
Unit 4 Classification of data and more info on it
Unit 4 Classification of data and more info on it
randomguy1722
 
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
Sebastian Raschka
 
IRJET- Supervised Learning Classification Algorithms Comparison
IRJET- Supervised Learning Classification Algorithms Comparison
IRJET Journal
 
Ad

More from Mostafa El-Hosseini (18)

why now Deep Neural Networks?
why now Deep Neural Networks?
Mostafa El-Hosseini
 
Activation functions types
Activation functions types
Mostafa El-Hosseini
 
Why activation function
Why activation function
Mostafa El-Hosseini
 
Logistic Regression (Binary Classification)
Logistic Regression (Binary Classification)
Mostafa El-Hosseini
 
Model validation and_early_stopping_-_shooting
Model validation and_early_stopping_-_shooting
Mostafa El-Hosseini
 
Lecture 01 _perceptron_intro
Lecture 01 _perceptron_intro
Mostafa El-Hosseini
 
Lecture 19 chapter_4_regularized_linear_models
Lecture 19 chapter_4_regularized_linear_models
Mostafa El-Hosseini
 
Svm rbf kernel
Svm rbf kernel
Mostafa El-Hosseini
 
Lecture 24 support vector machine kernel
Lecture 24 support vector machine kernel
Mostafa El-Hosseini
 
Lecture 23 support vector classifier
Lecture 23 support vector classifier
Mostafa El-Hosseini
 
Lecture 11 linear regression
Lecture 11 linear regression
Mostafa El-Hosseini
 
Numpy 02
Numpy 02
Mostafa El-Hosseini
 
Naive bayes classifier python session
Naive bayes classifier python session
Mostafa El-Hosseini
 
Ga
Ga
Mostafa El-Hosseini
 
Numpy 01
Numpy 01
Mostafa El-Hosseini
 
Lecture 08 prepare the data for ml algorithm
Lecture 08 prepare the data for ml algorithm
Mostafa El-Hosseini
 
Lecture 02 ml supervised and unsupervised
Lecture 02 ml supervised and unsupervised
Mostafa El-Hosseini
 
Lecture 01 intro. to ml and overview
Lecture 01 intro. to ml and overview
Mostafa El-Hosseini
 
Ad

Recently uploaded (20)

Introduction to Python Programming Language
Introduction to Python Programming Language
merlinjohnsy
 
Introduction to sensing and Week-1.pptx
Introduction to sensing and Week-1.pptx
KNaveenKumarECE
 
Call For Papers - 17th International Conference on Wireless & Mobile Networks...
Call For Papers - 17th International Conference on Wireless & Mobile Networks...
hosseinihamid192023
 
Mobile database systems 20254545645.pptx
Mobile database systems 20254545645.pptx
herosh1968
 
DESIGN OF REINFORCED CONCRETE ELEMENTS S
DESIGN OF REINFORCED CONCRETE ELEMENTS S
prabhusp8
 
Deep Learning for Image Processing on 16 June 2025 MITS.pptx
Deep Learning for Image Processing on 16 June 2025 MITS.pptx
resming1
 
Proposal for folders structure division in projects.pdf
Proposal for folders structure division in projects.pdf
Mohamed Ahmed
 
MATERIAL SCIENCE LECTURE NOTES FOR DIPLOMA STUDENTS
MATERIAL SCIENCE LECTURE NOTES FOR DIPLOMA STUDENTS
SAMEER VISHWAKARMA
 
Abraham Silberschatz-Operating System Concepts (9th,2012.12).pdf
Abraham Silberschatz-Operating System Concepts (9th,2012.12).pdf
Shabista Imam
 
Rapid Prototyping for XR: Lecture 6 - AI for Prototyping and Research Directi...
Rapid Prototyping for XR: Lecture 6 - AI for Prototyping and Research Directi...
Mark Billinghurst
 
Rapid Prototyping for XR: Lecture 3 - Video and Paper Prototyping
Rapid Prototyping for XR: Lecture 3 - Video and Paper Prototyping
Mark Billinghurst
 
Rapid Prototyping for XR: Lecture 1 Introduction to Prototyping
Rapid Prototyping for XR: Lecture 1 Introduction to Prototyping
Mark Billinghurst
 
دراسة حاله لقرية تقع في جنوب غرب السودان
دراسة حاله لقرية تقع في جنوب غرب السودان
محمد قصص فتوتة
 
Deep Learning for Natural Language Processing_FDP on 16 June 2025 MITS.pptx
Deep Learning for Natural Language Processing_FDP on 16 June 2025 MITS.pptx
resming1
 
Rapid Prototyping for XR: Lecture 5 - Cross Platform Development
Rapid Prototyping for XR: Lecture 5 - Cross Platform Development
Mark Billinghurst
 
輪読会資料_Miipher and Miipher2 .
輪読会資料_Miipher and Miipher2 .
NABLAS株式会社
 
LECTURE 7 COMPUTATIONS OF LEVELING DATA APRIL 2025.pptx
LECTURE 7 COMPUTATIONS OF LEVELING DATA APRIL 2025.pptx
rr22001247
 
Generative AI & Scientific Research : Catalyst for Innovation, Ethics & Impact
Generative AI & Scientific Research : Catalyst for Innovation, Ethics & Impact
AlqualsaDIResearchGr
 
Stability of IBR Dominated Grids - IEEE PEDG 2025 - short.pptx
Stability of IBR Dominated Grids - IEEE PEDG 2025 - short.pptx
ssuser307730
 
Comparison of Flexible and Rigid Pavements in Bangladesh
Comparison of Flexible and Rigid Pavements in Bangladesh
Arifur Rahman
 
Introduction to Python Programming Language
Introduction to Python Programming Language
merlinjohnsy
 
Introduction to sensing and Week-1.pptx
Introduction to sensing and Week-1.pptx
KNaveenKumarECE
 
Call For Papers - 17th International Conference on Wireless & Mobile Networks...
Call For Papers - 17th International Conference on Wireless & Mobile Networks...
hosseinihamid192023
 
Mobile database systems 20254545645.pptx
Mobile database systems 20254545645.pptx
herosh1968
 
DESIGN OF REINFORCED CONCRETE ELEMENTS S
DESIGN OF REINFORCED CONCRETE ELEMENTS S
prabhusp8
 
Deep Learning for Image Processing on 16 June 2025 MITS.pptx
Deep Learning for Image Processing on 16 June 2025 MITS.pptx
resming1
 
Proposal for folders structure division in projects.pdf
Proposal for folders structure division in projects.pdf
Mohamed Ahmed
 
MATERIAL SCIENCE LECTURE NOTES FOR DIPLOMA STUDENTS
MATERIAL SCIENCE LECTURE NOTES FOR DIPLOMA STUDENTS
SAMEER VISHWAKARMA
 
Abraham Silberschatz-Operating System Concepts (9th,2012.12).pdf
Abraham Silberschatz-Operating System Concepts (9th,2012.12).pdf
Shabista Imam
 
Rapid Prototyping for XR: Lecture 6 - AI for Prototyping and Research Directi...
Rapid Prototyping for XR: Lecture 6 - AI for Prototyping and Research Directi...
Mark Billinghurst
 
Rapid Prototyping for XR: Lecture 3 - Video and Paper Prototyping
Rapid Prototyping for XR: Lecture 3 - Video and Paper Prototyping
Mark Billinghurst
 
Rapid Prototyping for XR: Lecture 1 Introduction to Prototyping
Rapid Prototyping for XR: Lecture 1 Introduction to Prototyping
Mark Billinghurst
 
دراسة حاله لقرية تقع في جنوب غرب السودان
دراسة حاله لقرية تقع في جنوب غرب السودان
محمد قصص فتوتة
 
Deep Learning for Natural Language Processing_FDP on 16 June 2025 MITS.pptx
Deep Learning for Natural Language Processing_FDP on 16 June 2025 MITS.pptx
resming1
 
Rapid Prototyping for XR: Lecture 5 - Cross Platform Development
Rapid Prototyping for XR: Lecture 5 - Cross Platform Development
Mark Billinghurst
 
輪読会資料_Miipher and Miipher2 .
輪読会資料_Miipher and Miipher2 .
NABLAS株式会社
 
LECTURE 7 COMPUTATIONS OF LEVELING DATA APRIL 2025.pptx
LECTURE 7 COMPUTATIONS OF LEVELING DATA APRIL 2025.pptx
rr22001247
 
Generative AI & Scientific Research : Catalyst for Innovation, Ethics & Impact
Generative AI & Scientific Research : Catalyst for Innovation, Ethics & Impact
AlqualsaDIResearchGr
 
Stability of IBR Dominated Grids - IEEE PEDG 2025 - short.pptx
Stability of IBR Dominated Grids - IEEE PEDG 2025 - short.pptx
ssuser307730
 
Comparison of Flexible and Rigid Pavements in Bangladesh
Comparison of Flexible and Rigid Pavements in Bangladesh
Arifur Rahman
 

Lecture 12 binary classifier confusion matrix

  • 2. Revise Ꚛ Regression task Ꚛ Predicting Housing values using ▪ Linear Regression ▪ How to fix underfitting ▪ Decision Trees. ▪ Random Forest Ꚛ Cross-validation Ꚛ Fine-tune your model ▪ Grid Search ▪ Randomized Search ▪ Ensemble Methods Ꚛ Hyberparameter
  • 4. MINST Ꚛ Set of 70,000 small images of digits handwritten by high school students and employees of the US Census Bureau Ꚛ Each image is labeled with the digit it represents Ꚛ It is often called the “Hello World” of Machine Learning Ꚛ Each image has 28×28 pixels (784 features ) Ꚛ Each feature simply represents one pixel’s intensity, from 0 (white) to 255 (black)
  • 5. MINST Ꚛ Datasets loaded by Scikit- Learn generally have a similar dictionary structure including: ▪ A DESCR key describing the dataset ▪ A data key containing an array with one row per instance and one column per feature ▪ A target key containing an array with the labels
  • 6. Peek at one digit from the dataset
  • 7. ▪ To feel complexity of the classification task
  • 8. MINST Training & Testing set Ꚛ You should always create a test set and set it aside before inspecting the data closely. Ꚛ The MNIST dataset is actually already split into a training set (the first 60,000 images) and a test set (the last 10,000 images): Ꚛ Shuffle the training set; this will guarantee that… ▪ All cross-validation folds will be similar (you don’t want one fold to be missing some digits). ▪ Moreover, some learning algorithms are sensitive to the order of the training instances, and they perform poorly if they get many similar instances in a row. Shuffling the dataset ensures that this won’t happen:
  • 9. Training Binary Classifier Ꚛ Let’s simplify the problem for now and only try to identify one digit — for example, the number 5. Ꚛ This “5-detector” will be an example of a binary classifier, capable of distinguishing between just two classes, 5 and not-5. Let’s create the target vectors for this classification task: ▪ y_train_5 = (y_train == 5) # True for all 5s, False for all other digits. ▪ y_test_5 = (y_test == 5) Ꚛ Okay, now let’s pick a classifier and train it. A good place to start is with a Stochastic Gradient Descent (SGD) classifier, using Scikit- Learn’s SGDClassifier class
  • 10. Stochastic Gradient Descent Classifier Ꚛ This classifier has the advantage of being capable of handling very large datasets efficiently. Ꚛ This is in part because SGD deals with training instances independently, one at a time (which also makes SGD well suited for online learning), as we will see later. ▪ from sklearn.linear_model import SGDClassifier ▪ sgd_clf = SGDClassifier(random_state=42) ▪ sgd_clf.fit(X_train, y_train_5) Ꚛ The SGDClassifier relies on randomness during training (hence the name “stochastic”). Ꚛ If you want reproducible results, you should set the random_state parameter. Ꚛ The classifier guesses that this image represents a 5 (True)
  • 11. Performance Measures Ꚛ Evaluating a classifier is often significantly trickier than evaluating a regressor Ꚛ Let’s use the cross_val_score() function to evaluate your SGDClassifier model using K-fold crossvalidation, with three folds. Ꚛ Remember that K-fold cross-validation means splitting the training set into K-folds (in this case, three), then making predictions and evaluating them on each fold using a model trained on the remaining folds ▪ from sklearn.model_selection import cross_val_score ▪ cross_val_score(sgd_clf, X_train, y_train_5, cv=3, scoring="accuracy") ▪ Out[24]: array([0.94555, 0.9012 , 0.9625 ]) Ꚛ Wow! Around 95% accuracy (ratio of correct predictions) on all cross- validation folds? This looks amazing, doesn’t it?
  • 12. Dumb classifier ▪ Well, before you get too excited, let’s look at a very dumb classifier that just classifies every single image in the “not-5” class
  • 13. Dumb classifier Ꚛ It has over 90% accuracy! This is simply because only about 10% of the images are 5s, so if you always guess that an image is not a 5, you will be right about 90% of the time. Ꚛ This demonstrates why accuracy is generally not the preferred performance measure for classifiers, especially when you are dealing with skewed datasets (i.e., when some classes are much more frequent than others)
  • 14. Confusion Matrix Ꚛ A much better way to evaluate the performance of a classifier is to look at the confusion matrix. Ꚛ The general idea is to count the number of times instances of class A are classified as class B ▪ For example, to know the number of times the classifier confused images of 5s with 3s, you would look in the 5th row and 3rd column of the confusion matrix Ꚛ To compute the confusion matrix, you first need to have a set of predictions, so they can be compared to the actual targets. Ꚛ You could make predictions on the test set, but let’s keep it untouched for now (remember that you want to use the test set only at the very end of your project, once you have a classifier that you are ready to launch). ▪ Instead, you can use the cross_val_predict() function:
  • 15. Confusion Matrix ▪ from sklearn.model_selection import cross_val_predict ▪ y_train_pred = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3)
  • 16. Confusion Matrix Ꚛ Each row in a confusion matrix represents an actual class, while each column represents a predicted class. Ꚛ The first row of this matrix considers non-5 images (the negative class): 53,272 of them were correctly classified as non-5s (they are called true negatives TN), ▪ while the remaining 1,307 were wrongly classified as 5s (false positives FP). Ꚛ The second row considers the images of 5s (the positive class): 1,077 were wrongly classified as non-5s (false negatives FN), while the remaining 4,344 were correctly classified as 5s (true positives TP). Ꚛ A perfect classifier would have only true positives and true negatives, so its confusion matrix would have nonzero values only on its main diagonal (top left to bottom right)
  • 17. ▪ 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑃 𝑇𝑃+𝐹𝑃 ▪ Accuracy of the positive predictions ▪ 𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑇𝑃 𝑇𝑃+𝐹𝑁 ▪ Sensitivity = True Positive Rate TPR
  • 18. Confusion Matrix ▪ Now your 5-detector does not look as shiny as it did when you looked at its accuracy. ▪ When it claims an image represents a 5, it is correct only 77% of the time. Moreover, it only detects 79% of the 5s
  • 19. 𝐹1Score • It is often convenient to combine precision and recall into a single metric called the 𝐹1 score, in particular if you need a simple way to compare two classifiers. • The 𝐹1 score is the harmonic mean of precision and recall • Whereas the regular mean treats all values equally, the harmonic mean gives much more weight to low values. As a result, the classifier will only get a high 𝐹1 score if both recall and precision are high
  • 20. Which is more important – Precision / Recall? Ꚛ The 𝐹1 score favors classifiers that have similar precision and recall. This is not always what you want: in some contexts you mostly care about precision, and in other contexts you really care about recall. Ꚛ For example, if you trained a classifier to detect videos that are safe for kids, you would probably prefer a classifier that rejects many good videos (low recall) but keeps only safe ones (high precision), rather than a classifier that has a much higher recall but lets a few really bad videos show up in your product Ꚛ On the other hand, suppose you train a classifier to detect shoplifters on surveillance images: it is probably fine if your classifier has only 30% precision as long as it has 99% recall (sure, the security guards will get a few false alerts, but almost all shoplifters will get caught).
  • 21. Precision/Recall Tradeoff Ꚛ To understand this tradeoff, let’s look at how the SGDClassifier makes its classification decisions. ▪ For each instance, it computes a score based on a decision function, and if that score is greater than a threshold, it assigns the instance to the positive class, or else it assigns it to the negative class Ꚛ Figure below shows a few digits positioned from the lowest score on the left to the highest score on the right. ▪ Suppose the decision threshold is positioned at the central arrow (between the two 5s): you will find 4 true positives (actual 5s) on the right of that threshold, and one false positive (actually a 6). ▪ Therefore, with that threshold, the precision is 80% (4 out of 5). But out of 6 actual 5s, the classifier only detects 4, so the recall is 67% (4 out of 6). Ꚛ Now if you raise the threshold (move it to the arrow on the right), the false positive (the 6) becomes a true negative, thereby increasing precision (up to 100% in this case), but one true positive becomes a false negative, decreasing recall down to 50%. Conversely, lowering the threshold increases recall and reduces precision
  • 23. Precision/Recall Tradeoff Ꚛ Scikit-Learn does not let you set the threshold directly, but it does give you access to the decision scores that it uses to make predictions. Ꚛ Instead of calling the classifier’s predict() method, you can call its decision_function() method, which returns a score for each instance, and then make predictions based on those scores using any threshold you want:
  • 24. Precision/Recall Tradeoff Ꚛ This confirms that raising the threshold decreases recall. The image actually represents a 5, and the classifier detects it when the threshold is 0, but it misses it when the threshold is increased to 200,000. Ꚛ So how can you decide which threshold to use? For this you will first need to get the scores of all instances in the training set using the cross_val_predict() function again, but this time specifying that you want it to return decision scores instead of predictions:
  • 26. Precision/Recall Tradeoff Ꚛ You may wonder why the precision curve is bumpier than the recall curve in Figure 3-4. The reason is that precision may sometimes go down when you raise the threshold (although in general it will go up). Ꚛ To understand why, look back at Figure and notice what happens when you start from the central threshold and move it just one digit to the right: precision goes from 4/5 (80%) down to 3/4 (75%). Ꚛ On the other hand, recall can only go down when the threshold is increased, which explains why its curve looks smooth
  • 27. Precision/Recall Tradeoff Ꚛ Now you can simply select the threshold value that gives you the best precision/recall tradeoff for your task. Ꚛ Another way to select a good precision/recall tradeoff is to plot precision directly against recall You can see that precision really starts to fall sharply around 80% recall. You will probably want to select a precision/recall tradeoff just before that drop — for example, at around 60% recall. But of course the choice depends on your project
  • 28. Precision/Recall Tradeoff Ꚛ So let’s suppose you decide to aim for 90% precision. Ꚛ You look up the first plot (zooming in a bit) and find that you need to use a threshold of about 230,000. To make predictions (on the training set for now), instead of calling the classifier’s predict() method, you can just run this code:
  • 29. Precision/Recall Tradeoff Ꚛ Great, you have a 90% precision classifier (or close enough)! As you can see, it is fairly easy to create a classifier with virtually any precision you want: just set a high enough threshold, and you’re done. Ꚛ Hmm, not so fast. A high-precision classifier is not very useful if its recall is too low! Ꚛ If someone says “let’s reach 99% precision,” you should ask, “at what recall?”
  • 30. The ROC Curve Ꚛ The Receiver Operating Characteristic (ROC) curve is another common tool used with binary classifiers. Ꚛ ROC curve plots the true positive rate (another name for recall) against the false positive rate FPR Ꚛ The FPR is the ratio of negative instances that are incorrectly classified as positive. ▪ It is equal to one minus the true negative rate, which is the ratio of negative instances that are correctly classified as negative. Ꚛ The TNR is also called specificity. Ꚛ Hence the ROC curve plots sensitivity (recall) versus 1 – specificity.
  • 31. ▪ The ROC Curve ▪ Once again there is a tradeoff: the higher the recall (TPR), the more false positives (FPR) the classifier produces. ▪ The dotted line represents the ROC curve of a purely random classifier ▪ A good classifier stays as far away from that line as possible (toward the top-left corner).
  • 32. ▪ The ROC Curve Ꚛ One way to compare classifiers is to measure the Area Under the Curve (AUC). Ꚛ A perfect classifier will have a ROC AUC equal to 1, whereas a purely random classifier will have a ROC AUC equal to 0.5. Ꚛ Scikit-Learn provides a function to compute the ROC AUC: ▪ from sklearn.metrics import roc_auc_score ▪ roc_auc_score(y_train_5, y_scores) Ꚛ As a rule of thumb, you should prefer the PR curve whenever the positive class is rare or when you care more about the false positives than the false negatives, and the ROC curve otherwise
  • 33. ▪ The ROC Curve Ꚛ Let’s train a RandomForestClassifier and compare its ROC curve and ROC AUC score to the SGDClassifier. Ꚛ First, you need to get scores for each instance in the training set. ▪ But due to the way it works, the RandomForestClassifier class does not have a decision_function() method. Ꚛ Instead it has a predict_proba() method. Scikit-Learn classifiers generally have one or the other. Ꚛ The predict_proba() method returns an array containing a row per instance and a column per class, each containing the probability that the given instance belongs to the given class (e.g., 70% chance that the image represents a 5):
  • 34. ▪ The ROC Curve ▪ But to plot a ROC curve, you need scores, not probabilities. A simple solution is to use the positive class’s probability as the score:
  • 35. ▪ The ROC Curve Ꚛ The RandomForestClassifier’s ROC curve looks much better than the SGDClassifier’s: it comes much closer to the top-left corner. Ꚛ As a result, its ROC AUC score is also significantly better: