SlideShare a Scribd company logo
Clustering:
A Scikit-Learn Tutorial
Damian Mingle
About Me
• Chief Data Scientist, WPC Healthcare
• Speaker
• Researcher
• Writer
Outline
• What is k-means clustering?
• How does it work?
• When is it appropriate to use it?
• K-means clustering in scikit-learn
• Basic
• Basic with adjustments
Clustering
• It is unsupervised learning (inferring a function to
describe not so obvious structures from
unlabeled data)
• Groups data objects
• Measures distance between data points
• Helps in examining the data
K-means Clustering
• Formally: a method of vector quantization
• Informally: a mapping of a large set of inputs to a
(countable smaller set)
• Separate data into
groups with equal
variance
• Makes use of the
Euclidean
distance metric
K-means Clustering
Repeats refinement
Three basic steps:
• Step 1: Choose k (how many groups)
• Repeat over:
• Step 2: Assignment (labeling data as part of a group)
• Step 3: Update
This process continues until its goal is reached
K-means Clustering
• Assignment
• Update
K-means Clustering
• Advantages
• Large data accepted
• Fast
• Will always find a solution
• Disadvantages
• Choosing the wrong number of groups
• You reach a local optima not a global
K-means Clustering
• When to use
• Normally distributed data
• Large number of samples
• Not too many clusters
• Distance can be measured in a linear fashion
Scikit-Learn
• Python
• Open-source machine learning library
• Very well documented
Scikit-Learn
• Model = EstimatorObject()
• Unsupervised:
• Model.fit(dataset.data)
• dataset.data = dataset
K-means in Scikit-Learn
• Very fast
• Data Scientist: picks number of clusters,
• Scikit kmeans: finds the initial centroids of groups
Dataset
Name: Household Power Consumption by Individuals
Number of attributes: 9
Number of instances: 2,075,259
Missing values: Yes
K-means in Scikit-Learn
K-means in Scikit-Learn
• Results
K-means Parameters
• n_clusters
• Number of clusters to form
• max_iter
• Maximum number of repeats for algo in a single run
• n_init
• Number of times k-means algo will run with different initialization points
• init
• Method you want to initialize with
• precompute_distances
• Selection of Yes, No, or let the machine decide
• Tol
• How tolerable should the algo be when it converges
• n_jobs
• How many CPUs do you want to engage when running the algo
• random_state
• What instance should be the starting point for the algo
n_clusters: choosing k
• View the variance
• cdist is the distance between sets of observations
• pdist is the pairwise distances between observations in
the same set
n_clusters: choosing k
Step 1: Determine your k range
Step 2: Fit the k-means model for each n_clusters = k
Step 3: Pull out the cluster centers for each model
n_clusters: choosing k
Step 4: Calculate Euclidean distance from each point to each cluster center
Step 5: Total within-cluster sum of squares
Step 6: Total sum of squares
Step 7: Difference between-cluster sum of squares
n_clusters: choosing k
• Graphing the variance
n_clusters: choosing k
n_clusters = 4 n_clusters = 7
n_clusters: choosing k
• n_clusters = 8 (default)
init
Methods and their meaning:
• k-means++
• Selects initial clusters in a way that speeds up
convergence
• random
• Choose k rows at random for initial centroids
• Ndarray that gives initial centers
• (n_clusters, n_features)
K-means (8)
n_clusters = 8, init = kmeans++ n_clusters = 8, init = random
K-means (7)
n_clusters = 7, init = kmeans++ n_clusters = 7, init = random
Comparing Results: Silhouette Score
• Silhouette coefficient
• Not black and white, lots of gray
• Average distance between data observations and other data
in cluster
• Average distance between data observations and all other
points in the NEXT nearest cluster
• Silhouette score in scikit-learn
• Average silhouette coefficient for all data observations
• The closer to 1, the better the fit
• Computation time increases with larger datasets
Result Comparison: Silhouette Score
What Do the Results Say?
• Data patterns may in fact exist
• Similar observations can be grouped
• We need additional discovery
A Few Hacks
• Clustering is a great way to explore your data and
develop intution
• Too many features create a problem for
understanding
• Use dimensionality reduction
• Use clustering with other methods
Let’s Connect
• Twitter: @DamianMingle
• LinkedIn: DamianRMingle
• Sign-up for Data Science Hacks
Ad

Recommended

K-means Clustering with Scikit-Learn
K-means Clustering with Scikit-Learn
Sarah Guido
 
Intro to machine learning with scikit learn
Intro to machine learning with scikit learn
Yoss Cohen
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
Benjamin Bengfort
 
Session 06 machine learning.pptx
Session 06 machine learning.pptx
bodaceacat
 
Data exploration validation and sanitization
Data exploration validation and sanitization
Venkata Reddy Konasani
 
Evolutionary Design of Swarms (SSCI 2014)
Evolutionary Design of Swarms (SSCI 2014)
Benjamin Bengfort
 
R- Introduction
R- Introduction
Venkata Reddy Konasani
 
Numerical tour in the Python eco-system: Python, NumPy, scikit-learn
Numerical tour in the Python eco-system: Python, NumPy, scikit-learn
Arnaud Joly
 
Introduction to Machine Learning in Python using Scikit-Learn
Introduction to Machine Learning in Python using Scikit-Learn
Amol Agrawal
 
Machine Learning - Neural Networks - Perceptron
Machine Learning - Neural Networks - Perceptron
Andrew Ferlitsch
 
Kmeans plusplus
Kmeans plusplus
Renaud Richardet
 
Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017
Gabriel Moreira
 
Machine Learning with Azure
Machine Learning with Azure
Barbara Fusinska
 
Introduction to Machine Learning with Python and scikit-learn
Introduction to Machine Learning with Python and scikit-learn
Matt Hagy
 
Josh Patterson MLconf slides
Josh Patterson MLconf slides
MLconf
 
Gradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learn
DataRobot
 
Feature engineering pipelines
Feature engineering pipelines
Ramesh Sampath
 
Jan vitek distributedrandomforest_5-2-2013
Jan vitek distributedrandomforest_5-2-2013
Sri Ambati
 
Entity embeddings for categorical data
Entity embeddings for categorical data
Paul Skeie
 
Networks are like onions: Practical Deep Learning with TensorFlow
Networks are like onions: Practical Deep Learning with TensorFlow
Barbara Fusinska
 
Joey gonzalez, graph lab, m lconf 2013
Joey gonzalez, graph lab, m lconf 2013
MLconf
 
Visualizing the Model Selection Process
Visualizing the Model Selection Process
Benjamin Bengfort
 
Graph Analyses with Python and NetworkX
Graph Analyses with Python and NetworkX
Benjamin Bengfort
 
Analysis of algorithms
Analysis of algorithms
iqbalphy1
 
Support Vector Machine (Classification) - Step by Step
Support Vector Machine (Classification) - Step by Step
Manish nath choudhary
 
Deep learning with TensorFlow
Deep learning with TensorFlow
Barbara Fusinska
 
Building Random Forest at Scale
Building Random Forest at Scale
Sri Ambati
 
Data Product Architectures
Data Product Architectures
Benjamin Bengfort
 
Pyparis2017 / Scikit-learn - an incomplete yearly review, by Gael Varoquaux
Pyparis2017 / Scikit-learn - an incomplete yearly review, by Gael Varoquaux
Pôle Systematic Paris-Region
 
Tree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptions
Gilles Louppe
 

More Related Content

What's hot (20)

Introduction to Machine Learning in Python using Scikit-Learn
Introduction to Machine Learning in Python using Scikit-Learn
Amol Agrawal
 
Machine Learning - Neural Networks - Perceptron
Machine Learning - Neural Networks - Perceptron
Andrew Ferlitsch
 
Kmeans plusplus
Kmeans plusplus
Renaud Richardet
 
Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017
Gabriel Moreira
 
Machine Learning with Azure
Machine Learning with Azure
Barbara Fusinska
 
Introduction to Machine Learning with Python and scikit-learn
Introduction to Machine Learning with Python and scikit-learn
Matt Hagy
 
Josh Patterson MLconf slides
Josh Patterson MLconf slides
MLconf
 
Gradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learn
DataRobot
 
Feature engineering pipelines
Feature engineering pipelines
Ramesh Sampath
 
Jan vitek distributedrandomforest_5-2-2013
Jan vitek distributedrandomforest_5-2-2013
Sri Ambati
 
Entity embeddings for categorical data
Entity embeddings for categorical data
Paul Skeie
 
Networks are like onions: Practical Deep Learning with TensorFlow
Networks are like onions: Practical Deep Learning with TensorFlow
Barbara Fusinska
 
Joey gonzalez, graph lab, m lconf 2013
Joey gonzalez, graph lab, m lconf 2013
MLconf
 
Visualizing the Model Selection Process
Visualizing the Model Selection Process
Benjamin Bengfort
 
Graph Analyses with Python and NetworkX
Graph Analyses with Python and NetworkX
Benjamin Bengfort
 
Analysis of algorithms
Analysis of algorithms
iqbalphy1
 
Support Vector Machine (Classification) - Step by Step
Support Vector Machine (Classification) - Step by Step
Manish nath choudhary
 
Deep learning with TensorFlow
Deep learning with TensorFlow
Barbara Fusinska
 
Building Random Forest at Scale
Building Random Forest at Scale
Sri Ambati
 
Data Product Architectures
Data Product Architectures
Benjamin Bengfort
 
Introduction to Machine Learning in Python using Scikit-Learn
Introduction to Machine Learning in Python using Scikit-Learn
Amol Agrawal
 
Machine Learning - Neural Networks - Perceptron
Machine Learning - Neural Networks - Perceptron
Andrew Ferlitsch
 
Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017
Gabriel Moreira
 
Machine Learning with Azure
Machine Learning with Azure
Barbara Fusinska
 
Introduction to Machine Learning with Python and scikit-learn
Introduction to Machine Learning with Python and scikit-learn
Matt Hagy
 
Josh Patterson MLconf slides
Josh Patterson MLconf slides
MLconf
 
Gradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learn
DataRobot
 
Feature engineering pipelines
Feature engineering pipelines
Ramesh Sampath
 
Jan vitek distributedrandomforest_5-2-2013
Jan vitek distributedrandomforest_5-2-2013
Sri Ambati
 
Entity embeddings for categorical data
Entity embeddings for categorical data
Paul Skeie
 
Networks are like onions: Practical Deep Learning with TensorFlow
Networks are like onions: Practical Deep Learning with TensorFlow
Barbara Fusinska
 
Joey gonzalez, graph lab, m lconf 2013
Joey gonzalez, graph lab, m lconf 2013
MLconf
 
Visualizing the Model Selection Process
Visualizing the Model Selection Process
Benjamin Bengfort
 
Graph Analyses with Python and NetworkX
Graph Analyses with Python and NetworkX
Benjamin Bengfort
 
Analysis of algorithms
Analysis of algorithms
iqbalphy1
 
Support Vector Machine (Classification) - Step by Step
Support Vector Machine (Classification) - Step by Step
Manish nath choudhary
 
Deep learning with TensorFlow
Deep learning with TensorFlow
Barbara Fusinska
 
Building Random Forest at Scale
Building Random Forest at Scale
Sri Ambati
 

Viewers also liked (20)

Pyparis2017 / Scikit-learn - an incomplete yearly review, by Gael Varoquaux
Pyparis2017 / Scikit-learn - an incomplete yearly review, by Gael Varoquaux
Pôle Systematic Paris-Region
 
Tree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptions
Gilles Louppe
 
Exploring Machine Learning in Python with Scikit-Learn
Exploring Machine Learning in Python with Scikit-Learn
Kan Ouivirach, Ph.D.
 
Intro to scikit-learn
Intro to scikit-learn
AWeber
 
Realtime predictive analytics using RabbitMQ & scikit-learn
Realtime predictive analytics using RabbitMQ & scikit-learn
AWeber
 
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
PyData
 
Machine learning with scikit-learn
Machine learning with scikit-learn
Qingkai Kong
 
Intro to scikit learn may 2017
Intro to scikit learn may 2017
Francesco Mosconi
 
Machine learning in production with scikit-learn
Machine learning in production with scikit-learn
Jeff Klukas
 
Scikit-learn: the state of the union 2016
Scikit-learn: the state of the union 2016
Gael Varoquaux
 
Think machine-learning-with-scikit-learn-chetan
Think machine-learning-with-scikit-learn-chetan
Chetan Khatri
 
Machine Learning with scikit-learn
Machine Learning with scikit-learn
odsc
 
Data Science and Machine Learning Using Python and Scikit-learn
Data Science and Machine Learning Using Python and Scikit-learn
Asim Jalis
 
Text Classification/Categorization
Text Classification/Categorization
Oswal Abhishek
 
Scikit-learn for easy machine learning: the vision, the tool, and the project
Scikit-learn for easy machine learning: the vision, the tool, and the project
Gael Varoquaux
 
Accelerating Random Forests in Scikit-Learn
Accelerating Random Forests in Scikit-Learn
Gilles Louppe
 
Converting Scikit-Learn to PMML
Converting Scikit-Learn to PMML
Villu Ruusmann
 
A Beginner's Guide to Machine Learning with Scikit-Learn
A Beginner's Guide to Machine Learning with Scikit-Learn
Sarah Guido
 
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Jimmy Lai
 
Statistical Machine Learning for Text Classification with scikit-learn and NLTK
Statistical Machine Learning for Text Classification with scikit-learn and NLTK
Olivier Grisel
 
Pyparis2017 / Scikit-learn - an incomplete yearly review, by Gael Varoquaux
Pyparis2017 / Scikit-learn - an incomplete yearly review, by Gael Varoquaux
Pôle Systematic Paris-Region
 
Tree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptions
Gilles Louppe
 
Exploring Machine Learning in Python with Scikit-Learn
Exploring Machine Learning in Python with Scikit-Learn
Kan Ouivirach, Ph.D.
 
Intro to scikit-learn
Intro to scikit-learn
AWeber
 
Realtime predictive analytics using RabbitMQ & scikit-learn
Realtime predictive analytics using RabbitMQ & scikit-learn
AWeber
 
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
PyData
 
Machine learning with scikit-learn
Machine learning with scikit-learn
Qingkai Kong
 
Intro to scikit learn may 2017
Intro to scikit learn may 2017
Francesco Mosconi
 
Machine learning in production with scikit-learn
Machine learning in production with scikit-learn
Jeff Klukas
 
Scikit-learn: the state of the union 2016
Scikit-learn: the state of the union 2016
Gael Varoquaux
 
Think machine-learning-with-scikit-learn-chetan
Think machine-learning-with-scikit-learn-chetan
Chetan Khatri
 
Machine Learning with scikit-learn
Machine Learning with scikit-learn
odsc
 
Data Science and Machine Learning Using Python and Scikit-learn
Data Science and Machine Learning Using Python and Scikit-learn
Asim Jalis
 
Text Classification/Categorization
Text Classification/Categorization
Oswal Abhishek
 
Scikit-learn for easy machine learning: the vision, the tool, and the project
Scikit-learn for easy machine learning: the vision, the tool, and the project
Gael Varoquaux
 
Accelerating Random Forests in Scikit-Learn
Accelerating Random Forests in Scikit-Learn
Gilles Louppe
 
Converting Scikit-Learn to PMML
Converting Scikit-Learn to PMML
Villu Ruusmann
 
A Beginner's Guide to Machine Learning with Scikit-Learn
A Beginner's Guide to Machine Learning with Scikit-Learn
Sarah Guido
 
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Jimmy Lai
 
Statistical Machine Learning for Text Classification with scikit-learn and NLTK
Statistical Machine Learning for Text Classification with scikit-learn and NLTK
Olivier Grisel
 
Ad

Similar to Clustering: A Scikit Learn Tutorial (20)

machine learning - Clustering in R
machine learning - Clustering in R
Sudhakar Chavan
 
Training machine learning k means 2017
Training machine learning k means 2017
Iwan Sofana
 
Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...
Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...
Maninda Edirisooriya
 
DS9 - Clustering.pptx
DS9 - Clustering.pptx
JK970901
 
Clustering as a unsupervised learning method inin machine learning
Clustering as a unsupervised learning method inin machine learning
tanishqgujari
 
Unsupervised Learning: Clustering
Unsupervised Learning: Clustering
Experfy
 
Unsupervised Learning in Machine Learning
Unsupervised Learning in Machine Learning
Pyingkodi Maran
 
background.pptx
background.pptx
KabileshCm
 
Data mining techniques unit v
Data mining techniques unit v
malathieswaran29
 
26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt
vikassingh569137
 
big data analytics unit 2 notes for study
big data analytics unit 2 notes for study
DIVYADHARSHINISDIVYA
 
algoritma klastering.pdf
algoritma klastering.pdf
bintis1
 
05 k-means clustering
05 k-means clustering
Subhas Kumar Ghosh
 
Birch1
Birch1
ThamizharasiM3
 
Clustering.pptx
Clustering.pptx
Ramakrishna Reddy Bijjam
 
ch_5_dm clustering in data mining.......
ch_5_dm clustering in data mining.......
PriyankaPatil919748
 
Document clustering for forensic analysis an approach for improving compute...
Document clustering for forensic analysis an approach for improving compute...
Madan Golla
 
Machine Learning : Clustering - Cluster analysis.pptx
Machine Learning : Clustering - Cluster analysis.pptx
tecaviw979
 
New Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids Algorithm
Editor IJCATR
 
Ensemble_instance_unsupersied_learning 01_02_2024.pptx
Ensemble_instance_unsupersied_learning 01_02_2024.pptx
vigneshmatta2004
 
machine learning - Clustering in R
machine learning - Clustering in R
Sudhakar Chavan
 
Training machine learning k means 2017
Training machine learning k means 2017
Iwan Sofana
 
Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...
Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...
Maninda Edirisooriya
 
DS9 - Clustering.pptx
DS9 - Clustering.pptx
JK970901
 
Clustering as a unsupervised learning method inin machine learning
Clustering as a unsupervised learning method inin machine learning
tanishqgujari
 
Unsupervised Learning: Clustering
Unsupervised Learning: Clustering
Experfy
 
Unsupervised Learning in Machine Learning
Unsupervised Learning in Machine Learning
Pyingkodi Maran
 
background.pptx
background.pptx
KabileshCm
 
Data mining techniques unit v
Data mining techniques unit v
malathieswaran29
 
26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt
vikassingh569137
 
big data analytics unit 2 notes for study
big data analytics unit 2 notes for study
DIVYADHARSHINISDIVYA
 
algoritma klastering.pdf
algoritma klastering.pdf
bintis1
 
ch_5_dm clustering in data mining.......
ch_5_dm clustering in data mining.......
PriyankaPatil919748
 
Document clustering for forensic analysis an approach for improving compute...
Document clustering for forensic analysis an approach for improving compute...
Madan Golla
 
Machine Learning : Clustering - Cluster analysis.pptx
Machine Learning : Clustering - Cluster analysis.pptx
tecaviw979
 
New Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids Algorithm
Editor IJCATR
 
Ensemble_instance_unsupersied_learning 01_02_2024.pptx
Ensemble_instance_unsupersied_learning 01_02_2024.pptx
vigneshmatta2004
 
Ad

More from Damian R. Mingle, MBA (13)

Classify Rice Disease Using Self-Optimizing Models and Edge Computing with A...
Classify Rice Disease Using Self-Optimizing Models and Edge Computing with A...
Damian R. Mingle, MBA
 
Predicting Diabetic Readmission Rates: Moving Beyond HbA1c
Predicting Diabetic Readmission Rates: Moving Beyond HbA1c
Damian R. Mingle, MBA
 
Greek Letters with LaTeX Cheat Sheet
Greek Letters with LaTeX Cheat Sheet
Damian R. Mingle, MBA
 
Scikit Learn: How to Deal with Missing Values
Scikit Learn: How to Deal with Missing Values
Damian R. Mingle, MBA
 
SciKit Learn: How to Standardize Your Data
SciKit Learn: How to Standardize Your Data
Damian R. Mingle, MBA
 
Scikit Learn: Data Normalization Techniques That Work
Scikit Learn: Data Normalization Techniques That Work
Damian R. Mingle, MBA
 
What is sepsis?
What is sepsis?
Damian R. Mingle, MBA
 
Controlling informative features for improved accuracy and faster predictions...
Controlling informative features for improved accuracy and faster predictions...
Damian R. Mingle, MBA
 
The evolving definition of sepsis
The evolving definition of sepsis
Damian R. Mingle, MBA
 
Data and the Changing Role of the Tech Savvy CFO
Data and the Changing Role of the Tech Savvy CFO
Damian R. Mingle, MBA
 
A discriminative-feature-space-for-detecting-and-recognizing-pathologies-of-t...
A discriminative-feature-space-for-detecting-and-recognizing-pathologies-of-t...
Damian R. Mingle, MBA
 
Practical Data Science the WPC Healthcare Strategy for Delivering Meaningful ...
Practical Data Science the WPC Healthcare Strategy for Delivering Meaningful ...
Damian R. Mingle, MBA
 
A Multi-Pronged Approach to Data Mining Post-Acute Care Episodes
A Multi-Pronged Approach to Data Mining Post-Acute Care Episodes
Damian R. Mingle, MBA
 
Classify Rice Disease Using Self-Optimizing Models and Edge Computing with A...
Classify Rice Disease Using Self-Optimizing Models and Edge Computing with A...
Damian R. Mingle, MBA
 
Predicting Diabetic Readmission Rates: Moving Beyond HbA1c
Predicting Diabetic Readmission Rates: Moving Beyond HbA1c
Damian R. Mingle, MBA
 
Greek Letters with LaTeX Cheat Sheet
Greek Letters with LaTeX Cheat Sheet
Damian R. Mingle, MBA
 
Scikit Learn: How to Deal with Missing Values
Scikit Learn: How to Deal with Missing Values
Damian R. Mingle, MBA
 
SciKit Learn: How to Standardize Your Data
SciKit Learn: How to Standardize Your Data
Damian R. Mingle, MBA
 
Scikit Learn: Data Normalization Techniques That Work
Scikit Learn: Data Normalization Techniques That Work
Damian R. Mingle, MBA
 
Controlling informative features for improved accuracy and faster predictions...
Controlling informative features for improved accuracy and faster predictions...
Damian R. Mingle, MBA
 
Data and the Changing Role of the Tech Savvy CFO
Data and the Changing Role of the Tech Savvy CFO
Damian R. Mingle, MBA
 
A discriminative-feature-space-for-detecting-and-recognizing-pathologies-of-t...
A discriminative-feature-space-for-detecting-and-recognizing-pathologies-of-t...
Damian R. Mingle, MBA
 
Practical Data Science the WPC Healthcare Strategy for Delivering Meaningful ...
Practical Data Science the WPC Healthcare Strategy for Delivering Meaningful ...
Damian R. Mingle, MBA
 
A Multi-Pronged Approach to Data Mining Post-Acute Care Episodes
A Multi-Pronged Approach to Data Mining Post-Acute Care Episodes
Damian R. Mingle, MBA
 

Recently uploaded (20)

Verweven van EM Legacy en OTL-data bij AWV
Verweven van EM Legacy en OTL-data bij AWV
jacoba18
 
Communication_Skills_Class10_Visual.pptx
Communication_Skills_Class10_Visual.pptx
namanrastogi70555
 
QUALITATIVE EXPLANATORY VARIABLES REGRESSION MODELS
QUALITATIVE EXPLANATORY VARIABLES REGRESSION MODELS
Ameya Patekar
 
YEAP !NOT WHAT YOU THINK aakshdjdncnkenfj
YEAP !NOT WHAT YOU THINK aakshdjdncnkenfj
payalmistryb
 
llm_presentation and deep learning methods
llm_presentation and deep learning methods
sayedabdussalam11
 
presentation4.pdf Intro to mcmc methodss
presentation4.pdf Intro to mcmc methodss
SergeyTsygankov6
 
BCG-Executive-Perspectives-CEOs-Guide-to-Maximizing-Value-from-AI-EP0-3July20...
BCG-Executive-Perspectives-CEOs-Guide-to-Maximizing-Value-from-AI-EP0-3July20...
benediktnetzer1
 
Module 1Integrity_and_Ethics_PPT-2025.pptx
Module 1Integrity_and_Ethics_PPT-2025.pptx
Karikalcholan Mayavan
 
Artigo - Playing to Win.planejamento docx
Artigo - Playing to Win.planejamento docx
KellyXavier15
 
UPS and Big Data intro to Business Analytics.pptx
UPS and Big Data intro to Business Analytics.pptx
sanjum5582
 
FME Beyond Data Processing: Creating a Dartboard Accuracy App
FME Beyond Data Processing: Creating a Dartboard Accuracy App
jacoba18
 
25 items quiz for practical research 1 in grade 11
25 items quiz for practical research 1 in grade 11
leamaydayaganon81
 
定制OCAD学生卡加拿大安大略艺术与设计大学成绩单范本,OCAD成绩单复刻
定制OCAD学生卡加拿大安大略艺术与设计大学成绩单范本,OCAD成绩单复刻
taqyed
 
英国毕业证范本利物浦约翰摩尔斯大学成绩单底纹防伪LJMU学生证办理学历认证
英国毕业证范本利物浦约翰摩尔斯大学成绩单底纹防伪LJMU学生证办理学历认证
taqyed
 
@Reset-Password.pptx presentakh;kenvtion
@Reset-Password.pptx presentakh;kenvtion
MarkLariosa1
 
Attendance Presentation Project Excel.pptx
Attendance Presentation Project Excel.pptx
s2025266191
 
THE LINEAR REGRESSION MODEL: AN OVERVIEW
THE LINEAR REGRESSION MODEL: AN OVERVIEW
Ameya Patekar
 
Lesson-3_Program-Outcomes-and-Student-Learning-Outcomes_For-Students.pdf
Lesson-3_Program-Outcomes-and-Student-Learning-Outcomes_For-Students.pdf
SarahMaeDuallo
 
All the DataOps, all the paradigms .
All the DataOps, all the paradigms .
Lars Albertsson
 
Indigo_Airlines_Strategy_Presentation.pptx
Indigo_Airlines_Strategy_Presentation.pptx
mukeshpurohit991
 
Verweven van EM Legacy en OTL-data bij AWV
Verweven van EM Legacy en OTL-data bij AWV
jacoba18
 
Communication_Skills_Class10_Visual.pptx
Communication_Skills_Class10_Visual.pptx
namanrastogi70555
 
QUALITATIVE EXPLANATORY VARIABLES REGRESSION MODELS
QUALITATIVE EXPLANATORY VARIABLES REGRESSION MODELS
Ameya Patekar
 
YEAP !NOT WHAT YOU THINK aakshdjdncnkenfj
YEAP !NOT WHAT YOU THINK aakshdjdncnkenfj
payalmistryb
 
llm_presentation and deep learning methods
llm_presentation and deep learning methods
sayedabdussalam11
 
presentation4.pdf Intro to mcmc methodss
presentation4.pdf Intro to mcmc methodss
SergeyTsygankov6
 
BCG-Executive-Perspectives-CEOs-Guide-to-Maximizing-Value-from-AI-EP0-3July20...
BCG-Executive-Perspectives-CEOs-Guide-to-Maximizing-Value-from-AI-EP0-3July20...
benediktnetzer1
 
Module 1Integrity_and_Ethics_PPT-2025.pptx
Module 1Integrity_and_Ethics_PPT-2025.pptx
Karikalcholan Mayavan
 
Artigo - Playing to Win.planejamento docx
Artigo - Playing to Win.planejamento docx
KellyXavier15
 
UPS and Big Data intro to Business Analytics.pptx
UPS and Big Data intro to Business Analytics.pptx
sanjum5582
 
FME Beyond Data Processing: Creating a Dartboard Accuracy App
FME Beyond Data Processing: Creating a Dartboard Accuracy App
jacoba18
 
25 items quiz for practical research 1 in grade 11
25 items quiz for practical research 1 in grade 11
leamaydayaganon81
 
定制OCAD学生卡加拿大安大略艺术与设计大学成绩单范本,OCAD成绩单复刻
定制OCAD学生卡加拿大安大略艺术与设计大学成绩单范本,OCAD成绩单复刻
taqyed
 
英国毕业证范本利物浦约翰摩尔斯大学成绩单底纹防伪LJMU学生证办理学历认证
英国毕业证范本利物浦约翰摩尔斯大学成绩单底纹防伪LJMU学生证办理学历认证
taqyed
 
@Reset-Password.pptx presentakh;kenvtion
@Reset-Password.pptx presentakh;kenvtion
MarkLariosa1
 
Attendance Presentation Project Excel.pptx
Attendance Presentation Project Excel.pptx
s2025266191
 
THE LINEAR REGRESSION MODEL: AN OVERVIEW
THE LINEAR REGRESSION MODEL: AN OVERVIEW
Ameya Patekar
 
Lesson-3_Program-Outcomes-and-Student-Learning-Outcomes_For-Students.pdf
Lesson-3_Program-Outcomes-and-Student-Learning-Outcomes_For-Students.pdf
SarahMaeDuallo
 
All the DataOps, all the paradigms .
All the DataOps, all the paradigms .
Lars Albertsson
 
Indigo_Airlines_Strategy_Presentation.pptx
Indigo_Airlines_Strategy_Presentation.pptx
mukeshpurohit991
 

Clustering: A Scikit Learn Tutorial

  • 2. About Me • Chief Data Scientist, WPC Healthcare • Speaker • Researcher • Writer
  • 3. Outline • What is k-means clustering? • How does it work? • When is it appropriate to use it? • K-means clustering in scikit-learn • Basic • Basic with adjustments
  • 4. Clustering • It is unsupervised learning (inferring a function to describe not so obvious structures from unlabeled data) • Groups data objects • Measures distance between data points • Helps in examining the data
  • 5. K-means Clustering • Formally: a method of vector quantization • Informally: a mapping of a large set of inputs to a (countable smaller set) • Separate data into groups with equal variance • Makes use of the Euclidean distance metric
  • 6. K-means Clustering Repeats refinement Three basic steps: • Step 1: Choose k (how many groups) • Repeat over: • Step 2: Assignment (labeling data as part of a group) • Step 3: Update This process continues until its goal is reached
  • 8. K-means Clustering • Advantages • Large data accepted • Fast • Will always find a solution • Disadvantages • Choosing the wrong number of groups • You reach a local optima not a global
  • 9. K-means Clustering • When to use • Normally distributed data • Large number of samples • Not too many clusters • Distance can be measured in a linear fashion
  • 10. Scikit-Learn • Python • Open-source machine learning library • Very well documented
  • 11. Scikit-Learn • Model = EstimatorObject() • Unsupervised: • Model.fit(dataset.data) • dataset.data = dataset
  • 12. K-means in Scikit-Learn • Very fast • Data Scientist: picks number of clusters, • Scikit kmeans: finds the initial centroids of groups
  • 13. Dataset Name: Household Power Consumption by Individuals Number of attributes: 9 Number of instances: 2,075,259 Missing values: Yes
  • 16. K-means Parameters • n_clusters • Number of clusters to form • max_iter • Maximum number of repeats for algo in a single run • n_init • Number of times k-means algo will run with different initialization points • init • Method you want to initialize with • precompute_distances • Selection of Yes, No, or let the machine decide • Tol • How tolerable should the algo be when it converges • n_jobs • How many CPUs do you want to engage when running the algo • random_state • What instance should be the starting point for the algo
  • 17. n_clusters: choosing k • View the variance • cdist is the distance between sets of observations • pdist is the pairwise distances between observations in the same set
  • 18. n_clusters: choosing k Step 1: Determine your k range Step 2: Fit the k-means model for each n_clusters = k Step 3: Pull out the cluster centers for each model
  • 19. n_clusters: choosing k Step 4: Calculate Euclidean distance from each point to each cluster center Step 5: Total within-cluster sum of squares Step 6: Total sum of squares Step 7: Difference between-cluster sum of squares
  • 20. n_clusters: choosing k • Graphing the variance
  • 21. n_clusters: choosing k n_clusters = 4 n_clusters = 7
  • 22. n_clusters: choosing k • n_clusters = 8 (default)
  • 23. init Methods and their meaning: • k-means++ • Selects initial clusters in a way that speeds up convergence • random • Choose k rows at random for initial centroids • Ndarray that gives initial centers • (n_clusters, n_features)
  • 24. K-means (8) n_clusters = 8, init = kmeans++ n_clusters = 8, init = random
  • 25. K-means (7) n_clusters = 7, init = kmeans++ n_clusters = 7, init = random
  • 26. Comparing Results: Silhouette Score • Silhouette coefficient • Not black and white, lots of gray • Average distance between data observations and other data in cluster • Average distance between data observations and all other points in the NEXT nearest cluster • Silhouette score in scikit-learn • Average silhouette coefficient for all data observations • The closer to 1, the better the fit • Computation time increases with larger datasets
  • 28. What Do the Results Say? • Data patterns may in fact exist • Similar observations can be grouped • We need additional discovery
  • 29. A Few Hacks • Clustering is a great way to explore your data and develop intution • Too many features create a problem for understanding • Use dimensionality reduction • Use clustering with other methods
  • 30. Let’s Connect • Twitter: @DamianMingle • LinkedIn: DamianRMingle • Sign-up for Data Science Hacks