SlideShare a Scribd company logo
0
Confidential. © Stream Intelligence Ltd. All rights reserved.
Introduction to Clustering
1
Confidential. © Stream Intelligence Ltd. All rights reserved.
Agenda
1 Introduction: Business Case
2 Clustering
3 Hierarchical Clustering
4 K-means Clustering
2
Confidential. © Stream Intelligence Ltd. All rights reserved.
Business Case
1
3
Confidential. © Stream Intelligence Ltd. All rights reserved.
Business Case – Predicting Successful Music Production
Cluster
music A
Cluster
music B
Cluster
music C
Cluster
music D
• Target is to appear at Billboard’s weekly to 40
• Cost per single could up to 300K USD
• Music Intelligence Solution using clustering to predict if a music will be
accepted by market
• Increase success rate from 1 out of 10 to 8 out of 10
4
Confidential. © Stream Intelligence Ltd. All rights reserved.
Clustering
2
5
Confidential. © Stream Intelligence Ltd. All rights reserved.
Statistical Learning Categorization
Statistical
Learning
Unsupervised
Learning
Supervised
Learning
Clustering Predictive Model
6
Confidential. © Stream Intelligence Ltd. All rights reserved.
Clustering
• Process of grouping a set of physical or abstract objects into clusters
(example: customer, product etc.)
• A cluster is a collection of data objects that are similar to one another within the same
cluster and are dissimilar to the objects in other clusters
• Similarity is calculated based distance between point
• Common distance measure is Euclidian distance
7
Confidential. © Stream Intelligence Ltd. All rights reserved.
Hierarchycal Clustering
2
8
Confidential. © Stream Intelligence Ltd. All rights reserved.
Hierarchical Clustering
• Start with each data point in its own cluster
9
Confidential. © Stream Intelligence Ltd. All rights reserved.
Hierarchical Clustering
• Combine two nearest clusters (Euclidian, Centroid)
10
Confidential. © Stream Intelligence Ltd. All rights reserved.
Lets Practice
• The data for this exercise was downloaded from www.movielens.org
• Open “clustering_movie.R”
• The movies in the dataset are categorized as belonging to different gender:
a. Action
b. Comedy
c. Sci-Fi
d. etc.
11
Confidential. © Stream Intelligence Ltd. All rights reserved.
Dendogram
Heights represent
the distance
between
point/cluster
12
Confidential. © Stream Intelligence Ltd. All rights reserved.
Finding Meaningful Cluster
• How to see which cluster have the most action movies?
use this command:
tapply(movies$Action, clusterGroups, mean)
• Exercise: Can you find the characteristic of each cluster?
Hint:
- Add the cluster as one of the variable in the data
- Load dplyr library
- Use aggregate and summarise function
13
Confidential. © Stream Intelligence Ltd. All rights reserved.
Common scenario
Tips:
- Normalize the data
Movie Action Romance Rating Revenue
(in USD)
A 1 1 5 200
B 0 1 4 150
C 0 0 3 50
D 1 1 4 120
14
Confidential. © Stream Intelligence Ltd. All rights reserved.
K-means Clustering
2
15
Confidential. © Stream Intelligence Ltd. All rights reserved.
K-Means Clustering
1. Group data into K-clusters by:
a. Determining the k centroid
b. Group the data points to the nearest centroid
2. Algorithm works by iterating between two stages until the data points converge
Objective : High Level Description
16
Suppose k=3
K-Means Illustrations
17
Iteration = 0
1. Start with random positions of centroids.
K-Means Illustrations
18
Iteration = 1
1. Start with random positions of centroids.
2. Assign each data point to closest centroid
K-Means Illustrations
19
Iteration = 1
1. Start with random positions of centroids.
2. Assign each data point to closest centroid
3. Move centroids to center of assigned
points (recalculating C)
K-Means Illustrations
20
Iteration = 3
1. Start with random positions of centroids.
2. Assign each data point to closest centroid
3. Move centroids to center of assigned
points
4. Iterate till minimal cost
K-Means Illustrations
21
Iteration = 3
1. Start with random positions of centroids.
2. Assign each data point to closest centroid
3. Move centroids to center of assigned
points
4. Iterate till minimal cost
What potentially can go wrong?
22
Optimum Number of Cluster Illustrations
TSS = Total Sum of Square Error
K = Number of cluster
Optimum Number of Cluster
23
Confidential. © Stream Intelligence Ltd. All rights reserved.
Lets Practice
• We will use the credit card profile data (cc-profile.csv)
• Open “segmenting_customer.R”
Exercise:
• What is the optimum number of cluster?
• Please provide the characteristics of segment. Do you think it is meaningful?

More Related Content

PPT
K mean-clustering algorithm
PDF
Linear discriminant analysis
PDF
How to study the bible
PPTX
Segmentation d'un fichier client | Machine Learning
PPT
Factor analysis in Spss
PDF
MLOps Bridging the gap between Data Scientists and Ops.
PDF
Xml elgarrai 2020
PPTX
Stock Price Prediction PPT
K mean-clustering algorithm
Linear discriminant analysis
How to study the bible
Segmentation d'un fichier client | Machine Learning
Factor analysis in Spss
MLOps Bridging the gap between Data Scientists and Ops.
Xml elgarrai 2020
Stock Price Prediction PPT

What's hot (20)

PDF
Customer Segmentation
PPTX
Presentation on K-Means Clustering
PPTX
Clustering algorithms Type in image segmentation .pptx
PDF
K - Nearest neighbor ( KNN )
PDF
Data preprocessing using Machine Learning
PDF
Clustering
PDF
Linear regression
PPTX
Cluster Analysis Introduction
PPTX
House Sale Price Prediction
PDF
Adaptive Machine Learning for Credit Card Fraud Detection
PDF
Machine Learning and its Applications
PPTX
Exploratory data analysis with Python
PPTX
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
PDF
Feature selection
PDF
Data Science - Part V - Decision Trees & Random Forests
PDF
Fraud detection with Machine Learning
PPTX
Decision tree induction \ Decision Tree Algorithm with Example| Data science
PPT
Decision tree
PDF
Identifying customer segments using machine learning
Customer Segmentation
Presentation on K-Means Clustering
Clustering algorithms Type in image segmentation .pptx
K - Nearest neighbor ( KNN )
Data preprocessing using Machine Learning
Clustering
Linear regression
Cluster Analysis Introduction
House Sale Price Prediction
Adaptive Machine Learning for Credit Card Fraud Detection
Machine Learning and its Applications
Exploratory data analysis with Python
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
Feature selection
Data Science - Part V - Decision Trees & Random Forests
Fraud detection with Machine Learning
Decision tree induction \ Decision Tree Algorithm with Example| Data science
Decision tree
Identifying customer segments using machine learning
Ad

Similar to Customer Segmentation using Clustering (20)

PDF
Training machine learning k means 2017
PPTX
Mathematics online: some common algorithms
PDF
MLSD18. Unsupervised Learning
PPTX
Project PPT
PDF
Cluster Analysis : Assignment & Update
PPTX
MODULE 4_ CLUSTERING.pptx
PPTX
Clustering.pptx
PDF
Chapter#04[Part#01]K-Means Clusterig.pdf
PDF
Introduction to data mining and machine learning
PDF
MLT Unit4.pdfgmgkgmflbmrfmbrfmbfrmbofl;mb;lf
PDF
MLT Unit4.pdffdhngnrfgrgrfflmbpmpphfhbomf
PDF
Cluster Analysis for Dummies
PDF
Introduction to Big Data Science
PPT
2002_Spring_CS525_Lggggggfdtfffdfgecture_2.ppt
PDF
ch_5_dm clustering in data mining.......
PPTX
Ml9 introduction to-unsupervised_learning_and_clustering_methods
PPTX
Kmeans
PPT
CS3114_09212011.ppt
PPTX
Hiding slides
Training machine learning k means 2017
Mathematics online: some common algorithms
MLSD18. Unsupervised Learning
Project PPT
Cluster Analysis : Assignment & Update
MODULE 4_ CLUSTERING.pptx
Clustering.pptx
Chapter#04[Part#01]K-Means Clusterig.pdf
Introduction to data mining and machine learning
MLT Unit4.pdfgmgkgmflbmrfmbrfmbfrmbofl;mb;lf
MLT Unit4.pdffdhngnrfgrgrfflmbpmpphfhbomf
Cluster Analysis for Dummies
Introduction to Big Data Science
2002_Spring_CS525_Lggggggfdtfffdfgecture_2.ppt
ch_5_dm clustering in data mining.......
Ml9 introduction to-unsupervised_learning_and_clustering_methods
Kmeans
CS3114_09212011.ppt
Hiding slides
Ad

Recently uploaded (20)

PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PDF
Introduction to Data Science and Data Analysis
PPTX
Leprosy and NLEP programme community medicine
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PDF
[EN] Industrial Machine Downtime Prediction
PDF
Business Analytics and business intelligence.pdf
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPT
Predictive modeling basics in data cleaning process
PDF
.pdf is not working space design for the following data for the following dat...
PDF
Introduction to the R Programming Language
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
Introduction to Knowledge Engineering Part 1
PDF
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PDF
Mega Projects Data Mega Projects Data
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Introduction to Data Science and Data Analysis
Leprosy and NLEP programme community medicine
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
Acceptance and paychological effects of mandatory extra coach I classes.pptx
[EN] Industrial Machine Downtime Prediction
Business Analytics and business intelligence.pdf
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Qualitative Qantitative and Mixed Methods.pptx
Predictive modeling basics in data cleaning process
.pdf is not working space design for the following data for the following dat...
Introduction to the R Programming Language
Clinical guidelines as a resource for EBP(1).pdf
Introduction to Knowledge Engineering Part 1
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
STERILIZATION AND DISINFECTION-1.ppthhhbx
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Mega Projects Data Mega Projects Data

Customer Segmentation using Clustering

  • 1. 0 Confidential. © Stream Intelligence Ltd. All rights reserved. Introduction to Clustering
  • 2. 1 Confidential. © Stream Intelligence Ltd. All rights reserved. Agenda 1 Introduction: Business Case 2 Clustering 3 Hierarchical Clustering 4 K-means Clustering
  • 3. 2 Confidential. © Stream Intelligence Ltd. All rights reserved. Business Case 1
  • 4. 3 Confidential. © Stream Intelligence Ltd. All rights reserved. Business Case – Predicting Successful Music Production Cluster music A Cluster music B Cluster music C Cluster music D • Target is to appear at Billboard’s weekly to 40 • Cost per single could up to 300K USD • Music Intelligence Solution using clustering to predict if a music will be accepted by market • Increase success rate from 1 out of 10 to 8 out of 10
  • 5. 4 Confidential. © Stream Intelligence Ltd. All rights reserved. Clustering 2
  • 6. 5 Confidential. © Stream Intelligence Ltd. All rights reserved. Statistical Learning Categorization Statistical Learning Unsupervised Learning Supervised Learning Clustering Predictive Model
  • 7. 6 Confidential. © Stream Intelligence Ltd. All rights reserved. Clustering • Process of grouping a set of physical or abstract objects into clusters (example: customer, product etc.) • A cluster is a collection of data objects that are similar to one another within the same cluster and are dissimilar to the objects in other clusters • Similarity is calculated based distance between point • Common distance measure is Euclidian distance
  • 8. 7 Confidential. © Stream Intelligence Ltd. All rights reserved. Hierarchycal Clustering 2
  • 9. 8 Confidential. © Stream Intelligence Ltd. All rights reserved. Hierarchical Clustering • Start with each data point in its own cluster
  • 10. 9 Confidential. © Stream Intelligence Ltd. All rights reserved. Hierarchical Clustering • Combine two nearest clusters (Euclidian, Centroid)
  • 11. 10 Confidential. © Stream Intelligence Ltd. All rights reserved. Lets Practice • The data for this exercise was downloaded from www.movielens.org • Open “clustering_movie.R” • The movies in the dataset are categorized as belonging to different gender: a. Action b. Comedy c. Sci-Fi d. etc.
  • 12. 11 Confidential. © Stream Intelligence Ltd. All rights reserved. Dendogram Heights represent the distance between point/cluster
  • 13. 12 Confidential. © Stream Intelligence Ltd. All rights reserved. Finding Meaningful Cluster • How to see which cluster have the most action movies? use this command: tapply(movies$Action, clusterGroups, mean) • Exercise: Can you find the characteristic of each cluster? Hint: - Add the cluster as one of the variable in the data - Load dplyr library - Use aggregate and summarise function
  • 14. 13 Confidential. © Stream Intelligence Ltd. All rights reserved. Common scenario Tips: - Normalize the data Movie Action Romance Rating Revenue (in USD) A 1 1 5 200 B 0 1 4 150 C 0 0 3 50 D 1 1 4 120
  • 15. 14 Confidential. © Stream Intelligence Ltd. All rights reserved. K-means Clustering 2
  • 16. 15 Confidential. © Stream Intelligence Ltd. All rights reserved. K-Means Clustering 1. Group data into K-clusters by: a. Determining the k centroid b. Group the data points to the nearest centroid 2. Algorithm works by iterating between two stages until the data points converge Objective : High Level Description
  • 18. 17 Iteration = 0 1. Start with random positions of centroids. K-Means Illustrations
  • 19. 18 Iteration = 1 1. Start with random positions of centroids. 2. Assign each data point to closest centroid K-Means Illustrations
  • 20. 19 Iteration = 1 1. Start with random positions of centroids. 2. Assign each data point to closest centroid 3. Move centroids to center of assigned points (recalculating C) K-Means Illustrations
  • 21. 20 Iteration = 3 1. Start with random positions of centroids. 2. Assign each data point to closest centroid 3. Move centroids to center of assigned points 4. Iterate till minimal cost K-Means Illustrations
  • 22. 21 Iteration = 3 1. Start with random positions of centroids. 2. Assign each data point to closest centroid 3. Move centroids to center of assigned points 4. Iterate till minimal cost What potentially can go wrong?
  • 23. 22 Optimum Number of Cluster Illustrations TSS = Total Sum of Square Error K = Number of cluster Optimum Number of Cluster
  • 24. 23 Confidential. © Stream Intelligence Ltd. All rights reserved. Lets Practice • We will use the credit card profile data (cc-profile.csv) • Open “segmenting_customer.R” Exercise: • What is the optimum number of cluster? • Please provide the characteristics of segment. Do you think it is meaningful?