SlideShare a Scribd company logo
CLUSTERING
CHAPTER 7: MACHINE LEARNING – THEORY & PRACTICE
Introduction
• Clustering refers to the process of arranging or
organizing objects according to specific criteria.
• Partitioning of data
oGrouping of data in database application improves the
data access
• Data reorganization
• Data compression
Introduction
• Summarization
oStatistical measures like mean, mode, and median
can provide summarized information.
• Matrix factorization
o Let there be n data points in an l-dimensional space. We
can represent it as a matrix Xn×l.
oIt is possible to approximate X as a product of two
matrices Bn×K and CK×l.
oSo, X ≈ BC, where B is the cluster assignment matrix and
C is the representatives matrix
Clustering process
• The clustering process ensure that the distance
between any two points within the same cluster
(intra-cluster distance), as measured by a
dissimilarity measure such as Euclidean distance, is
smaller than the distance between any two points
belonging to different clusters (inter-cluster
distance)
• Any two points are placed in the same cluster if the
distance between them is lower than a certain
threshold (input). Squared Euclidean distance
is used as one of the measure to compute the
distance between the points.
Hard and soft clustering
Data abstraction
• Clustering is a useful method for data abstraction, and
it can be applied to generate clusters of data points
that can be represented by their centroid or
medoid or leader or some other suitable entity.
• The centroid is computed as the sample mean of the
data points in a cluster.
• The medoid is the point that minimizes the sum of
distances to all other points in the cluster.
• Note: The centroid can shift dramatically based on the
position of the outlier, while the medoid remains stable
within the boundaries of the original cluster.
Clustering algorithms
Divisive clustering
• Divisive algorithms are either polythetic where the
division is based on more than one feature or
monothetic when only one feature is considered at
a time.
• The polythetic scheme is based on finding all
possible 2-partitions of the data and choosing the
best among them. If there are n patterns, the
number
of distinct 2-partions is given by (2n −2)/2 = 2n−1 – 1.
Divisive clustering
• Among all possible 2-partitions, the partition with the
least sum of the sample variances of the two clusters is
chosen as the best.
• From the resulting partition, the cluster with the
maximum sample variance is selected and is split into
an optimal 2-partition.
• This process is repeated till we get singleton clusters.
• If a collection of patterns (data points) is split into two
clusters with p patterns x1, · · · , xp in one cluster and q
patterns y1, · · · , yq in the other cluster with the
centroids of the two clusters being C1 and C2
respectively, then the sum of the sample variances will
be
Divisive clustering
Monothetic clustering
• involves considering each feature direction
individually and dividing the data into two clusters
based on the gap in projected values along that
feature direction.
• Specifically, the dataset is split into two parts at a
point that corresponds to the mean value of the
maximum gap observed among the feature values.
• This process is then repeated sequentially for the
remaining features, further partitioning each
cluster.
Monothetic clustering
Agglomerative clustering
• An agglomerative clustering algorithm generally follows
the following steps:
1. Compute the proximity matrix for all pairs of patterns
in the dataset.
2. Find the closest pair of clusters based on the
computed proximity measure and merge them into a
single cluster. Update the proximity matrix to reflect
the merge, adjusting the distances between the
newly formed cluster and the remaining clusters.
3. If all patterns belong to a single cluster, terminate the
algorithm. Otherwise, go back to Step 2 and repeat
the process until all patterns are in one cluster.
Agglomerative clustering
k-Means clustering
Elbow method to select k
k-Means++ clustering
• k-means++ clustering algorithm is mainly used for
identifying the initial cluster centers.
Agglomerative clustering
Soft partitioning
1. Fuzzy clustering: Each data point is assigned to
multiple clusters, typically more than one, based on a
membership value. The value is computed using the
data point and the corresponding cluster centroid.
2. Rough clustering: Each cluster is assumed to have
both a non-overlapping part and an overlapping part.
Data points in the non-overlapping portion
exclusively belong to that cluster, while data points in
the overlapping part may belong to multiple clusters.
3. Neural network-based clustering: In this method,
varying weights associated with the data points are
used to obtain a soft partition.
Soft partitioning – Contd.
1. Simulated annealing: In this case, the current solution is
randomly updated, and the resulting solution is accepted
with a certain proba-bility. If the resulting solution is
better than the current solution, it is accepted; otherwise,
it is accepted with a probability ranging from 0 to 1.
2. Tabu search: Unlike simulated annealing, multiple
solutions are stored, and the current solution is perturbed
in various ways to determine the next configuration.
3. Evolutionary algorithms: This method maintains a
population of solutions. In addition to the fitness values
of individuals, a random search based on the interaction
among solutions with mutation is employed to generate
the next population.
Fuzzy c-means clustering
Fuzzy c-means clustering – contd.
Rough clustering
Rough k-means clustering algorithm
Rough k-means clustering algorithm
Clustering large datasets
• Issues:
• Number of dataset scans
• Incremental changes in dataset
• Solutions:
• Single dataset scan clustering algorithms
• Incremental clustering algorithms
• Abstraction based clustering
• Examples
• PC-Clustering algorithm
• Leader clustering algorithm
Divide-and-conquer method
• The divide-and-conquer approach is an effective
strategy for addressing the challenge of clustering
large datasets that cannot be stored entirely in
main memory.
• To overcome this limitation, a common solution is
to process a portion of the dataset at a time and
store the relevant cluster representatives in
memory.
Divide-and-conquer method
Divide-and-conquer method

More Related Content

PDF
Chapter 5.pdf
PPT
4 DM Clustering ifor computerscience.ppt
PPTX
DS9 - Clustering.pptx
PDF
CLUSTERING IN DATA MINING.pdf
PPTX
Unsupervised%20Learninffffg (2).pptx. application
PDF
Unsupervised Learning in Machine Learning
PDF
ClusteringClusteringClusteringClustering.pdf
PDF
An Analysis On Clustering Algorithms In Data Mining
Chapter 5.pdf
4 DM Clustering ifor computerscience.ppt
DS9 - Clustering.pptx
CLUSTERING IN DATA MINING.pdf
Unsupervised%20Learninffffg (2).pptx. application
Unsupervised Learning in Machine Learning
ClusteringClusteringClusteringClustering.pdf
An Analysis On Clustering Algorithms In Data Mining

Similar to Chapter7 clustering types concepts algorithms.pdf (20)

PPT
26-Clustering MTech-2017.ppt
PPTX
Data mining Techniques
PPTX
Unsupervised learning Algorithms and Assumptions
PDF
ch_5_dm clustering in data mining.......
PPT
Chap8 basic cluster_analysis
PPTX
Pattern recognition binoy k means clustering
PPT
DM_clustering.ppt
PPT
2002_Spring_CS525_Lggggggfdtfffdfgecture_2.ppt
PPTX
Clustering on DSS
PPT
Dataa miining
PDF
Mat189: Cluster Analysis with NBA Sports Data
PPT
PPT
Clustering
PPTX
Introduction to Clustering . pptx
PDF
84cc04ff77007e457df6aa2b814d2346bf1b
PPT
Chapter 11 cluster advanced, Han & Kamber
PPTX
Unsupervised learning Modi.pptx
PDF
International Journal of Engineering and Science Invention (IJESI)
PDF
Machine Learning - Clustering
26-Clustering MTech-2017.ppt
Data mining Techniques
Unsupervised learning Algorithms and Assumptions
ch_5_dm clustering in data mining.......
Chap8 basic cluster_analysis
Pattern recognition binoy k means clustering
DM_clustering.ppt
2002_Spring_CS525_Lggggggfdtfffdfgecture_2.ppt
Clustering on DSS
Dataa miining
Mat189: Cluster Analysis with NBA Sports Data
Clustering
Introduction to Clustering . pptx
84cc04ff77007e457df6aa2b814d2346bf1b
Chapter 11 cluster advanced, Han & Kamber
Unsupervised learning Modi.pptx
International Journal of Engineering and Science Invention (IJESI)
Machine Learning - Clustering
Ad

More from PRABHUCECC (7)

PDF
Chapter8 LINEAR DESCRIMINANT FOR MACHINE LEARNING.pdf
PDF
Chapter1MACHINE LEARNING THEORY AND PRACTICES.pdf
PDF
Chapter5 ML BASED FREQUENT ITEM SETS.pdf
PDF
Chapter2 NEAREST NEIGHBOURHOOD ALGORITHMS.pdf
PPTX
Criterion _1_NBA RELATED DOCUMENT BRR.pptx
PPT
DATA WAREHOUSING AND DATA MINING JNTUK UNIT-2.ppt
PPT
Data ware house and miningUNIT-1 DATA MINING CONCEPT.ppt
Chapter8 LINEAR DESCRIMINANT FOR MACHINE LEARNING.pdf
Chapter1MACHINE LEARNING THEORY AND PRACTICES.pdf
Chapter5 ML BASED FREQUENT ITEM SETS.pdf
Chapter2 NEAREST NEIGHBOURHOOD ALGORITHMS.pdf
Criterion _1_NBA RELATED DOCUMENT BRR.pptx
DATA WAREHOUSING AND DATA MINING JNTUK UNIT-2.ppt
Data ware house and miningUNIT-1 DATA MINING CONCEPT.ppt
Ad

Recently uploaded (20)

PPTX
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PDF
Basic Mud Logging Guide for educational purpose
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PPTX
Cardiovascular Pharmacology for pharmacy students.pptx
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PDF
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
PPTX
COMPUTERS AS DATA ANALYSIS IN PRECLINICAL DEVELOPMENT.pptx
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PDF
Introduction-to-Social-Work-by-Leonora-Serafeca-De-Guzman-Group-2.pdf
PDF
Microbial disease of the cardiovascular and lymphatic systems
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PPTX
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
PDF
Anesthesia in Laparoscopic Surgery in India
PPTX
GDM (1) (1).pptx small presentation for students
PPTX
Cell Structure & Organelles in detailed.
PPTX
The Healthy Child – Unit II | Child Health Nursing I | B.Sc Nursing 5th Semester
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
Renaissance Architecture: A Journey from Faith to Humanism
Basic Mud Logging Guide for educational purpose
O5-L3 Freight Transport Ops (International) V1.pdf
Cardiovascular Pharmacology for pharmacy students.pptx
STATICS OF THE RIGID BODIES Hibbelers.pdf
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
COMPUTERS AS DATA ANALYSIS IN PRECLINICAL DEVELOPMENT.pptx
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
Introduction-to-Social-Work-by-Leonora-Serafeca-De-Guzman-Group-2.pdf
Microbial disease of the cardiovascular and lymphatic systems
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
Anesthesia in Laparoscopic Surgery in India
GDM (1) (1).pptx small presentation for students
Cell Structure & Organelles in detailed.
The Healthy Child – Unit II | Child Health Nursing I | B.Sc Nursing 5th Semester
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
Pharmacology of Heart Failure /Pharmacotherapy of CHF

Chapter7 clustering types concepts algorithms.pdf

  • 1. CLUSTERING CHAPTER 7: MACHINE LEARNING – THEORY & PRACTICE
  • 2. Introduction • Clustering refers to the process of arranging or organizing objects according to specific criteria. • Partitioning of data oGrouping of data in database application improves the data access • Data reorganization • Data compression
  • 3. Introduction • Summarization oStatistical measures like mean, mode, and median can provide summarized information. • Matrix factorization o Let there be n data points in an l-dimensional space. We can represent it as a matrix Xn×l. oIt is possible to approximate X as a product of two matrices Bn×K and CK×l. oSo, X ≈ BC, where B is the cluster assignment matrix and C is the representatives matrix
  • 4. Clustering process • The clustering process ensure that the distance between any two points within the same cluster (intra-cluster distance), as measured by a dissimilarity measure such as Euclidean distance, is smaller than the distance between any two points belonging to different clusters (inter-cluster distance) • Any two points are placed in the same cluster if the distance between them is lower than a certain threshold (input). Squared Euclidean distance is used as one of the measure to compute the distance between the points.
  • 5. Hard and soft clustering
  • 6. Data abstraction • Clustering is a useful method for data abstraction, and it can be applied to generate clusters of data points that can be represented by their centroid or medoid or leader or some other suitable entity. • The centroid is computed as the sample mean of the data points in a cluster. • The medoid is the point that minimizes the sum of distances to all other points in the cluster. • Note: The centroid can shift dramatically based on the position of the outlier, while the medoid remains stable within the boundaries of the original cluster.
  • 8. Divisive clustering • Divisive algorithms are either polythetic where the division is based on more than one feature or monothetic when only one feature is considered at a time. • The polythetic scheme is based on finding all possible 2-partitions of the data and choosing the best among them. If there are n patterns, the number of distinct 2-partions is given by (2n −2)/2 = 2n−1 – 1.
  • 9. Divisive clustering • Among all possible 2-partitions, the partition with the least sum of the sample variances of the two clusters is chosen as the best. • From the resulting partition, the cluster with the maximum sample variance is selected and is split into an optimal 2-partition. • This process is repeated till we get singleton clusters. • If a collection of patterns (data points) is split into two clusters with p patterns x1, · · · , xp in one cluster and q patterns y1, · · · , yq in the other cluster with the centroids of the two clusters being C1 and C2 respectively, then the sum of the sample variances will be
  • 11. Monothetic clustering • involves considering each feature direction individually and dividing the data into two clusters based on the gap in projected values along that feature direction. • Specifically, the dataset is split into two parts at a point that corresponds to the mean value of the maximum gap observed among the feature values. • This process is then repeated sequentially for the remaining features, further partitioning each cluster.
  • 13. Agglomerative clustering • An agglomerative clustering algorithm generally follows the following steps: 1. Compute the proximity matrix for all pairs of patterns in the dataset. 2. Find the closest pair of clusters based on the computed proximity measure and merge them into a single cluster. Update the proximity matrix to reflect the merge, adjusting the distances between the newly formed cluster and the remaining clusters. 3. If all patterns belong to a single cluster, terminate the algorithm. Otherwise, go back to Step 2 and repeat the process until all patterns are in one cluster.
  • 16. Elbow method to select k
  • 17. k-Means++ clustering • k-means++ clustering algorithm is mainly used for identifying the initial cluster centers.
  • 19. Soft partitioning 1. Fuzzy clustering: Each data point is assigned to multiple clusters, typically more than one, based on a membership value. The value is computed using the data point and the corresponding cluster centroid. 2. Rough clustering: Each cluster is assumed to have both a non-overlapping part and an overlapping part. Data points in the non-overlapping portion exclusively belong to that cluster, while data points in the overlapping part may belong to multiple clusters. 3. Neural network-based clustering: In this method, varying weights associated with the data points are used to obtain a soft partition.
  • 20. Soft partitioning – Contd. 1. Simulated annealing: In this case, the current solution is randomly updated, and the resulting solution is accepted with a certain proba-bility. If the resulting solution is better than the current solution, it is accepted; otherwise, it is accepted with a probability ranging from 0 to 1. 2. Tabu search: Unlike simulated annealing, multiple solutions are stored, and the current solution is perturbed in various ways to determine the next configuration. 3. Evolutionary algorithms: This method maintains a population of solutions. In addition to the fitness values of individuals, a random search based on the interaction among solutions with mutation is employed to generate the next population.
  • 26. Clustering large datasets • Issues: • Number of dataset scans • Incremental changes in dataset • Solutions: • Single dataset scan clustering algorithms • Incremental clustering algorithms • Abstraction based clustering • Examples • PC-Clustering algorithm • Leader clustering algorithm
  • 27. Divide-and-conquer method • The divide-and-conquer approach is an effective strategy for addressing the challenge of clustering large datasets that cannot be stored entirely in main memory. • To overcome this limitation, a common solution is to process a portion of the dataset at a time and store the relevant cluster representatives in memory.