SlideShare a Scribd company logo
Chapter 5: Clustering
Definition
• Finding groups of objects such that the objects in a
group will be similar (or related) to one another and
different from (or unrelated to) the objects in other
groups
Inter-cluster
distances are
maximized
Intra-cluster
distances are
minimized
What is Cluster Analysis?
• Cluster: a collection of data objects
– Similar to one another within the same cluster
– Dissimilar to the objects in other clusters
• Cluster analysis
– Finding similarities between data according to the
characteristics found in the data and grouping similar data
objects into clusters
• Unsupervised learning: no predefined classes
• Typical applications
– As a stand-alone tool to get insight into data distribution
– As a preprocessing step for other algorithms
Applications
• Group related documents for browsing
• Group genes and proteins that have similar
functionality
• Group stocks with similar price fluctuations
• Group users with similar buying mentalities
Examples of Clustering Applications
• Marketing: Help marketers discover distinct groups in their
customer bases, and then use this knowledge to develop targeted
marketing programs
• Insurance: Identifying groups of motor insurance policy holders
with a high average claim cost
• Land use: Identification of areas of similar land use in an earth
observation database
• City-planning: Identifying groups of houses according to their house
type, value, and geographical location
• Earth-quake studies: Observed earth quake epicenters should be
clustered along continent faults
Clustering is ambiguous
• There is no correct or incorrect solution for
clustering.
How many clusters?
Four Clusters
Two Clusters
Six Clusters
Quality: What Is Good Clustering?
• A good clustering method will produce high quality clusters
with
– high intra-class similarity
– low inter-class similarity
• The quality of a clustering result depends on both the
similarity measure used by the method and its
implementation
• The quality of a clustering method is also measured by its
ability to discover some or all of the hidden patterns
Requirements of Clustering in Data
Mining
• Scalability
• Ability to deal with different types of attributes
• Ability to handle dynamic data
• Discovery of clusters with arbitrary shape
• Minimal requirements for domain knowledge to determine
input parameters
• Able to deal with noise and outliers
• Insensitive to order of input records
• High dimensionality
• Incorporation of user-specified constraints
• Interpretability and usability
Major Clustering Approaches
• Partitioning approach:
– Construct various partitions and then evaluate them by
some criterion, e.g., minimizing the sum of square errors
– Typical methods: k-means, k-medoids, CLARANS
• Hierarchical approach:
– Create a hierarchical decomposition of the set of data (or
objects) using some criterion either agglomerative or
divisive
– Typical methods: Diana, Agnes, BIRCH, ROCK, CAMELEON
• Density-based approach:
– Based on connectivity and density functions
– Typical methods: DBSACN, OPTICS, DenClue
Partitional Clustering
Original Points A Partitional Clustering
Hierarchical Clustering
p4
p1
p3
p2
p4
p1
p3
p2
p4
p1 p2 p3
p4
p1 p2 p3
Traditional Hierarchical
Clustering
Non-traditional Hierarchical
Clustering
Traditional Dendrogram
Non-traditional Dendrogram
Clustering Algorithms
• Partitional
– K-means
– K-mediods
• Hierarchial
– Agglomerative
– Divisive
– Birch
• Density based method:
– DBSCAN
– Optics
K-Mean Algorithm
• Each cluster is represented by the mean value of
the objects in the cluster
• Input : set of objects (n), no of clusters (k)
• Output : set of k clusters
• Algo
– Arbitrarily select k objects from D & mark them a
initial cluster centers
– Repeat
• (re)assign each object to the cluster to which the object is
the most similar, based on the mean value of the objects in
the cluster;
• update the cluster means, that is, calculate the mean value
of the objects for each cluster;
• until no change;
K-Means (second method)
• Step 1: Randomly assign objects to k clusters
• Step 2: Find the mean of each cluster
• Step 3: Re-assign objects to the cluster with
closest mean.
• Step 4: Go to step2
Repeat until no change.
Example 1
Given: {2,3,6,8,9,12,15,18,22} Assume k=3.
• Solution:
– Randomly partition given data set:
• K1 = 2,8,15 mean = 8.3
• K2 = 3,9,18 mean = 10
• K3 = 6,12,22 mean = 13.3
– Reassign
• K1 = 2,3,6,8,9 mean = 5.6
• K2 = mean = 0
• K3 = 12,15,18,22 mean = 16.75
– Reassign
• K1 = 3,6,8,9 mean = 6.5
• K2 = 2 mean = 2
• K3 = 12,15,18,22 mean = 16.75
– Reassign
• K1 = 6,8,9 mean = 7.6
• K2 = 2,3 mean = 2.5
• K3 = 12,15,18,22 mean = 16.75
– Reassign
• K1 = 6,8,9 mean = 7.6
• K2 = 2,3 mean = 2.5
• K3 = 12,15,18,22 mean = 16.75
• STOP
Example 2
Given {2,4,10,12,3,20,30,11,25}
Assume k=2.
Solution:
K1 = 2,3,4,10,11,12
K2 = 20, 25, 30
K-Means (graph)
• Step1: Form k centroids, randomly
• Step2: Calculate distance between centroids and
each object
– Use Euclidean’s law do determine min distance:
d(A,B) = (x2-x1)2 + (y2-y1)2
• Step3: Assign objects based on min distance to k
clusters
• Step4: Calculate centroid of each cluster using
C = (x1+x2+…xn , y1+y2+…yn)
n n
– Go to step 2.
– Repeat until no change in centroids.
Example 1
• There are four types of medicines and each
have two attributes, as shown below. Find a
way to group them into 2 groups based on
their features.
Medicine Weight(x) pH(y)
A 1 1
B 2 1
C 4 3
D 5 4
Solution
• Plot the values on a graph
• Mark any k centeroids
• Calculate Euclidean distance of each point from
the centroids.
• D = 0 1 3.61 5
1 0 2.83 4.24
• Based on minimum distance, we assign points to
clusters: C1 = A
C2 = B, C, D
• Calculate new centroids: C1=(1,1)
• C2 = 2+4+5 , 1+3+4 = (11/3 , 8/3)
3 3
C1=(1,1) group 1
C2 =(2,1) group 2
• D1 = 0 1 3.61 5
3.14 2.36 0.47 1.89
• Objects clustering :
– A,B = group 1
– C,D = group 2
• Calculate new centroids:
• C1 = (3/2, 1)
• C2 = (9/2,7/2)
• Marking the new centroids
• Continue the iteration, until there is no change in the
centroids or clusters.
Final solution
Example 2
• Use K-means algorithm to create two clusters.
Given:
Example 3
.
Group the below points into 3 clusters
K-Mediod (PAM)
• Also called Partitioning Around Mediods.
• Step1: choose k mediods
• Step2: assign all points to closest mediod
• Step3: form distance matrix for each cluster
and choose the next best mediod. i.e., the
point closest to all other points in cluster
– go to step2.
– Repeat until no change in any mediods
ch_5_dm clustering in data mining.......
ch_5_dm clustering in data mining.......
ch_5_dm clustering in data mining.......
ch_5_dm clustering in data mining.......
ch_5_dm clustering in data mining.......
ch_5_dm clustering in data mining.......
ch_5_dm clustering in data mining.......
Hierarchical methods
• Works by grouping data objects into a tree of
clusters
• Uses distance (similarity) matrix as clustering
criteria
• We provide a termination condition
• Classified further as :
– Agglomerative hierarchical clustering
– Divisive hierarchical clustering
ch_5_dm clustering in data mining.......
Agglomerative hierarchical clustering
• Data objects are grouped in bottom –up
fashion
• Initially each data object is in its own cluster
• Then merge these atomic clusters into larger
and larger clusters ,until all objects are in
single cluster or until certain termination
condition are satisfied
• Termination condition can be specified by the
user , as desired number of clusters
• An HAC clustering is typically visualized as a
dendrogram
• Dendrogram : is a tree data structure which
illustrates hierarchical clustering techniques
• Each level shows clusters for that level
– Leaf : individual clusters
– Root : one cluster
Agglomerative hierarchical clustering
p4
p1
p3
p2
p4
p1 p2 p3
Traditional Hierarchical Clustering Traditional Dendrogram
Agglomerative hierarchical clustering
Use Agglomerative algorithm on
following data and draw single link
dendogram.
Object X Y
A 2 2
B 3 2
C 1 1
D 3 1
E 1.5 0.5
Construct distance matrix
Use Euclidean’s law do determine min distance:
d(A,B) = (x2-x1)2 + (y2-y1)2
A B C D E
A 0
B 1 0
C 1.41 2.24 0
D 1.41 1 2 0
E 1.58 2.12 0.71 1.58 0
• SINGLE LINK :
• Step 1:
• SINCE C, E is minimum we can combine
clusters C, E
A B C,E D
A 0
B 1 0
C,E 1.41 2.12 0
D 1.41 1 1.58 0
• Step 2:
• A, B has minimum value therefore we merge
these two clusters
A,B C,E D
A,B 0
C,E 1.41 0
D 1 1.58 0
• Step 3:
• (A, B) and D has minimum value therefore we
merge these two clusters
• A,B,D C,E
A,B,D 0
C,E 1.41 0
• Step 4:
• We have two clusters to be combined
• Construct a dendrogram for the same
• Different approaches to defining the distance
between clusters distinguish the different
algorithm :
• Single linkage clustering: check for minimum
distance
• Complete linkage : check for maximum distance
• Average linkage : we consider the distance
between any two clusters A and B is taken to be
average of all distances between pairs of objects
“i” in A and “j” in B that is the mean distance
between elements of each cluster
Agglomerative
• Step1: Make each object as a cluster
• Step2: Calculate the Euclidean distance from
every point to every other point. i.e.,
construct a Distance Matrix
• Step3: Identify two clusters with shortest
distance.
– Merge them
– Go to Step 2
– Repeat until all objects are in one cluster
Example
• Find single link technique to find clusters in
the given database.
Data points X Y
1
0.4 0.53
2
0.22 0.38
3
0.35 0.32
4
0.26 0.19
5
0.08 0.41
6
0.45 0.3
Plot given data
Construct a distance matrix
1 0
2 0.24 0
3 0.22 0.15 0
4 0.37 0.20 0.15 0
5 0.34 0.14 0.28 0.29 0
6 0.23 0.25 0.11 0.22 0.39 0
1 2 3 4 5 6
Identify two nearest clusters
Repeat process until all objects in
same cluster
Complete link
• Max distance matrix
Average link
• Average distance matrix
Object X Y
A 1 1
B 1.5 1.5
C 5 5
D 3 4
E 4 4
F 3 3.5
Solution
• D,F at distance 0.50
• A,B at distance 0.71
• E and (D,F) at 1
• (D, F , E) and C at distance 1.41
• (D,F,E,C) and (A,B) at distance 2.50
•
Divisive hierarchical clustering
• Data objects are grouped in top down manner
• Initially all objects are in one clusters
• Then cluster is subdivided into smaller pieces ,
until each objects forms a cluster on its own
or until it satisfies some termination condition
• Advantages :
– is simple and outputs a hierarchy , a structure that
is more informative
– It does not require to pre specify the number of
clusters
• Disadvantages: selection of merge or split
point is critical as once a group of objects is
merged or split , it will operate on the newly
generated clusters and will not undo what was
done previously
Agglomerative vs Divisive
Agglomerative Divisive
Start with points as individual clusters Start with one , all – inclusive cluster
At each step merge closest pair of cluster
until only one cluster left
At each step split a cluster until each
cluster contains a point
Extensions to Hierarchical Clustering
• Major weakness of agglomerative clustering methods
– Can never undo what was done previously
– Do not scale well: time complexity of at least O(n2), where n
is the number of total objects
• Integration of hierarchical & distance-based clustering
– BIRCH (1996): uses CF-tree and incrementally adjusts the
quality of sub-clusters
– CHAMELEON (1999): hierarchical clustering using dynamic
modeling
60
BIRCH (Balanced Iterative Reducing and
Clustering Using Hierarchies)
• Zhang, Ramakrishnan & Livny, SIGMOD’96
• Incrementally construct a CF (Clustering Feature) tree, a hierarchical data
structure for multiphase clustering
– Phase 1: scan DB to build an initial in-memory CF tree (a multi-level
compression of the data that tries to preserve the inherent clustering
structure of the data)
– Phase 2: use an arbitrary clustering algorithm to cluster the leaf nodes of
the CF-tree
• Scales linearly: finds a good clustering with a single scan and improves the
quality with a few additional scans
• Weakness: handles only numeric data, and sensitive to the order of the data
record
61
Clustering Feature Vector in BIRCH
Clustering Feature (CF): CF = (N, LS, SS)
N: Number of data points
LS: linear sum of N points:
SS: square sum of N points
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
CF = (5, (16,30),(54,190))
(3,4)
(2,6)
(4,5)
(4,7)
(3,8)


N
i
i
X
1
2
1


N
i
i
X
62
• CF=<n,LS,SS>
• Where LS= linear sum of n points and SS is square
sum of n data points
• For 2 disjoint clusters C1 and C2, the CF is formed
by merging C1 and C2
• i.e CF1+CF2 = <n1+n2, LS1+LS2, SS1+SS2>
CF-Tree in BIRCH (Balanced Iterative Reducing
and Clustering using Hierarchies)
ch_5_dm clustering in data mining.......
CF-Tree in BIRCH
• Clustering feature:
– Summary of the statistics for a given subcluster: the 0-th, 1st,
and 2nd moments of the subcluster from the statistical point
of view
– Registers crucial measurements for computing cluster and
utilizes storage efficiently
A CF tree is a height-balanced tree that stores the clustering
features for a hierarchical clustering
– A nonleaf node in a tree has descendants or “children”
– The nonleaf nodes store sums of the CFs of their children
• A CF tree has two parameters
– Branching factor: max no. of children
– Threshold: max diameter of sub-clusters stored at the leaf
nodes
65
The CF Tree Structure
CF1
child1
CF3
child3
CF2
child2
CF6
child6
CF1
child1
CF3
child3
CF2
child2
CF5
child5
CF1 CF2 CF6
prev next CF1 CF2 CF4
prev next
Root
Non-leaf node
Leaf node Leaf node
66
The Birch Algorithm
• Cluster Diameter
• For each point in the input
– Find closest leaf entry
– Add point to leaf entry and update CF
– If entry diameter > max_diameter, then split leaf, and possibly parents
• Algorithm is O(n)
• Concerns
– Sensitive to insertion order of data points
– Since we fix the size of leaf nodes, so clusters may not be so natural
– Clusters tend to be spherical given the radius and diameter measures
 

2
)
(
)
1
(
1
j
x
i
x
n
n
BIRCH: The Idea by example
Data Objects
1
Clustering Process (build a tree)
Cluster1
1
2
3
4
5
6
2
If cluster 1 becomes too large (not compact) by adding object 2,
then split the cluster
Leaf node
BIRCH: The Idea by example
Data Objects
1
Clustering Process (build a tree)
Cluster1
1
2
3
4
5
6
2
Leaf node
Cluster2
entry 1 entry 2
Leaf node with two entries
BIRCH: The Idea by example
Data Objects
1
Clustering Process (build a tree)
Cluster1
1
2
3
4
5
6
2
Leaf node
Cluster2
3
entry1 is the closest to object 3
If cluster 1 becomes too large by adding object 3,
then split the cluster
entry 1 entry 2
BIRCH: The Idea by example
Data Objects
1
Clustering Process (build a tree)
Cluster1
1
2
3
4
5
6
2
Leaf node
Cluster2
3
entry 1 entry 2 entry 3
Cluster3
Leaf node with three entries
BIRCH: The Idea by example
Data Objects
1
Clustering Process (build a tree)
Cluster1
1
2
3
4
5
6
2
Leaf node
Cluster2
3
entry 1 entry 2 entry 3
Cluster3
4
entry3 is the closest to object 4
Cluster 2 remains compact when adding object 4
then add object 4 to cluster 2
Cluster2
BIRCH: The Idea by example
Data Objects
1
Clustering Process (build a tree)
Cluster1
1
2
3
4
5
6
2
Leaf node
3
entry 1 entry 2 entry 3
Cluster3
4
entry2 is the closest to object 5
Cluster 3 becomes too large by adding object 5
then split cluster 3?
BUT there is a limit to the number of entries a node can have
Thus, split the node
Cluster2
5
BIRCH: The Idea by example
Data Objects
1
Clustering Process (build a tree)
Cluster1
1
2
3
4
5
6
2
Leaf node
3
Cluster3
4
Cluster2
5
entry 1 entry 2
entry 1.1 entry 1.2 entry 2.1 entry 2.2
Leaf node
Non-Leaf node
Cluster4
BIRCH: The Idea by example
Data Objects
1
Clustering Process (build a tree)
Cluster1
1
2
3
4
5
6
2
Leaf node
3
Cluster3
4
Cluster2
5
entry 1 entry 2
entry 1.1 entry 1.2 entry 2.1 entry 2.2
Leaf node
Non-Leaf node
Cluster4
6
entry1.2 is the closest to object 6
Cluster 3 remains compact when adding object 6
then add object 6 to cluster 3
Cluster3
Density-Based Clustering Methods
• Clustering based on density (local cluster criterion), such as
density-connected points
• Major features:
– Discover clusters of arbitrary shape
– Handle noise
– One scan
– Need density parameters as termination condition
• Several interesting studies:
– DBSCAN: Ester, et al. (KDD’96)
– OPTICS: Ankerst, et al (SIGMOD’99).
– DENCLUE: Hinneburg & D. Keim (KDD’98)
– CLIQUE: Agrawal, et al. (SIGMOD’98) (more grid-based)
76
Density-Based Clustering: Basic Concepts
• Two parameters:
– Eps: Maximum radius of the neighbourhood
– MinPts: Minimum number of points in an Eps-
neighbourhood of that point
• NEps(q): {p belongs to D | dist(p,q) ≤ Eps}
• Directly density-reachable: A point p is directly density-
reachable from a point q w.r.t. Eps, MinPts if
– p belongs to NEps(q)
– core point condition:
|NEps (q)| ≥ MinPts
MinPts = 5
Eps = 1 cm
p
q
77
Density-Reachable and Density-Connected
• Density-reachable:
– A point p is density-reachable from a
point q w.r.t. Eps, MinPts if there is a
chain of points p1, …, pn, p1 = q, pn =
p such that pi+1 is directly density-
reachable from pi
• Density-connected
– A point p is density-connected to a
point q w.r.t. Eps, MinPts if there is a
point o such that both, p and q are
density-reachable from o w.r.t. Eps
and MinPts
p
q
p1
p q
o
78
DBSCAN: Density-Based Spatial Clustering of
Applications with Noise
• Relies on a density-based notion of cluster: A cluster is defined
as a maximal set of density-connected points
• Discovers clusters of arbitrary shape in spatial databases with
noise
Core
Border
Outlier
Eps = 1cm
MinPts = 5
79
DBSCAN: The Algorithm
• Arbitrary select a point p
• Retrieve all points density-reachable from p w.r.t. Eps and
MinPts
• If p is a core point, a cluster is formed
• If p is a border point, no points are density-reachable from p
and DBSCAN visits the next point of the database
• Continue the process until all of the points have been
processed
• If a spatial index is used, the computational complexity of DBSCAN is
O(nlogn), where n is the number of database objects. Otherwise, the
complexity is O(n2)
80
DBSCAN: Sensitive to Parameters
81
OPTICS (Ordering Points to
Identify the Clustering
Structure)
– Based on DBSCAN.
– Does not produce clusters explicitly.
– Rather generate an ordering of data objects representing
density-based clustering structure.
• Produces a special order of the database wrt its density-
based clustering structure
• This cluster-ordering contains info equiv to the density-
based clusterings corresponding to a broad range of
parameter settings
• Good for both automatic and interactive cluster analysis,
including finding intrinsic clustering structure
• Can be represented graphically or using visualization
techniques
83
OPTICS: Some Extension from DBSCAN
• Two values need to be stored for each object—core-distance and
reachability-distance:
– The core-distance of an object p is the smallest Ɛ’ value that makes
{p} a core object. If p is not a core object, the core-distance of
p is undefined.
– The reachability-distance of an object q with respect to another
object p is the greater value of the core-distance of p and the
Euclidean distance between p and q. If p is not a core object,
the reachability-distance between p and q is undefined.
84
Core Distance & Reachability Distance
85
OPTICS: Some Extension from DBSCAN
• The OPTICS algorithm creates an ordering of the objects in a
database, additionally storing the core-distance and a suitable
reachability distance for each object.
• An algorithm was proposed to extract clusters based on the
ordering information produced by OPTICS.
• Such information is sufficient for the extraction of all density-
based clusterings with respect to any distance Ɛ’ that is smaller
than the distance e used in generating the order.
86


Reachability-
distance
Cluster-order of the objects
undefined
‘
87
88
OPTICS
If we change minimum points we still get the same clusters, there is a change in
graph according to min pts
Important Questions
• Explain K-means algorithm
• Difference between clustering and
classification
• Write a note on DBSCAN, Birch
• Difference between agglomerative and
divisive
• Numerical on:
– k-means(both single and graphical)
– Agglomerative clustering

More Related Content

PPT
26-Clustering MTech-2017.ppt
PPTX
unitvclusteranalysis-221214135407-1956d6ef.pptx
PPTX
Advanced database and data mining & clustering concepts
PPTX
Data mining techniques unit v
PPT
2002_Spring_CS525_Lggggggfdtfffdfgecture_2.ppt
PPTX
machine learning - Clustering in R
PPTX
Unsupervised learning Modi.pptx
PPTX
UNIT_V_Cluster Analysis.pptx
26-Clustering MTech-2017.ppt
unitvclusteranalysis-221214135407-1956d6ef.pptx
Advanced database and data mining & clustering concepts
Data mining techniques unit v
2002_Spring_CS525_Lggggggfdtfffdfgecture_2.ppt
machine learning - Clustering in R
Unsupervised learning Modi.pptx
UNIT_V_Cluster Analysis.pptx

Similar to ch_5_dm clustering in data mining....... (20)

PPT
4 DM Clustering ifor computerscience.ppt
PPT
DM_clustering.ppt
PDF
Unsupervised Learning in Machine Learning
PDF
MLT Unit4.pdfgmgkgmflbmrfmbrfmbfrmbofl;mb;lf
PDF
MLT Unit4.pdffdhngnrfgrgrfflmbpmpphfhbomf
PPT
ClusetrigBasic.ppt
PPTX
DS9 - Clustering.pptx
PPTX
Unsupervised learning Algorithms and Assumptions
PDF
Clustering.pdf
PPTX
CLUSTER ANALYSIS ALGORITHMS.pptx
PDF
algoritma klastering.pdf
PDF
CSA 3702 machine learning module 3
PPT
UniT_A_Clustering machine learning .ppt
PPT
Chapter 10. Cluster Analysis Basic Concepts and Methods.ppt
PDF
Training machine learning k means 2017
PDF
Chapter 10.1,2,3 pdf.pdf
PPT
Basic Clustering Algorithms in Data Warehouisng and Data Miningppt
PPT
Data Mining and Warehousing Concept and Techniques
PDF
ClusteringClusteringClusteringClustering.pdf
PDF
Cluster Analysis for Dummies
4 DM Clustering ifor computerscience.ppt
DM_clustering.ppt
Unsupervised Learning in Machine Learning
MLT Unit4.pdfgmgkgmflbmrfmbrfmbfrmbofl;mb;lf
MLT Unit4.pdffdhngnrfgrgrfflmbpmpphfhbomf
ClusetrigBasic.ppt
DS9 - Clustering.pptx
Unsupervised learning Algorithms and Assumptions
Clustering.pdf
CLUSTER ANALYSIS ALGORITHMS.pptx
algoritma klastering.pdf
CSA 3702 machine learning module 3
UniT_A_Clustering machine learning .ppt
Chapter 10. Cluster Analysis Basic Concepts and Methods.ppt
Training machine learning k means 2017
Chapter 10.1,2,3 pdf.pdf
Basic Clustering Algorithms in Data Warehouisng and Data Miningppt
Data Mining and Warehousing Concept and Techniques
ClusteringClusteringClusteringClustering.pdf
Cluster Analysis for Dummies
Ad

More from PriyankaPatil919748 (10)

PDF
Data Mining and Warehousing presentation
PDF
ch_9_data Mining and warehousing thirdpdf
PPTX
Digitalsignature&protocol Cryptographycss
PPTX
Networking in Python2025 (programs allll)
PPTX
Apriori Examplein data warehouse &mining.pptx
PPT
ch_1_dm data preprocessing in data mining
PPT
2D_Transformations in computer graphicsCG
PPTX
Aliasing and Antialiasing techniques CGVR
PPTX
cgvr ppt key frame animation computer .
PPT
DBSecurity-Overview database securityPPT
Data Mining and Warehousing presentation
ch_9_data Mining and warehousing thirdpdf
Digitalsignature&protocol Cryptographycss
Networking in Python2025 (programs allll)
Apriori Examplein data warehouse &mining.pptx
ch_1_dm data preprocessing in data mining
2D_Transformations in computer graphicsCG
Aliasing and Antialiasing techniques CGVR
cgvr ppt key frame animation computer .
DBSecurity-Overview database securityPPT
Ad

Recently uploaded (20)

PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
importance of Data-Visualization-in-Data-Science. for mba studnts
PDF
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PDF
Transcultural that can help you someday.
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
Topic 5 Presentation 5 Lesson 5 Corporate Fin
PDF
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PDF
How to run a consulting project- client discovery
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PDF
Mega Projects Data Mega Projects Data
PPTX
Database Infoormation System (DBIS).pptx
PDF
Business Analytics and business intelligence.pdf
PPTX
modul_python (1).pptx for professional and student
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
ISS -ESG Data flows What is ESG and HowHow
importance of Data-Visualization-in-Data-Science. for mba studnts
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
Transcultural that can help you someday.
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Topic 5 Presentation 5 Lesson 5 Corporate Fin
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
How to run a consulting project- client discovery
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
Galatica Smart Energy Infrastructure Startup Pitch Deck
Mega Projects Data Mega Projects Data
Database Infoormation System (DBIS).pptx
Business Analytics and business intelligence.pdf
modul_python (1).pptx for professional and student
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
168300704-gasification-ppt.pdfhghhhsjsjhsuxush

ch_5_dm clustering in data mining.......

  • 2. Definition • Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups Inter-cluster distances are maximized Intra-cluster distances are minimized
  • 3. What is Cluster Analysis? • Cluster: a collection of data objects – Similar to one another within the same cluster – Dissimilar to the objects in other clusters • Cluster analysis – Finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters • Unsupervised learning: no predefined classes • Typical applications – As a stand-alone tool to get insight into data distribution – As a preprocessing step for other algorithms
  • 4. Applications • Group related documents for browsing • Group genes and proteins that have similar functionality • Group stocks with similar price fluctuations • Group users with similar buying mentalities
  • 5. Examples of Clustering Applications • Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs • Insurance: Identifying groups of motor insurance policy holders with a high average claim cost • Land use: Identification of areas of similar land use in an earth observation database • City-planning: Identifying groups of houses according to their house type, value, and geographical location • Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults
  • 6. Clustering is ambiguous • There is no correct or incorrect solution for clustering. How many clusters? Four Clusters Two Clusters Six Clusters
  • 7. Quality: What Is Good Clustering? • A good clustering method will produce high quality clusters with – high intra-class similarity – low inter-class similarity • The quality of a clustering result depends on both the similarity measure used by the method and its implementation • The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns
  • 8. Requirements of Clustering in Data Mining • Scalability • Ability to deal with different types of attributes • Ability to handle dynamic data • Discovery of clusters with arbitrary shape • Minimal requirements for domain knowledge to determine input parameters • Able to deal with noise and outliers • Insensitive to order of input records • High dimensionality • Incorporation of user-specified constraints • Interpretability and usability
  • 9. Major Clustering Approaches • Partitioning approach: – Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors – Typical methods: k-means, k-medoids, CLARANS • Hierarchical approach: – Create a hierarchical decomposition of the set of data (or objects) using some criterion either agglomerative or divisive – Typical methods: Diana, Agnes, BIRCH, ROCK, CAMELEON • Density-based approach: – Based on connectivity and density functions – Typical methods: DBSACN, OPTICS, DenClue
  • 10. Partitional Clustering Original Points A Partitional Clustering
  • 11. Hierarchical Clustering p4 p1 p3 p2 p4 p1 p3 p2 p4 p1 p2 p3 p4 p1 p2 p3 Traditional Hierarchical Clustering Non-traditional Hierarchical Clustering Traditional Dendrogram Non-traditional Dendrogram
  • 12. Clustering Algorithms • Partitional – K-means – K-mediods • Hierarchial – Agglomerative – Divisive – Birch • Density based method: – DBSCAN – Optics
  • 13. K-Mean Algorithm • Each cluster is represented by the mean value of the objects in the cluster • Input : set of objects (n), no of clusters (k) • Output : set of k clusters • Algo – Arbitrarily select k objects from D & mark them a initial cluster centers – Repeat • (re)assign each object to the cluster to which the object is the most similar, based on the mean value of the objects in the cluster; • update the cluster means, that is, calculate the mean value of the objects for each cluster; • until no change;
  • 14. K-Means (second method) • Step 1: Randomly assign objects to k clusters • Step 2: Find the mean of each cluster • Step 3: Re-assign objects to the cluster with closest mean. • Step 4: Go to step2 Repeat until no change.
  • 15. Example 1 Given: {2,3,6,8,9,12,15,18,22} Assume k=3. • Solution: – Randomly partition given data set: • K1 = 2,8,15 mean = 8.3 • K2 = 3,9,18 mean = 10 • K3 = 6,12,22 mean = 13.3 – Reassign • K1 = 2,3,6,8,9 mean = 5.6 • K2 = mean = 0 • K3 = 12,15,18,22 mean = 16.75
  • 16. – Reassign • K1 = 3,6,8,9 mean = 6.5 • K2 = 2 mean = 2 • K3 = 12,15,18,22 mean = 16.75 – Reassign • K1 = 6,8,9 mean = 7.6 • K2 = 2,3 mean = 2.5 • K3 = 12,15,18,22 mean = 16.75 – Reassign • K1 = 6,8,9 mean = 7.6 • K2 = 2,3 mean = 2.5 • K3 = 12,15,18,22 mean = 16.75 • STOP
  • 17. Example 2 Given {2,4,10,12,3,20,30,11,25} Assume k=2. Solution: K1 = 2,3,4,10,11,12 K2 = 20, 25, 30
  • 18. K-Means (graph) • Step1: Form k centroids, randomly • Step2: Calculate distance between centroids and each object – Use Euclidean’s law do determine min distance: d(A,B) = (x2-x1)2 + (y2-y1)2 • Step3: Assign objects based on min distance to k clusters • Step4: Calculate centroid of each cluster using C = (x1+x2+…xn , y1+y2+…yn) n n – Go to step 2. – Repeat until no change in centroids.
  • 19. Example 1 • There are four types of medicines and each have two attributes, as shown below. Find a way to group them into 2 groups based on their features. Medicine Weight(x) pH(y) A 1 1 B 2 1 C 4 3 D 5 4
  • 20. Solution • Plot the values on a graph • Mark any k centeroids
  • 21. • Calculate Euclidean distance of each point from the centroids. • D = 0 1 3.61 5 1 0 2.83 4.24 • Based on minimum distance, we assign points to clusters: C1 = A C2 = B, C, D • Calculate new centroids: C1=(1,1) • C2 = 2+4+5 , 1+3+4 = (11/3 , 8/3) 3 3 C1=(1,1) group 1 C2 =(2,1) group 2
  • 22. • D1 = 0 1 3.61 5 3.14 2.36 0.47 1.89 • Objects clustering : – A,B = group 1 – C,D = group 2 • Calculate new centroids: • C1 = (3/2, 1) • C2 = (9/2,7/2)
  • 23. • Marking the new centroids • Continue the iteration, until there is no change in the centroids or clusters.
  • 25. Example 2 • Use K-means algorithm to create two clusters. Given:
  • 26. Example 3 . Group the below points into 3 clusters
  • 27. K-Mediod (PAM) • Also called Partitioning Around Mediods. • Step1: choose k mediods • Step2: assign all points to closest mediod • Step3: form distance matrix for each cluster and choose the next best mediod. i.e., the point closest to all other points in cluster – go to step2. – Repeat until no change in any mediods
  • 35. Hierarchical methods • Works by grouping data objects into a tree of clusters • Uses distance (similarity) matrix as clustering criteria • We provide a termination condition • Classified further as : – Agglomerative hierarchical clustering – Divisive hierarchical clustering
  • 37. Agglomerative hierarchical clustering • Data objects are grouped in bottom –up fashion • Initially each data object is in its own cluster • Then merge these atomic clusters into larger and larger clusters ,until all objects are in single cluster or until certain termination condition are satisfied • Termination condition can be specified by the user , as desired number of clusters
  • 38. • An HAC clustering is typically visualized as a dendrogram • Dendrogram : is a tree data structure which illustrates hierarchical clustering techniques • Each level shows clusters for that level – Leaf : individual clusters – Root : one cluster Agglomerative hierarchical clustering
  • 39. p4 p1 p3 p2 p4 p1 p2 p3 Traditional Hierarchical Clustering Traditional Dendrogram Agglomerative hierarchical clustering
  • 40. Use Agglomerative algorithm on following data and draw single link dendogram. Object X Y A 2 2 B 3 2 C 1 1 D 3 1 E 1.5 0.5
  • 41. Construct distance matrix Use Euclidean’s law do determine min distance: d(A,B) = (x2-x1)2 + (y2-y1)2 A B C D E A 0 B 1 0 C 1.41 2.24 0 D 1.41 1 2 0 E 1.58 2.12 0.71 1.58 0
  • 42. • SINGLE LINK : • Step 1: • SINCE C, E is minimum we can combine clusters C, E A B C,E D A 0 B 1 0 C,E 1.41 2.12 0 D 1.41 1 1.58 0
  • 43. • Step 2: • A, B has minimum value therefore we merge these two clusters A,B C,E D A,B 0 C,E 1.41 0 D 1 1.58 0
  • 44. • Step 3: • (A, B) and D has minimum value therefore we merge these two clusters • A,B,D C,E A,B,D 0 C,E 1.41 0
  • 45. • Step 4: • We have two clusters to be combined • Construct a dendrogram for the same
  • 46. • Different approaches to defining the distance between clusters distinguish the different algorithm : • Single linkage clustering: check for minimum distance • Complete linkage : check for maximum distance • Average linkage : we consider the distance between any two clusters A and B is taken to be average of all distances between pairs of objects “i” in A and “j” in B that is the mean distance between elements of each cluster
  • 47. Agglomerative • Step1: Make each object as a cluster • Step2: Calculate the Euclidean distance from every point to every other point. i.e., construct a Distance Matrix • Step3: Identify two clusters with shortest distance. – Merge them – Go to Step 2 – Repeat until all objects are in one cluster
  • 48. Example • Find single link technique to find clusters in the given database. Data points X Y 1 0.4 0.53 2 0.22 0.38 3 0.35 0.32 4 0.26 0.19 5 0.08 0.41 6 0.45 0.3
  • 50. Construct a distance matrix 1 0 2 0.24 0 3 0.22 0.15 0 4 0.37 0.20 0.15 0 5 0.34 0.14 0.28 0.29 0 6 0.23 0.25 0.11 0.22 0.39 0 1 2 3 4 5 6
  • 52. Repeat process until all objects in same cluster
  • 53. Complete link • Max distance matrix
  • 54. Average link • Average distance matrix
  • 55. Object X Y A 1 1 B 1.5 1.5 C 5 5 D 3 4 E 4 4 F 3 3.5
  • 56. Solution • D,F at distance 0.50 • A,B at distance 0.71 • E and (D,F) at 1 • (D, F , E) and C at distance 1.41 • (D,F,E,C) and (A,B) at distance 2.50 •
  • 57. Divisive hierarchical clustering • Data objects are grouped in top down manner • Initially all objects are in one clusters • Then cluster is subdivided into smaller pieces , until each objects forms a cluster on its own or until it satisfies some termination condition • Advantages : – is simple and outputs a hierarchy , a structure that is more informative – It does not require to pre specify the number of clusters
  • 58. • Disadvantages: selection of merge or split point is critical as once a group of objects is merged or split , it will operate on the newly generated clusters and will not undo what was done previously
  • 59. Agglomerative vs Divisive Agglomerative Divisive Start with points as individual clusters Start with one , all – inclusive cluster At each step merge closest pair of cluster until only one cluster left At each step split a cluster until each cluster contains a point
  • 60. Extensions to Hierarchical Clustering • Major weakness of agglomerative clustering methods – Can never undo what was done previously – Do not scale well: time complexity of at least O(n2), where n is the number of total objects • Integration of hierarchical & distance-based clustering – BIRCH (1996): uses CF-tree and incrementally adjusts the quality of sub-clusters – CHAMELEON (1999): hierarchical clustering using dynamic modeling 60
  • 61. BIRCH (Balanced Iterative Reducing and Clustering Using Hierarchies) • Zhang, Ramakrishnan & Livny, SIGMOD’96 • Incrementally construct a CF (Clustering Feature) tree, a hierarchical data structure for multiphase clustering – Phase 1: scan DB to build an initial in-memory CF tree (a multi-level compression of the data that tries to preserve the inherent clustering structure of the data) – Phase 2: use an arbitrary clustering algorithm to cluster the leaf nodes of the CF-tree • Scales linearly: finds a good clustering with a single scan and improves the quality with a few additional scans • Weakness: handles only numeric data, and sensitive to the order of the data record 61
  • 62. Clustering Feature Vector in BIRCH Clustering Feature (CF): CF = (N, LS, SS) N: Number of data points LS: linear sum of N points: SS: square sum of N points 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 CF = (5, (16,30),(54,190)) (3,4) (2,6) (4,5) (4,7) (3,8)   N i i X 1 2 1   N i i X 62
  • 63. • CF=<n,LS,SS> • Where LS= linear sum of n points and SS is square sum of n data points • For 2 disjoint clusters C1 and C2, the CF is formed by merging C1 and C2 • i.e CF1+CF2 = <n1+n2, LS1+LS2, SS1+SS2> CF-Tree in BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies)
  • 65. CF-Tree in BIRCH • Clustering feature: – Summary of the statistics for a given subcluster: the 0-th, 1st, and 2nd moments of the subcluster from the statistical point of view – Registers crucial measurements for computing cluster and utilizes storage efficiently A CF tree is a height-balanced tree that stores the clustering features for a hierarchical clustering – A nonleaf node in a tree has descendants or “children” – The nonleaf nodes store sums of the CFs of their children • A CF tree has two parameters – Branching factor: max no. of children – Threshold: max diameter of sub-clusters stored at the leaf nodes 65
  • 66. The CF Tree Structure CF1 child1 CF3 child3 CF2 child2 CF6 child6 CF1 child1 CF3 child3 CF2 child2 CF5 child5 CF1 CF2 CF6 prev next CF1 CF2 CF4 prev next Root Non-leaf node Leaf node Leaf node 66
  • 67. The Birch Algorithm • Cluster Diameter • For each point in the input – Find closest leaf entry – Add point to leaf entry and update CF – If entry diameter > max_diameter, then split leaf, and possibly parents • Algorithm is O(n) • Concerns – Sensitive to insertion order of data points – Since we fix the size of leaf nodes, so clusters may not be so natural – Clusters tend to be spherical given the radius and diameter measures    2 ) ( ) 1 ( 1 j x i x n n
  • 68. BIRCH: The Idea by example Data Objects 1 Clustering Process (build a tree) Cluster1 1 2 3 4 5 6 2 If cluster 1 becomes too large (not compact) by adding object 2, then split the cluster Leaf node
  • 69. BIRCH: The Idea by example Data Objects 1 Clustering Process (build a tree) Cluster1 1 2 3 4 5 6 2 Leaf node Cluster2 entry 1 entry 2 Leaf node with two entries
  • 70. BIRCH: The Idea by example Data Objects 1 Clustering Process (build a tree) Cluster1 1 2 3 4 5 6 2 Leaf node Cluster2 3 entry1 is the closest to object 3 If cluster 1 becomes too large by adding object 3, then split the cluster entry 1 entry 2
  • 71. BIRCH: The Idea by example Data Objects 1 Clustering Process (build a tree) Cluster1 1 2 3 4 5 6 2 Leaf node Cluster2 3 entry 1 entry 2 entry 3 Cluster3 Leaf node with three entries
  • 72. BIRCH: The Idea by example Data Objects 1 Clustering Process (build a tree) Cluster1 1 2 3 4 5 6 2 Leaf node Cluster2 3 entry 1 entry 2 entry 3 Cluster3 4 entry3 is the closest to object 4 Cluster 2 remains compact when adding object 4 then add object 4 to cluster 2 Cluster2
  • 73. BIRCH: The Idea by example Data Objects 1 Clustering Process (build a tree) Cluster1 1 2 3 4 5 6 2 Leaf node 3 entry 1 entry 2 entry 3 Cluster3 4 entry2 is the closest to object 5 Cluster 3 becomes too large by adding object 5 then split cluster 3? BUT there is a limit to the number of entries a node can have Thus, split the node Cluster2 5
  • 74. BIRCH: The Idea by example Data Objects 1 Clustering Process (build a tree) Cluster1 1 2 3 4 5 6 2 Leaf node 3 Cluster3 4 Cluster2 5 entry 1 entry 2 entry 1.1 entry 1.2 entry 2.1 entry 2.2 Leaf node Non-Leaf node Cluster4
  • 75. BIRCH: The Idea by example Data Objects 1 Clustering Process (build a tree) Cluster1 1 2 3 4 5 6 2 Leaf node 3 Cluster3 4 Cluster2 5 entry 1 entry 2 entry 1.1 entry 1.2 entry 2.1 entry 2.2 Leaf node Non-Leaf node Cluster4 6 entry1.2 is the closest to object 6 Cluster 3 remains compact when adding object 6 then add object 6 to cluster 3 Cluster3
  • 76. Density-Based Clustering Methods • Clustering based on density (local cluster criterion), such as density-connected points • Major features: – Discover clusters of arbitrary shape – Handle noise – One scan – Need density parameters as termination condition • Several interesting studies: – DBSCAN: Ester, et al. (KDD’96) – OPTICS: Ankerst, et al (SIGMOD’99). – DENCLUE: Hinneburg & D. Keim (KDD’98) – CLIQUE: Agrawal, et al. (SIGMOD’98) (more grid-based) 76
  • 77. Density-Based Clustering: Basic Concepts • Two parameters: – Eps: Maximum radius of the neighbourhood – MinPts: Minimum number of points in an Eps- neighbourhood of that point • NEps(q): {p belongs to D | dist(p,q) ≤ Eps} • Directly density-reachable: A point p is directly density- reachable from a point q w.r.t. Eps, MinPts if – p belongs to NEps(q) – core point condition: |NEps (q)| ≥ MinPts MinPts = 5 Eps = 1 cm p q 77
  • 78. Density-Reachable and Density-Connected • Density-reachable: – A point p is density-reachable from a point q w.r.t. Eps, MinPts if there is a chain of points p1, …, pn, p1 = q, pn = p such that pi+1 is directly density- reachable from pi • Density-connected – A point p is density-connected to a point q w.r.t. Eps, MinPts if there is a point o such that both, p and q are density-reachable from o w.r.t. Eps and MinPts p q p1 p q o 78
  • 79. DBSCAN: Density-Based Spatial Clustering of Applications with Noise • Relies on a density-based notion of cluster: A cluster is defined as a maximal set of density-connected points • Discovers clusters of arbitrary shape in spatial databases with noise Core Border Outlier Eps = 1cm MinPts = 5 79
  • 80. DBSCAN: The Algorithm • Arbitrary select a point p • Retrieve all points density-reachable from p w.r.t. Eps and MinPts • If p is a core point, a cluster is formed • If p is a border point, no points are density-reachable from p and DBSCAN visits the next point of the database • Continue the process until all of the points have been processed • If a spatial index is used, the computational complexity of DBSCAN is O(nlogn), where n is the number of database objects. Otherwise, the complexity is O(n2) 80
  • 81. DBSCAN: Sensitive to Parameters 81
  • 82. OPTICS (Ordering Points to Identify the Clustering Structure)
  • 83. – Based on DBSCAN. – Does not produce clusters explicitly. – Rather generate an ordering of data objects representing density-based clustering structure. • Produces a special order of the database wrt its density- based clustering structure • This cluster-ordering contains info equiv to the density- based clusterings corresponding to a broad range of parameter settings • Good for both automatic and interactive cluster analysis, including finding intrinsic clustering structure • Can be represented graphically or using visualization techniques 83
  • 84. OPTICS: Some Extension from DBSCAN • Two values need to be stored for each object—core-distance and reachability-distance: – The core-distance of an object p is the smallest Ɛ’ value that makes {p} a core object. If p is not a core object, the core-distance of p is undefined. – The reachability-distance of an object q with respect to another object p is the greater value of the core-distance of p and the Euclidean distance between p and q. If p is not a core object, the reachability-distance between p and q is undefined. 84
  • 85. Core Distance & Reachability Distance 85
  • 86. OPTICS: Some Extension from DBSCAN • The OPTICS algorithm creates an ordering of the objects in a database, additionally storing the core-distance and a suitable reachability distance for each object. • An algorithm was proposed to extract clusters based on the ordering information produced by OPTICS. • Such information is sufficient for the extraction of all density- based clusterings with respect to any distance Ɛ’ that is smaller than the distance e used in generating the order. 86
  • 88. 88 OPTICS If we change minimum points we still get the same clusters, there is a change in graph according to min pts
  • 89. Important Questions • Explain K-means algorithm • Difference between clustering and classification • Write a note on DBSCAN, Birch • Difference between agglomerative and divisive • Numerical on: – k-means(both single and graphical) – Agglomerative clustering