SlideShare a Scribd company logo
Clustering
2
Road map
• Basic concepts
• K-means algorithm
• Representation of clusters
• Hierarchical clustering
• Distance functions
• Data standardization
• Handling mixed attributes
• Which clustering algorithm to use?
• Cluster evaluation
• Discovering holes and data regions
• Summary
3
Supervised learning vs.
unsupervised learning
• Supervised learning: discover patterns in the
data that relate data attributes with a target
(class) attribute.
– These patterns are then utilized to predict the
values of the target attribute in future data
instances.
• Unsupervised learning: The data have no
target attribute.
– We want to explore the data to find some intrinsic
structures in them.
4
Clustering
• Clustering is a technique for finding similarity groups in
data, called clusters. I.e.,
– it groups data instances that are similar to (near) each other in
one cluster and data instances that are very different (far away)
from each other into different clusters.
• Clustering is often called an unsupervised learning task as
no class values denoting an a priori grouping of the data
instances are given, which is the case in supervised
learning.
• Due to historical reasons, clustering is often considered
synonymous with unsupervised learning.
– In fact, association rule mining is also unsupervised
5
An illustration
• The data set has three natural groups of data points, i.e.,
3 natural clusters.
6
What is clustering for?
• Let us see some real-life examples
• Example 1: groups people of similar sizes
together to make “small”, “medium” and
“large” T-Shirts.
– Tailor-made for each person: too expensive
– One-size-fits-all: does not fit all.
• Example 2: In marketing, segment customers
according to their similarities
– To do targeted marketing.
7
What is clustering for? (cont…)
• Example 3: Given a collection of text
documents, we want to organize them
according to their content similarities,
– To produce a topic hierarchy
• In fact, clustering is one of the most utilized
data mining techniques.
– It has a long history, and used in almost every field,
e.g., medicine, psychology, botany, sociology,
biology, archeology, marketing, insurance, libraries,
etc.
– In recent years, due to the rapid increase of online
documents, text clustering becomes important.
8
Aspects of clustering
• A clustering algorithm
– Partitional clustering
– Hierarchical clustering
– …
• A distance (similarity, or dissimilarity) function
• Clustering quality
– Inter-clusters distance  maximized
– Intra-clusters distance  minimized
• The quality of a clustering result depends on
the algorithm, the distance function, and the
application.
Road map
9
• Basic concepts
• K-means algorithm
• Representation of clusters
• Hierarchical clustering
• Distance functions
• Data standardization
• Handling mixed attributes
• Which clustering algorithm to use?
• Cluster evaluation
• Discovering holes and data regions
• Summary
10
K-means clustering
• K-means is a partitional clustering algorithm
• Let the set of data points (or instances) D be
{x1, x2, …, xn},
where xi = (xi1, xi2, …, xir) is a vector in a real-
valued space X  Rr, and r is the number of
attributes (dimensions) in the data.
• The k-means algorithm partitions the given
data into k clusters.
– Each cluster has a cluster center, called centroid.
– k is specified by the user
11
K-means algorithm
• Given k, the k-means algorithm works as
follows:
1)Randomly choose k data points (seeds) to be the
initial centroids, cluster centers
2)Assign each data point to the closest centroid
3)Re-compute the centroids using the current
cluster memberships.
4)If a convergence criterion is not met, go to 2).
12
K-means algorithm – (cont …)
13
Stopping/convergence criterion
1. no (or minimum) re-assignments of data
points to different clusters,
2. no (or minimum) change of centroids, or
3. minimum decrease in the sum of squared
error (SSE),
– Ci is the jth cluster, mj is the centroid of cluster Cj
(the mean vector of all the data points in Cj), and
dist(x, mj) is the distance between data point x
and centroid mj.




k
j
C j
j
dist
SSE
1
2
)
,
(
x
m
x (1)
14
An example
+
+
15
An example (cont …)
16
An example distance function
17
A disk version of k-means
• K-means can be implemented with data on
disk
– In each iteration, it scans the data once.
– as the centroids can be computed incrementally
• It can be used to cluster large datasets that do
not fit in main memory
• We need to control the number of iterations
– In practice, a limited is set (< 50).
• Not the best method. There are other scale-up
algorithms, e.g., BIRCH.
18
A disk version of k-means (cont …)
19
Strengths of k-means
• Strengths:
– Simple: easy to understand and to implement
– Efficient: Time complexity: O(tkn),
where n is the number of data points,
k is the number of clusters, and
t is the number of iterations.
– Since both k and t are small. k-means is considered a linear
algorithm.
• K-means is the most popular clustering algorithm.
• Note that: it terminates at a local optimum if SSE is used.
The global optimum is hard to find due to complexity.
20
Weaknesses of k-means
• The algorithm is only applicable if the mean is
defined.
– For categorical data, k-mode - the centroid is
represented by most frequent values.
• The user needs to specify k.
• The algorithm is sensitive to outliers
– Outliers are data points that are very far away
from other data points.
– Outliers could be errors in the data recording or
some special data points with very different values.
21
Weaknesses of k-means: Problems with
outliers
22
Weaknesses of k-means: To deal with
outliers
• One method is to remove some data points in the
clustering process that are much further away from the
centroids than other data points.
– To be safe, we may want to monitor these possible outliers over
a few iterations and then decide to remove them.
• Another method is to perform random sampling. Since in
sampling we only choose a small subset of the data
points, the chance of selecting an outlier is very small.
– Assign the rest of the data points to the clusters by distance or
similarity comparison, or classification
23
Weaknesses of k-means (cont …)
• The algorithm is sensitive to initial seeds.
24
Weaknesses of k-means (cont …)
• If we use different seeds: good results
There are some
methods to help
choose good
seeds
25
Weaknesses of k-means (cont …)
• The k-means algorithm is not suitable for discovering
clusters that are not hyper-ellipsoids (or hyper-spheres).
+
26
K-means summary
• Despite weaknesses, k-means is still the most
popular algorithm due to its simplicity,
efficiency and
– other clustering algorithms have their own lists of
weaknesses.
• No clear evidence that any other clustering
algorithm performs better in general
– although they may be more suitable for some
specific types of data or applications.
• Comparing different clustering algorithms is a
difficult task. No one knows the correct
clusters!
27
Road map
• Basic concepts
• K-means algorithm
• Representation of clusters
• Hierarchical clustering
• Distance functions
• Data standardization
• Handling mixed attributes
• Which clustering algorithm to use?
• Cluster evaluation
• Discovering holes and data regions
• Summary
28
Common ways to represent
clusters
• Use the centroid of each cluster to represent
the cluster.
– compute the radius and
– standard deviation of the cluster to determine its
spread in each dimension
– The centroid representation alone works well if
the clusters are of the hyper-spherical shape.
– If clusters are elongated or are of other shapes,
centroids are not sufficient
29
Using classification model
• All the data points in a
cluster are regarded to
have the same class label,
e.g., the cluster ID.
– run a supervised learning
algorithm on the data to
find a classification model.
30
Use frequent values to represent
cluster
• This method is mainly for clustering of
categorical data (e.g., k-modes clustering).
• Main method used in text clustering, where a
small set of frequent words in each cluster is
selected to represent the cluster.
31
Clusters of arbitrary shapes
• Hyper-elliptical and hyper-spherical
clusters are usually easy to
represent, using their centroid
together with spreads.
• Irregular shape clusters are hard to
represent. They may not be useful in
some applications.
– Using centroids are not suitable (upper
figure) in general
– K-means clusters may be more useful
(lower figure), e.g., for making 2 size T-
shirts.
32
Road map
• Basic concepts
• K-means algorithm
• Representation of clusters
• Hierarchical clustering
• Distance functions
• Data standardization
• Handling mixed attributes
• Which clustering algorithm to use?
• Cluster evaluation
• Discovering holes and data regions
• Summary
33
Hierarchical Clustering
• Produce a nested sequence of clusters, a tree, also
called Dendrogram.
34
Types of hierarchical clustering
• Agglomerative (bottom up) clustering: It builds the
dendrogram (tree) from the bottom level, and
– merges the most similar (or nearest) pair of clusters
– stops when all the data points are merged into a single cluster
(i.e., the root cluster).
• Divisive (top down) clustering: It starts with all data
points in one cluster, the root.
– Splits the root into a set of child clusters. Each child cluster is
recursively divided further
– stops when only singleton clusters of individual data points
remain, i.e., each cluster with only a single point
35
Agglomerative clustering
It is more popular then divisive methods.
• At the beginning, each data point forms a
cluster (also called a node).
• Merge nodes/clusters that have the least
distance.
• Go on merging
• Eventually all nodes belong to one cluster
36
Agglomerative clustering algorithm
37
An example: working of the
algorithm
38
Measuring the distance of two
clusters
• A few ways to measure distances of two
clusters.
• Results in different variations of the algorithm.
– Single link
– Complete link
– Average link
– Centroids
– …
39
Single link method
• The distance between two
clusters is the distance
between two closest data
points in the two clusters,
one data point from each
cluster.
• It can find arbitrarily
shaped clusters, but
– It may cause the
undesirable “chain effect”
by noisy points
Two natural clusters are
split into two
40
Complete link method
• The distance between two clusters is the distance of
two furthest data points in the two clusters.
• It is sensitive to outliers because they are far away
41
Average link and centroid methods
• Average link: A compromise between
– the sensitivity of complete-link clustering to
outliers and
– the tendency of single-link clustering to form long
chains that do not correspond to the intuitive
notion of clusters as compact, spherical objects.
– In this method, the distance between two clusters
is the average distance of all pair-wise distances
between the data points in two clusters.
• Centroid method: In this method, the distance
between two clusters is the distance between
their centroids
42
The complexity
• All the algorithms are at least O(n2). n is the
number of data points.
• Single link can be done in O(n2).
• Complete and average links can be done in
O(n2logn).
• Due the complexity, hard to use for large data
sets.
– Sampling
– Scale-up methods (e.g., BIRCH).
43
Road map
• Basic concepts
• K-means algorithm
• Representation of clusters
• Hierarchical clustering
• Distance functions
• Data standardization
• Handling mixed attributes
• Which clustering algorithm to use?
• Cluster evaluation
• Discovering holes and data regions
• Summary
44
Distance functions
• Key to clustering. “similarity” and
“dissimilarity” can also commonly used terms.
• There are numerous distance functions for
– Different types of data
• Numeric data
• Nominal data
– Different specific applications
45
Distance functions for numeric attributes
• Most commonly used functions are
– Euclidean distance and
– Manhattan (city block) distance
• We denote distance with: dist(xi, xj), where xi
and xj are data points (vectors)
• They are special cases of Minkowski distance.
h is positive integer.
h
h
jr
ir
h
j
i
h
j
i
j
i x
x
x
x
x
x
dist
1
2
2
1
1 )
)
(
...
)
(
)
((
)
,
( 






x
x
46
Euclidean distance and Manhattan
distance
• If h = 2, it is the Euclidean distance
• If h = 1, it is the Manhattan distance
• Weighted Euclidean distance
2
2
2
2
2
1
1 )
(
...
)
(
)
(
)
,
( jr
ir
j
i
j
i
j
i x
x
x
x
x
x
dist 






x
x
|
|
...
|
|
|
|
)
,
( 2
2
1
1 jr
ir
j
i
j
i
j
i x
x
x
x
x
x
dist 






x
x
2
2
2
2
2
2
1
1
1 )
(
...
)
(
)
(
)
,
( jr
ir
r
j
i
j
i
j
i x
x
w
x
x
w
x
x
w
dist 






x
x
47
Squared distance and Chebychev
distance
• Squared Euclidean distance: to place
progressively greater weight on data points
that are further apart.
• Chebychev distance: one wants to define two
data points as "different" if they are different
on any one of the attributes.
2
2
2
2
2
1
1 )
(
...
)
(
)
(
)
,
( jr
ir
j
i
j
i
j
i x
x
x
x
x
x
dist 






x
x
|)
|
...,
|,
|
|,
max(|
)
,
( 2
2
1
1 jr
ir
j
i
j
i
j
i x
x
x
x
x
x
dist 



x
x
48
Distance functions for binary and
nominal attributes
• Binary attribute: has two values or states but
no ordering relationships, e.g.,
– Gender: male and female.
• We use a confusion matrix to introduce the
distance functions/measures.
• Let the ith and jth data points be xi and xj
(vectors)
49
Confusion matrix
50
Symmetric binary attributes
• A binary attribute is symmetric if both of its
states (0 and 1) have equal importance, and
carry the same weights, e.g., male and female
of the attribute Gender
• Distance function: Simple Matching
Coefficient, proportion of mismatches of their
values
d
c
b
a
c
b
dist j
i





)
,
( x
x
51
Symmetric binary attributes:
example
52
Asymmetric binary attributes
• Asymmetric: if one of the states is more
important or more valuable than the other.
– By convention, state 1 represents the more
important state, which is typically the rare or
infrequent state.
– Jaccard coefficient is a popular measure
– We can have some variations, adding weights
c
b
a
c
b
dist j
i




)
,
( x
x
53
Nominal attributes
• Nominal attributes: with more than two states
or values.
– the commonly used distance measure is also
based on the simple matching method.
– Given two data points xi and xj, let the number of
attributes be r, and the number of values that
match in xi and xj be q.
r
q
r
dist j
i


)
,
( x
x
54
Distance function for text documents
• A text document consists of a sequence of sentences and
each sentence consists of a sequence of words.
• To simplify: a document is usually considered a “bag” of
words in document clustering.
– Sequence and position of words are ignored.
• A document is represented with a vector just like a
normal data point.
• It is common to use similarity to compare two
documents rather than distance.
– The most commonly used similarity function is the cosine
similarity. We will study this later.
55
Road map
• Basic concepts
• K-means algorithm
• Representation of clusters
• Hierarchical clustering
• Distance functions
• Data standardization
• Handling mixed attributes
• Which clustering algorithm to use?
• Cluster evaluation
• Discovering holes and data regions
• Summary
56
Data standardization
• In the Euclidean space, standardization of attributes is
recommended so that all attributes can have equal
impact on the computation of distances.
• Consider the following pair of data points
– xi: (0.1, 20) and xj: (0.9, 720).
• The distance is almost completely dominated by (720-20)
= 700.
• Standardize attributes: to force the attributes to have a
common value range
,
700.000457
)
20
720
(
)
1
.
0
9
.
0
(
)
,
( 2
2





j
i
dist x
x
57
Interval-scaled attributes
• Their values are real numbers following a
linear scale.
– The difference in Age between 10 and 20 is the
same as that between 40 and 50.
– The key idea is that intervals keep the same
importance through out the scale
• Two main approaches to standardize interval
scaled attributes, range and z-score. f is an
attribute ,
)
min(
)
max(
)
min(
)
(
f
f
f
x
x
range
if
if



58
Interval-scaled attributes (cont …)
• Z-score: transforms the attribute values so that they have
a mean of zero and a mean absolute deviation of 1. The
mean absolute deviation of attribute f, denoted by sf, is
computed as follows
 ,
|
|
...
|
|
|
|
1
2
1 f
nf
f
f
f
f
f m
x
m
x
m
x
n
s 






 ,
...
1
2
1 nf
f
f
f x
x
x
n
m 



.
)
(
f
f
if
if
s
m
x
x
z


Z-score:
59
Ratio-scaled attributes
• Numeric attributes, but unlike interval-scaled
attributes, their scales are exponential,
• For example, the total amount of micro
organisms that evolve in a time t is
approximately given by
AeBt,
– where A and B are some positive constants.
• Do log transform:
– Then treat it as an interval-scaled attribuete
)
log( if
x
60
Nominal attributes
• Sometime, we need to transform nominal
attributes to numeric attributes.
• Transform nominal attributes to binary
attributes.
– The number of values of a nominal attribute is v.
– Create v binary attributes to represent them.
– If a data instance for the nominal attribute takes a
particular value, the value of its binary attribute is
set to 1, otherwise it is set to 0.
• The resulting binary attributes can be used as
numeric attributes, with two values, 0 and 1.
61
Nominal attributes: an example
• Nominal attribute fruit: has three values,
– Apple, Orange, and Pear
• We create three binary attributes called,
Apple, Orange, and Pear in the new data.
• If a particular data instance in the original data
has Apple as the value for fruit,
– then in the transformed data, we set the value of
the attribute Apple to 1, and
– the values of attributes Orange and Pear to 0
62
Ordinal attributes
• Ordinal attribute: an ordinal attribute is like a
nominal attribute, but its values have a
numerical ordering. E.g.,
– Age attribute with values: Young, MiddleAge and
Old. They are ordered.
– Common approach to standardization: treat is as
an interval-scaled attribute.
63
Road map
• Basic concepts
• K-means algorithm
• Representation of clusters
• Hierarchical clustering
• Distance functions
• Data standardization
• Handling mixed attributes
• Which clustering algorithm to use?
• Cluster evaluation
• Discovering holes and data regions
• Summary
64
Mixed attributes
• Our distance functions given are for data with
all numeric attributes, or all nominal
attributes, etc.
• Practical data has different types:
– Any subset of the 6 types of attributes,
• interval-scaled,
• symmetric binary,
• asymmetric binary,
• ratio-scaled,
• ordinal and
• nominal
65
Convert to a single type
• One common way of dealing with mixed
attributes is to
– Decide the dominant attribute type, and
– Convert the other types to this type.
• E.g, if most attributes in a data set are
interval-scaled,
– we convert ordinal attributes and ratio-scaled
attributes to interval-scaled attributes.
– It is also appropriate to treat symmetric binary
attributes as interval-scaled attributes.
66
Convert to a single type (cont …)
• It does not make much sense to convert a
nominal attribute or an asymmetric binary
attribute to an interval-scaled attribute,
– but it is still frequently done in practice by
assigning some numbers to them according to
some hidden ordering, e.g., prices of the fruits
• Alternatively, a nominal attribute can be
converted to a set of (symmetric) binary
attributes, which are then treated as numeric
attributes.
67
Combining individual distances
• This approach computes
individual attribute distances
and then combine them.




 r
f
f
ij
f
ij
r
f
f
ij
j
i
d
dist
1
1
)
,
(


x
x
68
Road map
• Basic concepts
• K-means algorithm
• Representation of clusters
• Hierarchical clustering
• Distance functions
• Data standardization
• Handling mixed attributes
• Which clustering algorithm to use?
• Cluster evaluation
• Discovering holes and data regions
• Summary
69
How to choose a clustering
algorithm
• Clustering research has a long history. A vast collection of
algorithms are available.
– We only introduced several main algorithms.
• Choosing the “best” algorithm is a challenge.
– Every algorithm has limitations and works well with certain data
distributions.
– It is very hard, if not impossible, to know what distribution the
application data follow. The data may not fully follow any “ideal”
structure or distribution required by the algorithms.
– One also needs to decide how to standardize the data, to
choose a suitable distance function and to select other
parameter values.
70
Choose a clustering algorithm (cont
…)
• Due to these complexities, the common practice is to
– run several algorithms using different distance functions and
parameter settings, and
– then carefully analyze and compare the results.
• The interpretation of the results must be based on
insight into the meaning of the original data together
with knowledge of the algorithms used.
• Clustering is highly application dependent and to certain
extent subjective (personal preferences).
71
Road map
• Basic concepts
• K-means algorithm
• Representation of clusters
• Hierarchical clustering
• Distance functions
• Data standardization
• Handling mixed attributes
• Which clustering algorithm to use?
• Cluster evaluation
• Discovering holes and data regions
• Summary
72
Cluster Evaluation: hard problem
• The quality of a clustering is very hard to
evaluate because
– We do not know the correct clusters
• Some methods are used:
– User inspection
• Study centroids, and spreads
• Rules from a decision tree.
• For text documents, one can read some documents in
clusters.
73
Cluster evaluation: ground truth
• We use some labeled data (for classification)
• Assumption: Each class is a cluster.
• After clustering, a confusion matrix is
constructed. From the matrix, we compute
various measurements, entropy, purity,
precision, recall and F-score.
– Let the classes in the data D be C = (c1, c2, …, ck).
The clustering method produces k clusters, which
divides D into k disjoint subsets, D1, D2, …, Dk.
74
Evaluation measures: Entropy
75
Evaluation measures: purity
76
An example
77
A remark about ground truth
evaluation
• Commonly used to compare different clustering
algorithms.
• A real-life data set for clustering has no class labels.
– Thus although an algorithm may perform very well on some
labeled data sets, no guarantee that it will perform well on the
actual application data at hand.
• The fact that it performs well on some label data sets
does give us some confidence of the quality of the
algorithm.
• This evaluation method is said to be based on external
data or information.
78
Evaluation based on internal information
• Intra-cluster cohesion (compactness):
– Cohesion measures how near the data points in a
cluster are to the cluster centroid.
– Sum of squared error (SSE) is a commonly used
measure.
• Inter-cluster separation (isolation):
– Separation means that different cluster centroids
should be far away from one another.
• In most applications, expert judgments are
still the key.
79
Indirect evaluation
• In some applications, clustering is not the primary task,
but used to help perform another task.
• We can use the performance on the primary task to
compare clustering methods.
• For instance, in an application, the primary task is to
provide recommendations on book purchasing to online
shoppers.
– If we can cluster books according to their features, we might be
able to provide better recommendations.
– We can evaluate different clustering algorithms based on how
well they help with the recommendation task.
– Here, we assume that the recommendation can be reliably
evaluated.
80
Road map
• Basic concepts
• K-means algorithm
• Representation of clusters
• Hierarchical clustering
• Distance functions
• Data standardization
• Handling mixed attributes
• Which clustering algorithm to use?
• Cluster evaluation
• Discovering holes and data regions
• Summary
81
Holes in data space
• All the clustering algorithms only group data.
• Clusters only represent one aspect of the
knowledge in the data.
• Another aspect that we have not studied is
the holes.
– A hole is a region in the data space that contains
no or few data points. Reasons:
• insufficient data in certain areas, and/or
• certain attribute-value combinations are not possible or
seldom occur.
82
Holes are useful too
• Although clusters are important, holes in the
space can be quite useful too.
• For example, in a disease database
– we may find that certain symptoms and/or test
values do not occur together, or
– when a certain medicine is used, some test values
never go beyond certain ranges.
• Discovery of such information can be
important in medical domains because
– it could mean the discovery of a cure to a disease
or some biological laws.
83
Data regions and empty regions
• Given a data space, separate
– data regions (clusters) and
– empty regions (holes, with few or no data points).
• Use a supervised learning technique, i.e.,
decision tree induction, to separate the two
types of regions.
• Due to the use of a supervised learning
method for an unsupervised learning task,
– an interesting connection is made between the
two types of learning paradigms.
84
Supervised learning for unsupervised
learning
• Decision tree algorithm is not directly applicable.
– it needs at least two classes of data.
– A clustering data set has no class label for each data point.
• The problem can be dealt with by a simple idea.
– Regard each point in the data set to have a class label Y.
– Assume that the data space is uniformly distributed with
another type of points, called non-existing points. We give
them the class, N.
• With the N points added, the problem of partitioning the
data space into data and empty regions becomes a
supervised classification problem.
85
An example
A decision tree method is used for partitioning in (B).
86
Can it done without adding N
points?
• Yes.
• Physically adding N points increases the size of
the data and thus the running time.
• More importantly: it is unlikely that we can
have points truly uniformly distributed in a
high dimensional space as we would need an
exponential number of points.
• Fortunately, no need to physically add any N
points.
– We can compute them when needed
87
Characteristics of the approach
• It provides representations of the resulting data and
empty regions in terms of hyper-rectangles, or rules.
• It detects outliers automatically. Outliers are data points in
an empty region.
• It may not use all attributes in the data just as in a normal
decision tree for supervised learning.
– It can automatically determine what attributes are useful.
Subspace clustering …
• Drawback: data regions of irregular shapes are hard to
handle since decision tree learning only generates hyper-
rectangles (formed by axis-parallel hyper-planes), which
are rules.
88
Building the Tree
• The main computation in decision tree building is to
evaluate entropy (for information gain):
• Can it be evaluated without adding N points? Yes.
• Pr(cj) is the probability of class cj in data set D, and |C| is
the number of classes, Y and N (2 classes).
– To compute Pr(cj), we only need the number of Y (data) points
and the number of N (non-existing) points.
– We already have Y (or data) points, and we can compute the
number of N points on the fly. Simple: as we assume that the N
points are uniformly distributed in the space.
)
Pr(
log
)
Pr(
)
(
|
|
1
2 j
C
j
j c
c
D
entropy 



89
An example
• The space has 25 data (Y) points and 25 N points.
Assume the system is evaluating a possible cut S.
– # N points on the left of S is 25 * 4/10 = 10. The number of Y
points is 3.
– Likewise, # N points on the right of S is 15 (= 25 - 10).The
number of Y points is 22.
• With these numbers, entropy can be computed.
90
How many N points to add?
• We add a different number of N points at each different
node.
– The number of N points for the current node E is determined
by the following rule (note that at the root node, the number
of inherited N points is 0):
91
An example
92
How many N points to add?
(cont…)
• Basically, for a Y node (which has more data
points), we increase N points so that
#Y = #N
• The number of N points is not reduced if the
current node is an N node (an N node has
more N points than Y points).
– A reduction may cause outlier Y points to form Y
nodes (a Y node has an equal number of Y points
as N points or more).
– Then data regions and empty regions may not be
separated well.
93
Building the decision tree
• Using the above ideas, a decision tree can be
built to separate data regions and empty
regions.
• The actual method is more sophisticated as a
few other tricky issues need to be handled in
– tree building and
– tree pruning.
94
Road map
• Basic concepts
• K-means algorithm
• Representation of clusters
• Hierarchical clustering
• Distance functions
• Data standardization
• Handling mixed attributes
• Which clustering algorithm to use?
• Cluster evaluation
• Discovering holes and data regions
• Summary
95
Summary
• Clustering is has along history and still active
– There are a huge number of clustering algorithms
– More are still coming every year.
• We only introduced several main algorithms. There are
many others, e.g.,
– density based algorithm, sub-space clustering, scale-up methods,
neural networks based methods, fuzzy clustering, co-clustering,
etc.
• Clustering is hard to evaluate, but very useful in practice.
This partially explains why there are still a large number of
clustering algorithms being devised every year.
• Clustering is highly application dependent and to some
extent subjective.

More Related Content

PDF
[ML]-Unsupervised-learning_Unit2.ppt.pdf
PPTX
Unsupervised Learning.pptx
PPT
15857 cse422 unsupervised-learning
PPT
K_MeansK_MeansK_MeansK_MeansK_MeansK_MeansK_Means.ppt
PPTX
machine learning - Clustering in R
PPTX
Unsupervised%20Learninffffg (2).pptx. application
PPT
CS583-unsupervised-learning.ppt learning
PPT
CS583-unsupervised-learning.ppt
[ML]-Unsupervised-learning_Unit2.ppt.pdf
Unsupervised Learning.pptx
15857 cse422 unsupervised-learning
K_MeansK_MeansK_MeansK_MeansK_MeansK_MeansK_Means.ppt
machine learning - Clustering in R
Unsupervised%20Learninffffg (2).pptx. application
CS583-unsupervised-learning.ppt learning
CS583-unsupervised-learning.ppt

Similar to clustering using different methods in .pdf (20)

PPT
CS583-unsupervised-learning CS583-unsupervised-learning.ppt
PPT
unsupervised-learning AI information and analytics
PPT
Lecture11_ Intro to clustering and K-means algorithm.ppt
PPT
Lecture11_ Intro to clustering and K-means algorithm.ppt
PPT
Chap8 basic cluster_analysis
PPTX
Unsupervised learning (clustering)
PPTX
K means clustring @jax
PPTX
Cluster Analysis
PPTX
Cluster Analysis
PPTX
Cluster Analysis
PPTX
K-Means clustring @jax
PDF
Unsupervised learning clustering
PPTX
K means Clustering - algorithm to cluster n objects
PDF
Machine Learning, Statistics And Data Mining
PPT
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
PPTX
Unsupervised learning Algorithms and Assumptions
PPTX
K MEANS CLUSTERING - UNSUPERVISED LEARNING
PDF
Machine Learning - Clustering
PPT
26-Clustering MTech-2017.ppt
PPTX
Clustering on DSS
CS583-unsupervised-learning CS583-unsupervised-learning.ppt
unsupervised-learning AI information and analytics
Lecture11_ Intro to clustering and K-means algorithm.ppt
Lecture11_ Intro to clustering and K-means algorithm.ppt
Chap8 basic cluster_analysis
Unsupervised learning (clustering)
K means clustring @jax
Cluster Analysis
Cluster Analysis
Cluster Analysis
K-Means clustring @jax
Unsupervised learning clustering
K means Clustering - algorithm to cluster n objects
Machine Learning, Statistics And Data Mining
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Unsupervised learning Algorithms and Assumptions
K MEANS CLUSTERING - UNSUPERVISED LEARNING
Machine Learning - Clustering
26-Clustering MTech-2017.ppt
Clustering on DSS
Ad

Recently uploaded (20)

PDF
Lecture1 pattern recognition............
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
PPTX
Computer network topology notes for revision
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
SAP 2 completion done . PRESENTATION.pptx
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
Database Infoormation System (DBIS).pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PDF
Introduction to the R Programming Language
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
1_Introduction to advance data techniques.pptx
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
Lecture1 pattern recognition............
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
Computer network topology notes for revision
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Introduction to Knowledge Engineering Part 1
IBA_Chapter_11_Slides_Final_Accessible.pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
IB Computer Science - Internal Assessment.pptx
SAP 2 completion done . PRESENTATION.pptx
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Database Infoormation System (DBIS).pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Introduction to the R Programming Language
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx
1_Introduction to advance data techniques.pptx
ISS -ESG Data flows What is ESG and HowHow
Business Ppt On Nestle.pptx huunnnhhgfvu
Ad

clustering using different methods in .pdf

  • 2. 2 Road map • Basic concepts • K-means algorithm • Representation of clusters • Hierarchical clustering • Distance functions • Data standardization • Handling mixed attributes • Which clustering algorithm to use? • Cluster evaluation • Discovering holes and data regions • Summary
  • 3. 3 Supervised learning vs. unsupervised learning • Supervised learning: discover patterns in the data that relate data attributes with a target (class) attribute. – These patterns are then utilized to predict the values of the target attribute in future data instances. • Unsupervised learning: The data have no target attribute. – We want to explore the data to find some intrinsic structures in them.
  • 4. 4 Clustering • Clustering is a technique for finding similarity groups in data, called clusters. I.e., – it groups data instances that are similar to (near) each other in one cluster and data instances that are very different (far away) from each other into different clusters. • Clustering is often called an unsupervised learning task as no class values denoting an a priori grouping of the data instances are given, which is the case in supervised learning. • Due to historical reasons, clustering is often considered synonymous with unsupervised learning. – In fact, association rule mining is also unsupervised
  • 5. 5 An illustration • The data set has three natural groups of data points, i.e., 3 natural clusters.
  • 6. 6 What is clustering for? • Let us see some real-life examples • Example 1: groups people of similar sizes together to make “small”, “medium” and “large” T-Shirts. – Tailor-made for each person: too expensive – One-size-fits-all: does not fit all. • Example 2: In marketing, segment customers according to their similarities – To do targeted marketing.
  • 7. 7 What is clustering for? (cont…) • Example 3: Given a collection of text documents, we want to organize them according to their content similarities, – To produce a topic hierarchy • In fact, clustering is one of the most utilized data mining techniques. – It has a long history, and used in almost every field, e.g., medicine, psychology, botany, sociology, biology, archeology, marketing, insurance, libraries, etc. – In recent years, due to the rapid increase of online documents, text clustering becomes important.
  • 8. 8 Aspects of clustering • A clustering algorithm – Partitional clustering – Hierarchical clustering – … • A distance (similarity, or dissimilarity) function • Clustering quality – Inter-clusters distance  maximized – Intra-clusters distance  minimized • The quality of a clustering result depends on the algorithm, the distance function, and the application.
  • 9. Road map 9 • Basic concepts • K-means algorithm • Representation of clusters • Hierarchical clustering • Distance functions • Data standardization • Handling mixed attributes • Which clustering algorithm to use? • Cluster evaluation • Discovering holes and data regions • Summary
  • 10. 10 K-means clustering • K-means is a partitional clustering algorithm • Let the set of data points (or instances) D be {x1, x2, …, xn}, where xi = (xi1, xi2, …, xir) is a vector in a real- valued space X  Rr, and r is the number of attributes (dimensions) in the data. • The k-means algorithm partitions the given data into k clusters. – Each cluster has a cluster center, called centroid. – k is specified by the user
  • 11. 11 K-means algorithm • Given k, the k-means algorithm works as follows: 1)Randomly choose k data points (seeds) to be the initial centroids, cluster centers 2)Assign each data point to the closest centroid 3)Re-compute the centroids using the current cluster memberships. 4)If a convergence criterion is not met, go to 2).
  • 13. 13 Stopping/convergence criterion 1. no (or minimum) re-assignments of data points to different clusters, 2. no (or minimum) change of centroids, or 3. minimum decrease in the sum of squared error (SSE), – Ci is the jth cluster, mj is the centroid of cluster Cj (the mean vector of all the data points in Cj), and dist(x, mj) is the distance between data point x and centroid mj.     k j C j j dist SSE 1 2 ) , ( x m x (1)
  • 17. 17 A disk version of k-means • K-means can be implemented with data on disk – In each iteration, it scans the data once. – as the centroids can be computed incrementally • It can be used to cluster large datasets that do not fit in main memory • We need to control the number of iterations – In practice, a limited is set (< 50). • Not the best method. There are other scale-up algorithms, e.g., BIRCH.
  • 18. 18 A disk version of k-means (cont …)
  • 19. 19 Strengths of k-means • Strengths: – Simple: easy to understand and to implement – Efficient: Time complexity: O(tkn), where n is the number of data points, k is the number of clusters, and t is the number of iterations. – Since both k and t are small. k-means is considered a linear algorithm. • K-means is the most popular clustering algorithm. • Note that: it terminates at a local optimum if SSE is used. The global optimum is hard to find due to complexity.
  • 20. 20 Weaknesses of k-means • The algorithm is only applicable if the mean is defined. – For categorical data, k-mode - the centroid is represented by most frequent values. • The user needs to specify k. • The algorithm is sensitive to outliers – Outliers are data points that are very far away from other data points. – Outliers could be errors in the data recording or some special data points with very different values.
  • 21. 21 Weaknesses of k-means: Problems with outliers
  • 22. 22 Weaknesses of k-means: To deal with outliers • One method is to remove some data points in the clustering process that are much further away from the centroids than other data points. – To be safe, we may want to monitor these possible outliers over a few iterations and then decide to remove them. • Another method is to perform random sampling. Since in sampling we only choose a small subset of the data points, the chance of selecting an outlier is very small. – Assign the rest of the data points to the clusters by distance or similarity comparison, or classification
  • 23. 23 Weaknesses of k-means (cont …) • The algorithm is sensitive to initial seeds.
  • 24. 24 Weaknesses of k-means (cont …) • If we use different seeds: good results There are some methods to help choose good seeds
  • 25. 25 Weaknesses of k-means (cont …) • The k-means algorithm is not suitable for discovering clusters that are not hyper-ellipsoids (or hyper-spheres). +
  • 26. 26 K-means summary • Despite weaknesses, k-means is still the most popular algorithm due to its simplicity, efficiency and – other clustering algorithms have their own lists of weaknesses. • No clear evidence that any other clustering algorithm performs better in general – although they may be more suitable for some specific types of data or applications. • Comparing different clustering algorithms is a difficult task. No one knows the correct clusters!
  • 27. 27 Road map • Basic concepts • K-means algorithm • Representation of clusters • Hierarchical clustering • Distance functions • Data standardization • Handling mixed attributes • Which clustering algorithm to use? • Cluster evaluation • Discovering holes and data regions • Summary
  • 28. 28 Common ways to represent clusters • Use the centroid of each cluster to represent the cluster. – compute the radius and – standard deviation of the cluster to determine its spread in each dimension – The centroid representation alone works well if the clusters are of the hyper-spherical shape. – If clusters are elongated or are of other shapes, centroids are not sufficient
  • 29. 29 Using classification model • All the data points in a cluster are regarded to have the same class label, e.g., the cluster ID. – run a supervised learning algorithm on the data to find a classification model.
  • 30. 30 Use frequent values to represent cluster • This method is mainly for clustering of categorical data (e.g., k-modes clustering). • Main method used in text clustering, where a small set of frequent words in each cluster is selected to represent the cluster.
  • 31. 31 Clusters of arbitrary shapes • Hyper-elliptical and hyper-spherical clusters are usually easy to represent, using their centroid together with spreads. • Irregular shape clusters are hard to represent. They may not be useful in some applications. – Using centroids are not suitable (upper figure) in general – K-means clusters may be more useful (lower figure), e.g., for making 2 size T- shirts.
  • 32. 32 Road map • Basic concepts • K-means algorithm • Representation of clusters • Hierarchical clustering • Distance functions • Data standardization • Handling mixed attributes • Which clustering algorithm to use? • Cluster evaluation • Discovering holes and data regions • Summary
  • 33. 33 Hierarchical Clustering • Produce a nested sequence of clusters, a tree, also called Dendrogram.
  • 34. 34 Types of hierarchical clustering • Agglomerative (bottom up) clustering: It builds the dendrogram (tree) from the bottom level, and – merges the most similar (or nearest) pair of clusters – stops when all the data points are merged into a single cluster (i.e., the root cluster). • Divisive (top down) clustering: It starts with all data points in one cluster, the root. – Splits the root into a set of child clusters. Each child cluster is recursively divided further – stops when only singleton clusters of individual data points remain, i.e., each cluster with only a single point
  • 35. 35 Agglomerative clustering It is more popular then divisive methods. • At the beginning, each data point forms a cluster (also called a node). • Merge nodes/clusters that have the least distance. • Go on merging • Eventually all nodes belong to one cluster
  • 37. 37 An example: working of the algorithm
  • 38. 38 Measuring the distance of two clusters • A few ways to measure distances of two clusters. • Results in different variations of the algorithm. – Single link – Complete link – Average link – Centroids – …
  • 39. 39 Single link method • The distance between two clusters is the distance between two closest data points in the two clusters, one data point from each cluster. • It can find arbitrarily shaped clusters, but – It may cause the undesirable “chain effect” by noisy points Two natural clusters are split into two
  • 40. 40 Complete link method • The distance between two clusters is the distance of two furthest data points in the two clusters. • It is sensitive to outliers because they are far away
  • 41. 41 Average link and centroid methods • Average link: A compromise between – the sensitivity of complete-link clustering to outliers and – the tendency of single-link clustering to form long chains that do not correspond to the intuitive notion of clusters as compact, spherical objects. – In this method, the distance between two clusters is the average distance of all pair-wise distances between the data points in two clusters. • Centroid method: In this method, the distance between two clusters is the distance between their centroids
  • 42. 42 The complexity • All the algorithms are at least O(n2). n is the number of data points. • Single link can be done in O(n2). • Complete and average links can be done in O(n2logn). • Due the complexity, hard to use for large data sets. – Sampling – Scale-up methods (e.g., BIRCH).
  • 43. 43 Road map • Basic concepts • K-means algorithm • Representation of clusters • Hierarchical clustering • Distance functions • Data standardization • Handling mixed attributes • Which clustering algorithm to use? • Cluster evaluation • Discovering holes and data regions • Summary
  • 44. 44 Distance functions • Key to clustering. “similarity” and “dissimilarity” can also commonly used terms. • There are numerous distance functions for – Different types of data • Numeric data • Nominal data – Different specific applications
  • 45. 45 Distance functions for numeric attributes • Most commonly used functions are – Euclidean distance and – Manhattan (city block) distance • We denote distance with: dist(xi, xj), where xi and xj are data points (vectors) • They are special cases of Minkowski distance. h is positive integer. h h jr ir h j i h j i j i x x x x x x dist 1 2 2 1 1 ) ) ( ... ) ( ) (( ) , (        x x
  • 46. 46 Euclidean distance and Manhattan distance • If h = 2, it is the Euclidean distance • If h = 1, it is the Manhattan distance • Weighted Euclidean distance 2 2 2 2 2 1 1 ) ( ... ) ( ) ( ) , ( jr ir j i j i j i x x x x x x dist        x x | | ... | | | | ) , ( 2 2 1 1 jr ir j i j i j i x x x x x x dist        x x 2 2 2 2 2 2 1 1 1 ) ( ... ) ( ) ( ) , ( jr ir r j i j i j i x x w x x w x x w dist        x x
  • 47. 47 Squared distance and Chebychev distance • Squared Euclidean distance: to place progressively greater weight on data points that are further apart. • Chebychev distance: one wants to define two data points as "different" if they are different on any one of the attributes. 2 2 2 2 2 1 1 ) ( ... ) ( ) ( ) , ( jr ir j i j i j i x x x x x x dist        x x |) | ..., |, | |, max(| ) , ( 2 2 1 1 jr ir j i j i j i x x x x x x dist     x x
  • 48. 48 Distance functions for binary and nominal attributes • Binary attribute: has two values or states but no ordering relationships, e.g., – Gender: male and female. • We use a confusion matrix to introduce the distance functions/measures. • Let the ith and jth data points be xi and xj (vectors)
  • 50. 50 Symmetric binary attributes • A binary attribute is symmetric if both of its states (0 and 1) have equal importance, and carry the same weights, e.g., male and female of the attribute Gender • Distance function: Simple Matching Coefficient, proportion of mismatches of their values d c b a c b dist j i      ) , ( x x
  • 52. 52 Asymmetric binary attributes • Asymmetric: if one of the states is more important or more valuable than the other. – By convention, state 1 represents the more important state, which is typically the rare or infrequent state. – Jaccard coefficient is a popular measure – We can have some variations, adding weights c b a c b dist j i     ) , ( x x
  • 53. 53 Nominal attributes • Nominal attributes: with more than two states or values. – the commonly used distance measure is also based on the simple matching method. – Given two data points xi and xj, let the number of attributes be r, and the number of values that match in xi and xj be q. r q r dist j i   ) , ( x x
  • 54. 54 Distance function for text documents • A text document consists of a sequence of sentences and each sentence consists of a sequence of words. • To simplify: a document is usually considered a “bag” of words in document clustering. – Sequence and position of words are ignored. • A document is represented with a vector just like a normal data point. • It is common to use similarity to compare two documents rather than distance. – The most commonly used similarity function is the cosine similarity. We will study this later.
  • 55. 55 Road map • Basic concepts • K-means algorithm • Representation of clusters • Hierarchical clustering • Distance functions • Data standardization • Handling mixed attributes • Which clustering algorithm to use? • Cluster evaluation • Discovering holes and data regions • Summary
  • 56. 56 Data standardization • In the Euclidean space, standardization of attributes is recommended so that all attributes can have equal impact on the computation of distances. • Consider the following pair of data points – xi: (0.1, 20) and xj: (0.9, 720). • The distance is almost completely dominated by (720-20) = 700. • Standardize attributes: to force the attributes to have a common value range , 700.000457 ) 20 720 ( ) 1 . 0 9 . 0 ( ) , ( 2 2      j i dist x x
  • 57. 57 Interval-scaled attributes • Their values are real numbers following a linear scale. – The difference in Age between 10 and 20 is the same as that between 40 and 50. – The key idea is that intervals keep the same importance through out the scale • Two main approaches to standardize interval scaled attributes, range and z-score. f is an attribute , ) min( ) max( ) min( ) ( f f f x x range if if   
  • 58. 58 Interval-scaled attributes (cont …) • Z-score: transforms the attribute values so that they have a mean of zero and a mean absolute deviation of 1. The mean absolute deviation of attribute f, denoted by sf, is computed as follows  , | | ... | | | | 1 2 1 f nf f f f f f m x m x m x n s         , ... 1 2 1 nf f f f x x x n m     . ) ( f f if if s m x x z   Z-score:
  • 59. 59 Ratio-scaled attributes • Numeric attributes, but unlike interval-scaled attributes, their scales are exponential, • For example, the total amount of micro organisms that evolve in a time t is approximately given by AeBt, – where A and B are some positive constants. • Do log transform: – Then treat it as an interval-scaled attribuete ) log( if x
  • 60. 60 Nominal attributes • Sometime, we need to transform nominal attributes to numeric attributes. • Transform nominal attributes to binary attributes. – The number of values of a nominal attribute is v. – Create v binary attributes to represent them. – If a data instance for the nominal attribute takes a particular value, the value of its binary attribute is set to 1, otherwise it is set to 0. • The resulting binary attributes can be used as numeric attributes, with two values, 0 and 1.
  • 61. 61 Nominal attributes: an example • Nominal attribute fruit: has three values, – Apple, Orange, and Pear • We create three binary attributes called, Apple, Orange, and Pear in the new data. • If a particular data instance in the original data has Apple as the value for fruit, – then in the transformed data, we set the value of the attribute Apple to 1, and – the values of attributes Orange and Pear to 0
  • 62. 62 Ordinal attributes • Ordinal attribute: an ordinal attribute is like a nominal attribute, but its values have a numerical ordering. E.g., – Age attribute with values: Young, MiddleAge and Old. They are ordered. – Common approach to standardization: treat is as an interval-scaled attribute.
  • 63. 63 Road map • Basic concepts • K-means algorithm • Representation of clusters • Hierarchical clustering • Distance functions • Data standardization • Handling mixed attributes • Which clustering algorithm to use? • Cluster evaluation • Discovering holes and data regions • Summary
  • 64. 64 Mixed attributes • Our distance functions given are for data with all numeric attributes, or all nominal attributes, etc. • Practical data has different types: – Any subset of the 6 types of attributes, • interval-scaled, • symmetric binary, • asymmetric binary, • ratio-scaled, • ordinal and • nominal
  • 65. 65 Convert to a single type • One common way of dealing with mixed attributes is to – Decide the dominant attribute type, and – Convert the other types to this type. • E.g, if most attributes in a data set are interval-scaled, – we convert ordinal attributes and ratio-scaled attributes to interval-scaled attributes. – It is also appropriate to treat symmetric binary attributes as interval-scaled attributes.
  • 66. 66 Convert to a single type (cont …) • It does not make much sense to convert a nominal attribute or an asymmetric binary attribute to an interval-scaled attribute, – but it is still frequently done in practice by assigning some numbers to them according to some hidden ordering, e.g., prices of the fruits • Alternatively, a nominal attribute can be converted to a set of (symmetric) binary attributes, which are then treated as numeric attributes.
  • 67. 67 Combining individual distances • This approach computes individual attribute distances and then combine them.      r f f ij f ij r f f ij j i d dist 1 1 ) , (   x x
  • 68. 68 Road map • Basic concepts • K-means algorithm • Representation of clusters • Hierarchical clustering • Distance functions • Data standardization • Handling mixed attributes • Which clustering algorithm to use? • Cluster evaluation • Discovering holes and data regions • Summary
  • 69. 69 How to choose a clustering algorithm • Clustering research has a long history. A vast collection of algorithms are available. – We only introduced several main algorithms. • Choosing the “best” algorithm is a challenge. – Every algorithm has limitations and works well with certain data distributions. – It is very hard, if not impossible, to know what distribution the application data follow. The data may not fully follow any “ideal” structure or distribution required by the algorithms. – One also needs to decide how to standardize the data, to choose a suitable distance function and to select other parameter values.
  • 70. 70 Choose a clustering algorithm (cont …) • Due to these complexities, the common practice is to – run several algorithms using different distance functions and parameter settings, and – then carefully analyze and compare the results. • The interpretation of the results must be based on insight into the meaning of the original data together with knowledge of the algorithms used. • Clustering is highly application dependent and to certain extent subjective (personal preferences).
  • 71. 71 Road map • Basic concepts • K-means algorithm • Representation of clusters • Hierarchical clustering • Distance functions • Data standardization • Handling mixed attributes • Which clustering algorithm to use? • Cluster evaluation • Discovering holes and data regions • Summary
  • 72. 72 Cluster Evaluation: hard problem • The quality of a clustering is very hard to evaluate because – We do not know the correct clusters • Some methods are used: – User inspection • Study centroids, and spreads • Rules from a decision tree. • For text documents, one can read some documents in clusters.
  • 73. 73 Cluster evaluation: ground truth • We use some labeled data (for classification) • Assumption: Each class is a cluster. • After clustering, a confusion matrix is constructed. From the matrix, we compute various measurements, entropy, purity, precision, recall and F-score. – Let the classes in the data D be C = (c1, c2, …, ck). The clustering method produces k clusters, which divides D into k disjoint subsets, D1, D2, …, Dk.
  • 77. 77 A remark about ground truth evaluation • Commonly used to compare different clustering algorithms. • A real-life data set for clustering has no class labels. – Thus although an algorithm may perform very well on some labeled data sets, no guarantee that it will perform well on the actual application data at hand. • The fact that it performs well on some label data sets does give us some confidence of the quality of the algorithm. • This evaluation method is said to be based on external data or information.
  • 78. 78 Evaluation based on internal information • Intra-cluster cohesion (compactness): – Cohesion measures how near the data points in a cluster are to the cluster centroid. – Sum of squared error (SSE) is a commonly used measure. • Inter-cluster separation (isolation): – Separation means that different cluster centroids should be far away from one another. • In most applications, expert judgments are still the key.
  • 79. 79 Indirect evaluation • In some applications, clustering is not the primary task, but used to help perform another task. • We can use the performance on the primary task to compare clustering methods. • For instance, in an application, the primary task is to provide recommendations on book purchasing to online shoppers. – If we can cluster books according to their features, we might be able to provide better recommendations. – We can evaluate different clustering algorithms based on how well they help with the recommendation task. – Here, we assume that the recommendation can be reliably evaluated.
  • 80. 80 Road map • Basic concepts • K-means algorithm • Representation of clusters • Hierarchical clustering • Distance functions • Data standardization • Handling mixed attributes • Which clustering algorithm to use? • Cluster evaluation • Discovering holes and data regions • Summary
  • 81. 81 Holes in data space • All the clustering algorithms only group data. • Clusters only represent one aspect of the knowledge in the data. • Another aspect that we have not studied is the holes. – A hole is a region in the data space that contains no or few data points. Reasons: • insufficient data in certain areas, and/or • certain attribute-value combinations are not possible or seldom occur.
  • 82. 82 Holes are useful too • Although clusters are important, holes in the space can be quite useful too. • For example, in a disease database – we may find that certain symptoms and/or test values do not occur together, or – when a certain medicine is used, some test values never go beyond certain ranges. • Discovery of such information can be important in medical domains because – it could mean the discovery of a cure to a disease or some biological laws.
  • 83. 83 Data regions and empty regions • Given a data space, separate – data regions (clusters) and – empty regions (holes, with few or no data points). • Use a supervised learning technique, i.e., decision tree induction, to separate the two types of regions. • Due to the use of a supervised learning method for an unsupervised learning task, – an interesting connection is made between the two types of learning paradigms.
  • 84. 84 Supervised learning for unsupervised learning • Decision tree algorithm is not directly applicable. – it needs at least two classes of data. – A clustering data set has no class label for each data point. • The problem can be dealt with by a simple idea. – Regard each point in the data set to have a class label Y. – Assume that the data space is uniformly distributed with another type of points, called non-existing points. We give them the class, N. • With the N points added, the problem of partitioning the data space into data and empty regions becomes a supervised classification problem.
  • 85. 85 An example A decision tree method is used for partitioning in (B).
  • 86. 86 Can it done without adding N points? • Yes. • Physically adding N points increases the size of the data and thus the running time. • More importantly: it is unlikely that we can have points truly uniformly distributed in a high dimensional space as we would need an exponential number of points. • Fortunately, no need to physically add any N points. – We can compute them when needed
  • 87. 87 Characteristics of the approach • It provides representations of the resulting data and empty regions in terms of hyper-rectangles, or rules. • It detects outliers automatically. Outliers are data points in an empty region. • It may not use all attributes in the data just as in a normal decision tree for supervised learning. – It can automatically determine what attributes are useful. Subspace clustering … • Drawback: data regions of irregular shapes are hard to handle since decision tree learning only generates hyper- rectangles (formed by axis-parallel hyper-planes), which are rules.
  • 88. 88 Building the Tree • The main computation in decision tree building is to evaluate entropy (for information gain): • Can it be evaluated without adding N points? Yes. • Pr(cj) is the probability of class cj in data set D, and |C| is the number of classes, Y and N (2 classes). – To compute Pr(cj), we only need the number of Y (data) points and the number of N (non-existing) points. – We already have Y (or data) points, and we can compute the number of N points on the fly. Simple: as we assume that the N points are uniformly distributed in the space. ) Pr( log ) Pr( ) ( | | 1 2 j C j j c c D entropy    
  • 89. 89 An example • The space has 25 data (Y) points and 25 N points. Assume the system is evaluating a possible cut S. – # N points on the left of S is 25 * 4/10 = 10. The number of Y points is 3. – Likewise, # N points on the right of S is 15 (= 25 - 10).The number of Y points is 22. • With these numbers, entropy can be computed.
  • 90. 90 How many N points to add? • We add a different number of N points at each different node. – The number of N points for the current node E is determined by the following rule (note that at the root node, the number of inherited N points is 0):
  • 92. 92 How many N points to add? (cont…) • Basically, for a Y node (which has more data points), we increase N points so that #Y = #N • The number of N points is not reduced if the current node is an N node (an N node has more N points than Y points). – A reduction may cause outlier Y points to form Y nodes (a Y node has an equal number of Y points as N points or more). – Then data regions and empty regions may not be separated well.
  • 93. 93 Building the decision tree • Using the above ideas, a decision tree can be built to separate data regions and empty regions. • The actual method is more sophisticated as a few other tricky issues need to be handled in – tree building and – tree pruning.
  • 94. 94 Road map • Basic concepts • K-means algorithm • Representation of clusters • Hierarchical clustering • Distance functions • Data standardization • Handling mixed attributes • Which clustering algorithm to use? • Cluster evaluation • Discovering holes and data regions • Summary
  • 95. 95 Summary • Clustering is has along history and still active – There are a huge number of clustering algorithms – More are still coming every year. • We only introduced several main algorithms. There are many others, e.g., – density based algorithm, sub-space clustering, scale-up methods, neural networks based methods, fuzzy clustering, co-clustering, etc. • Clustering is hard to evaluate, but very useful in practice. This partially explains why there are still a large number of clustering algorithms being devised every year. • Clustering is highly application dependent and to some extent subjective.