2. Introduction
• Clustering refers to the process of arranging or
organizing objects according to specific criteria.
• Partitioning of data
oGrouping of data in database application improves the
data access
• Data reorganization
• Data compression
3. Introduction
• Summarization
oStatistical measures like mean, mode, and median
can provide summarized information.
• Matrix factorization
o Let there be n data points in an l-dimensional space. We
can represent it as a matrix Xn×l.
oIt is possible to approximate X as a product of two
matrices Bn×K and CK×l.
oSo, X ≈ BC, where B is the cluster assignment matrix and
C is the representatives matrix
4. Clustering process
• The clustering process ensure that the distance
between any two points within the same cluster
(intra-cluster distance), as measured by a
dissimilarity measure such as Euclidean distance, is
smaller than the distance between any two points
belonging to different clusters (inter-cluster
distance)
• Any two points are placed in the same cluster if the
distance between them is lower than a certain
threshold (input). Squared Euclidean distance
is used as one of the measure to compute the
distance between the points.
6. Data abstraction
• Clustering is a useful method for data abstraction, and
it can be applied to generate clusters of data points
that can be represented by their centroid or
medoid or leader or some other suitable entity.
• The centroid is computed as the sample mean of the
data points in a cluster.
• The medoid is the point that minimizes the sum of
distances to all other points in the cluster.
• Note: The centroid can shift dramatically based on the
position of the outlier, while the medoid remains stable
within the boundaries of the original cluster.
8. Divisive clustering
• Divisive algorithms are either polythetic where the
division is based on more than one feature or
monothetic when only one feature is considered at
a time.
• The polythetic scheme is based on finding all
possible 2-partitions of the data and choosing the
best among them. If there are n patterns, the
number
of distinct 2-partions is given by (2n −2)/2 = 2n−1 – 1.
9. Divisive clustering
• Among all possible 2-partitions, the partition with the
least sum of the sample variances of the two clusters is
chosen as the best.
• From the resulting partition, the cluster with the
maximum sample variance is selected and is split into
an optimal 2-partition.
• This process is repeated till we get singleton clusters.
• If a collection of patterns (data points) is split into two
clusters with p patterns x1, · · · , xp in one cluster and q
patterns y1, · · · , yq in the other cluster with the
centroids of the two clusters being C1 and C2
respectively, then the sum of the sample variances will
be
11. Monothetic clustering
• involves considering each feature direction
individually and dividing the data into two clusters
based on the gap in projected values along that
feature direction.
• Specifically, the dataset is split into two parts at a
point that corresponds to the mean value of the
maximum gap observed among the feature values.
• This process is then repeated sequentially for the
remaining features, further partitioning each
cluster.
13. Agglomerative clustering
• An agglomerative clustering algorithm generally follows
the following steps:
1. Compute the proximity matrix for all pairs of patterns
in the dataset.
2. Find the closest pair of clusters based on the
computed proximity measure and merge them into a
single cluster. Update the proximity matrix to reflect
the merge, adjusting the distances between the
newly formed cluster and the remaining clusters.
3. If all patterns belong to a single cluster, terminate the
algorithm. Otherwise, go back to Step 2 and repeat
the process until all patterns are in one cluster.
19. Soft partitioning
1. Fuzzy clustering: Each data point is assigned to
multiple clusters, typically more than one, based on a
membership value. The value is computed using the
data point and the corresponding cluster centroid.
2. Rough clustering: Each cluster is assumed to have
both a non-overlapping part and an overlapping part.
Data points in the non-overlapping portion
exclusively belong to that cluster, while data points in
the overlapping part may belong to multiple clusters.
3. Neural network-based clustering: In this method,
varying weights associated with the data points are
used to obtain a soft partition.
20. Soft partitioning – Contd.
1. Simulated annealing: In this case, the current solution is
randomly updated, and the resulting solution is accepted
with a certain proba-bility. If the resulting solution is
better than the current solution, it is accepted; otherwise,
it is accepted with a probability ranging from 0 to 1.
2. Tabu search: Unlike simulated annealing, multiple
solutions are stored, and the current solution is perturbed
in various ways to determine the next configuration.
3. Evolutionary algorithms: This method maintains a
population of solutions. In addition to the fitness values
of individuals, a random search based on the interaction
among solutions with mutation is employed to generate
the next population.
26. Clustering large datasets
• Issues:
• Number of dataset scans
• Incremental changes in dataset
• Solutions:
• Single dataset scan clustering algorithms
• Incremental clustering algorithms
• Abstraction based clustering
• Examples
• PC-Clustering algorithm
• Leader clustering algorithm
27. Divide-and-conquer method
• The divide-and-conquer approach is an effective
strategy for addressing the challenge of clustering
large datasets that cannot be stored entirely in
main memory.
• To overcome this limitation, a common solution is
to process a portion of the dataset at a time and
store the relevant cluster representatives in
memory.