Automated Clustering Project - 12th CONTECSI 34th WCARS

Automated Clustering Project
MiklosVasarhelyi, Paul Byrnes, andYunsenWang
Presented by DenizAppelbaum

Motivation
 Motivation entails the development of a program that automatically performs
clustering and outlier detection for a wide variety of numerically represented data.

Outline of program features
 Normalizes all data to be clustered
 Creates normalized principal components from the normalized data
 Automatically selects the necessary normalized principal components for use in actual
clustering and outlier detection
 Compares a variety of algorithms based upon the selected set of normalized principal
components
 Adopts the top performing model based upon silhouette coefficient values to perform
the final clustering and outlier detection procedures
 Produces relevant information and outputs throughout the process

Data normalization
 Data normalization
 Converts each numerically represented dimension to be clustered into the range [0,1].
 A desirable procedure for preparing numeric attributes for clustering

Principal component analysis
 Principal component analysis (PCA) is a statistical procedure that uses an
orthogonal transformation to convert a set of observations of possibly correlated
variables into a set of values of linearly uncorrelated variables called principal
components.
 In this way, PCA can both reduce dimensionality as well as eliminate inherent
problems associated with clustering data whose attributes are correlated
 In the following slides, a random sample of 5,000 credit card customers is used to
demonstrate the automated clustering and outlier detection program

 PCA initially results in four principal
components being generated from
the original data
 Using a cumulative data variability
threshold of 80% (default
specification), three principal
components are automatically
selected for analysis – they explain
the vast majority of data variability

 Scatter plot of PC1 and PC2
 In this view, the top 2 principal
components are plotted for each object in
two-dimensional space.
 As can be seen, a small subset of records
appear significantly more distant/different
from the vast majority of objects.

Clustering exploration/simulation process - examples
 Ward method
 Ward suggested a general agglomerative hierarchical clustering procedure, where the criterion for
choosing the pair of clusters to merge at each step is based on the optimal value of an objective function.
 Complete link method
 This method is also known as farthest neighbor clustering.The result of the clustering can be visualized
as a dendrogram, which shows the sequence of cluster fusion and the distance at which each fusion took
place.
 PAM (partitioning around medoids)
 The k-medoids algorithm is a clustering method related to the k-means algorithm and the medoids shift
algorithm; It is considered more stable than k-means, because it uses the median rather than mean
 K-means
 k-means clustering aims to partition n observations into k clusters in which each observation belongs to
the cluster with the nearest mean, serving as a prototype of the cluster.

Clustering exploration results
 The result shown below is based upon a simulation exercise, whereby all four
algorithms are automatically compared on the data set (i.e., a random sample of 5,000
records from the credit card customer data). In this particular case, the best model is
found to be a two-cluster solution using the complete link hierarchical method. This is
the final model and is used for subsequent clustering and outlier detection.
 Best clustering result:
 The silhouette value can theoretically range from -1 to +1, with higher values indicative
of better cluster quality in terms of both cohesion and separation.
Best Method Number Of Clusters SilhouetteValue
complete link hierarchical 2 0.753754205720575

Complete-link Hierarchical clustering (1/2)
 The 5,000 instances are on the
x-axis. In moving vertically from
the x-axis, one can begin to see
how the actual clusters are
formed.

Plot of PCs with cluster assignment labels (1/3)
 In this view, the top two principal
components (i.e., PC1 and PC2) are
plotted for each object in two-
dimensional space.
 In the graph, there are two clusters, one
dark blue and the other light blue.
 The small subset of three records appears
substantially more different from the
majority of objects.

 In this view, PC1 and PC3 are plotted for
each object in two-dimensional space.
 In the graph, the two clusters are again
shown.
 It is once again evident that the small
subset of three records appears more
different from the majority of other
objects.

 In this view, PC2 and PC3 are
plotted for each object in two-
dimensional space.
 Cluster differences appear less
prominent from this perspective.

Principal components 3D scatterplot
 Cluster one represents the majority
class (black) while cluster two
represents the rare class (red).
 In this view, one can clearly see the
subset of three records (in red)
appearing more isolated from the other
objects.

Cluster 1 outlier plot
 In this view, an arbitrary cutoff is
inserted at the 99.9th percentile (red
horizontal line) so as to provide for
efficient identification of very irregular
records.
 Objects further from the x-axis are
more questionable.
 While all objects distant from the x-
axis might be worth investigating,
points above the cutoff should be
viewed as particularly suspicious.

Conclusion of Process
 At the conclusion of outlier detection, an output file for each cluster containing the unique
record identifier, original variables, normalized variables, principal components, normalized
principal components, cluster assignments, and mahalanobis distance information can be
exported to facilitate further analyses and investigations.
 Cluster 2 – final output file of a subset of fields:
 Distinguishing features of cluster 2 records: 1) New accounts (age = 1 month), 2)
Very high incidence of late payments, and 3) Relatively high credit limits,
particularly given the account age and late payment issues.
Record AccountAge CreditLimit AdditionalAssets LatePayments model.cluster md
32430 1 2500 1 3 2 5.83E-05
65470 1 8500 1 4 2 0.002371778
78772 1 2200 0 3 2 0.000442305

Automated Clustering Project - 12th CONTECSI 34th WCARS

More Related Content

What's hot (20)

Viewers also liked (6)

Similar to Automated Clustering Project - 12th CONTECSI 34th WCARS (20)

More from TECSI FEA USP (20)

Recently uploaded (20)

Automated Clustering Project - 12th CONTECSI 34th WCARS