SlideShare a Scribd company logo
Automated Clustering Project
MiklosVasarhelyi, Paul Byrnes, andYunsenWang
Presented by DenizAppelbaum
Motivation
 Motivation entails the development of a program that automatically performs
clustering and outlier detection for a wide variety of numerically represented data.
Outline of program features
 Normalizes all data to be clustered
 Creates normalized principal components from the normalized data
 Automatically selects the necessary normalized principal components for use in actual
clustering and outlier detection
 Compares a variety of algorithms based upon the selected set of normalized principal
components
 Adopts the top performing model based upon silhouette coefficient values to perform
the final clustering and outlier detection procedures
 Produces relevant information and outputs throughout the process
Data normalization
 Data normalization
 Converts each numerically represented dimension to be clustered into the range [0,1].
 A desirable procedure for preparing numeric attributes for clustering
Principal component analysis
 Principal component analysis (PCA) is a statistical procedure that uses an
orthogonal transformation to convert a set of observations of possibly correlated
variables into a set of values of linearly uncorrelated variables called principal
components.
 In this way, PCA can both reduce dimensionality as well as eliminate inherent
problems associated with clustering data whose attributes are correlated
 In the following slides, a random sample of 5,000 credit card customers is used to
demonstrate the automated clustering and outlier detection program
Principal component analysis
 PCA initially results in four principal
components being generated from
the original data
 Using a cumulative data variability
threshold of 80% (default
specification), three principal
components are automatically
selected for analysis – they explain
the vast majority of data variability
Principal component analysis
 Scatter plot of PC1 and PC2
 In this view, the top 2 principal
components are plotted for each object in
two-dimensional space.
 As can be seen, a small subset of records
appear significantly more distant/different
from the vast majority of objects.
Clustering exploration/simulation process - examples
 Ward method
 Ward suggested a general agglomerative hierarchical clustering procedure, where the criterion for
choosing the pair of clusters to merge at each step is based on the optimal value of an objective function.
 Complete link method
 This method is also known as farthest neighbor clustering.The result of the clustering can be visualized
as a dendrogram, which shows the sequence of cluster fusion and the distance at which each fusion took
place.
 PAM (partitioning around medoids)
 The k-medoids algorithm is a clustering method related to the k-means algorithm and the medoids shift
algorithm; It is considered more stable than k-means, because it uses the median rather than mean
 K-means
 k-means clustering aims to partition n observations into k clusters in which each observation belongs to
the cluster with the nearest mean, serving as a prototype of the cluster.
Clustering exploration results
 The result shown below is based upon a simulation exercise, whereby all four
algorithms are automatically compared on the data set (i.e., a random sample of 5,000
records from the credit card customer data). In this particular case, the best model is
found to be a two-cluster solution using the complete link hierarchical method. This is
the final model and is used for subsequent clustering and outlier detection.
 Best clustering result:
 The silhouette value can theoretically range from -1 to +1, with higher values indicative
of better cluster quality in terms of both cohesion and separation.
Best Method Number Of Clusters SilhouetteValue
complete link hierarchical 2 0.753754205720575
Complete-link Hierarchical clustering (1/2)
 The 5,000 instances are on the
x-axis. In moving vertically from
the x-axis, one can begin to see
how the actual clusters are
formed.
Plot of PCs with cluster assignment labels (1/3)
 In this view, the top two principal
components (i.e., PC1 and PC2) are
plotted for each object in two-
dimensional space.
 In the graph, there are two clusters, one
dark blue and the other light blue.
 The small subset of three records appears
substantially more different from the
majority of objects.
Plot of PCs with cluster assignment labels (2/3)
 In this view, PC1 and PC3 are plotted for
each object in two-dimensional space.
 In the graph, the two clusters are again
shown.
 It is once again evident that the small
subset of three records appears more
different from the majority of other
objects.
Plot of PCs with cluster assignment labels (3/3)
 In this view, PC2 and PC3 are
plotted for each object in two-
dimensional space.
 Cluster differences appear less
prominent from this perspective.
Principal components 3D scatterplot
 Cluster one represents the majority
class (black) while cluster two
represents the rare class (red).
 In this view, one can clearly see the
subset of three records (in red)
appearing more isolated from the other
objects.
Cluster 1 outlier plot
 In this view, an arbitrary cutoff is
inserted at the 99.9th percentile (red
horizontal line) so as to provide for
efficient identification of very irregular
records.
 Objects further from the x-axis are
more questionable.
 While all objects distant from the x-
axis might be worth investigating,
points above the cutoff should be
viewed as particularly suspicious.
Conclusion of Process
 At the conclusion of outlier detection, an output file for each cluster containing the unique
record identifier, original variables, normalized variables, principal components, normalized
principal components, cluster assignments, and mahalanobis distance information can be
exported to facilitate further analyses and investigations.
 Cluster 2 – final output file of a subset of fields:
 Distinguishing features of cluster 2 records: 1) New accounts (age = 1 month), 2)
Very high incidence of late payments, and 3) Relatively high credit limits,
particularly given the account age and late payment issues.
Record AccountAge CreditLimit AdditionalAssets LatePayments model.cluster md
32430 1 2500 1 3 2 5.83E-05
65470 1 8500 1 4 2 0.002371778
78772 1 2200 0 3 2 0.000442305

More Related Content

PPT
Concurrent Replication of Parallel and Distributed Simulations
PDF
50120140505013
PDF
TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...
PPT
Cure, Clustering Algorithm
PDF
MPSKM Algorithm to Cluster Uneven Dimensional Time Series Subspace Data
PPT
Clustering
PDF
Big data Clustering Algorithms And Strategies
PDF
Data clustering using kernel based
Concurrent Replication of Parallel and Distributed Simulations
50120140505013
TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...
Cure, Clustering Algorithm
MPSKM Algorithm to Cluster Uneven Dimensional Time Series Subspace Data
Clustering
Big data Clustering Algorithms And Strategies
Data clustering using kernel based

What's hot (20)

PPT
DATA MINING:Clustering Types
PPT
3.3 hierarchical methods
PDF
EVALUATING SYMMETRIC INFORMATION GAP BETWEEN DYNAMICAL SYSTEMS USING PARTICLE...
PDF
Parallel KNN for Big Data using Adaptive Indexing
PPT
PPTX
Hierarchical clustering
DOCX
Principal Component Analysis
PPTX
presentation 2019 04_09_rev1
DOCX
8.clustering algorithm.k means.em algorithm
PDF
A046010107
PDF
New Approach for K-mean and K-medoids Algorithm
PDF
SVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONS
PPTX
Pillar k means
PDF
IRJET- Different Data Mining Techniques for Weather Prediction
PPTX
Application of stochastic modelling in bioinformatics
DOCX
Canopy clustering algorithm
PPT
3.6 constraint based cluster analysis
PDF
Clustering using kernel entropy principal component analysis and variable ker...
PPT
PDF
Data clustering using map reduce
DATA MINING:Clustering Types
3.3 hierarchical methods
EVALUATING SYMMETRIC INFORMATION GAP BETWEEN DYNAMICAL SYSTEMS USING PARTICLE...
Parallel KNN for Big Data using Adaptive Indexing
Hierarchical clustering
Principal Component Analysis
presentation 2019 04_09_rev1
8.clustering algorithm.k means.em algorithm
A046010107
New Approach for K-mean and K-medoids Algorithm
SVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONS
Pillar k means
IRJET- Different Data Mining Techniques for Weather Prediction
Application of stochastic modelling in bioinformatics
Canopy clustering algorithm
3.6 constraint based cluster analysis
Clustering using kernel entropy principal component analysis and variable ker...
Data clustering using map reduce
Ad

Viewers also liked (6)

PDF
Not Only Statements: The Role of Textual Analysis in Software Quality
PDF
Consulting whitepaper enterprise-architecture-transformation-pharmaceutical-c...
PPTX
A2DataDive workshop: Introduction to R
PDF
Preliminary Study of Engineering Self
PPTX
Kent ro systems
PPTX
Selected ion flow tube MS - Online quantitative VOC analysis
Not Only Statements: The Role of Textual Analysis in Software Quality
Consulting whitepaper enterprise-architecture-transformation-pharmaceutical-c...
A2DataDive workshop: Introduction to R
Preliminary Study of Engineering Self
Kent ro systems
Selected ion flow tube MS - Online quantitative VOC analysis
Ad

Similar to Automated Clustering Project - 12th CONTECSI 34th WCARS (20)

PDF
An Efficient Clustering Method for Aggregation on Data Fragments
PDF
A PSO-Based Subtractive Data Clustering Algorithm
PDF
Mine Blood Donors Information through Improved K-Means Clustering
PDF
Experimental study of Data clustering using k- Means and modified algorithms
PDF
Comparison Between Clustering Algorithms for Microarray Data Analysis
DOCX
COMPUTER VISION UNIT 4 BSC CS WITH AI MADRAS UNIVERSITY
PPTX
Unsupervised Learning.pptx
PDF
An improvement in k mean clustering algorithm using better time and accuracy
PDF
Data Mining: Cluster Analysis
PDF
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
PDF
The Positive Effects of Fuzzy C-Means Clustering on Supervised Learning Class...
PDF
The Positive Effects of Fuzzy C-Means Clustering on Supervised Learning Class...
PDF
CLUSTERING IN DATA MINING.pdf
PPTX
An Approach to Mixed Dataset Clustering and Validation with ART-2 Artificial ...
PDF
CONVOLUTIONAL NEURAL NETWORK BASED RETINAL VESSEL SEGMENTATION
PDF
Convolutional Neural Network based Retinal Vessel Segmentation
PDF
IRJET- Customer Segmentation from Massive Customer Transaction Data
PDF
F017132529
PDF
Performance Analysis of Different Clustering Algorithm
PDF
Az36311316
An Efficient Clustering Method for Aggregation on Data Fragments
A PSO-Based Subtractive Data Clustering Algorithm
Mine Blood Donors Information through Improved K-Means Clustering
Experimental study of Data clustering using k- Means and modified algorithms
Comparison Between Clustering Algorithms for Microarray Data Analysis
COMPUTER VISION UNIT 4 BSC CS WITH AI MADRAS UNIVERSITY
Unsupervised Learning.pptx
An improvement in k mean clustering algorithm using better time and accuracy
Data Mining: Cluster Analysis
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
The Positive Effects of Fuzzy C-Means Clustering on Supervised Learning Class...
The Positive Effects of Fuzzy C-Means Clustering on Supervised Learning Class...
CLUSTERING IN DATA MINING.pdf
An Approach to Mixed Dataset Clustering and Validation with ART-2 Artificial ...
CONVOLUTIONAL NEURAL NETWORK BASED RETINAL VESSEL SEGMENTATION
Convolutional Neural Network based Retinal Vessel Segmentation
IRJET- Customer Segmentation from Massive Customer Transaction Data
F017132529
Performance Analysis of Different Clustering Algorithm
Az36311316

More from TECSI FEA USP (20)

PDF
12th CONTECSI USP - Guia para publicar Andre Jun Emerald
PDF
12 contecsi IT Management GAESI USP Rastreabilidade de Medicamentos - Elcio...
PDF
12 contecsi Workshop Mendeley Ligia Capobianco
PDF
Planejamento e gestão de indicadores para projetos digitais - Workshop 12th C...
PDF
Programa de Apoio às Publicações Científicas Periódicas da USP - 12th CONTECSI
PDF
Centro Integrado de Mobilidade Urbana CIMU - 12th CONTECSI
PDF
The ladder of Citizens ‘smartness’ Citizens participation in smart cities - 1...
PDF
Papel de los vocabularios semánticos en la economía en red - 12th CONTECSI
PDF
Balance Innovations in Backoffice Improvement and Service Delivery A study ca...
PDF
Sistema Autenticador e Transmissor (SAT): modelo tecnológico de automação e c...
PDF
GAESI - Gestão em Automação e TI - 12th CONTECSI
PDF
Co-production: an opportunity toward better digital governance - 12th CONTECSI
PDF
The Digital Transformation - Challenges and Opportunities for IS researchers ...
PDF
Auditoria Contínua e o Sistema de Controle da Administração Pública Federal -...
PDF
Big (huge) Data and a continuous and predictive audit: new evidence, new met...
PDF
Text Mining and Continuous Assurance Kevin Moffitt - 12th CONTECSI 34th WCARS
PDF
Federal Audit in Relation to Continuous Audit - 12th CONTECSI 34th WCARS
PDF
O Tribunal de Contas da União e a Auditoria Contínua - 12th CONTECSI 34th WCARS
PDF
Cenário atual da Auditoria Contínua em Bancos no Brasil Itaú-Unibanco Holding...
PDF
Auditoria Eletrônica: Automatização de procedimentos de auditoria através do ...
12th CONTECSI USP - Guia para publicar Andre Jun Emerald
12 contecsi IT Management GAESI USP Rastreabilidade de Medicamentos - Elcio...
12 contecsi Workshop Mendeley Ligia Capobianco
Planejamento e gestão de indicadores para projetos digitais - Workshop 12th C...
Programa de Apoio às Publicações Científicas Periódicas da USP - 12th CONTECSI
Centro Integrado de Mobilidade Urbana CIMU - 12th CONTECSI
The ladder of Citizens ‘smartness’ Citizens participation in smart cities - 1...
Papel de los vocabularios semánticos en la economía en red - 12th CONTECSI
Balance Innovations in Backoffice Improvement and Service Delivery A study ca...
Sistema Autenticador e Transmissor (SAT): modelo tecnológico de automação e c...
GAESI - Gestão em Automação e TI - 12th CONTECSI
Co-production: an opportunity toward better digital governance - 12th CONTECSI
The Digital Transformation - Challenges and Opportunities for IS researchers ...
Auditoria Contínua e o Sistema de Controle da Administração Pública Federal -...
Big (huge) Data and a continuous and predictive audit: new evidence, new met...
Text Mining and Continuous Assurance Kevin Moffitt - 12th CONTECSI 34th WCARS
Federal Audit in Relation to Continuous Audit - 12th CONTECSI 34th WCARS
O Tribunal de Contas da União e a Auditoria Contínua - 12th CONTECSI 34th WCARS
Cenário atual da Auditoria Contínua em Bancos no Brasil Itaú-Unibanco Holding...
Auditoria Eletrônica: Automatização de procedimentos de auditoria através do ...

Recently uploaded (20)

PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Empathic Computing: Creating Shared Understanding
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
A comparative study of natural language inference in Swahili using monolingua...
PDF
A comparative analysis of optical character recognition models for extracting...
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PPTX
Spectroscopy.pptx food analysis technology
PPTX
1. Introduction to Computer Programming.pptx
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Heart disease approach using modified random forest and particle swarm optimi...
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PDF
Encapsulation theory and applications.pdf
PPTX
OMC Textile Division Presentation 2021.pptx
PPTX
cloud_computing_Infrastucture_as_cloud_p
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Empathic Computing: Creating Shared Understanding
Spectral efficient network and resource selection model in 5G networks
A comparative study of natural language inference in Swahili using monolingua...
A comparative analysis of optical character recognition models for extracting...
Programs and apps: productivity, graphics, security and other tools
Assigned Numbers - 2025 - Bluetooth® Document
Spectroscopy.pptx food analysis technology
1. Introduction to Computer Programming.pptx
Encapsulation_ Review paper, used for researhc scholars
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Heart disease approach using modified random forest and particle swarm optimi...
Univ-Connecticut-ChatGPT-Presentaion.pdf
Encapsulation theory and applications.pdf
OMC Textile Division Presentation 2021.pptx
cloud_computing_Infrastucture_as_cloud_p
Reach Out and Touch Someone: Haptics and Empathic Computing
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...

Automated Clustering Project - 12th CONTECSI 34th WCARS

  • 1. Automated Clustering Project MiklosVasarhelyi, Paul Byrnes, andYunsenWang Presented by DenizAppelbaum
  • 2. Motivation  Motivation entails the development of a program that automatically performs clustering and outlier detection for a wide variety of numerically represented data.
  • 3. Outline of program features  Normalizes all data to be clustered  Creates normalized principal components from the normalized data  Automatically selects the necessary normalized principal components for use in actual clustering and outlier detection  Compares a variety of algorithms based upon the selected set of normalized principal components  Adopts the top performing model based upon silhouette coefficient values to perform the final clustering and outlier detection procedures  Produces relevant information and outputs throughout the process
  • 4. Data normalization  Data normalization  Converts each numerically represented dimension to be clustered into the range [0,1].  A desirable procedure for preparing numeric attributes for clustering
  • 5. Principal component analysis  Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components.  In this way, PCA can both reduce dimensionality as well as eliminate inherent problems associated with clustering data whose attributes are correlated  In the following slides, a random sample of 5,000 credit card customers is used to demonstrate the automated clustering and outlier detection program
  • 6. Principal component analysis  PCA initially results in four principal components being generated from the original data  Using a cumulative data variability threshold of 80% (default specification), three principal components are automatically selected for analysis – they explain the vast majority of data variability
  • 7. Principal component analysis  Scatter plot of PC1 and PC2  In this view, the top 2 principal components are plotted for each object in two-dimensional space.  As can be seen, a small subset of records appear significantly more distant/different from the vast majority of objects.
  • 8. Clustering exploration/simulation process - examples  Ward method  Ward suggested a general agglomerative hierarchical clustering procedure, where the criterion for choosing the pair of clusters to merge at each step is based on the optimal value of an objective function.  Complete link method  This method is also known as farthest neighbor clustering.The result of the clustering can be visualized as a dendrogram, which shows the sequence of cluster fusion and the distance at which each fusion took place.  PAM (partitioning around medoids)  The k-medoids algorithm is a clustering method related to the k-means algorithm and the medoids shift algorithm; It is considered more stable than k-means, because it uses the median rather than mean  K-means  k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.
  • 9. Clustering exploration results  The result shown below is based upon a simulation exercise, whereby all four algorithms are automatically compared on the data set (i.e., a random sample of 5,000 records from the credit card customer data). In this particular case, the best model is found to be a two-cluster solution using the complete link hierarchical method. This is the final model and is used for subsequent clustering and outlier detection.  Best clustering result:  The silhouette value can theoretically range from -1 to +1, with higher values indicative of better cluster quality in terms of both cohesion and separation. Best Method Number Of Clusters SilhouetteValue complete link hierarchical 2 0.753754205720575
  • 10. Complete-link Hierarchical clustering (1/2)  The 5,000 instances are on the x-axis. In moving vertically from the x-axis, one can begin to see how the actual clusters are formed.
  • 11. Plot of PCs with cluster assignment labels (1/3)  In this view, the top two principal components (i.e., PC1 and PC2) are plotted for each object in two- dimensional space.  In the graph, there are two clusters, one dark blue and the other light blue.  The small subset of three records appears substantially more different from the majority of objects.
  • 12. Plot of PCs with cluster assignment labels (2/3)  In this view, PC1 and PC3 are plotted for each object in two-dimensional space.  In the graph, the two clusters are again shown.  It is once again evident that the small subset of three records appears more different from the majority of other objects.
  • 13. Plot of PCs with cluster assignment labels (3/3)  In this view, PC2 and PC3 are plotted for each object in two- dimensional space.  Cluster differences appear less prominent from this perspective.
  • 14. Principal components 3D scatterplot  Cluster one represents the majority class (black) while cluster two represents the rare class (red).  In this view, one can clearly see the subset of three records (in red) appearing more isolated from the other objects.
  • 15. Cluster 1 outlier plot  In this view, an arbitrary cutoff is inserted at the 99.9th percentile (red horizontal line) so as to provide for efficient identification of very irregular records.  Objects further from the x-axis are more questionable.  While all objects distant from the x- axis might be worth investigating, points above the cutoff should be viewed as particularly suspicious.
  • 16. Conclusion of Process  At the conclusion of outlier detection, an output file for each cluster containing the unique record identifier, original variables, normalized variables, principal components, normalized principal components, cluster assignments, and mahalanobis distance information can be exported to facilitate further analyses and investigations.  Cluster 2 – final output file of a subset of fields:  Distinguishing features of cluster 2 records: 1) New accounts (age = 1 month), 2) Very high incidence of late payments, and 3) Relatively high credit limits, particularly given the account age and late payment issues. Record AccountAge CreditLimit AdditionalAssets LatePayments model.cluster md 32430 1 2500 1 3 2 5.83E-05 65470 1 8500 1 4 2 0.002371778 78772 1 2200 0 3 2 0.000442305