SlideShare a Scribd company logo
Machine
Learning
Source: Introduction to Machine Learning with Python
Authors: Andreas C. Muller and Sarah Guido
Unit - III
Unsupervised Learning
Agenda
Introduction
Types of Unsupervised
Learning?
Challenges in Unsupervised
Learning
Preprocessing and Scaling
Clustering
Introduction
Introduction to Unsupervised
Learning
Introduction
§ Unsupervised learning includes all kinds of machine
learning where there is no known output
§ No teacher to instruct the learning algorithm
§ The learning algorithm is just shown the input data and
asked to extract knowledge from this data
Types of
Unsupervised
learning
Types of Unsupervised learning
§ Two kinds of Unsupervised learning
§ Transformations of the dataset
§ Clustering
Types of
Unsupervised
learning
Unsupervised transformations of a dataset
§ Algorithms that create a new representation of the
data which might be easier for humans or other
machine learning algorithms to understand
compared to the original representation of the data.
Types of
Unsupervised
learning
§ Application of unsupervised transformations is
dimensionality reduction
§ which takes a high-dimensional representation of the
data, consisting of many features, and finds a new way
to represent this data that summarizes the essential
characteristics with fewer features.
Example:
§ Application for dimensionality reduction is reduction
to two dimensions for visualization purposes
Types of
Unsupervised
learning
Unsupervised transformations of a dataset
§ Another application for unsupervised transformations is
finding the parts or components that “make up” the
data
Example: Topic Extraction
§ The task is
§ to find the unknown topics that are talked about in each
document
§ to learn what topics appear in each document
§ tracking the discussion of themes like elections, gun
control, or pop stars on social media
Types of
Unsupervised
learning
Clustering Algorithms
§ Partition data into distinct groups of similar items
EXAMPLE:
§ Uploading photos to a social media site
Challenges in
Unsupervised
Learning
• Evaluating whether the algorithm learned something
useful
• Unsupervised ML algorithms are applied to data that does
not contain any label information ---> we don’t know
what the right output should be
• Very hard to say whether a model “did well”
• There is no way for us to tell the algorithm what we are
looking for and often the only way to evaluate the result of
an unsupervised algorithm is to inspect it manually
• Unsupervised algor ithms are used often in an
exploratory setting --> when the data scientist wants to
understand the data better, rather than as part of a larger
automatic system
• Common application for unsupervised algorithms is as a
preprocessing step for supervised algorithms
Preprocessing
and Scaling
Different Kinds of
Preprocessing
Applying Data
Transformations
Scaling Training and Test
Data the Same Way
The Effect of Preprocessing
on Supervised Learning
Preprocessing
and Scaling
Dimensionality Reduction,
Feature Extraction and
Manifold Learning
Non-Negative Matrix
Factorization
Manifold Learning with
t-SNE
Preprocessing
and Scaling
§ Neural networks and SVMs, are very sensitive to the
scaling of the data
§ A common practice is to adjust the features so that the
data representation is more suitable for these
algorithms
Different
Kinds of
Preprocessing
Different
Kinds of
Preprocessing
§ StandardScaler
§ The StandardScaler in scikit-learn ensures that for each
feature the mean is 0 and the variance is 1, bringing all
features to the same magnitude.
§ Disadvantage:
§ This scaling does not ensure any particular
minimum and maximum values for the features
§ RobustScaler
§ It ensures statistical properties for each feature that
guarantee that they are on the same scale.
§ It uses the median and quartiles, instead of mean and
variance
§ Advantage:
§ RobustScaler ignore data points that are very different from
the rest (like measurement errors)
§ These odd data points are also called outliers, and can lead
to trouble for other scaling techniques
Different
Kinds of
Preprocessing
§ MinMaxScaler
§ It shifts the data such that all features are exactly
between 0 and 1
§ For a two-dimensional dataset this means all of the
data is contained within the rectangle created by
X-axis between 0 and 1 and the Y-axis between 0
and 1
§ Normalizer
§ Scales each data point such that the feature vector
has a Euclidean length of 1
§ It projects a data point on the circle (or sphere, in
the case of higher dimensions) with a radius of 1.
§ This normalization is often used when only the
direction of the data matters, not the length of the
feature vector.
Applying Data
Transformations
Applying Data
Transformations
§ Transformations on Cancer dataset
§ Preprocessing methods like the scalers are usually applied
before applying a supervised machine learning algorithm
§ Example: Apply kernel SVM (SVC) to the cancer dataset,
and use MinMaxScaler for preprocessing the data
STEP 1: Loading the dataset and splitting it into train and test set
Applying Data
Transformations
STEP 2: Import the class and then instantiate it
STEP 3: Fit the scaler using the fit method, applied to the
training data
STEP 4: Apply the transformation
§ i.e., scale the training data — we use the transform
method of the scaler
§ The transform method is used in scikit-learn whenever a
model returns a new representation of the data.
Applying Data
Transformations
Applying Data
Transformations
STEP 5: Apply SVM to the scaled data
§ To apply the SVM to the scaled data, we also need to transform
the test set
Scaling
Training and
Test Data the
SameWay
• After scaling, the minimum and maximum are
not 0 and 1
• Some of the features are even outside the 0–1
• MinMaxScaler (and all the other scalers) always
applies exactly the same transformation to
the training and the test set
• i.e., the transform method always subtracts the
training set minimum and divides by the training set
range, which might be different from the minimum
and range for the test set
Scaling
Training and
Test Data the
SameWay
• It is important to apply exactly the same
transformation to the training set and the test set
for the supervised model to work on the test set
Scaling
Training and
Test Data the
SameWay
Scaling
Training and
Test Data the
SameWay
Scaling
Training and
Test Data the
SameWay
• First panel: Unscaled two-dimensional dataset
• The training set shown as circles and the test set shown as
triangles
• Second panel: Data is same but scaled using the
MinMaxScaler
• We called fit on the training set, and then called transform on
the training and test sets.
• The dataset in the second panel looks identical to the first only
the ticks on the axes have changed.
• The features are between 0 and 1
• The minimum and maximum feature values for the test data (the
triangles) are not 0 and 1.
• Third panel: Scaling the training set and test set separately
• The minimum and maximum feature values for both the
training and the test set are 0 and 1
• The test points moved incongruously to the training set, as they
were scaled differently
• The arrangement of the data is changed in an arbitrary way
Scaling
Training and
Test Data the
SameWay
Note:
Shortcuts and efficient alternatives:
§ Often, you want to fit a model on some dataset, and then
transform it.
§ All models that have a transform method also have a
fit_transform method.
§ While fit_transform is not necessarily more efficient for all
models, it is still good practice to use this method when
trying to transform the training set
The Effect of
Preprocessing
on Supervised
Learning
The Effect of
Preprocessing
on Supervised
Learning
§ STEP 1: Before scaling
The Effect of
Preprocessing
on Supervised
Learning
§ STEP 2: After applying scaling
The Effect of
Preprocessing
on Supervised
Learning
§ The effect of scaling the data is quite significant
§ Scaling the data doesn’t involve any complicated
math but don’t tr y to reimplement them
yourself
§ It is always a good practice to use the scaling
mechanisms provided by scikit-learn
The Effect of
Preprocessing
on Supervised
Learning
We can easily replace one preprocessing algorithm with
another by changing the class we use ---> because all the
preprocessing classes have the same interface
Dimensionality
Reduction,Feature
Extractionand
ManifoldLearning
Principal Component
Analysis
Eigenfaces for feature
extraction
Dimensionality
Reduction,
Feature
Extraction and
Manifold
Learning
§ It is a statistical process
§ converts correlated features into a set of linearly
uncorrelated features -> with the help of
orthogonal transformation
§ PCA is used for exploratory data analysis and
predictive modeling
§ Applications of PCA:
§ Image processing
§ Movie recommendation system
§ Dimensionality reduction technique in various AI
applications such as computer vision, image
compression, etc
§ finding hidden patterns if data has high dimensions
Dimensionality
Reduction,
Feature
Extraction and
Manifold
Learning
§ Transforming data using unsupervised learning can
have many motivations
§ Compressing data
§ Finding a representation that is more informative
for further processing (Feature Extraction)
§ Visualization
§ One of the simplest and most widely used algorithms
for all of these is Principal Component Analysis for
dimensionality reduction, feature extraction, feature
selection, data compression, and data visualization
§ Non-negative matrix factorization (NMF) - for
feature extraction
§ t-SNE - for visualization using two dimensional
scatter plots
Principal
Component
Analysis
Principal Component Analysis
§ It is a method that rotates the dataset in a way
such that
§ the rotated features are statistically
uncorrelated
§ This rotation is often followed by selecting
only a subset of the new features, according to
how important they are for explaining the data
Principal
Component
Analysis
Principal Component Analysis
Principal
Component
Analysis
Principal Component Analysis
§ First Plot:
§ The first plot (top left) shows the original data points colored to distinguish
among them
§ Step 1:
§ The algorithm proceeds by first finding the direction of maximum variance, labeled
“Component 1”.
§ This direction (or vector) in the data that contains most of the information
§ i.e., the direction along which the features are most correlated with each other
§ Step 2:
§ The algorithm finds the direction that contains the most information while being
orthogonal (at a right angle) to the first direction
§ Note:
§ In two dimensions, there is only one possible orientation that is at a right angle,
§ In higher-dimensional spaces there would be (infinitely) many orthogonal
directions
§ We could have drawn the first component from the center up to the top left
instead of down to the bottom right
§ Principal Components:
§ The directions found using this process are called principal components
§ They are the main directions of variance in the data
§ Note:
§ There are as many principal components as original features
Principal
Component
Analysis
Principal Component Analysis
§ The second plot
§ Step 3:
§ The mean was subtracted from the data, so that the
transformed data is centered around zero
§ Step 4:
§ The first plot is rotated so that the first principal
component aligns with the x-axis
§ The second principal component aligns with the y-
axis
§ Note:
§ In the rotated representation, the two axes are
uncorrelated
§ i.e., Correlation matrix of the data in this representation is zero
except for the diagonal
Principal
Component
Analysis
Principal Component Analysis
§ PCA for Dimensionality Reduction:
§ We can use PCA for dimensionality reduction by retaining
only some of the principal components
§ In this example, we might keep only the first principal
component
§ The Third Plot:
§ Step 5:
§ Reduces the data from a two-dimensional dataset to a one-dimensional
dataset
§ The Fourth Plot:
§ Step 6:
§ Undo the rotation and add the mean back to the data
§ These points are in the original feature space, but we kept only the
information contained in the first principal component
§ Note:
§ This transformation is sometimes used to remove noise effects from the
data or visualize what part of the information is retained using the
principal components
Principal
Component
Analysis
Principal Component Analysis
(Cancer Dataset)
§ One of the most common applications of PCA is visualizing high-
dimensional datasets
§ Disadvantage of Scatter Plot:
§ It is hard to create scatter plots of data that has more than two features
§ Pair Plot:
§ A 2D categorical scatter plot that represents the pair wise relationship
between the numerical variables
§ The Iris dataset ---> able to create a pair plot that gave us a partial picture of
the data by showing us all the possible combinations of two features
§ Breast Cancer Dataset:
§ The Breast Cancer dataset, even using a pair plot is tricky
§ This dataset has 30 features, which would result in 30 * 14 = 420 scatter plots
§ Histograms:
§ Computing histograms of each of the features for the two classes, benign and
malignant cancer
Principal
Component
Analysis
Principal Component Analysis
(Cancer Dataset)
EXAMPLE:
Principal
Component
Analysis
Principal Component Analysis (Cancer Dataset)
Principal
Component
Analysis
§ Created Histogam for each feature -
§ Counting how often a data point appears with a feature in a certain
range (called a bin)
§ Each plot overlays two histograms, one for all of the points in the
benign class (blue) and one for all the points in the malignant class
(red).
§ This gives us some idea of how each feature is
distributed across the two classes
§ Allows us to guess as to which features are better at distinguishing
malignant and benign samples
§ Example:
§ The feature “smoothness error” seems quite uninformative,
because the two histograms mostly overlap
§ The feature “worst concave points” seems quite informative,
because the histograms are quite disjoint
§ NOTE:
• Histogram doesn’t show us anything about the interactions
between variables and how these relate to the classes
• Using PCA, we can capture the main interactions
Principal Component Analysis (Cancer Dataset)
Principal
Component
Analysis
Principal Component Analysis
(Cancer Dataset)
Scaling before applying PCA
Principal
Component
Analysis
Principal Component Analysis
(Cancer Dataset)
• Learning the PCA transformation and applying it is as
simple as applying a preprocessing transformation
• We instantiate the PCA object, find the principal
components by calling the fit method
• Then apply the rotation and dimensionality reduction by
calling transform
Principal
Component
Analysis
Principal Component Analysis
(Cancer Dataset)
• PCA only rotates and shifts the data, but
keeps all the principal components
• To reduce the dimensionality of the data, we
need to specify how many components we
want to keep when creating a PCA Object
Principal
Component
Analysis
Principal Component Analysis
(Breast Cancer Dataset)
Principal
Component
Analysis
Principal Component Analysis
Principal
Component
Analysis
Principal Component Analysis
§ Note:
• PCA is an unsupervised method
• does not use any class information when finding
the rotation
• A linear classifier (that would learn a line in this
space) could do a reasonably good job at
distinguishing the two classes.
Principal
Component
Analysis
Principal Component Analysis
DRAWBACKS:
§ A downside of PCA is that the two axes in the plot are
often not very easy to interpret
§ The principal components correspond to directions in
the original data
§ The Principal components are combinations of the
original features. Hence, these combinations are
usually very complex
Principal
Component
Analysis
§ The principal components themselves are stored in the
components_ attribute
§ Rows in components_ corresponds to one principal
component
§ they are sorted by their importance
§ The columns correspond to the original features
attribute of the PCA
Principal Component Analysis
Principal
Component
Analysis
Principal Component Analysis
Principal
Component
Analysis
Principal Component Analysis
§ Visualization using Heatmap
Eigenfaces for
FeatureExtraction
Eigenfaces for
FeatureExtraction
Eigenfaces for feature extraction
§ Another application of PCA that we mentioned earlier is
feature extraction
§ The idea behind feature extraction is that it is possible to find a
representation of your data that is better suited to analysis
than the raw representation you were given
§ An application where feature extraction is helpful is with
images
§ Images are made up of pixels, usually stored as red, green, and
blue (RGB) intensities
§ Objects in images are usually made up of thousands of pixels,
and only together are they meaningful
Eigenfacesfor
FeatureExtraction
Eigenfaces for feature extraction
EXAMPLE: LFW (Labeled Faces in the Wild) dataset
§ This dataset contains face images of celebrities
downloaded from the Internet
§ It includes faces of politicians, singers, actors, and athletes
from the early 2000s
§ We use grayscale versions of these images, and scale them
down for faster processing
Eigenfacesfor
FeatureExtraction
Eigenfaces for feature extraction
Eigenfacesfor
FeatureExtraction
Eigenfaces for feature extraction
• There are 3,023 images, each 87×65 pixels large, belonging to
62 different people
• The dataset is a bit skewed, containing a lot of images of
George W. Bush and Colin Powell
Eigenfacesfor
FeatureExtraction
Eigenfaces for feature extraction
Eigenfacesfor
FeatureExtraction
EXAMPLE:
Eigenfaces for feature extraction
• To make the data less skewed, we will only take up to 50
images of each person
Eigenfacesfor
FeatureExtraction
Eigenfaces for feature extraction
• Face Recognition -
• Ask if a previously unseen face belongs to a known person from a
database
• An eigenfac is the name given to a set of eigenvectors when used
in the computer vision problem of human face recognition
• The eigenface approach searches for a low-dimensional
representation of face images
• Applications of Face Recognition:
§ Photo collection
§ Social media
§ Security applications
§ Solution for Face Recognition:
§ To build a classifier
§ where each person is a separate class
§ Usually many different people in face databases, and very few images of the same
person
§ That makes it hard to train most classifiers
§ Simple solution is to use a one-nearest-neighbor classifier
§ looks for the most similar face image to the face you are classifying
§ This classifier could in principle work with only a single training example per class
Eigenfaces for
FeatureExtraction
Eigenfaces for feature extraction
• We obtain an accuracy of 26.6%, which is not actually that
bad for a 62-class classification problem
• Random guessing would give you around 1/62 = 1.6%
accuracy
• But here, we only correctly identify a person every fourth
time
Eigenfacesfor
FeatureExtraction
Eigenfaces for feature extraction
§ Use of PCA -
§ Computing distances in the original pixel space is Quite a bad way to measure similarity
between faces
§ When using a pixel representation to compare two images, we compare the grayscale
value of each individual pixel to the value of the pixel in the corresponding position in
the other image
§ This representation is quite different from how humans would interpret the image of a
face and it is hard to capture the facial features using this raw representation
§ Using pixel distances means that shifting a face by one pixel to the right corresponds to a
drastic change, with a completely different representation
§ Using distances along principal components can improve our accuracy
§ Whitening option of PCA:
§ Whitening = Rotation + Rescaling
§ Rescales the principal components to have the same scale
§ Whitening corresponds to not only rotating the data, but also rescaling it so that the center
panel is a circle instead of an ellipse
Eigenfacesfor
FeatureExtraction
Eigenfaces for feature extraction
Eigenfacesfor
FeatureExtraction
Eigenfaces for feature extraction
§ We fit the PCA object to the training data and extract the first 100
principal components.
§ The new data has 100 features, the first 100 principal components.
Eigenfacesfor
FeatureExtraction
Eigenfaces for feature extraction
§ Use the new representation to classify our images using a one-nearest-neighbors
classifier
§ Accuracy improved quite significantly, from 26.6% to 35.7%
§ For Image data, Components correspond to directions in the input space
§ The input space here is 87×65-pixel grayscale images, so directions within this
space are also 87×65-pixel grayscale images.
Eigenfacesfor
FeatureExtraction
Eigenfaces for feature extraction
Eigenfacesfor
FeatureExtraction
Eigenfaces for feature extraction
§ Eigenfaces:
Eigenfacesfor
FeatureExtraction
§ We cannot understand all aspects of these
components in the images
§ First Component -
§ seems to mostly encode the contrast between the
face and the background.
§ Second Component -
§ encodes differences in lighting between the right
and the left half of the face, and so on
Eigenfacesfor
FeatureExtraction
• As the PCA model is based on pixels
• the alignment of the face and the lighting both have a strong
influence on how similar two images are in their pixel
representation
• These properties i.e., Alignment and lighting are probably
not what a human would perceive first
• When asking people to rate similarity of faces, they are more
likely to use attributes like age, gender, facial expression,
and hair style, which are attributes that are hard to infer
from the pixel intensities
• Algorithms often interpret data (particularly visual data,
such as images, which humans are very familiar with)
quite differently from how a human would
Eigenfacesfor
FeatureExtraction
Eigenfaces for feature extraction
§ PCA transformation = rotating the data + dropping the
components with low variance
§ Another Trick for PCA Transformation - Express the test
points as a weighted sum of the principal components
§ Try to find some numbers (Coefficients x0 , x1 , etc.,) (the new
feature values after the PCA rotation) and express the test
points as a weighted sum of the principal components
§ the reconstructions of the original data using only some
components
§ A similar transformation for the faces by reducing the data to
only some principal components and then rotating back into
the original space.
Eigenfacesfor
FeatureExtraction
Eigenfaces for feature extraction
§ Reconstruction of the original data -
§ The reconstructions of the original data using only some
components (Example Fig 3.3)
§ Reducing the data to only some principal components and
then rotating back into the original space
§ This return of the original feature space can be done using
the inverse_transform method
Eigenfacesfor
FeatureExtraction
Eigenfaces for feature extraction
Eigenfacesfor
featureextraction
• Reconstructing three face images using increasing
numbers of principal components
• First 10 principal components -
• only the essence of the picture, like the face orientation
and lighting, is captured
• Using more and more principal components, more and
more details in the image are preserved
• Using as many components as there are pixels would
mean that we would not discard any information after
the rotation, and we would reconstruct the image
perfectly
Non-Negative
Matrix
Factorization
Non-Negative
Matrix
Factorization
§ Non-negative matrix factorization is another unsupervised
learning algorithm that aims to extract useful features
§ It works similarly to PCA
§ It can also be used for dimensionality reduction
§ Write each data point as a weighted sum of some
components
§ PCA -
§ wants components that were orthogonal and that explained
as much variance of the data as possible
§ NMF -
§ wants the components and the coefficients to be non-
negative
§ i.e., both the components and the coefficients to be greater
than or equal to zero
§ This method can only be applied to data where each
feature is non-negative --> as a non-negative sum of non-
negative components cannot become negative
Non-Negative
Matrix
Factorization
§ Process of decomposing data into a non-negative
weighted sum is particularly helpful for - data that is
created as the addition (or overlay) of several
independent sources
§ audio track of multiple people speaking
§ music with many instruments
§ In these situations, NMF can identify the original
components that make up the combined data
§ NMF leads to more interpretable components than PCA
§ as negative components and coefficients can lead to hard-to-
interpret cancellation effects
Non-Negative
Matrix
Factorization
Applying NMF to synthetic data
§ we need to ensure that our data is positive for NMF to be
able to operate on the data
§ Data lies relative to the origin (0, 0) actually matters for
NMF.
Non-Negative
Matrix
Factorization
Applying NMF to synthetic data
§ Left Component
§ It is clear that all points in the data can be written as a
positive combination of the two components.
§ If there are enough components to perfectly reconstruct
the data (as many components as there are features)
§ The algorithm will choose directions that point toward
the extremes of the data.
§ Right Component
§ NMF creates a component that points toward the mean, as
pointing there best explains the data
§ reducing the number of components not only removes
some directions, but creates an entirely different set of
components
Non-Negative
Matrix
Factorization
Applying NMF to synthetic data
§ NMF are
§ not ordered in any specific way
§ all components play an equal part
§ Randomness:
§ NMF uses a random initialization, which might
lead to different results depending on the
random seed
§ data with two components, where all the data
can be explained perfectly, the randomness
has little effect
§ In more complex situations, there might be
more drastic changes
Non-Negative
Matrix
Factorization
Applying NMF to face images
§ LFW dataset
§ Main parameter of NMF is how many components
we want to extract
§ Usually this is lower than the number of input features
§ Number of components impacts how well the data can be
reconstructed using NMF
Non-Negative
Matrix
Factorization
Applying NMF to face images
Non-Negative
Matrix
Factorization
Applying NMF to face images
Non-Negative
Matrix
Factorization
Applying NMF to face images
• The quality of the back-transformed data is similar to when
using PCA, but slightly worse
• PCA -
• finds the optimum directions in terms of reconstruction
• NMF -
• is usually not used for its ability to reconstruct or encode data,
but rather for finding interesting patterns within the data
• These components are all positive, and so resemble
prototypes of faces much more so than the components
shown for PCA
• Component 3 -
• shows a face rotated somewhat to the right
• Component 7 -
• shows a face somewhat rotated to the left.
Non-Negative
Matrix
Factorization
Applying NMF to face images
Non-Negative
Matrix
Factorization
Applying NMF to face images
Non-Negative
Matrix
Factorization
Applying NMF to face images
• Faces that have a high coefficient for component 3 are faces
looking to the right (Figure 3-16)
• Faces with a high coefficient for component 7 are looking
to the left (Figure 3-17)
• Extracting patterns like these works best for data with
additive structure
• audio
• gene expression
• text data
Non-Negative
Matrix
Factorization
Applying NMF to face images
Non-Negative
Matrix
Factorization
Applying NMF to face images
We can use NMF to recover the three signals
Non-Negative
Matrix
Factorization
Applying NMF to face images
Manifold
Learning
with t-SNE
Manifold
Learning
with t-SNE
§ Advantages of PCA;
§ PCA is often a good first approach for transforming the data so
that we might be able to visualize it using a scatter plot
§ Disadvantages of PCA:
§ The nature of the method (applying a rotation and then dropping
directions) limits its usefulness
Manifold
Learning
with t-SNE
§ Manifold learning -
§ reduces the dimensinality of high-dimensional data by assuming
that the data is embedded in a lower dimentional nonlinear manifold
§ Class of algorithms for visualization called manifold learning
algorithms
§ allow for much more complex mappings
§ often provide better visualizations
§ many algorithms exist -
§ LLE (Locally Linear Embedding), Isomap, SE(Spectral Embedding), t-SNE
(T-distributed Stochastic Neighbor Embedding
§ useful one is the t-SNE algorithm
§ Manifold learning algorithms are mainly aimed at visualization
§ rarely used to generate more than two new features
§ t-SNE compute a new representation of the training data, but don’t
allow transformations of new data
§ these algorithms cannot be applied to a test set
§ Manifold learning can be useful for exploratory data analysis
Manifold
Learning
with t-SNE
§ Idea behind t-SNE -
§ Machine learning algorithm that is used to visualize high
dimensional data in two or three dimensions
§ embeds high dimensional points into lower dimensions
§ find a two-dimensional representation of the data that
preserves the distances between points as best as possible
§ t-SNE
§ Starts with a random two dimensional representation for each data
point
§ Tries to make points that are close in the original feature space
closer
§ Tries data points that are far apart in the original feature space
farther apart
Manifold
Learning
with t-SNE
§ t-SNE
§ puts more emphasis on points that are close by rather than
preserving distances between far-apart points
§ i.e., it tries to preserve the information indicating which points are
neighbors to each other
§ EXAMPLE:
§ Handwritten digits
§ data point in this dataset is an 8×8 grayscale image of a handwritten
digit between 0 and 9.
Manifold
Learning
with t-SNE
Applying PCA to HandWritten Digits
§ we actually used the true digit classes as characters, to show which
class is where.
§ The digits zero, six, and four are relatively well separated using the
first two principal components, though they still overlap.
§ Most of the other digits overlap significantly.
Manifold
Learning
with t-SNE
Applying PCA to HandWritten Digits
§ PCA to visualize the data reduced to two dimensions.
§ We plot the first two principal components, and represent each
sample with a digit corresponding to its class.
Manifold
Learning
with t-SNE
Applying NMF to HandWritten Digits
§ we actually used the true digit classes as glyphs, to show which class
is where.
§ The digits zero, six, and four are relatively well separated using the
first two principal components, though they still overlap
§ Most of the other digits overlap significantly
Manifold
Learning
with t-SNE
Scatter Plot using PCA to HandWritten Digits
Manifold
Learning
with t-SNE
Applying t-SNE to HandWritten Digits
§ t-SNE does not support transforming new data, the TSNE
class has no transform method
§ we can call the fit_transform method
§ build the model and immediately return the transformed data.
Manifold
Learning
with t-SNE
Applying NMF to face images
Manifold
Learning
with t-SNE
Applying t-SNE to HandWritten Digits
§ The result of t-SNE is quite remarkable
§ All the classes are quite clearly separated
§ The ones and nines are somewhat split up, but most of the classes
form a single dense group
§ This method has no knowledge of the class labels: it is
completely unsupervised
§ It can find a representation of the data in two dimensions that clearly
separates the classes, based solely on how close points are in the
original space.
§ The t-SNE algorithm has some tuning parameters
§ though it often works well with the default settings.
§ perplexity -
§ controls the effective number of neighbors that each point considers during
the dimensionality reduction process
§ early_exaggeration -
§ Controls how tight natural clusters in the original space are in the embedded
space and how much space will be between them
§ learning_rate
§ max_iter etc.,
Agenda
Clustering
§ Clustering is the task of partitioning the dataset
into groups, called clusters
§ GOAL:The goal is to split up the data in such a way
that points within a single cluster are very similar
and points in different clusters are different
§ Similarly to classification algorithms, clustering
algorithms assign (or predict) a number to each
data point, indicating which cluster a particular
point belongs to
Clustering
Clustering
K-Means Clustering
Agglomerative Clustering
DBSCAN
Clustering K-Means Clustering
Clustering
K-Means Clustering
§ k-means clustering is one of the simplest and
most commonly used clustering algorithms
§ It tr ies to f ind cluster centers that are
representative of certain regions of the data
§ The algorithm alternates between two steps:
§ Step 1: Assigning data point to cluster
§ Assigning each data point to the closest cluster
center
§ Step 2: Recalculation of cluster center
§ Setting each cluster center as the mean of the data
points that are assigned to it
§ The algorithm is finished when the assignment of
instances to clusters no longer changes
Clustering
K-Means Clustering
Clustering
K-Means Clustering
§ Cluster centers are shown as triangles
§ Data points are shown as circles
§ Colors indicate cluster membership
§ Three clusters - so the algorithm was initialized by
declaring three data points randomly as cluster centers
(Initialization)
§ Then the iterative algorithm starts
§ First, each data point is assigned to the cluster center it is
closest to. (Assignpoints1)
§ The cluster centers are updated to be the mean of the
assigned points (Recompute Centers (1))
§ Then the process is repeated two more times. After the third
iteration, the assignment of points to cluster centers remained
unchanged, so the algorithm stops.
Clustering
K-Means Clustering
§ Given new data points, k-means will assign each to the
closest cluster center.
Clustering
K-Means Clustering
K-Means
Clustering § Each training data point in X is assigned a cluster label
§ Find these labels in the kmeans.labels_ attribute
§ we asked for three clusters, the clusters are numbered 0
to 2
K-Means
Clustering
§ We can also assign cluster labels to new points, using
the predict method
§ Each new point is assigned to the closest cluster center
when predicting, but the existing model is not changed
§ Running predict on the training set returns the same
result as labels_.
Clustering
K-Means Clustering
§ Clustering is somewhat similar to Classification
§ The labels themselves have no a priori meaning
§ Example 1 - Two dimensional toy dataset
§ we should not assign any significance to the fact that one group was
labeled 0 and another one was labeled 1.
§ Note 1 -
§ Running the algorithm again might result in a different numbering
of clusters because of the random nature of the initialization
§ Note 2 -
§ The cluster centers are stored in the cluster_centers_ attribute
§ Example 2 - Clustering face images
§ It might be that the cluster 3 found by the algorithm
contains only faces of Bela.You can only know that after you
look at the pictures
§ The number 3 is arbitrary
§ The only information the algorithm gives you is that all
faces labeled as 3 are similar
K-Means
Clustering
K-Means
Clustering
§ We can also use more or fewer cluster centers
K-Means
Clustering
Failure cases of k-means
§ Even if you know the “right” number of clusters for a
given dataset, k-means might not always be able to
recover them
§ Each cluster is defined solely by its center, which
means that each cluster is a convex shape
§ k-means can only capture relatively simple shapes
§ k-means also assumes that all clusters have the same
“diameter” -
§ It always draws the boundary between clusters to be
exactly in the middle between the cluster centers
K-Means
Clustering
K-Means
Clustering
§ Three clusters - Cluster 0, cluster 1, cluster 2
§ cluster 0 and cluster 1 have some points that are far
away from all the other points in these clusters that
“reach” toward the center
K-Means
Clustering
§ k-means also assumes that all directions are equally
important for each cluster
§ The following plot (Figure 3-28) shows a two-
dimensional dataset where there are three clearly
separated parts in the data
§ However, these groups are stretched toward the
diagonal
§ As k-means only considers the distance to the nearest
cluster center, it can’t handle this kind of data
K-Means
Clustering
K-Means
Clustering
K-Means
Clustering
• k-means also performs poorly if the clusters have more
complex shapes, like the two_moons
• Here, we would hope that the clustering algorithm can
discover the two halfmoon shapes
• However, this is not possible using the k-means
algorithm
K-Means
Clustering
Vector quantization
§ Even though k-means is a clustering algorithm, there
are interesting parallels between k-means and the
decomposition methods like PCA and NMF
§ PCA tries to find directions of maximum variance
in the data
§ NMF tries to find additive components, which often
correspond to “extremes” or “parts” of the data
§ Both methods tried to express the data points as a
sum over some components
§ k-means, on the other hand, tries to represent
each data point using a cluster center
§ In k-means each point being represented using
only a single component, which is given by the
cluster center
K-Means
Clustering
Vector quantization
§ This view of k-means as a decomposition method,
where each point is represented using a single
component, is called vector quantization
§ Comparison of PCA, NMF, and k-means
§ showing the components extracted , as well as
reconstructions of faces from the test set using 100
components
§ For k-means, the reconstruction is the closest cluster
center found on the training set
K-Means
Clustering
K-Means
Clustering
K-Means
Clustering
K-Means
Clustering
§ One interesting advantage of k-means -
§ An interesting aspect of vector quantization using k-
means is that we can use many more clusters than
input dimensions to encode our data
§ Example: two_moons data
§ Using PCA or NMF, there is nothing much we can do to this
data, as it lives in only two dimensions
§ Reducing it to one dimension with PCA or NMF would
completely destroy the structure of the data
§ But we can find a more expressive representation with
k-means, by using more cluster centers
K-Means
Clustering
K-Means
Clustering
K-Means
Clustering
§ We used 10 cluster centers - means each point is now
assigned a number between 0 and 9
§ We can see this as the data being represented using 10
components (that is, we have 10 new features)
§ Using this 10-dimensional representation, it would now be
possible to separate the two half-moon shapes using a
linear model, which would not have been possible using
the original two features
K-Means
Clustering
K-Means
Clustering
Advantages
§ k-means is a very popular algorithm for clustering
§ Relatively easy to understand and implement
§ It runs relatively quickly
§ The k-means clustering algorithm is guaranteed to
give results (Convergence)
§ It is not specific to particular problems. (i.e., can be
applied for numerical data to text) (Generalization)
§ k-means scales easily to large datasets
§ NOTE:
§ MiniBatchKMeans class - can handle very large
datasets
K-Means
Clustering
Disadvantages
§ It relies on a random initialization, which means the outcome
of the algorithm depends on a random seed
§ Deciding on the number of clusters to start is difficult (can
use elbow method)
§ Choice of initial centroids is difficult
§ Effect of outliers
§ Curse of dimentionality
§ Preprocessing is mandatory
§ Restrictive assumptions made on the shape of clusters
§ The requirement to specify the number of clusters you are
looking for (which might not be known in a real-world
application)
Clustering Agglomerative Clustering
Agglomerative
Clustering
§ Agglomerative clustering refers to a collection of
clustering algorithms that all build upon the same
principles:
§ The algorithm starts by declaring each point its own
cluster
§ then merges the two most similar clusters until some
stopping criterion is satisfied
§ The stopping criterion -
§ Number of clusters
§ so similar clusters are merged until only the specified
number of clusters are left
§ Most similar cluster is identified by considering several
linkage criteria
§ This measure is always defined between two existing
clusters
Agglomerative
Clustering
§ The following three choices (Linkages) are implemented in scikit-learn:
§ Ward
§ The default choice
§ Ward picks the two clusters to merge such that the variance within
all clusters increases the least
§ This often leads to clusters that are relatively equally sized
§ Average
§ Merges the two clusters that have the smallest average distance
between all their points
§ Complete
§ Also known as maximum linkage
§ Merges the two clusters that have the smallest maximum distance
between their points
§ Note 1-
§ Ward works on most datasets
§ Note 2 -
§ If the clusters have very dissimilar numbers of members, average or complete
might work better
Agglomerative
Clustering
§ This plot illustrates the progression of agglomerative clustering on a two-
dimensional dataset, looking for three clusters.
Agglomerative
Clustering
§ Initially, each point is its own cluster
§ Then, in each step, the two clusters that are closest
are merged
§ In the first four steps, two single-point clusters are
picked and these are joined into two-point
clusters
§ In step 5, one of the two-point clusters is extended
to a third point, and so on
§ In step 9, there are only three clusters remaining
§ As we specified that we are looking for three
clusters, the algorithm then stops
Agglomerative
Clustering
§ Because of the way the algorithm works,
agglomerative cluster ing cannot make
predictions for new data points
§ AgglomerativeClustering has no predict method
§ To bu i l d t h e m o d e l a n d ge t t h e c l u s t e r
memberships on the training set, use the
fit_predict method instead
Agglomerative
Clustering
Agglomerative
Clustering
§ While the scikit-learn implementation of
agglomerative clustering requires to
specify the number of clusters
§ Agglomerative clustering methods
provide some help with choosing the right
number of clusters
Agglomerative
Clustering
Hierarchical clustering and
dendrograms
§ Agglomerative clustering produces what is
known as a hierarchical clustering
§ The clustering proceeds iteratively
§ Every point makes a journey from being a single
point cluster to belonging to some final
cluster
§ Each intermediate step provides a clustering of
the data (with a different number of clusters)
Agglomerative
Clustering
The following figure shows an overlay of all the possible
clusterings shown in Figure 3-33, providing some insight into
how each cluster breaks up into smaller clusters.
Agglomerative
Clustering
Hierarchical Clustering and Dendograms
§ Hierarchical clustering relies on the two-
dimensional nature of the data
§ Hierarchical clustering cannot be used on
datasets that have more than two features
§ Dendograms -
§ Another tool to visualize hierarchical
clustering, called a dendrogram
§ can handle multidimensional datasets
Agglomerative
Clustering
Hierarchical Clustering and
Dendograms
§ The dendrogram is a tree-like structure
that is mainly used to store each step
§ scikit-learn currently does not have the
functionality to draw dendrograms
§ Dendograms can be generated easily
using SciPy
Agglomerative
Clustering
SciPy vs scikitlearn
§ SciPy clustering algorithms have a
slightly different interface to the scikit-
learn clustering algorithms
§ SciPy provides a function that
§ Takes a data array X
§ Computes a linkage array, which encodes
hierarchical cluster similarities
Agglomerative
Clustering
§ We can then feed this linkage array into the scipy
dendrogram function to plot the dendrogram
Agglomerative
Clustering
Agglomerative
Clustering
§ The dendrogram
§ shows data points as points on the bottom (i.e.,X-
axis) (numbered from 0 to 11)
§ shows Cluster Distance on Y- axis
§ Then, a tree is plotted with these points
(representing single-point clusters) as the leaves,
and a new node parent is added for each two
clusters that are joined
§ Reading from bottom to top, the data points 1 and
4 are joined first (as you could see in Figure 3-33).
§ Next, points 6 and 9 are joined into a cluster, and
so on. At the top level, there are two branches,
one consisting of points 11, 0, 5, 10, 7, 6, and 9, and
the other consisting of points 1, 4, 3, 2, and 8.
§ These correspond to the two largest clusters
Agglomerative
Clustering
§ The y-axis in the dendrogram
§ specifies when two clusters get merged?
§ The length of each branch also shows
how far apart the merged clusters are
§ The longest branches in this dendrogram
are the three lines that are marked by the
dashed line labeled “three clusters”
§ Going from three to two clusters meant
merging some very far-apart points
§ At the top of the chart, where merging
the two remaining clusters into a single
cluster again bridges a relatively large
distance
Agglomerative
Clustering
Drawbacks of Agglomerative Clustering
§ Fails at separating complex shapes
(Example: two_moons )
Clustering DBSCAN
DBSCAN
DBSCAN
§ DBSCAN - Density-Based Spatial Clustering of
Applications with Noise
§ Another very useful clustering algorithm
§ Benefits:
§ It does not require the user to set the number of
clusters a priori
§ It can capture clusters of complex shapes
§ It can identify points that are not part of any cluster
§ Drawbacks:
§ Somewhat slower than agglomerative clustering and k-
means, but still scales to relatively large datasets
DBSCAN
DBSCAN
§ Functionality:
§ DBSCAN works by identifying points that are in
“crowded” regions of the feature space, where
many data points are close together
§ These regions are referred to as dense regions in
feature space
§ The idea behind DBSCAN is that clusters form
dense regions of data, separated by regions that
are relatively empty
DBSCAN
DBSCAN
§ Core Samples:
§ Points that are within a dense region are called
core samples
§ Also called as core points
§ Parameters to identify core samples:
§ min_samples
§ eps
§ If there are at least min_samples many data points
within a distance of eps to a given data point, that
data point is classified as a core sample
§ Core samples that are closer to each other than the
distance eps are put into the same cluster by
DBSCAN
DBSCAN
DBSCAN Algorithm
§ Step 1: Picks an arbitrary point to start with
§ Step 2: Finds all points with distance eps or less from that point
§ Step 3: If there are less than min_samples points within
distance eps of the starting point - this point is labeled as noise
(i.e., it doesn’t belong to any cluster)
§ Step 4: If there are more than min_samples points within a
distance of eps, the point is labeled a core sample - assigned
a new cluster label
§ Step 5: All neighbors (within eps) of the point are visited
§ Step 5.1: If they have not been assigned a cluster yet, they are
assigned the new cluster label that was just created
§ Step 5.2: If they are core samples, their neighbors are
visited in turn, and so on.
§ Step 5.3: The cluster grows until there are no more core samples
within distance eps of the cluster
DBSCAN
§ Step 6: Another point that hasn’t yet been visited is
picked, and the same procedure is repeated
§ Finally we end up with three kinds of points
§ Core points
§ Boundary Points - Points that are within distance eps of
core points (called boundary points)
§ Noise - Points that do not belong to any cluster
§
DBSCAN
§ Note 1:
§ When the DBSCAN algorithm is run on a particular
dataset multiple times, there will not be any change
in Core points and Noise
§ (i.e., the clustering of the core points is always the same,
and the same points will always be labeled as noise)
§ Note 2:
§ When the DBSCAN algorithm is run on a
par ticular dataset multiple times, the
boundary points may change
§ i.e., A boundary point might be neighbor to
core samples of more than one cluster.
§ Note 3:
§ The cluster membership - of boundary points
depends on the order in which points are visited
DBSCAN
DBSCAN on the synthetic dataset
§ DBSCAN does not allow predictions on new test data,
so we will use the fit_predict method to perform
clustering and return the cluster labels in one step
DBSCAN
DBSCAN
§ Points that belong to clusters are solid
§ Noise points are shown in white
§ Core samples are shown as large markers
§ Boundary points are displayed as smaller markers
§ Increasing eps (going from left to right in the figure)
§ means that more points will be included in a cluster
§ This makes clusters grow, but might also lead to
multiple clusters joining into one
§ Increasing min_samples (going from top to bottom
in the figure)
§ means that fewer points will be core points, and more
points will be labeled as noise
DBSCAN
§ Parameter eps:
§ most important parameter
§ it determines what it means for points to be “close”
§ Very small eps -
§ means - NO points are core samples
§ Leads to - all points being labeled as noise
§ Very large eps -
§ All points forming a single cluster
§ Parameter min_samples:
§ The min_samples - mostly determines whether points
in less dense regions will be labeled as outliers or as
their own clusters
§ Large min_samples - many samples will now be
labeled as noise
§ determines the minimum cluster size
DBSCAN
§ Note 1:
§ While DBSCAN doesn’t require setting the number of
clusters explicitly, setting eps implicitly controls how
many clusters will be found
§ Note 2:
§ Finding a good setting for eps is sometimes easier after
scaling the data using StandardScaler or MinMaxScaler
DBSCAN
DBSCAN on the two_moons dataset
§ The algorithm actually finds the two half-circles and
separates them using the default settings.
DBSCAN
DBSCAN
§ As the algorithm produced the desired number of
clusters (two)
§ Default parameter (eps=0.5) settings seem to work well
§ If we decrease eps to 0.2 we will get eight clusters
§ Increasing eps to 0.7 results in a single cluster
§ When using DBSCAN, you need to be careful about
handling the returned cluster assignments
DBSCAN
Comparing and Evaluating Clustering
Algorithms
§ Challenges in clustering algorithms -
§ Very hard to assess how well an algorithm
worked
§ To compare outcomes between different
algorithms
DBSCAN
Evaluating clustering with ground truth
§ Metrics to assess the outcome of a clustering algorithm
§ Adjusted Rand Index (ARI)
§ Normalized Mutual Information (NMI)
§ Both provides a quantitative measure
§ Clustering - 1
§ Unrelated Clusterings - 0
§ ARI can become negative
§ Compare the k-means, agglomerative clustering, and
DBSCAN algorithms using ARI
DBSCAN
DBSCAN
Common Mistake when evaluating clustering
§ Use of accuracy_score instead of adjusted_rand_score and
normalized_mutual_info_score,
DBSCAN
Evaluating clustering without ground truth (O/P)
§ In practice, there is a big problem with using measures like ARI
§ In Clustering algorithms -
§ there is usually no ground truth to which to compare the results
§ Metrics like ARI and NMI -
§ only helps in developing algorithms
§ NOT in assessing success in an application
§ Silhouette coefficient -
§ Another metric for clustering
§ Don’t require ground truth
§ Computes the compactness of a cluster
§ Note 1:
§ Compactness doesn’t allow for complex shapes
§ Note 2:
§ These output metrics often don’t work well in practice
DBSCAN
Comparison using the silhouette score
DBSCAN
Observations
§ k-means gets the highest silhouette score
§ We might prefer the result produced by DBSCAN
§ Better strategy:
§ for evaluating clusters - use robustness-based
clustering metrics
§ These run an algorithm
§ after adding some noise to the data
§ using different parameter settings
§ Then compare the outcomes
DBSCAN
Face Images Example
§ Note:
§ Even if we get very high silhouette score -
§ Still don’t know if there is any semantic meaning in the clustering
§ Whether the clustering reflects an aspect of the data that we are
interested in
§ Face Images Example:
§ Goal is to find groups of similar faces — men and women, or old people and
young people, or people with beards and without
§ Target:
§ Cluster the data into two clusters
§ Drawbacks:
§ We still don’t know if the clusters that are found correspond in any way to the
concepts we are interested in
§ The clusters may find side views versus front views, or pictures taken at
night versus pictures taken during the day, or pictures taken with
iPhones versus pictures taken with Android phones
§ The only way to know whether the clustering corresponds to anything we are
interested in is to analyze the clusters manually
DBSCAN
Comparing algorithms on the faces dataset
§ Use Eigenface representation of the data, as produced
by PCA(whiten=True), with 100 components
§ The output has more semantic representation of the
face images than the raw pixels
§ It will also make computation faster
DBSCAN
Analyzing the faces dataset with DBSCAN
§ All the returned labels are –1
§ All of the data was labeled as “noise” by DBSCAN.
§ Solution:
§ eps = Higher
§ expand the neighborhood of each point
§ min_samples = Lower
§ to consider smaller groups of points as clusters
DBSCAN
§ min_samples
§ Result:
§ Everything is labeled as noise
DBSCAN
§ Changing eps value
§ Result:
§ Only One Cluster (0) is formed along with noise (-1)
DBSCAN
§ Use this result to find out what the “noise” looks like
compared to the rest of the data
§ 27 points of noise and 2036 points are inside the cluster
DBSCAN
§ Noise Points
DBSCAN
§ Why they are considered as noise?
§ the fifth image in the first row - person drinking from a
glass
§ Images of people wearing hats
§ Last image - hand in front of the person’s face
§ other images - contain odd angles or crops that are too
close or too wide
§ We can do little about people in photos who is wearing
hats, drinking, or holding something in front of their faces
§ Outlier Detection:
§ This kind of analysis — trying to find “the odd one out” —
is called outlier detection
§ Solution:
§ do a better job of cropping images
DBSCAN
§ For more clusters:
§ Need to set smaller eps (15 and 0.5 (the default)
DBSCAN
DBSCAN
DBSCAN
Analyzing the faces dataset
§ Some of the clusters correspond to people with very
distinct faces (within this dataset), such as Sharon or
Koizumi
§ Within each cluster, the orientation of the face is also quite
fixed, as well as the facial expression
§ Some of the clusters contain faces of multiple people, but
they share a similar orientation and expression
§ Note:
§ We are doing a manual analysis here
§ Different from the supervised learning based on R2 score or
accuracy
DBSCAN
Analyzing the faces dataset with k-means
§ Disadvantage of DBSCAN on Face Dataset -
§ Not possible to create more than one big cluster using
DBSCAN
§ Pros and Cons of Agglomerative clustering and k-
means -
§ Pros -
§ Can create clusters of even size
§ Cons -
§ Need to set a target number of clusters a priori
§ Number of clusters = Number of people in the dataset
§ Still cannot recover all the clusters correctly
§ Solution -
§ Start with a low number of clusters (eg., 10) - Analyze
each of the clusters manually
§ Increase the number of clusters if necessary
DBSCAN
§ K-Means -
§ Partitioned the data into relatively similarly sized clusters
from 64 to 386
§ This is quite different from the result of DBSCAN
DBSCAN
§ Visualization of outcome of k-means
§ As we clustered in the representation produced by PCA,
we need to rotate the cluster centers back into the
o r i g i n a l s p a c e t o v i s u a l i z e t h e m , u s i n g
pca.inverse_transform.
DBSCAN
§ The cluster centers found by k-means are very smooth
versions of faces
§ Each center is an average of 64 to 386 face images
§ The clustering seems to pick up on
§ different orientations of the face
§ different expressions (the third cluster center seems to
show a smiling face)
§ the presence of shirt collars (see the second-to-last
cluster center).
DBSCAN
§ More detailed view -
§ In Figure 3-44
§ Each cluster center shows -
§ The five most typical images in the cluster -
§ the images assigned to the cluster that are closest to
the cluster center
§ The five most atypical images in the cluster -
§ the images assigned to the cluster that are furthest
from the cluster center
DBSCAN
DBSCAN
§ Third Cluster - Smiling Faces
§ Other clusters - Orientation
§ Atypical points -
§ are not very similar to the cluster centers
§ Their assignment seems somewhat arbitrary
§ k-means partitions doesn’t have a concept of “noise”
points
§ Using a larger number of clusters, the algorithm could
find finer distinctions
§ Note:
§ Adding more clusters makes manual inspection even
harder
DBSCAN
Analyzing the faces dataset with agglomerative
clustering
§ Agglomerative clustering also produces
§ relatively equally sized clusters
§ with cluster sizes between 26 and 623
§ More uneven than those produced by k-means
§ Much more even than the ones produced by DBSCAN
DBSCAN
§ Compute ARI -
§ to measure the similar ity of two par titions by
Agglomerative and K-Means
§ ARI = 0.13
§ means that the two clusterings labels_agg and labels_km
have little in common
DBSCAN
§ Dendrogram -
§ We’ll limit the depth of the tree in the plot, as branching
down to the individual 2,063 data points would result in an
unreadably dense plot.
DBSCAN
Agglomerative with ten clusters
§ 10 clusters (Figure 3-46)
§ There is no notion of cluster center in agglomerative
clustering (arbitrary data point is choosen)
§ Number of points in each cluster - placed as left of the
first image
DBSCAN
DBSCAN
§ While some of the clusters seem to have a semantic
theme, many of them are too large to be actually
homogeneous
§ To get more homogeneous clusters - run the algorithm
again, this time with 40 clusters
DBSCAN
DBSCAN
§ Agglomerative Clustering - (Figure 3.47)
§ dark skinned and smiling
§ collared shirt
§ smiling woman
§ Hussein
§ high forehead
§ We could also find these highly similar clusters using
Dendograms
DBSCAN
Summary of Clustering Methods
§ Applying and evaluating clustering is a highly qualitative procedure
§ Most helpful in the exploratory phase of data analysis
§ Three clustering algorithms:
§ k-means
§ DBSCAN,
§ Agglomerative
§ All three have a way of controlling the granularity of clustering
§ k-means and agglomerative clustering allos to specify the number of
desired clusters
§ DBSCAN allows to define proximity using the eps parameter, which
indirectly influences cluster size
§ All three methods can be
§ Used on large data sets
§ Used on Real-world datasets
§ Relatively easy to understand
§ Allow for clustering into many clusters
DBSCAN
Summary of Clustering Methods
§ Strengths -
§ k-means -
§ k-means allows for a characterization of the clusters using the
cluster means
§ It can also be viewed as a decomposition method, where each
data point is represented by its cluster center
§ DBSCAN -
§ Allows for the detection of “noise points” (i.e., datapoints that
are not assigned to any cluster)
§ It can help automatically determine the number of clusters
§ Allows for complex cluster shape
§ Sometimes produces clusters of very differing size, which can be
a strength or a weakness
§ Agglomerative clustering -
§ Provide a whole hierarchy of possible partitions of the data
§ Easily inspected via dendrograms
Thank you

More Related Content

PDF
Machine Learning - Implementation with Python - 2
PDF
Machine Learning - Implementation with Python - 1
PDF
Machine Learning - Implementation with Python - 4.pdf
PPTX
Machine Learning - Dataset Preparation
PDF
Machine Learning Algorithms
PDF
Introduction to Machine Learning with SciKit-Learn
PPTX
Machine learning with scikitlearn
PDF
Linear Regression Algorithm | Linear Regression in Python | Machine Learning ...
Machine Learning - Implementation with Python - 2
Machine Learning - Implementation with Python - 1
Machine Learning - Implementation with Python - 4.pdf
Machine Learning - Dataset Preparation
Machine Learning Algorithms
Introduction to Machine Learning with SciKit-Learn
Machine learning with scikitlearn
Linear Regression Algorithm | Linear Regression in Python | Machine Learning ...

What's hot (20)

PPTX
PPTX
A importância dos meios físicos de transmissão
 
PPTX
CCNA 1 Routing and Switching v5.0 Chapter 8
PPTX
CCNA 2 Routing and Switching v5.0 Chapter 3
PPT
Chapter3
PPTX
TCP/IP Introduction
PDF
Understanding Cisco’s Next Generation SD-WAN Solution with Viptela
PPTX
CCNA 2 Routing and Switching v5.0 Chapter 2
PPTX
CCNA 2 Routing and Switching v5.0 Chapter 6
PDF
CCNAv5 - S4: Chapter 1 Hierarchical Network Design
PPTX
CCNA 2 Routing and Switching v5.0 Chapter 8
PPTX
Chapter 17 : static routing
PPTX
CCNA 2 Routing and Switching v5.0 Chapter 4
PPTX
CCNA 1 Routing and Switching v5.0 Chapter 9
PDF
PHP Interview Questions and Answers | Edureka
PDF
Hardware accelerated virtio networking for nfv linux con
PDF
Deploy MPLS Traffic Engineering
PDF
Exos concepts guide_15_4
PPTX
CCNA 2 Routing and Switching v5.0 Chapter 5
PPTX
INTRODUCTION TO WIRELESS NETWORKING
A importância dos meios físicos de transmissão
 
CCNA 1 Routing and Switching v5.0 Chapter 8
CCNA 2 Routing and Switching v5.0 Chapter 3
Chapter3
TCP/IP Introduction
Understanding Cisco’s Next Generation SD-WAN Solution with Viptela
CCNA 2 Routing and Switching v5.0 Chapter 2
CCNA 2 Routing and Switching v5.0 Chapter 6
CCNAv5 - S4: Chapter 1 Hierarchical Network Design
CCNA 2 Routing and Switching v5.0 Chapter 8
Chapter 17 : static routing
CCNA 2 Routing and Switching v5.0 Chapter 4
CCNA 1 Routing and Switching v5.0 Chapter 9
PHP Interview Questions and Answers | Edureka
Hardware accelerated virtio networking for nfv linux con
Deploy MPLS Traffic Engineering
Exos concepts guide_15_4
CCNA 2 Routing and Switching v5.0 Chapter 5
INTRODUCTION TO WIRELESS NETWORKING
Ad

Similar to Machine Learning - Implementation with Python - 3.pdf (20)

PDF
ML Basic Concepts.pdf
PPTX
machine learning workflow with data input.pptx
PDF
ML_Lec2 introduction to data processing.pdf
PPTX
UNIT-2. unsupervised learning of machine learning
PPT
slides
PPT
slides
PDF
BPstudy sklearn 20180925
PDF
Choosing a Machine Learning technique to solve your need
PPTX
Machine_Learning.pptx
PDF
Machine Learning - Deep Learning
PDF
Introduction to Data Science
PDF
Machine Learning Foundations for Professional Managers
PPTX
Introduction to Machine Learning
PPTX
Learning – Types of Machine Learning – Supervised Learning – Unsupervised UNI...
PPTX
Introduction to Machine Learning
PDF
Machine Learning ebook.pdf
PDF
1_5_AI_edx_ml_51intro_240204_104838machine learning lecture 1
PPTX
Feature scaling
PPTX
Big Data & Machine Learning - TDC2013 São Paulo - 12/0713
PPTX
Data Preprocessing
ML Basic Concepts.pdf
machine learning workflow with data input.pptx
ML_Lec2 introduction to data processing.pdf
UNIT-2. unsupervised learning of machine learning
slides
slides
BPstudy sklearn 20180925
Choosing a Machine Learning technique to solve your need
Machine_Learning.pptx
Machine Learning - Deep Learning
Introduction to Data Science
Machine Learning Foundations for Professional Managers
Introduction to Machine Learning
Learning – Types of Machine Learning – Supervised Learning – Unsupervised UNI...
Introduction to Machine Learning
Machine Learning ebook.pdf
1_5_AI_edx_ml_51intro_240204_104838machine learning lecture 1
Feature scaling
Big Data & Machine Learning - TDC2013 São Paulo - 12/0713
Data Preprocessing
Ad

More from University College of Engineering Kakinada, JNTUK - Kakinada, India (6)

PDF
Object Oriented Programming using C++ - Part 1
PDF
Object Oriented Programming using C++ - Part 2
PDF
Object Oriented Programming using C++ - Part 5
PDF
Object Oriented Programming using C++ - Part 4
PDF
Object Oriented Programming using C++ - Part 3
Object Oriented Programming using C++ - Part 1
Object Oriented Programming using C++ - Part 2
Object Oriented Programming using C++ - Part 5
Object Oriented Programming using C++ - Part 4
Object Oriented Programming using C++ - Part 3

Recently uploaded (20)

PPTX
master seminar digital applications in india
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
Insiders guide to clinical Medicine.pdf
PPTX
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
PDF
Business Ethics Teaching Materials for college
PPTX
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PDF
TR - Agricultural Crops Production NC III.pdf
PPTX
Introduction to Child Health Nursing – Unit I | Child Health Nursing I | B.Sc...
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PDF
Origin of periodic table-Mendeleev’s Periodic-Modern Periodic table
PDF
Pre independence Education in Inndia.pdf
PDF
01-Introduction-to-Information-Management.pdf
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PDF
Complications of Minimal Access Surgery at WLH
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
master seminar digital applications in india
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
STATICS OF THE RIGID BODIES Hibbelers.pdf
Final Presentation General Medicine 03-08-2024.pptx
Insiders guide to clinical Medicine.pdf
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
Business Ethics Teaching Materials for college
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
Abdominal Access Techniques with Prof. Dr. R K Mishra
Pharmacology of Heart Failure /Pharmacotherapy of CHF
Supply Chain Operations Speaking Notes -ICLT Program
TR - Agricultural Crops Production NC III.pdf
Introduction to Child Health Nursing – Unit I | Child Health Nursing I | B.Sc...
Renaissance Architecture: A Journey from Faith to Humanism
Origin of periodic table-Mendeleev’s Periodic-Modern Periodic table
Pre independence Education in Inndia.pdf
01-Introduction-to-Information-Management.pdf
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
Complications of Minimal Access Surgery at WLH
102 student loan defaulters named and shamed – Is someone you know on the list?

Machine Learning - Implementation with Python - 3.pdf

  • 1. Machine Learning Source: Introduction to Machine Learning with Python Authors: Andreas C. Muller and Sarah Guido
  • 3. Agenda Introduction Types of Unsupervised Learning? Challenges in Unsupervised Learning Preprocessing and Scaling Clustering
  • 5. Introduction § Unsupervised learning includes all kinds of machine learning where there is no known output § No teacher to instruct the learning algorithm § The learning algorithm is just shown the input data and asked to extract knowledge from this data
  • 6. Types of Unsupervised learning Types of Unsupervised learning § Two kinds of Unsupervised learning § Transformations of the dataset § Clustering
  • 7. Types of Unsupervised learning Unsupervised transformations of a dataset § Algorithms that create a new representation of the data which might be easier for humans or other machine learning algorithms to understand compared to the original representation of the data.
  • 8. Types of Unsupervised learning § Application of unsupervised transformations is dimensionality reduction § which takes a high-dimensional representation of the data, consisting of many features, and finds a new way to represent this data that summarizes the essential characteristics with fewer features. Example: § Application for dimensionality reduction is reduction to two dimensions for visualization purposes
  • 9. Types of Unsupervised learning Unsupervised transformations of a dataset § Another application for unsupervised transformations is finding the parts or components that “make up” the data Example: Topic Extraction § The task is § to find the unknown topics that are talked about in each document § to learn what topics appear in each document § tracking the discussion of themes like elections, gun control, or pop stars on social media
  • 10. Types of Unsupervised learning Clustering Algorithms § Partition data into distinct groups of similar items EXAMPLE: § Uploading photos to a social media site
  • 11. Challenges in Unsupervised Learning • Evaluating whether the algorithm learned something useful • Unsupervised ML algorithms are applied to data that does not contain any label information ---> we don’t know what the right output should be • Very hard to say whether a model “did well” • There is no way for us to tell the algorithm what we are looking for and often the only way to evaluate the result of an unsupervised algorithm is to inspect it manually • Unsupervised algor ithms are used often in an exploratory setting --> when the data scientist wants to understand the data better, rather than as part of a larger automatic system • Common application for unsupervised algorithms is as a preprocessing step for supervised algorithms
  • 12. Preprocessing and Scaling Different Kinds of Preprocessing Applying Data Transformations Scaling Training and Test Data the Same Way The Effect of Preprocessing on Supervised Learning
  • 13. Preprocessing and Scaling Dimensionality Reduction, Feature Extraction and Manifold Learning Non-Negative Matrix Factorization Manifold Learning with t-SNE
  • 14. Preprocessing and Scaling § Neural networks and SVMs, are very sensitive to the scaling of the data § A common practice is to adjust the features so that the data representation is more suitable for these algorithms
  • 16. Different Kinds of Preprocessing § StandardScaler § The StandardScaler in scikit-learn ensures that for each feature the mean is 0 and the variance is 1, bringing all features to the same magnitude. § Disadvantage: § This scaling does not ensure any particular minimum and maximum values for the features § RobustScaler § It ensures statistical properties for each feature that guarantee that they are on the same scale. § It uses the median and quartiles, instead of mean and variance § Advantage: § RobustScaler ignore data points that are very different from the rest (like measurement errors) § These odd data points are also called outliers, and can lead to trouble for other scaling techniques
  • 17. Different Kinds of Preprocessing § MinMaxScaler § It shifts the data such that all features are exactly between 0 and 1 § For a two-dimensional dataset this means all of the data is contained within the rectangle created by X-axis between 0 and 1 and the Y-axis between 0 and 1 § Normalizer § Scales each data point such that the feature vector has a Euclidean length of 1 § It projects a data point on the circle (or sphere, in the case of higher dimensions) with a radius of 1. § This normalization is often used when only the direction of the data matters, not the length of the feature vector.
  • 19. Applying Data Transformations § Transformations on Cancer dataset § Preprocessing methods like the scalers are usually applied before applying a supervised machine learning algorithm § Example: Apply kernel SVM (SVC) to the cancer dataset, and use MinMaxScaler for preprocessing the data STEP 1: Loading the dataset and splitting it into train and test set
  • 20. Applying Data Transformations STEP 2: Import the class and then instantiate it STEP 3: Fit the scaler using the fit method, applied to the training data
  • 21. STEP 4: Apply the transformation § i.e., scale the training data — we use the transform method of the scaler § The transform method is used in scikit-learn whenever a model returns a new representation of the data. Applying Data Transformations
  • 22. Applying Data Transformations STEP 5: Apply SVM to the scaled data § To apply the SVM to the scaled data, we also need to transform the test set
  • 23. Scaling Training and Test Data the SameWay • After scaling, the minimum and maximum are not 0 and 1 • Some of the features are even outside the 0–1 • MinMaxScaler (and all the other scalers) always applies exactly the same transformation to the training and the test set • i.e., the transform method always subtracts the training set minimum and divides by the training set range, which might be different from the minimum and range for the test set
  • 24. Scaling Training and Test Data the SameWay • It is important to apply exactly the same transformation to the training set and the test set for the supervised model to work on the test set
  • 27. Scaling Training and Test Data the SameWay • First panel: Unscaled two-dimensional dataset • The training set shown as circles and the test set shown as triangles • Second panel: Data is same but scaled using the MinMaxScaler • We called fit on the training set, and then called transform on the training and test sets. • The dataset in the second panel looks identical to the first only the ticks on the axes have changed. • The features are between 0 and 1 • The minimum and maximum feature values for the test data (the triangles) are not 0 and 1. • Third panel: Scaling the training set and test set separately • The minimum and maximum feature values for both the training and the test set are 0 and 1 • The test points moved incongruously to the training set, as they were scaled differently • The arrangement of the data is changed in an arbitrary way
  • 28. Scaling Training and Test Data the SameWay Note: Shortcuts and efficient alternatives: § Often, you want to fit a model on some dataset, and then transform it. § All models that have a transform method also have a fit_transform method. § While fit_transform is not necessarily more efficient for all models, it is still good practice to use this method when trying to transform the training set
  • 29. The Effect of Preprocessing on Supervised Learning
  • 30. The Effect of Preprocessing on Supervised Learning § STEP 1: Before scaling
  • 31. The Effect of Preprocessing on Supervised Learning § STEP 2: After applying scaling
  • 32. The Effect of Preprocessing on Supervised Learning § The effect of scaling the data is quite significant § Scaling the data doesn’t involve any complicated math but don’t tr y to reimplement them yourself § It is always a good practice to use the scaling mechanisms provided by scikit-learn
  • 33. The Effect of Preprocessing on Supervised Learning We can easily replace one preprocessing algorithm with another by changing the class we use ---> because all the preprocessing classes have the same interface
  • 35. Dimensionality Reduction, Feature Extraction and Manifold Learning § It is a statistical process § converts correlated features into a set of linearly uncorrelated features -> with the help of orthogonal transformation § PCA is used for exploratory data analysis and predictive modeling § Applications of PCA: § Image processing § Movie recommendation system § Dimensionality reduction technique in various AI applications such as computer vision, image compression, etc § finding hidden patterns if data has high dimensions
  • 36. Dimensionality Reduction, Feature Extraction and Manifold Learning § Transforming data using unsupervised learning can have many motivations § Compressing data § Finding a representation that is more informative for further processing (Feature Extraction) § Visualization § One of the simplest and most widely used algorithms for all of these is Principal Component Analysis for dimensionality reduction, feature extraction, feature selection, data compression, and data visualization § Non-negative matrix factorization (NMF) - for feature extraction § t-SNE - for visualization using two dimensional scatter plots
  • 37. Principal Component Analysis Principal Component Analysis § It is a method that rotates the dataset in a way such that § the rotated features are statistically uncorrelated § This rotation is often followed by selecting only a subset of the new features, according to how important they are for explaining the data
  • 39. Principal Component Analysis Principal Component Analysis § First Plot: § The first plot (top left) shows the original data points colored to distinguish among them § Step 1: § The algorithm proceeds by first finding the direction of maximum variance, labeled “Component 1”. § This direction (or vector) in the data that contains most of the information § i.e., the direction along which the features are most correlated with each other § Step 2: § The algorithm finds the direction that contains the most information while being orthogonal (at a right angle) to the first direction § Note: § In two dimensions, there is only one possible orientation that is at a right angle, § In higher-dimensional spaces there would be (infinitely) many orthogonal directions § We could have drawn the first component from the center up to the top left instead of down to the bottom right § Principal Components: § The directions found using this process are called principal components § They are the main directions of variance in the data § Note: § There are as many principal components as original features
  • 40. Principal Component Analysis Principal Component Analysis § The second plot § Step 3: § The mean was subtracted from the data, so that the transformed data is centered around zero § Step 4: § The first plot is rotated so that the first principal component aligns with the x-axis § The second principal component aligns with the y- axis § Note: § In the rotated representation, the two axes are uncorrelated § i.e., Correlation matrix of the data in this representation is zero except for the diagonal
  • 41. Principal Component Analysis Principal Component Analysis § PCA for Dimensionality Reduction: § We can use PCA for dimensionality reduction by retaining only some of the principal components § In this example, we might keep only the first principal component § The Third Plot: § Step 5: § Reduces the data from a two-dimensional dataset to a one-dimensional dataset § The Fourth Plot: § Step 6: § Undo the rotation and add the mean back to the data § These points are in the original feature space, but we kept only the information contained in the first principal component § Note: § This transformation is sometimes used to remove noise effects from the data or visualize what part of the information is retained using the principal components
  • 42. Principal Component Analysis Principal Component Analysis (Cancer Dataset) § One of the most common applications of PCA is visualizing high- dimensional datasets § Disadvantage of Scatter Plot: § It is hard to create scatter plots of data that has more than two features § Pair Plot: § A 2D categorical scatter plot that represents the pair wise relationship between the numerical variables § The Iris dataset ---> able to create a pair plot that gave us a partial picture of the data by showing us all the possible combinations of two features § Breast Cancer Dataset: § The Breast Cancer dataset, even using a pair plot is tricky § This dataset has 30 features, which would result in 30 * 14 = 420 scatter plots § Histograms: § Computing histograms of each of the features for the two classes, benign and malignant cancer
  • 45. Principal Component Analysis § Created Histogam for each feature - § Counting how often a data point appears with a feature in a certain range (called a bin) § Each plot overlays two histograms, one for all of the points in the benign class (blue) and one for all the points in the malignant class (red). § This gives us some idea of how each feature is distributed across the two classes § Allows us to guess as to which features are better at distinguishing malignant and benign samples § Example: § The feature “smoothness error” seems quite uninformative, because the two histograms mostly overlap § The feature “worst concave points” seems quite informative, because the histograms are quite disjoint § NOTE: • Histogram doesn’t show us anything about the interactions between variables and how these relate to the classes • Using PCA, we can capture the main interactions Principal Component Analysis (Cancer Dataset)
  • 47. Principal Component Analysis Principal Component Analysis (Cancer Dataset) • Learning the PCA transformation and applying it is as simple as applying a preprocessing transformation • We instantiate the PCA object, find the principal components by calling the fit method • Then apply the rotation and dimensionality reduction by calling transform
  • 48. Principal Component Analysis Principal Component Analysis (Cancer Dataset) • PCA only rotates and shifts the data, but keeps all the principal components • To reduce the dimensionality of the data, we need to specify how many components we want to keep when creating a PCA Object
  • 51. Principal Component Analysis Principal Component Analysis § Note: • PCA is an unsupervised method • does not use any class information when finding the rotation • A linear classifier (that would learn a line in this space) could do a reasonably good job at distinguishing the two classes.
  • 52. Principal Component Analysis Principal Component Analysis DRAWBACKS: § A downside of PCA is that the two axes in the plot are often not very easy to interpret § The principal components correspond to directions in the original data § The Principal components are combinations of the original features. Hence, these combinations are usually very complex
  • 53. Principal Component Analysis § The principal components themselves are stored in the components_ attribute § Rows in components_ corresponds to one principal component § they are sorted by their importance § The columns correspond to the original features attribute of the PCA Principal Component Analysis
  • 57. Eigenfaces for FeatureExtraction Eigenfaces for feature extraction § Another application of PCA that we mentioned earlier is feature extraction § The idea behind feature extraction is that it is possible to find a representation of your data that is better suited to analysis than the raw representation you were given § An application where feature extraction is helpful is with images § Images are made up of pixels, usually stored as red, green, and blue (RGB) intensities § Objects in images are usually made up of thousands of pixels, and only together are they meaningful
  • 58. Eigenfacesfor FeatureExtraction Eigenfaces for feature extraction EXAMPLE: LFW (Labeled Faces in the Wild) dataset § This dataset contains face images of celebrities downloaded from the Internet § It includes faces of politicians, singers, actors, and athletes from the early 2000s § We use grayscale versions of these images, and scale them down for faster processing
  • 60. Eigenfacesfor FeatureExtraction Eigenfaces for feature extraction • There are 3,023 images, each 87×65 pixels large, belonging to 62 different people • The dataset is a bit skewed, containing a lot of images of George W. Bush and Colin Powell
  • 62. Eigenfacesfor FeatureExtraction EXAMPLE: Eigenfaces for feature extraction • To make the data less skewed, we will only take up to 50 images of each person
  • 63. Eigenfacesfor FeatureExtraction Eigenfaces for feature extraction • Face Recognition - • Ask if a previously unseen face belongs to a known person from a database • An eigenfac is the name given to a set of eigenvectors when used in the computer vision problem of human face recognition • The eigenface approach searches for a low-dimensional representation of face images • Applications of Face Recognition: § Photo collection § Social media § Security applications § Solution for Face Recognition: § To build a classifier § where each person is a separate class § Usually many different people in face databases, and very few images of the same person § That makes it hard to train most classifiers § Simple solution is to use a one-nearest-neighbor classifier § looks for the most similar face image to the face you are classifying § This classifier could in principle work with only a single training example per class
  • 64. Eigenfaces for FeatureExtraction Eigenfaces for feature extraction • We obtain an accuracy of 26.6%, which is not actually that bad for a 62-class classification problem • Random guessing would give you around 1/62 = 1.6% accuracy • But here, we only correctly identify a person every fourth time
  • 65. Eigenfacesfor FeatureExtraction Eigenfaces for feature extraction § Use of PCA - § Computing distances in the original pixel space is Quite a bad way to measure similarity between faces § When using a pixel representation to compare two images, we compare the grayscale value of each individual pixel to the value of the pixel in the corresponding position in the other image § This representation is quite different from how humans would interpret the image of a face and it is hard to capture the facial features using this raw representation § Using pixel distances means that shifting a face by one pixel to the right corresponds to a drastic change, with a completely different representation § Using distances along principal components can improve our accuracy § Whitening option of PCA: § Whitening = Rotation + Rescaling § Rescales the principal components to have the same scale § Whitening corresponds to not only rotating the data, but also rescaling it so that the center panel is a circle instead of an ellipse
  • 67. Eigenfacesfor FeatureExtraction Eigenfaces for feature extraction § We fit the PCA object to the training data and extract the first 100 principal components. § The new data has 100 features, the first 100 principal components.
  • 68. Eigenfacesfor FeatureExtraction Eigenfaces for feature extraction § Use the new representation to classify our images using a one-nearest-neighbors classifier § Accuracy improved quite significantly, from 26.6% to 35.7% § For Image data, Components correspond to directions in the input space § The input space here is 87×65-pixel grayscale images, so directions within this space are also 87×65-pixel grayscale images.
  • 71. Eigenfacesfor FeatureExtraction § We cannot understand all aspects of these components in the images § First Component - § seems to mostly encode the contrast between the face and the background. § Second Component - § encodes differences in lighting between the right and the left half of the face, and so on
  • 72. Eigenfacesfor FeatureExtraction • As the PCA model is based on pixels • the alignment of the face and the lighting both have a strong influence on how similar two images are in their pixel representation • These properties i.e., Alignment and lighting are probably not what a human would perceive first • When asking people to rate similarity of faces, they are more likely to use attributes like age, gender, facial expression, and hair style, which are attributes that are hard to infer from the pixel intensities • Algorithms often interpret data (particularly visual data, such as images, which humans are very familiar with) quite differently from how a human would
  • 73. Eigenfacesfor FeatureExtraction Eigenfaces for feature extraction § PCA transformation = rotating the data + dropping the components with low variance § Another Trick for PCA Transformation - Express the test points as a weighted sum of the principal components § Try to find some numbers (Coefficients x0 , x1 , etc.,) (the new feature values after the PCA rotation) and express the test points as a weighted sum of the principal components § the reconstructions of the original data using only some components § A similar transformation for the faces by reducing the data to only some principal components and then rotating back into the original space.
  • 74. Eigenfacesfor FeatureExtraction Eigenfaces for feature extraction § Reconstruction of the original data - § The reconstructions of the original data using only some components (Example Fig 3.3) § Reducing the data to only some principal components and then rotating back into the original space § This return of the original feature space can be done using the inverse_transform method
  • 76. Eigenfacesfor featureextraction • Reconstructing three face images using increasing numbers of principal components • First 10 principal components - • only the essence of the picture, like the face orientation and lighting, is captured • Using more and more principal components, more and more details in the image are preserved • Using as many components as there are pixels would mean that we would not discard any information after the rotation, and we would reconstruct the image perfectly
  • 78. Non-Negative Matrix Factorization § Non-negative matrix factorization is another unsupervised learning algorithm that aims to extract useful features § It works similarly to PCA § It can also be used for dimensionality reduction § Write each data point as a weighted sum of some components § PCA - § wants components that were orthogonal and that explained as much variance of the data as possible § NMF - § wants the components and the coefficients to be non- negative § i.e., both the components and the coefficients to be greater than or equal to zero § This method can only be applied to data where each feature is non-negative --> as a non-negative sum of non- negative components cannot become negative
  • 79. Non-Negative Matrix Factorization § Process of decomposing data into a non-negative weighted sum is particularly helpful for - data that is created as the addition (or overlay) of several independent sources § audio track of multiple people speaking § music with many instruments § In these situations, NMF can identify the original components that make up the combined data § NMF leads to more interpretable components than PCA § as negative components and coefficients can lead to hard-to- interpret cancellation effects
  • 80. Non-Negative Matrix Factorization Applying NMF to synthetic data § we need to ensure that our data is positive for NMF to be able to operate on the data § Data lies relative to the origin (0, 0) actually matters for NMF.
  • 81. Non-Negative Matrix Factorization Applying NMF to synthetic data § Left Component § It is clear that all points in the data can be written as a positive combination of the two components. § If there are enough components to perfectly reconstruct the data (as many components as there are features) § The algorithm will choose directions that point toward the extremes of the data. § Right Component § NMF creates a component that points toward the mean, as pointing there best explains the data § reducing the number of components not only removes some directions, but creates an entirely different set of components
  • 82. Non-Negative Matrix Factorization Applying NMF to synthetic data § NMF are § not ordered in any specific way § all components play an equal part § Randomness: § NMF uses a random initialization, which might lead to different results depending on the random seed § data with two components, where all the data can be explained perfectly, the randomness has little effect § In more complex situations, there might be more drastic changes
  • 83. Non-Negative Matrix Factorization Applying NMF to face images § LFW dataset § Main parameter of NMF is how many components we want to extract § Usually this is lower than the number of input features § Number of components impacts how well the data can be reconstructed using NMF
  • 86. Non-Negative Matrix Factorization Applying NMF to face images • The quality of the back-transformed data is similar to when using PCA, but slightly worse • PCA - • finds the optimum directions in terms of reconstruction • NMF - • is usually not used for its ability to reconstruct or encode data, but rather for finding interesting patterns within the data • These components are all positive, and so resemble prototypes of faces much more so than the components shown for PCA • Component 3 - • shows a face rotated somewhat to the right • Component 7 - • shows a face somewhat rotated to the left.
  • 89. Non-Negative Matrix Factorization Applying NMF to face images • Faces that have a high coefficient for component 3 are faces looking to the right (Figure 3-16) • Faces with a high coefficient for component 7 are looking to the left (Figure 3-17) • Extracting patterns like these works best for data with additive structure • audio • gene expression • text data
  • 91. Non-Negative Matrix Factorization Applying NMF to face images We can use NMF to recover the three signals
  • 94. Manifold Learning with t-SNE § Advantages of PCA; § PCA is often a good first approach for transforming the data so that we might be able to visualize it using a scatter plot § Disadvantages of PCA: § The nature of the method (applying a rotation and then dropping directions) limits its usefulness
  • 95. Manifold Learning with t-SNE § Manifold learning - § reduces the dimensinality of high-dimensional data by assuming that the data is embedded in a lower dimentional nonlinear manifold § Class of algorithms for visualization called manifold learning algorithms § allow for much more complex mappings § often provide better visualizations § many algorithms exist - § LLE (Locally Linear Embedding), Isomap, SE(Spectral Embedding), t-SNE (T-distributed Stochastic Neighbor Embedding § useful one is the t-SNE algorithm § Manifold learning algorithms are mainly aimed at visualization § rarely used to generate more than two new features § t-SNE compute a new representation of the training data, but don’t allow transformations of new data § these algorithms cannot be applied to a test set § Manifold learning can be useful for exploratory data analysis
  • 96. Manifold Learning with t-SNE § Idea behind t-SNE - § Machine learning algorithm that is used to visualize high dimensional data in two or three dimensions § embeds high dimensional points into lower dimensions § find a two-dimensional representation of the data that preserves the distances between points as best as possible § t-SNE § Starts with a random two dimensional representation for each data point § Tries to make points that are close in the original feature space closer § Tries data points that are far apart in the original feature space farther apart
  • 97. Manifold Learning with t-SNE § t-SNE § puts more emphasis on points that are close by rather than preserving distances between far-apart points § i.e., it tries to preserve the information indicating which points are neighbors to each other § EXAMPLE: § Handwritten digits § data point in this dataset is an 8×8 grayscale image of a handwritten digit between 0 and 9.
  • 98. Manifold Learning with t-SNE Applying PCA to HandWritten Digits § we actually used the true digit classes as characters, to show which class is where. § The digits zero, six, and four are relatively well separated using the first two principal components, though they still overlap. § Most of the other digits overlap significantly.
  • 99. Manifold Learning with t-SNE Applying PCA to HandWritten Digits § PCA to visualize the data reduced to two dimensions. § We plot the first two principal components, and represent each sample with a digit corresponding to its class.
  • 100. Manifold Learning with t-SNE Applying NMF to HandWritten Digits § we actually used the true digit classes as glyphs, to show which class is where. § The digits zero, six, and four are relatively well separated using the first two principal components, though they still overlap § Most of the other digits overlap significantly
  • 101. Manifold Learning with t-SNE Scatter Plot using PCA to HandWritten Digits
  • 102. Manifold Learning with t-SNE Applying t-SNE to HandWritten Digits § t-SNE does not support transforming new data, the TSNE class has no transform method § we can call the fit_transform method § build the model and immediately return the transformed data.
  • 104. Manifold Learning with t-SNE Applying t-SNE to HandWritten Digits § The result of t-SNE is quite remarkable § All the classes are quite clearly separated § The ones and nines are somewhat split up, but most of the classes form a single dense group § This method has no knowledge of the class labels: it is completely unsupervised § It can find a representation of the data in two dimensions that clearly separates the classes, based solely on how close points are in the original space. § The t-SNE algorithm has some tuning parameters § though it often works well with the default settings. § perplexity - § controls the effective number of neighbors that each point considers during the dimensionality reduction process § early_exaggeration - § Controls how tight natural clusters in the original space are in the embedded space and how much space will be between them § learning_rate § max_iter etc.,
  • 106. § Clustering is the task of partitioning the dataset into groups, called clusters § GOAL:The goal is to split up the data in such a way that points within a single cluster are very similar and points in different clusters are different § Similarly to classification algorithms, clustering algorithms assign (or predict) a number to each data point, indicating which cluster a particular point belongs to Clustering
  • 109. Clustering K-Means Clustering § k-means clustering is one of the simplest and most commonly used clustering algorithms § It tr ies to f ind cluster centers that are representative of certain regions of the data § The algorithm alternates between two steps: § Step 1: Assigning data point to cluster § Assigning each data point to the closest cluster center § Step 2: Recalculation of cluster center § Setting each cluster center as the mean of the data points that are assigned to it § The algorithm is finished when the assignment of instances to clusters no longer changes
  • 111. Clustering K-Means Clustering § Cluster centers are shown as triangles § Data points are shown as circles § Colors indicate cluster membership § Three clusters - so the algorithm was initialized by declaring three data points randomly as cluster centers (Initialization) § Then the iterative algorithm starts § First, each data point is assigned to the cluster center it is closest to. (Assignpoints1) § The cluster centers are updated to be the mean of the assigned points (Recompute Centers (1)) § Then the process is repeated two more times. After the third iteration, the assignment of points to cluster centers remained unchanged, so the algorithm stops.
  • 112. Clustering K-Means Clustering § Given new data points, k-means will assign each to the closest cluster center.
  • 114. K-Means Clustering § Each training data point in X is assigned a cluster label § Find these labels in the kmeans.labels_ attribute § we asked for three clusters, the clusters are numbered 0 to 2
  • 115. K-Means Clustering § We can also assign cluster labels to new points, using the predict method § Each new point is assigned to the closest cluster center when predicting, but the existing model is not changed § Running predict on the training set returns the same result as labels_.
  • 116. Clustering K-Means Clustering § Clustering is somewhat similar to Classification § The labels themselves have no a priori meaning § Example 1 - Two dimensional toy dataset § we should not assign any significance to the fact that one group was labeled 0 and another one was labeled 1. § Note 1 - § Running the algorithm again might result in a different numbering of clusters because of the random nature of the initialization § Note 2 - § The cluster centers are stored in the cluster_centers_ attribute § Example 2 - Clustering face images § It might be that the cluster 3 found by the algorithm contains only faces of Bela.You can only know that after you look at the pictures § The number 3 is arbitrary § The only information the algorithm gives you is that all faces labeled as 3 are similar
  • 118. K-Means Clustering § We can also use more or fewer cluster centers
  • 119. K-Means Clustering Failure cases of k-means § Even if you know the “right” number of clusters for a given dataset, k-means might not always be able to recover them § Each cluster is defined solely by its center, which means that each cluster is a convex shape § k-means can only capture relatively simple shapes § k-means also assumes that all clusters have the same “diameter” - § It always draws the boundary between clusters to be exactly in the middle between the cluster centers
  • 121. K-Means Clustering § Three clusters - Cluster 0, cluster 1, cluster 2 § cluster 0 and cluster 1 have some points that are far away from all the other points in these clusters that “reach” toward the center
  • 122. K-Means Clustering § k-means also assumes that all directions are equally important for each cluster § The following plot (Figure 3-28) shows a two- dimensional dataset where there are three clearly separated parts in the data § However, these groups are stretched toward the diagonal § As k-means only considers the distance to the nearest cluster center, it can’t handle this kind of data
  • 125. K-Means Clustering • k-means also performs poorly if the clusters have more complex shapes, like the two_moons • Here, we would hope that the clustering algorithm can discover the two halfmoon shapes • However, this is not possible using the k-means algorithm
  • 126. K-Means Clustering Vector quantization § Even though k-means is a clustering algorithm, there are interesting parallels between k-means and the decomposition methods like PCA and NMF § PCA tries to find directions of maximum variance in the data § NMF tries to find additive components, which often correspond to “extremes” or “parts” of the data § Both methods tried to express the data points as a sum over some components § k-means, on the other hand, tries to represent each data point using a cluster center § In k-means each point being represented using only a single component, which is given by the cluster center
  • 127. K-Means Clustering Vector quantization § This view of k-means as a decomposition method, where each point is represented using a single component, is called vector quantization § Comparison of PCA, NMF, and k-means § showing the components extracted , as well as reconstructions of faces from the test set using 100 components § For k-means, the reconstruction is the closest cluster center found on the training set
  • 131. K-Means Clustering § One interesting advantage of k-means - § An interesting aspect of vector quantization using k- means is that we can use many more clusters than input dimensions to encode our data § Example: two_moons data § Using PCA or NMF, there is nothing much we can do to this data, as it lives in only two dimensions § Reducing it to one dimension with PCA or NMF would completely destroy the structure of the data § But we can find a more expressive representation with k-means, by using more cluster centers
  • 134. K-Means Clustering § We used 10 cluster centers - means each point is now assigned a number between 0 and 9 § We can see this as the data being represented using 10 components (that is, we have 10 new features) § Using this 10-dimensional representation, it would now be possible to separate the two half-moon shapes using a linear model, which would not have been possible using the original two features
  • 136. K-Means Clustering Advantages § k-means is a very popular algorithm for clustering § Relatively easy to understand and implement § It runs relatively quickly § The k-means clustering algorithm is guaranteed to give results (Convergence) § It is not specific to particular problems. (i.e., can be applied for numerical data to text) (Generalization) § k-means scales easily to large datasets § NOTE: § MiniBatchKMeans class - can handle very large datasets
  • 137. K-Means Clustering Disadvantages § It relies on a random initialization, which means the outcome of the algorithm depends on a random seed § Deciding on the number of clusters to start is difficult (can use elbow method) § Choice of initial centroids is difficult § Effect of outliers § Curse of dimentionality § Preprocessing is mandatory § Restrictive assumptions made on the shape of clusters § The requirement to specify the number of clusters you are looking for (which might not be known in a real-world application)
  • 139. Agglomerative Clustering § Agglomerative clustering refers to a collection of clustering algorithms that all build upon the same principles: § The algorithm starts by declaring each point its own cluster § then merges the two most similar clusters until some stopping criterion is satisfied § The stopping criterion - § Number of clusters § so similar clusters are merged until only the specified number of clusters are left § Most similar cluster is identified by considering several linkage criteria § This measure is always defined between two existing clusters
  • 140. Agglomerative Clustering § The following three choices (Linkages) are implemented in scikit-learn: § Ward § The default choice § Ward picks the two clusters to merge such that the variance within all clusters increases the least § This often leads to clusters that are relatively equally sized § Average § Merges the two clusters that have the smallest average distance between all their points § Complete § Also known as maximum linkage § Merges the two clusters that have the smallest maximum distance between their points § Note 1- § Ward works on most datasets § Note 2 - § If the clusters have very dissimilar numbers of members, average or complete might work better
  • 141. Agglomerative Clustering § This plot illustrates the progression of agglomerative clustering on a two- dimensional dataset, looking for three clusters.
  • 142. Agglomerative Clustering § Initially, each point is its own cluster § Then, in each step, the two clusters that are closest are merged § In the first four steps, two single-point clusters are picked and these are joined into two-point clusters § In step 5, one of the two-point clusters is extended to a third point, and so on § In step 9, there are only three clusters remaining § As we specified that we are looking for three clusters, the algorithm then stops
  • 143. Agglomerative Clustering § Because of the way the algorithm works, agglomerative cluster ing cannot make predictions for new data points § AgglomerativeClustering has no predict method § To bu i l d t h e m o d e l a n d ge t t h e c l u s t e r memberships on the training set, use the fit_predict method instead
  • 145. Agglomerative Clustering § While the scikit-learn implementation of agglomerative clustering requires to specify the number of clusters § Agglomerative clustering methods provide some help with choosing the right number of clusters
  • 146. Agglomerative Clustering Hierarchical clustering and dendrograms § Agglomerative clustering produces what is known as a hierarchical clustering § The clustering proceeds iteratively § Every point makes a journey from being a single point cluster to belonging to some final cluster § Each intermediate step provides a clustering of the data (with a different number of clusters)
  • 147. Agglomerative Clustering The following figure shows an overlay of all the possible clusterings shown in Figure 3-33, providing some insight into how each cluster breaks up into smaller clusters.
  • 148. Agglomerative Clustering Hierarchical Clustering and Dendograms § Hierarchical clustering relies on the two- dimensional nature of the data § Hierarchical clustering cannot be used on datasets that have more than two features § Dendograms - § Another tool to visualize hierarchical clustering, called a dendrogram § can handle multidimensional datasets
  • 149. Agglomerative Clustering Hierarchical Clustering and Dendograms § The dendrogram is a tree-like structure that is mainly used to store each step § scikit-learn currently does not have the functionality to draw dendrograms § Dendograms can be generated easily using SciPy
  • 150. Agglomerative Clustering SciPy vs scikitlearn § SciPy clustering algorithms have a slightly different interface to the scikit- learn clustering algorithms § SciPy provides a function that § Takes a data array X § Computes a linkage array, which encodes hierarchical cluster similarities
  • 151. Agglomerative Clustering § We can then feed this linkage array into the scipy dendrogram function to plot the dendrogram
  • 153. Agglomerative Clustering § The dendrogram § shows data points as points on the bottom (i.e.,X- axis) (numbered from 0 to 11) § shows Cluster Distance on Y- axis § Then, a tree is plotted with these points (representing single-point clusters) as the leaves, and a new node parent is added for each two clusters that are joined § Reading from bottom to top, the data points 1 and 4 are joined first (as you could see in Figure 3-33). § Next, points 6 and 9 are joined into a cluster, and so on. At the top level, there are two branches, one consisting of points 11, 0, 5, 10, 7, 6, and 9, and the other consisting of points 1, 4, 3, 2, and 8. § These correspond to the two largest clusters
  • 154. Agglomerative Clustering § The y-axis in the dendrogram § specifies when two clusters get merged? § The length of each branch also shows how far apart the merged clusters are § The longest branches in this dendrogram are the three lines that are marked by the dashed line labeled “three clusters” § Going from three to two clusters meant merging some very far-apart points § At the top of the chart, where merging the two remaining clusters into a single cluster again bridges a relatively large distance
  • 155. Agglomerative Clustering Drawbacks of Agglomerative Clustering § Fails at separating complex shapes (Example: two_moons )
  • 157. DBSCAN DBSCAN § DBSCAN - Density-Based Spatial Clustering of Applications with Noise § Another very useful clustering algorithm § Benefits: § It does not require the user to set the number of clusters a priori § It can capture clusters of complex shapes § It can identify points that are not part of any cluster § Drawbacks: § Somewhat slower than agglomerative clustering and k- means, but still scales to relatively large datasets
  • 158. DBSCAN DBSCAN § Functionality: § DBSCAN works by identifying points that are in “crowded” regions of the feature space, where many data points are close together § These regions are referred to as dense regions in feature space § The idea behind DBSCAN is that clusters form dense regions of data, separated by regions that are relatively empty
  • 159. DBSCAN DBSCAN § Core Samples: § Points that are within a dense region are called core samples § Also called as core points § Parameters to identify core samples: § min_samples § eps § If there are at least min_samples many data points within a distance of eps to a given data point, that data point is classified as a core sample § Core samples that are closer to each other than the distance eps are put into the same cluster by DBSCAN
  • 160. DBSCAN DBSCAN Algorithm § Step 1: Picks an arbitrary point to start with § Step 2: Finds all points with distance eps or less from that point § Step 3: If there are less than min_samples points within distance eps of the starting point - this point is labeled as noise (i.e., it doesn’t belong to any cluster) § Step 4: If there are more than min_samples points within a distance of eps, the point is labeled a core sample - assigned a new cluster label § Step 5: All neighbors (within eps) of the point are visited § Step 5.1: If they have not been assigned a cluster yet, they are assigned the new cluster label that was just created § Step 5.2: If they are core samples, their neighbors are visited in turn, and so on. § Step 5.3: The cluster grows until there are no more core samples within distance eps of the cluster
  • 161. DBSCAN § Step 6: Another point that hasn’t yet been visited is picked, and the same procedure is repeated § Finally we end up with three kinds of points § Core points § Boundary Points - Points that are within distance eps of core points (called boundary points) § Noise - Points that do not belong to any cluster §
  • 162. DBSCAN § Note 1: § When the DBSCAN algorithm is run on a particular dataset multiple times, there will not be any change in Core points and Noise § (i.e., the clustering of the core points is always the same, and the same points will always be labeled as noise) § Note 2: § When the DBSCAN algorithm is run on a par ticular dataset multiple times, the boundary points may change § i.e., A boundary point might be neighbor to core samples of more than one cluster. § Note 3: § The cluster membership - of boundary points depends on the order in which points are visited
  • 163. DBSCAN DBSCAN on the synthetic dataset § DBSCAN does not allow predictions on new test data, so we will use the fit_predict method to perform clustering and return the cluster labels in one step
  • 164. DBSCAN
  • 165. DBSCAN § Points that belong to clusters are solid § Noise points are shown in white § Core samples are shown as large markers § Boundary points are displayed as smaller markers § Increasing eps (going from left to right in the figure) § means that more points will be included in a cluster § This makes clusters grow, but might also lead to multiple clusters joining into one § Increasing min_samples (going from top to bottom in the figure) § means that fewer points will be core points, and more points will be labeled as noise
  • 166. DBSCAN § Parameter eps: § most important parameter § it determines what it means for points to be “close” § Very small eps - § means - NO points are core samples § Leads to - all points being labeled as noise § Very large eps - § All points forming a single cluster § Parameter min_samples: § The min_samples - mostly determines whether points in less dense regions will be labeled as outliers or as their own clusters § Large min_samples - many samples will now be labeled as noise § determines the minimum cluster size
  • 167. DBSCAN § Note 1: § While DBSCAN doesn’t require setting the number of clusters explicitly, setting eps implicitly controls how many clusters will be found § Note 2: § Finding a good setting for eps is sometimes easier after scaling the data using StandardScaler or MinMaxScaler
  • 168. DBSCAN DBSCAN on the two_moons dataset § The algorithm actually finds the two half-circles and separates them using the default settings.
  • 169. DBSCAN
  • 170. DBSCAN § As the algorithm produced the desired number of clusters (two) § Default parameter (eps=0.5) settings seem to work well § If we decrease eps to 0.2 we will get eight clusters § Increasing eps to 0.7 results in a single cluster § When using DBSCAN, you need to be careful about handling the returned cluster assignments
  • 171. DBSCAN Comparing and Evaluating Clustering Algorithms § Challenges in clustering algorithms - § Very hard to assess how well an algorithm worked § To compare outcomes between different algorithms
  • 172. DBSCAN Evaluating clustering with ground truth § Metrics to assess the outcome of a clustering algorithm § Adjusted Rand Index (ARI) § Normalized Mutual Information (NMI) § Both provides a quantitative measure § Clustering - 1 § Unrelated Clusterings - 0 § ARI can become negative § Compare the k-means, agglomerative clustering, and DBSCAN algorithms using ARI
  • 173. DBSCAN
  • 174. DBSCAN Common Mistake when evaluating clustering § Use of accuracy_score instead of adjusted_rand_score and normalized_mutual_info_score,
  • 175. DBSCAN Evaluating clustering without ground truth (O/P) § In practice, there is a big problem with using measures like ARI § In Clustering algorithms - § there is usually no ground truth to which to compare the results § Metrics like ARI and NMI - § only helps in developing algorithms § NOT in assessing success in an application § Silhouette coefficient - § Another metric for clustering § Don’t require ground truth § Computes the compactness of a cluster § Note 1: § Compactness doesn’t allow for complex shapes § Note 2: § These output metrics often don’t work well in practice
  • 176. DBSCAN Comparison using the silhouette score
  • 177. DBSCAN Observations § k-means gets the highest silhouette score § We might prefer the result produced by DBSCAN § Better strategy: § for evaluating clusters - use robustness-based clustering metrics § These run an algorithm § after adding some noise to the data § using different parameter settings § Then compare the outcomes
  • 178. DBSCAN Face Images Example § Note: § Even if we get very high silhouette score - § Still don’t know if there is any semantic meaning in the clustering § Whether the clustering reflects an aspect of the data that we are interested in § Face Images Example: § Goal is to find groups of similar faces — men and women, or old people and young people, or people with beards and without § Target: § Cluster the data into two clusters § Drawbacks: § We still don’t know if the clusters that are found correspond in any way to the concepts we are interested in § The clusters may find side views versus front views, or pictures taken at night versus pictures taken during the day, or pictures taken with iPhones versus pictures taken with Android phones § The only way to know whether the clustering corresponds to anything we are interested in is to analyze the clusters manually
  • 179. DBSCAN Comparing algorithms on the faces dataset § Use Eigenface representation of the data, as produced by PCA(whiten=True), with 100 components § The output has more semantic representation of the face images than the raw pixels § It will also make computation faster
  • 180. DBSCAN Analyzing the faces dataset with DBSCAN § All the returned labels are –1 § All of the data was labeled as “noise” by DBSCAN. § Solution: § eps = Higher § expand the neighborhood of each point § min_samples = Lower § to consider smaller groups of points as clusters
  • 181. DBSCAN § min_samples § Result: § Everything is labeled as noise
  • 182. DBSCAN § Changing eps value § Result: § Only One Cluster (0) is formed along with noise (-1)
  • 183. DBSCAN § Use this result to find out what the “noise” looks like compared to the rest of the data § 27 points of noise and 2036 points are inside the cluster
  • 185. DBSCAN § Why they are considered as noise? § the fifth image in the first row - person drinking from a glass § Images of people wearing hats § Last image - hand in front of the person’s face § other images - contain odd angles or crops that are too close or too wide § We can do little about people in photos who is wearing hats, drinking, or holding something in front of their faces § Outlier Detection: § This kind of analysis — trying to find “the odd one out” — is called outlier detection § Solution: § do a better job of cropping images
  • 186. DBSCAN § For more clusters: § Need to set smaller eps (15 and 0.5 (the default)
  • 187. DBSCAN
  • 188. DBSCAN
  • 189. DBSCAN Analyzing the faces dataset § Some of the clusters correspond to people with very distinct faces (within this dataset), such as Sharon or Koizumi § Within each cluster, the orientation of the face is also quite fixed, as well as the facial expression § Some of the clusters contain faces of multiple people, but they share a similar orientation and expression § Note: § We are doing a manual analysis here § Different from the supervised learning based on R2 score or accuracy
  • 190. DBSCAN Analyzing the faces dataset with k-means § Disadvantage of DBSCAN on Face Dataset - § Not possible to create more than one big cluster using DBSCAN § Pros and Cons of Agglomerative clustering and k- means - § Pros - § Can create clusters of even size § Cons - § Need to set a target number of clusters a priori § Number of clusters = Number of people in the dataset § Still cannot recover all the clusters correctly § Solution - § Start with a low number of clusters (eg., 10) - Analyze each of the clusters manually § Increase the number of clusters if necessary
  • 191. DBSCAN § K-Means - § Partitioned the data into relatively similarly sized clusters from 64 to 386 § This is quite different from the result of DBSCAN
  • 192. DBSCAN § Visualization of outcome of k-means § As we clustered in the representation produced by PCA, we need to rotate the cluster centers back into the o r i g i n a l s p a c e t o v i s u a l i z e t h e m , u s i n g pca.inverse_transform.
  • 193. DBSCAN § The cluster centers found by k-means are very smooth versions of faces § Each center is an average of 64 to 386 face images § The clustering seems to pick up on § different orientations of the face § different expressions (the third cluster center seems to show a smiling face) § the presence of shirt collars (see the second-to-last cluster center).
  • 194. DBSCAN § More detailed view - § In Figure 3-44 § Each cluster center shows - § The five most typical images in the cluster - § the images assigned to the cluster that are closest to the cluster center § The five most atypical images in the cluster - § the images assigned to the cluster that are furthest from the cluster center
  • 195. DBSCAN
  • 196. DBSCAN § Third Cluster - Smiling Faces § Other clusters - Orientation § Atypical points - § are not very similar to the cluster centers § Their assignment seems somewhat arbitrary § k-means partitions doesn’t have a concept of “noise” points § Using a larger number of clusters, the algorithm could find finer distinctions § Note: § Adding more clusters makes manual inspection even harder
  • 197. DBSCAN Analyzing the faces dataset with agglomerative clustering § Agglomerative clustering also produces § relatively equally sized clusters § with cluster sizes between 26 and 623 § More uneven than those produced by k-means § Much more even than the ones produced by DBSCAN
  • 198. DBSCAN § Compute ARI - § to measure the similar ity of two par titions by Agglomerative and K-Means § ARI = 0.13 § means that the two clusterings labels_agg and labels_km have little in common
  • 199. DBSCAN § Dendrogram - § We’ll limit the depth of the tree in the plot, as branching down to the individual 2,063 data points would result in an unreadably dense plot.
  • 200. DBSCAN Agglomerative with ten clusters § 10 clusters (Figure 3-46) § There is no notion of cluster center in agglomerative clustering (arbitrary data point is choosen) § Number of points in each cluster - placed as left of the first image
  • 201. DBSCAN
  • 202. DBSCAN § While some of the clusters seem to have a semantic theme, many of them are too large to be actually homogeneous § To get more homogeneous clusters - run the algorithm again, this time with 40 clusters
  • 203. DBSCAN
  • 204. DBSCAN § Agglomerative Clustering - (Figure 3.47) § dark skinned and smiling § collared shirt § smiling woman § Hussein § high forehead § We could also find these highly similar clusters using Dendograms
  • 205. DBSCAN Summary of Clustering Methods § Applying and evaluating clustering is a highly qualitative procedure § Most helpful in the exploratory phase of data analysis § Three clustering algorithms: § k-means § DBSCAN, § Agglomerative § All three have a way of controlling the granularity of clustering § k-means and agglomerative clustering allos to specify the number of desired clusters § DBSCAN allows to define proximity using the eps parameter, which indirectly influences cluster size § All three methods can be § Used on large data sets § Used on Real-world datasets § Relatively easy to understand § Allow for clustering into many clusters
  • 206. DBSCAN Summary of Clustering Methods § Strengths - § k-means - § k-means allows for a characterization of the clusters using the cluster means § It can also be viewed as a decomposition method, where each data point is represented by its cluster center § DBSCAN - § Allows for the detection of “noise points” (i.e., datapoints that are not assigned to any cluster) § It can help automatically determine the number of clusters § Allows for complex cluster shape § Sometimes produces clusters of very differing size, which can be a strength or a weakness § Agglomerative clustering - § Provide a whole hierarchy of possible partitions of the data § Easily inspected via dendrograms