Machine Learning - Implementation with Python - 3.pdf

Machine
Learning
Source: Introduction to Machine Learning with Python
Authors: Andreas C. Muller and Sarah Guido

Unit - III
Unsupervised Learning

Agenda
Introduction
Types of Unsupervised
Learning?
Challenges in Unsupervised
Learning
Preprocessing and Scaling
Clustering

Introduction
Introduction to Unsupervised
Learning

Introduction
§ Unsupervised learning includes all kinds of machine
learning where there is no known output
§ No teacher to instruct the learning algorithm
§ The learning algorithm is just shown the input data and
asked to extract knowledge from this data

Types of
Unsupervised
learning
Types of Unsupervised learning
§ Two kinds of Unsupervised learning
§ Transformations of the dataset
§ Clustering

Types of
Unsupervised
learning
Unsupervised transformations of a dataset
§ Algorithms that create a new representation of the
data which might be easier for humans or other
machine learning algorithms to understand
compared to the original representation of the data.

Types of
Unsupervised
learning
§ Application of unsupervised transformations is
dimensionality reduction
§ which takes a high-dimensional representation of the
data, consisting of many features, and finds a new way
to represent this data that summarizes the essential
characteristics with fewer features.
Example:
§ Application for dimensionality reduction is reduction
to two dimensions for visualization purposes

Types of
Unsupervised
learning
Unsupervised transformations of a dataset
§ Another application for unsupervised transformations is
finding the parts or components that “make up” the
data
Example: Topic Extraction
§ The task is
§ to find the unknown topics that are talked about in each
document
§ to learn what topics appear in each document
§ tracking the discussion of themes like elections, gun
control, or pop stars on social media

Types of
Unsupervised
learning
Clustering Algorithms
§ Partition data into distinct groups of similar items
EXAMPLE:
§ Uploading photos to a social media site

Challenges in
Unsupervised
Learning
• Evaluating whether the algorithm learned something
useful
• Unsupervised ML algorithms are applied to data that does
not contain any label information ---> we don’t know
what the right output should be
• Very hard to say whether a model “did well”
• There is no way for us to tell the algorithm what we are
looking for and often the only way to evaluate the result of
an unsupervised algorithm is to inspect it manually
• Unsupervised algor ithms are used often in an
exploratory setting --> when the data scientist wants to
understand the data better, rather than as part of a larger
automatic system
• Common application for unsupervised algorithms is as a
preprocessing step for supervised algorithms

Preprocessing
and Scaling
Different Kinds of
Preprocessing
Applying Data
Transformations
Scaling Training and Test
Data the Same Way
The Effect of Preprocessing
on Supervised Learning

Preprocessing
and Scaling
Dimensionality Reduction,
Feature Extraction and
Manifold Learning
Non-Negative Matrix
Factorization
Manifold Learning with
t-SNE

Preprocessing
and Scaling
§ Neural networks and SVMs, are very sensitive to the
scaling of the data
§ A common practice is to adjust the features so that the
data representation is more suitable for these
algorithms

Different
Kinds of
Preprocessing

Different
Kinds of
Preprocessing
§ StandardScaler
§ The StandardScaler in scikit-learn ensures that for each
feature the mean is 0 and the variance is 1, bringing all
features to the same magnitude.
§ Disadvantage:
§ This scaling does not ensure any particular
minimum and maximum values for the features
§ RobustScaler
§ It ensures statistical properties for each feature that
guarantee that they are on the same scale.
§ It uses the median and quartiles, instead of mean and
variance
§ Advantage:
§ RobustScaler ignore data points that are very different from
the rest (like measurement errors)
§ These odd data points are also called outliers, and can lead
to trouble for other scaling techniques

Different
Kinds of
Preprocessing
§ MinMaxScaler
§ It shifts the data such that all features are exactly
between 0 and 1
§ For a two-dimensional dataset this means all of the
data is contained within the rectangle created by
X-axis between 0 and 1 and the Y-axis between 0
and 1
§ Normalizer
§ Scales each data point such that the feature vector
has a Euclidean length of 1
§ It projects a data point on the circle (or sphere, in
the case of higher dimensions) with a radius of 1.
§ This normalization is often used when only the
direction of the data matters, not the length of the
feature vector.

Applying Data
Transformations
§ Transformations on Cancer dataset
§ Preprocessing methods like the scalers are usually applied
before applying a supervised machine learning algorithm
§ Example: Apply kernel SVM (SVC) to the cancer dataset,
and use MinMaxScaler for preprocessing the data
STEP 1: Loading the dataset and splitting it into train and test set

Applying Data
Transformations
STEP 2: Import the class and then instantiate it
STEP 3: Fit the scaler using the fit method, applied to the
training data

STEP 4: Apply the transformation
§ i.e., scale the training data — we use the transform
method of the scaler
§ The transform method is used in scikit-learn whenever a
model returns a new representation of the data.
Applying Data
Transformations

Applying Data
Transformations
STEP 5: Apply SVM to the scaled data
§ To apply the SVM to the scaled data, we also need to transform
the test set

Scaling
Training and
Test Data the
SameWay
• After scaling, the minimum and maximum are
not 0 and 1
• Some of the features are even outside the 0–1
• MinMaxScaler (and all the other scalers) always
applies exactly the same transformation to
the training and the test set
• i.e., the transform method always subtracts the
training set minimum and divides by the training set
range, which might be different from the minimum
and range for the test set

Scaling
Training and
Test Data the
SameWay
• It is important to apply exactly the same
transformation to the training set and the test set
for the supervised model to work on the test set

Scaling
Training and
Test Data the
SameWay

Scaling
Training and
Test Data the
SameWay
• First panel: Unscaled two-dimensional dataset
• The training set shown as circles and the test set shown as
triangles
• Second panel: Data is same but scaled using the
MinMaxScaler
• We called fit on the training set, and then called transform on
the training and test sets.
• The dataset in the second panel looks identical to the first only
the ticks on the axes have changed.
• The features are between 0 and 1
• The minimum and maximum feature values for the test data (the
triangles) are not 0 and 1.
• Third panel: Scaling the training set and test set separately
• The minimum and maximum feature values for both the
training and the test set are 0 and 1
• The test points moved incongruously to the training set, as they
were scaled differently
• The arrangement of the data is changed in an arbitrary way

Scaling
Training and
Test Data the
SameWay
Note:
Shortcuts and efficient alternatives:
§ Often, you want to fit a model on some dataset, and then
transform it.
§ All models that have a transform method also have a
fit_transform method.
§ While fit_transform is not necessarily more efficient for all
models, it is still good practice to use this method when
trying to transform the training set

The Effect of
Preprocessing
on Supervised
Learning

The Effect of
Preprocessing
on Supervised
Learning
§ STEP 1: Before scaling

The Effect of
Preprocessing
on Supervised
Learning
§ STEP 2: After applying scaling

The Effect of
Preprocessing
on Supervised
Learning
§ The effect of scaling the data is quite significant
§ Scaling the data doesn’t involve any complicated
math but don’t tr y to reimplement them
yourself
§ It is always a good practice to use the scaling
mechanisms provided by scikit-learn

The Effect of
Preprocessing
on Supervised
Learning
We can easily replace one preprocessing algorithm with
another by changing the class we use ---> because all the
preprocessing classes have the same interface

Dimensionality
Reduction,Feature
Extractionand
ManifoldLearning
Principal Component
Analysis
Eigenfaces for feature
extraction

Dimensionality
Reduction,
Feature
Extraction and
Manifold
Learning
§ It is a statistical process
§ converts correlated features into a set of linearly
uncorrelated features -> with the help of
orthogonal transformation
§ PCA is used for exploratory data analysis and
predictive modeling
§ Applications of PCA:
§ Image processing
§ Movie recommendation system
§ Dimensionality reduction technique in various AI
applications such as computer vision, image
compression, etc
§ finding hidden patterns if data has high dimensions

Dimensionality
Reduction,
Feature
Extraction and
Manifold
Learning
§ Transforming data using unsupervised learning can
have many motivations
§ Compressing data
§ Finding a representation that is more informative
for further processing (Feature Extraction)
§ Visualization
§ One of the simplest and most widely used algorithms
for all of these is Principal Component Analysis for
dimensionality reduction, feature extraction, feature
selection, data compression, and data visualization
§ Non-negative matrix factorization (NMF) - for
feature extraction
§ t-SNE - for visualization using two dimensional
scatter plots

Principal
Component
Analysis
Principal Component Analysis
§ It is a method that rotates the dataset in a way
such that
§ the rotated features are statistically
uncorrelated
§ This rotation is often followed by selecting
only a subset of the new features, according to
how important they are for explaining the data

Principal
Component
Analysis

Principal
Component
Analysis
§ First Plot:
§ The first plot (top left) shows the original data points colored to distinguish
among them
§ Step 1:
§ The algorithm proceeds by first finding the direction of maximum variance, labeled
“Component 1”.
§ This direction (or vector) in the data that contains most of the information
§ i.e., the direction along which the features are most correlated with each other
§ Step 2:
§ The algorithm finds the direction that contains the most information while being
orthogonal (at a right angle) to the first direction
§ Note:
§ In two dimensions, there is only one possible orientation that is at a right angle,
§ In higher-dimensional spaces there would be (infinitely) many orthogonal
directions
§ We could have drawn the first component from the center up to the top left
instead of down to the bottom right
§ Principal Components:
§ The directions found using this process are called principal components
§ They are the main directions of variance in the data
§ Note:
§ There are as many principal components as original features

Principal
Component
Analysis
§ The second plot
§ Step 3:
§ The mean was subtracted from the data, so that the
transformed data is centered around zero
§ Step 4:
§ The first plot is rotated so that the first principal
component aligns with the x-axis
§ The second principal component aligns with the y-
axis
§ Note:
§ In the rotated representation, the two axes are
uncorrelated
§ i.e., Correlation matrix of the data in this representation is zero
except for the diagonal

Principal
Component
Analysis
§ PCA for Dimensionality Reduction:
§ We can use PCA for dimensionality reduction by retaining
only some of the principal components
§ In this example, we might keep only the first principal
component
§ The Third Plot:
§ Step 5:
§ Reduces the data from a two-dimensional dataset to a one-dimensional
dataset
§ The Fourth Plot:
§ Step 6:
§ Undo the rotation and add the mean back to the data
§ These points are in the original feature space, but we kept only the
information contained in the first principal component
§ Note:
§ This transformation is sometimes used to remove noise effects from the
data or visualize what part of the information is retained using the
principal components

Principal
Component
Analysis
(Cancer Dataset)
§ One of the most common applications of PCA is visualizing high-
dimensional datasets
§ Disadvantage of Scatter Plot:
§ It is hard to create scatter plots of data that has more than two features
§ Pair Plot:
§ A 2D categorical scatter plot that represents the pair wise relationship
between the numerical variables
§ The Iris dataset ---> able to create a pair plot that gave us a partial picture of
the data by showing us all the possible combinations of two features
§ Breast Cancer Dataset:
§ The Breast Cancer dataset, even using a pair plot is tricky
§ This dataset has 30 features, which would result in 30 * 14 = 420 scatter plots
§ Histograms:
§ Computing histograms of each of the features for the two classes, benign and
malignant cancer

Principal
Component
Analysis
(Cancer Dataset)
EXAMPLE:

Principal
Component
Analysis
Principal Component Analysis (Cancer Dataset)

Principal
Component
Analysis
§ Created Histogam for each feature -
§ Counting how often a data point appears with a feature in a certain
range (called a bin)
§ Each plot overlays two histograms, one for all of the points in the
benign class (blue) and one for all the points in the malignant class
(red).
§ This gives us some idea of how each feature is
distributed across the two classes
§ Allows us to guess as to which features are better at distinguishing
malignant and benign samples
§ Example:
§ The feature “smoothness error” seems quite uninformative,
because the two histograms mostly overlap
§ The feature “worst concave points” seems quite informative,
because the histograms are quite disjoint
§ NOTE:
• Histogram doesn’t show us anything about the interactions
between variables and how these relate to the classes
• Using PCA, we can capture the main interactions
Principal Component Analysis (Cancer Dataset)

Principal
Component
Analysis
(Cancer Dataset)
Scaling before applying PCA

Principal
Component
Analysis
(Cancer Dataset)
• Learning the PCA transformation and applying it is as
simple as applying a preprocessing transformation
• We instantiate the PCA object, find the principal
components by calling the fit method
• Then apply the rotation and dimensionality reduction by
calling transform

Principal
Component
Analysis
(Cancer Dataset)
• PCA only rotates and shifts the data, but
keeps all the principal components
• To reduce the dimensionality of the data, we
need to specify how many components we
want to keep when creating a PCA Object

Principal
Component
Analysis
(Breast Cancer Dataset)

Principal
Component
Analysis
§ Note:
• PCA is an unsupervised method
• does not use any class information when finding
the rotation
• A linear classifier (that would learn a line in this
space) could do a reasonably good job at
distinguishing the two classes.

Principal
Component
Analysis
DRAWBACKS:
§ A downside of PCA is that the two axes in the plot are
often not very easy to interpret
§ The principal components correspond to directions in
the original data
§ The Principal components are combinations of the
original features. Hence, these combinations are
usually very complex

Principal
Component
Analysis
§ The principal components themselves are stored in the
components_ attribute
§ Rows in components_ corresponds to one principal
component
§ they are sorted by their importance
§ The columns correspond to the original features
attribute of the PCA

Principal
Component
Analysis
§ Visualization using Heatmap

Eigenfaces for
FeatureExtraction

Eigenfaces for
FeatureExtraction
Eigenfaces for feature extraction
§ Another application of PCA that we mentioned earlier is
feature extraction
§ The idea behind feature extraction is that it is possible to find a
representation of your data that is better suited to analysis
than the raw representation you were given
§ An application where feature extraction is helpful is with
images
§ Images are made up of pixels, usually stored as red, green, and
blue (RGB) intensities
§ Objects in images are usually made up of thousands of pixels,
and only together are they meaningful

Eigenfacesfor
FeatureExtraction
EXAMPLE: LFW (Labeled Faces in the Wild) dataset
§ This dataset contains face images of celebrities
downloaded from the Internet
§ It includes faces of politicians, singers, actors, and athletes
from the early 2000s
§ We use grayscale versions of these images, and scale them
down for faster processing

Eigenfacesfor
FeatureExtraction

Eigenfacesfor
FeatureExtraction
• There are 3,023 images, each 87×65 pixels large, belonging to
62 different people
• The dataset is a bit skewed, containing a lot of images of
George W. Bush and Colin Powell

Eigenfacesfor
FeatureExtraction
EXAMPLE:
• To make the data less skewed, we will only take up to 50
images of each person

Eigenfacesfor
FeatureExtraction
• Face Recognition -
• Ask if a previously unseen face belongs to a known person from a
database
• An eigenfac is the name given to a set of eigenvectors when used
in the computer vision problem of human face recognition
• The eigenface approach searches for a low-dimensional
representation of face images
• Applications of Face Recognition:
§ Photo collection
§ Social media
§ Security applications
§ Solution for Face Recognition:
§ To build a classifier
§ where each person is a separate class
§ Usually many different people in face databases, and very few images of the same
person
§ That makes it hard to train most classifiers
§ Simple solution is to use a one-nearest-neighbor classifier
§ looks for the most similar face image to the face you are classifying
§ This classifier could in principle work with only a single training example per class

Eigenfaces for
FeatureExtraction
• We obtain an accuracy of 26.6%, which is not actually that
bad for a 62-class classification problem
• Random guessing would give you around 1/62 = 1.6%
accuracy
• But here, we only correctly identify a person every fourth
time

Eigenfacesfor
FeatureExtraction
§ Use of PCA -
§ Computing distances in the original pixel space is Quite a bad way to measure similarity
between faces
§ When using a pixel representation to compare two images, we compare the grayscale
value of each individual pixel to the value of the pixel in the corresponding position in
the other image
§ This representation is quite different from how humans would interpret the image of a
face and it is hard to capture the facial features using this raw representation
§ Using pixel distances means that shifting a face by one pixel to the right corresponds to a
drastic change, with a completely different representation
§ Using distances along principal components can improve our accuracy
§ Whitening option of PCA:
§ Whitening = Rotation + Rescaling
§ Rescales the principal components to have the same scale
§ Whitening corresponds to not only rotating the data, but also rescaling it so that the center
panel is a circle instead of an ellipse

Eigenfacesfor
FeatureExtraction
§ We fit the PCA object to the training data and extract the first 100
principal components.
§ The new data has 100 features, the first 100 principal components.

Eigenfacesfor
FeatureExtraction
§ Use the new representation to classify our images using a one-nearest-neighbors
classifier
§ Accuracy improved quite significantly, from 26.6% to 35.7%
§ For Image data, Components correspond to directions in the input space
§ The input space here is 87×65-pixel grayscale images, so directions within this
space are also 87×65-pixel grayscale images.

Eigenfacesfor
FeatureExtraction
§ Eigenfaces:

Eigenfacesfor
FeatureExtraction
§ We cannot understand all aspects of these
components in the images
§ First Component -
§ seems to mostly encode the contrast between the
face and the background.
§ Second Component -
§ encodes differences in lighting between the right
and the left half of the face, and so on

Eigenfacesfor
FeatureExtraction
• As the PCA model is based on pixels
• the alignment of the face and the lighting both have a strong
influence on how similar two images are in their pixel
representation
• These properties i.e., Alignment and lighting are probably
not what a human would perceive first
• When asking people to rate similarity of faces, they are more
likely to use attributes like age, gender, facial expression,
and hair style, which are attributes that are hard to infer
from the pixel intensities
• Algorithms often interpret data (particularly visual data,
such as images, which humans are very familiar with)
quite differently from how a human would

Eigenfacesfor
FeatureExtraction
§ PCA transformation = rotating the data + dropping the
components with low variance
§ Another Trick for PCA Transformation - Express the test
points as a weighted sum of the principal components
§ Try to find some numbers (Coefficients x0 , x1 , etc.,) (the new
feature values after the PCA rotation) and express the test
points as a weighted sum of the principal components
§ the reconstructions of the original data using only some
components
§ A similar transformation for the faces by reducing the data to
only some principal components and then rotating back into
the original space.

Eigenfacesfor
FeatureExtraction
§ Reconstruction of the original data -
§ The reconstructions of the original data using only some
components (Example Fig 3.3)
§ Reducing the data to only some principal components and
then rotating back into the original space
§ This return of the original feature space can be done using
the inverse_transform method

Eigenfacesfor
featureextraction
• Reconstructing three face images using increasing
numbers of principal components
• First 10 principal components -
• only the essence of the picture, like the face orientation
and lighting, is captured
• Using more and more principal components, more and
more details in the image are preserved
• Using as many components as there are pixels would
mean that we would not discard any information after
the rotation, and we would reconstruct the image
perfectly

Non-Negative
Matrix
Factorization

Non-Negative
Matrix
Factorization
§ Non-negative matrix factorization is another unsupervised
learning algorithm that aims to extract useful features
§ It works similarly to PCA
§ It can also be used for dimensionality reduction
§ Write each data point as a weighted sum of some
components
§ PCA -
§ wants components that were orthogonal and that explained
as much variance of the data as possible
§ NMF -
§ wants the components and the coefficients to be non-
negative
§ i.e., both the components and the coefficients to be greater
than or equal to zero
§ This method can only be applied to data where each
feature is non-negative --> as a non-negative sum of non-
negative components cannot become negative

Non-Negative
Matrix
Factorization
§ Process of decomposing data into a non-negative
weighted sum is particularly helpful for - data that is
created as the addition (or overlay) of several
independent sources
§ audio track of multiple people speaking
§ music with many instruments
§ In these situations, NMF can identify the original
components that make up the combined data
§ NMF leads to more interpretable components than PCA
§ as negative components and coefficients can lead to hard-to-
interpret cancellation effects

Non-Negative
Matrix
Factorization
Applying NMF to synthetic data
§ we need to ensure that our data is positive for NMF to be
able to operate on the data
§ Data lies relative to the origin (0, 0) actually matters for
NMF.

Non-Negative
Matrix
Factorization
§ Left Component
§ It is clear that all points in the data can be written as a
positive combination of the two components.
§ If there are enough components to perfectly reconstruct
the data (as many components as there are features)
§ The algorithm will choose directions that point toward
the extremes of the data.
§ Right Component
§ NMF creates a component that points toward the mean, as
pointing there best explains the data
§ reducing the number of components not only removes
some directions, but creates an entirely different set of
components

Non-Negative
Matrix
Factorization
§ NMF are
§ not ordered in any specific way
§ all components play an equal part
§ Randomness:
§ NMF uses a random initialization, which might
lead to different results depending on the
random seed
§ data with two components, where all the data
can be explained perfectly, the randomness
has little effect
§ In more complex situations, there might be
more drastic changes

Non-Negative
Matrix
Factorization
Applying NMF to face images
§ LFW dataset
§ Main parameter of NMF is how many components
we want to extract
§ Usually this is lower than the number of input features
§ Number of components impacts how well the data can be
reconstructed using NMF

Non-Negative
Matrix
Factorization

Non-Negative
Matrix
Factorization
• The quality of the back-transformed data is similar to when
using PCA, but slightly worse
• PCA -
• finds the optimum directions in terms of reconstruction
• NMF -
• is usually not used for its ability to reconstruct or encode data,
but rather for finding interesting patterns within the data
• These components are all positive, and so resemble
prototypes of faces much more so than the components
shown for PCA
• Component 3 -
• shows a face rotated somewhat to the right
• Component 7 -
• shows a face somewhat rotated to the left.

Non-Negative
Matrix
Factorization
• Faces that have a high coefficient for component 3 are faces
looking to the right (Figure 3-16)
• Faces with a high coefficient for component 7 are looking
to the left (Figure 3-17)
• Extracting patterns like these works best for data with
additive structure
• audio
• gene expression
• text data

Non-Negative
Matrix
Factorization
We can use NMF to recover the three signals

Manifold
Learning
with t-SNE
§ Advantages of PCA;
§ PCA is often a good first approach for transforming the data so
that we might be able to visualize it using a scatter plot
§ Disadvantages of PCA:
§ The nature of the method (applying a rotation and then dropping
directions) limits its usefulness

Manifold
Learning
with t-SNE
§ Manifold learning -
§ reduces the dimensinality of high-dimensional data by assuming
that the data is embedded in a lower dimentional nonlinear manifold
§ Class of algorithms for visualization called manifold learning
algorithms
§ allow for much more complex mappings
§ often provide better visualizations
§ many algorithms exist -
§ LLE (Locally Linear Embedding), Isomap, SE(Spectral Embedding), t-SNE
(T-distributed Stochastic Neighbor Embedding
§ useful one is the t-SNE algorithm
§ Manifold learning algorithms are mainly aimed at visualization
§ rarely used to generate more than two new features
§ t-SNE compute a new representation of the training data, but don’t
allow transformations of new data
§ these algorithms cannot be applied to a test set
§ Manifold learning can be useful for exploratory data analysis

Manifold
Learning
with t-SNE
§ Idea behind t-SNE -
§ Machine learning algorithm that is used to visualize high
dimensional data in two or three dimensions
§ embeds high dimensional points into lower dimensions
§ find a two-dimensional representation of the data that
preserves the distances between points as best as possible
§ t-SNE
§ Starts with a random two dimensional representation for each data
point
§ Tries to make points that are close in the original feature space
closer
§ Tries data points that are far apart in the original feature space
farther apart

Manifold
Learning
with t-SNE
§ t-SNE
§ puts more emphasis on points that are close by rather than
preserving distances between far-apart points
§ i.e., it tries to preserve the information indicating which points are
neighbors to each other
§ EXAMPLE:
§ Handwritten digits
§ data point in this dataset is an 8×8 grayscale image of a handwritten
digit between 0 and 9.

Manifold
Learning
with t-SNE
Applying PCA to HandWritten Digits
§ we actually used the true digit classes as characters, to show which
class is where.
§ The digits zero, six, and four are relatively well separated using the
first two principal components, though they still overlap.
§ Most of the other digits overlap significantly.

Manifold
Learning
with t-SNE
Applying PCA to HandWritten Digits
§ PCA to visualize the data reduced to two dimensions.
§ We plot the first two principal components, and represent each
sample with a digit corresponding to its class.

Manifold
Learning
with t-SNE
Applying NMF to HandWritten Digits
§ we actually used the true digit classes as glyphs, to show which class
is where.
§ The digits zero, six, and four are relatively well separated using the
first two principal components, though they still overlap
§ Most of the other digits overlap significantly

Manifold
Learning
with t-SNE
Scatter Plot using PCA to HandWritten Digits

Manifold
Learning
with t-SNE
Applying t-SNE to HandWritten Digits
§ t-SNE does not support transforming new data, the TSNE
class has no transform method
§ we can call the fit_transform method
§ build the model and immediately return the transformed data.

Manifold
Learning
with t-SNE

Manifold
Learning
with t-SNE
Applying t-SNE to HandWritten Digits
§ The result of t-SNE is quite remarkable
§ All the classes are quite clearly separated
§ The ones and nines are somewhat split up, but most of the classes
form a single dense group
§ This method has no knowledge of the class labels: it is
completely unsupervised
§ It can find a representation of the data in two dimensions that clearly
separates the classes, based solely on how close points are in the
original space.
§ The t-SNE algorithm has some tuning parameters
§ though it often works well with the default settings.
§ perplexity -
§ controls the effective number of neighbors that each point considers during
the dimensionality reduction process
§ early_exaggeration -
§ Controls how tight natural clusters in the original space are in the embedded
space and how much space will be between them
§ learning_rate
§ max_iter etc.,

§ Clustering is the task of partitioning the dataset
into groups, called clusters
§ GOAL:The goal is to split up the data in such a way
that points within a single cluster are very similar
and points in different clusters are different
§ Similarly to classification algorithms, clustering
algorithms assign (or predict) a number to each
data point, indicating which cluster a particular
point belongs to
Clustering

Clustering
K-Means Clustering
Agglomerative Clustering
DBSCAN

Clustering
K-Means Clustering
§ k-means clustering is one of the simplest and
most commonly used clustering algorithms
§ It tr ies to f ind cluster centers that are
representative of certain regions of the data
§ The algorithm alternates between two steps:
§ Step 1: Assigning data point to cluster
§ Assigning each data point to the closest cluster
center
§ Step 2: Recalculation of cluster center
§ Setting each cluster center as the mean of the data
points that are assigned to it
§ The algorithm is finished when the assignment of
instances to clusters no longer changes

Clustering
K-Means Clustering
§ Cluster centers are shown as triangles
§ Data points are shown as circles
§ Colors indicate cluster membership
§ Three clusters - so the algorithm was initialized by
declaring three data points randomly as cluster centers
(Initialization)
§ Then the iterative algorithm starts
§ First, each data point is assigned to the cluster center it is
closest to. (Assignpoints1)
§ The cluster centers are updated to be the mean of the
assigned points (Recompute Centers (1))
§ Then the process is repeated two more times. After the third
iteration, the assignment of points to cluster centers remained
unchanged, so the algorithm stops.

Clustering
K-Means Clustering
§ Given new data points, k-means will assign each to the
closest cluster center.

K-Means
Clustering § Each training data point in X is assigned a cluster label
§ Find these labels in the kmeans.labels_ attribute
§ we asked for three clusters, the clusters are numbered 0
to 2

K-Means
Clustering
§ We can also assign cluster labels to new points, using
the predict method
§ Each new point is assigned to the closest cluster center
when predicting, but the existing model is not changed
§ Running predict on the training set returns the same
result as labels_.

Clustering
K-Means Clustering
§ Clustering is somewhat similar to Classification
§ The labels themselves have no a priori meaning
§ Example 1 - Two dimensional toy dataset
§ we should not assign any significance to the fact that one group was
labeled 0 and another one was labeled 1.
§ Note 1 -
§ Running the algorithm again might result in a different numbering
of clusters because of the random nature of the initialization
§ Note 2 -
§ The cluster centers are stored in the cluster_centers_ attribute
§ Example 2 - Clustering face images
§ It might be that the cluster 3 found by the algorithm
contains only faces of Bela.You can only know that after you
look at the pictures
§ The number 3 is arbitrary
§ The only information the algorithm gives you is that all
faces labeled as 3 are similar

K-Means
Clustering
§ We can also use more or fewer cluster centers

K-Means
Clustering
Failure cases of k-means
§ Even if you know the “right” number of clusters for a
given dataset, k-means might not always be able to
recover them
§ Each cluster is defined solely by its center, which
means that each cluster is a convex shape
§ k-means can only capture relatively simple shapes
§ k-means also assumes that all clusters have the same
“diameter” -
§ It always draws the boundary between clusters to be
exactly in the middle between the cluster centers

K-Means
Clustering
§ Three clusters - Cluster 0, cluster 1, cluster 2
§ cluster 0 and cluster 1 have some points that are far
away from all the other points in these clusters that
“reach” toward the center

K-Means
Clustering
§ k-means also assumes that all directions are equally
important for each cluster
§ The following plot (Figure 3-28) shows a two-
dimensional dataset where there are three clearly
separated parts in the data
§ However, these groups are stretched toward the
diagonal
§ As k-means only considers the distance to the nearest
cluster center, it can’t handle this kind of data

K-Means
Clustering
• k-means also performs poorly if the clusters have more
complex shapes, like the two_moons
• Here, we would hope that the clustering algorithm can
discover the two halfmoon shapes
• However, this is not possible using the k-means
algorithm

K-Means
Clustering
Vector quantization
§ Even though k-means is a clustering algorithm, there
are interesting parallels between k-means and the
decomposition methods like PCA and NMF
§ PCA tries to find directions of maximum variance
in the data
§ NMF tries to find additive components, which often
correspond to “extremes” or “parts” of the data
§ Both methods tried to express the data points as a
sum over some components
§ k-means, on the other hand, tries to represent
each data point using a cluster center
§ In k-means each point being represented using
only a single component, which is given by the
cluster center

K-Means
Clustering
Vector quantization
§ This view of k-means as a decomposition method,
where each point is represented using a single
component, is called vector quantization
§ Comparison of PCA, NMF, and k-means
§ showing the components extracted , as well as
reconstructions of faces from the test set using 100
components
§ For k-means, the reconstruction is the closest cluster
center found on the training set

K-Means
Clustering
§ One interesting advantage of k-means -
§ An interesting aspect of vector quantization using k-
means is that we can use many more clusters than
input dimensions to encode our data
§ Example: two_moons data
§ Using PCA or NMF, there is nothing much we can do to this
data, as it lives in only two dimensions
§ Reducing it to one dimension with PCA or NMF would
completely destroy the structure of the data
§ But we can find a more expressive representation with
k-means, by using more cluster centers

K-Means
Clustering
§ We used 10 cluster centers - means each point is now
assigned a number between 0 and 9
§ We can see this as the data being represented using 10
components (that is, we have 10 new features)
§ Using this 10-dimensional representation, it would now be
possible to separate the two half-moon shapes using a
linear model, which would not have been possible using
the original two features

K-Means
Clustering
Advantages
§ k-means is a very popular algorithm for clustering
§ Relatively easy to understand and implement
§ It runs relatively quickly
§ The k-means clustering algorithm is guaranteed to
give results (Convergence)
§ It is not specific to particular problems. (i.e., can be
applied for numerical data to text) (Generalization)
§ k-means scales easily to large datasets
§ NOTE:
§ MiniBatchKMeans class - can handle very large
datasets

K-Means
Clustering
Disadvantages
§ It relies on a random initialization, which means the outcome
of the algorithm depends on a random seed
§ Deciding on the number of clusters to start is difficult (can
use elbow method)
§ Choice of initial centroids is difficult
§ Effect of outliers
§ Curse of dimentionality
§ Preprocessing is mandatory
§ Restrictive assumptions made on the shape of clusters
§ The requirement to specify the number of clusters you are
looking for (which might not be known in a real-world
application)

Clustering Agglomerative Clustering

Agglomerative
Clustering
§ Agglomerative clustering refers to a collection of
clustering algorithms that all build upon the same
principles:
§ The algorithm starts by declaring each point its own
cluster
§ then merges the two most similar clusters until some
stopping criterion is satisfied
§ The stopping criterion -
§ Number of clusters
§ so similar clusters are merged until only the specified
number of clusters are left
§ Most similar cluster is identified by considering several
linkage criteria
§ This measure is always defined between two existing
clusters

Agglomerative
Clustering
§ The following three choices (Linkages) are implemented in scikit-learn:
§ Ward
§ The default choice
§ Ward picks the two clusters to merge such that the variance within
all clusters increases the least
§ This often leads to clusters that are relatively equally sized
§ Average
§ Merges the two clusters that have the smallest average distance
between all their points
§ Complete
§ Also known as maximum linkage
§ Merges the two clusters that have the smallest maximum distance
between their points
§ Note 1-
§ Ward works on most datasets
§ Note 2 -
§ If the clusters have very dissimilar numbers of members, average or complete
might work better

Agglomerative
Clustering
§ This plot illustrates the progression of agglomerative clustering on a two-
dimensional dataset, looking for three clusters.

Agglomerative
Clustering
§ Initially, each point is its own cluster
§ Then, in each step, the two clusters that are closest
are merged
§ In the first four steps, two single-point clusters are
picked and these are joined into two-point
clusters
§ In step 5, one of the two-point clusters is extended
to a third point, and so on
§ In step 9, there are only three clusters remaining
§ As we specified that we are looking for three
clusters, the algorithm then stops

Agglomerative
Clustering
§ Because of the way the algorithm works,
agglomerative cluster ing cannot make
predictions for new data points
§ AgglomerativeClustering has no predict method
§ To bu i l d t h e m o d e l a n d ge t t h e c l u s t e r
memberships on the training set, use the
fit_predict method instead

Agglomerative
Clustering
§ While the scikit-learn implementation of
agglomerative clustering requires to
specify the number of clusters
§ Agglomerative clustering methods
provide some help with choosing the right
number of clusters

Agglomerative
Clustering
Hierarchical clustering and
dendrograms
§ Agglomerative clustering produces what is
known as a hierarchical clustering
§ The clustering proceeds iteratively
§ Every point makes a journey from being a single
point cluster to belonging to some final
cluster
§ Each intermediate step provides a clustering of
the data (with a different number of clusters)

Agglomerative
Clustering
The following figure shows an overlay of all the possible
clusterings shown in Figure 3-33, providing some insight into
how each cluster breaks up into smaller clusters.

Agglomerative
Clustering
Hierarchical Clustering and Dendograms
§ Hierarchical clustering relies on the two-
dimensional nature of the data
§ Hierarchical clustering cannot be used on
datasets that have more than two features
§ Dendograms -
§ Another tool to visualize hierarchical
clustering, called a dendrogram
§ can handle multidimensional datasets

Agglomerative
Clustering
Hierarchical Clustering and
Dendograms
§ The dendrogram is a tree-like structure
that is mainly used to store each step
§ scikit-learn currently does not have the
functionality to draw dendrograms
§ Dendograms can be generated easily
using SciPy

Agglomerative
Clustering
SciPy vs scikitlearn
§ SciPy clustering algorithms have a
slightly different interface to the scikit-
learn clustering algorithms
§ SciPy provides a function that
§ Takes a data array X
§ Computes a linkage array, which encodes
hierarchical cluster similarities

Agglomerative
Clustering
§ We can then feed this linkage array into the scipy
dendrogram function to plot the dendrogram

Agglomerative
Clustering
§ The dendrogram
§ shows data points as points on the bottom (i.e.,X-
axis) (numbered from 0 to 11)
§ shows Cluster Distance on Y- axis
§ Then, a tree is plotted with these points
(representing single-point clusters) as the leaves,
and a new node parent is added for each two
clusters that are joined
§ Reading from bottom to top, the data points 1 and
4 are joined first (as you could see in Figure 3-33).
§ Next, points 6 and 9 are joined into a cluster, and
so on. At the top level, there are two branches,
one consisting of points 11, 0, 5, 10, 7, 6, and 9, and
the other consisting of points 1, 4, 3, 2, and 8.
§ These correspond to the two largest clusters

Agglomerative
Clustering
§ The y-axis in the dendrogram
§ specifies when two clusters get merged?
§ The length of each branch also shows
how far apart the merged clusters are
§ The longest branches in this dendrogram
are the three lines that are marked by the
dashed line labeled “three clusters”
§ Going from three to two clusters meant
merging some very far-apart points
§ At the top of the chart, where merging
the two remaining clusters into a single
cluster again bridges a relatively large
distance

Agglomerative
Clustering
Drawbacks of Agglomerative Clustering
§ Fails at separating complex shapes
(Example: two_moons )

DBSCAN
DBSCAN
§ DBSCAN - Density-Based Spatial Clustering of
Applications with Noise
§ Another very useful clustering algorithm
§ Benefits:
§ It does not require the user to set the number of
clusters a priori
§ It can capture clusters of complex shapes
§ It can identify points that are not part of any cluster
§ Drawbacks:
§ Somewhat slower than agglomerative clustering and k-
means, but still scales to relatively large datasets

DBSCAN
DBSCAN
§ Functionality:
§ DBSCAN works by identifying points that are in
“crowded” regions of the feature space, where
many data points are close together
§ These regions are referred to as dense regions in
feature space
§ The idea behind DBSCAN is that clusters form
dense regions of data, separated by regions that
are relatively empty

DBSCAN
DBSCAN
§ Core Samples:
§ Points that are within a dense region are called
core samples
§ Also called as core points
§ Parameters to identify core samples:
§ min_samples
§ eps
§ If there are at least min_samples many data points
within a distance of eps to a given data point, that
data point is classified as a core sample
§ Core samples that are closer to each other than the
distance eps are put into the same cluster by
DBSCAN

DBSCAN
DBSCAN Algorithm
§ Step 1: Picks an arbitrary point to start with
§ Step 2: Finds all points with distance eps or less from that point
§ Step 3: If there are less than min_samples points within
distance eps of the starting point - this point is labeled as noise
(i.e., it doesn’t belong to any cluster)
§ Step 4: If there are more than min_samples points within a
distance of eps, the point is labeled a core sample - assigned
a new cluster label
§ Step 5: All neighbors (within eps) of the point are visited
§ Step 5.1: If they have not been assigned a cluster yet, they are
assigned the new cluster label that was just created
§ Step 5.2: If they are core samples, their neighbors are
visited in turn, and so on.
§ Step 5.3: The cluster grows until there are no more core samples
within distance eps of the cluster

DBSCAN
§ Step 6: Another point that hasn’t yet been visited is
picked, and the same procedure is repeated
§ Finally we end up with three kinds of points
§ Core points
§ Boundary Points - Points that are within distance eps of
core points (called boundary points)
§ Noise - Points that do not belong to any cluster
§

DBSCAN
§ Note 1:
§ When the DBSCAN algorithm is run on a particular
dataset multiple times, there will not be any change
in Core points and Noise
§ (i.e., the clustering of the core points is always the same,
and the same points will always be labeled as noise)
§ Note 2:
§ When the DBSCAN algorithm is run on a
par ticular dataset multiple times, the
boundary points may change
§ i.e., A boundary point might be neighbor to
core samples of more than one cluster.
§ Note 3:
§ The cluster membership - of boundary points
depends on the order in which points are visited

DBSCAN
DBSCAN on the synthetic dataset
§ DBSCAN does not allow predictions on new test data,
so we will use the fit_predict method to perform
clustering and return the cluster labels in one step

DBSCAN
§ Points that belong to clusters are solid
§ Noise points are shown in white
§ Core samples are shown as large markers
§ Boundary points are displayed as smaller markers
§ Increasing eps (going from left to right in the figure)
§ means that more points will be included in a cluster
§ This makes clusters grow, but might also lead to
multiple clusters joining into one
§ Increasing min_samples (going from top to bottom
in the figure)
§ means that fewer points will be core points, and more
points will be labeled as noise

DBSCAN
§ Parameter eps:
§ most important parameter
§ it determines what it means for points to be “close”
§ Very small eps -
§ means - NO points are core samples
§ Leads to - all points being labeled as noise
§ Very large eps -
§ All points forming a single cluster
§ Parameter min_samples:
§ The min_samples - mostly determines whether points
in less dense regions will be labeled as outliers or as
their own clusters
§ Large min_samples - many samples will now be
labeled as noise
§ determines the minimum cluster size

DBSCAN
§ Note 1:
§ While DBSCAN doesn’t require setting the number of
clusters explicitly, setting eps implicitly controls how
many clusters will be found
§ Note 2:
§ Finding a good setting for eps is sometimes easier after
scaling the data using StandardScaler or MinMaxScaler

DBSCAN
DBSCAN on the two_moons dataset
§ The algorithm actually finds the two half-circles and
separates them using the default settings.

DBSCAN
§ As the algorithm produced the desired number of
clusters (two)
§ Default parameter (eps=0.5) settings seem to work well
§ If we decrease eps to 0.2 we will get eight clusters
§ Increasing eps to 0.7 results in a single cluster
§ When using DBSCAN, you need to be careful about
handling the returned cluster assignments

DBSCAN
Comparing and Evaluating Clustering
Algorithms
§ Challenges in clustering algorithms -
§ Very hard to assess how well an algorithm
worked
§ To compare outcomes between different
algorithms

DBSCAN
Evaluating clustering with ground truth
§ Metrics to assess the outcome of a clustering algorithm
§ Adjusted Rand Index (ARI)
§ Normalized Mutual Information (NMI)
§ Both provides a quantitative measure
§ Clustering - 1
§ Unrelated Clusterings - 0
§ ARI can become negative
§ Compare the k-means, agglomerative clustering, and
DBSCAN algorithms using ARI

DBSCAN
Common Mistake when evaluating clustering
§ Use of accuracy_score instead of adjusted_rand_score and
normalized_mutual_info_score,

DBSCAN
Evaluating clustering without ground truth (O/P)
§ In practice, there is a big problem with using measures like ARI
§ In Clustering algorithms -
§ there is usually no ground truth to which to compare the results
§ Metrics like ARI and NMI -
§ only helps in developing algorithms
§ NOT in assessing success in an application
§ Silhouette coefficient -
§ Another metric for clustering
§ Don’t require ground truth
§ Computes the compactness of a cluster
§ Note 1:
§ Compactness doesn’t allow for complex shapes
§ Note 2:
§ These output metrics often don’t work well in practice

DBSCAN
Comparison using the silhouette score

DBSCAN
Observations
§ k-means gets the highest silhouette score
§ We might prefer the result produced by DBSCAN
§ Better strategy:
§ for evaluating clusters - use robustness-based
clustering metrics
§ These run an algorithm
§ after adding some noise to the data
§ using different parameter settings
§ Then compare the outcomes

DBSCAN
Face Images Example
§ Note:
§ Even if we get very high silhouette score -
§ Still don’t know if there is any semantic meaning in the clustering
§ Whether the clustering reflects an aspect of the data that we are
interested in
§ Face Images Example:
§ Goal is to find groups of similar faces — men and women, or old people and
young people, or people with beards and without
§ Target:
§ Cluster the data into two clusters
§ Drawbacks:
§ We still don’t know if the clusters that are found correspond in any way to the
concepts we are interested in
§ The clusters may find side views versus front views, or pictures taken at
night versus pictures taken during the day, or pictures taken with
iPhones versus pictures taken with Android phones
§ The only way to know whether the clustering corresponds to anything we are
interested in is to analyze the clusters manually

DBSCAN
Comparing algorithms on the faces dataset
§ Use Eigenface representation of the data, as produced
by PCA(whiten=True), with 100 components
§ The output has more semantic representation of the
face images than the raw pixels
§ It will also make computation faster

DBSCAN
Analyzing the faces dataset with DBSCAN
§ All the returned labels are –1
§ All of the data was labeled as “noise” by DBSCAN.
§ Solution:
§ eps = Higher
§ expand the neighborhood of each point
§ min_samples = Lower
§ to consider smaller groups of points as clusters

DBSCAN
§ min_samples
§ Result:
§ Everything is labeled as noise

DBSCAN
§ Changing eps value
§ Result:
§ Only One Cluster (0) is formed along with noise (-1)

DBSCAN
§ Use this result to find out what the “noise” looks like
compared to the rest of the data
§ 27 points of noise and 2036 points are inside the cluster

DBSCAN
§ Why they are considered as noise?
§ the fifth image in the first row - person drinking from a
glass
§ Images of people wearing hats
§ Last image - hand in front of the person’s face
§ other images - contain odd angles or crops that are too
close or too wide
§ We can do little about people in photos who is wearing
hats, drinking, or holding something in front of their faces
§ Outlier Detection:
§ This kind of analysis — trying to find “the odd one out” —
is called outlier detection
§ Solution:
§ do a better job of cropping images

DBSCAN
§ For more clusters:
§ Need to set smaller eps (15 and 0.5 (the default)

DBSCAN
Analyzing the faces dataset
§ Some of the clusters correspond to people with very
distinct faces (within this dataset), such as Sharon or
Koizumi
§ Within each cluster, the orientation of the face is also quite
fixed, as well as the facial expression
§ Some of the clusters contain faces of multiple people, but
they share a similar orientation and expression
§ Note:
§ We are doing a manual analysis here
§ Different from the supervised learning based on R2 score or
accuracy

DBSCAN
Analyzing the faces dataset with k-means
§ Disadvantage of DBSCAN on Face Dataset -
§ Not possible to create more than one big cluster using
DBSCAN
§ Pros and Cons of Agglomerative clustering and k-
means -
§ Pros -
§ Can create clusters of even size
§ Cons -
§ Need to set a target number of clusters a priori
§ Number of clusters = Number of people in the dataset
§ Still cannot recover all the clusters correctly
§ Solution -
§ Start with a low number of clusters (eg., 10) - Analyze
each of the clusters manually
§ Increase the number of clusters if necessary

DBSCAN
§ K-Means -
§ Partitioned the data into relatively similarly sized clusters
from 64 to 386
§ This is quite different from the result of DBSCAN

DBSCAN
§ Visualization of outcome of k-means
§ As we clustered in the representation produced by PCA,
we need to rotate the cluster centers back into the
o r i g i n a l s p a c e t o v i s u a l i z e t h e m , u s i n g
pca.inverse_transform.

DBSCAN
§ The cluster centers found by k-means are very smooth
versions of faces
§ Each center is an average of 64 to 386 face images
§ The clustering seems to pick up on
§ different orientations of the face
§ different expressions (the third cluster center seems to
show a smiling face)
§ the presence of shirt collars (see the second-to-last
cluster center).

DBSCAN
§ More detailed view -
§ In Figure 3-44
§ Each cluster center shows -
§ The five most typical images in the cluster -
§ the images assigned to the cluster that are closest to
the cluster center
§ The five most atypical images in the cluster -
§ the images assigned to the cluster that are furthest
from the cluster center

DBSCAN
§ Third Cluster - Smiling Faces
§ Other clusters - Orientation
§ Atypical points -
§ are not very similar to the cluster centers
§ Their assignment seems somewhat arbitrary
§ k-means partitions doesn’t have a concept of “noise”
points
§ Using a larger number of clusters, the algorithm could
find finer distinctions
§ Note:
§ Adding more clusters makes manual inspection even
harder

DBSCAN
Analyzing the faces dataset with agglomerative
clustering
§ Agglomerative clustering also produces
§ relatively equally sized clusters
§ with cluster sizes between 26 and 623
§ More uneven than those produced by k-means
§ Much more even than the ones produced by DBSCAN

DBSCAN
§ Compute ARI -
§ to measure the similar ity of two par titions by
Agglomerative and K-Means
§ ARI = 0.13
§ means that the two clusterings labels_agg and labels_km
have little in common

DBSCAN
§ Dendrogram -
§ We’ll limit the depth of the tree in the plot, as branching
down to the individual 2,063 data points would result in an
unreadably dense plot.

DBSCAN
Agglomerative with ten clusters
§ 10 clusters (Figure 3-46)
§ There is no notion of cluster center in agglomerative
clustering (arbitrary data point is choosen)
§ Number of points in each cluster - placed as left of the
first image

DBSCAN
§ While some of the clusters seem to have a semantic
theme, many of them are too large to be actually
homogeneous
§ To get more homogeneous clusters - run the algorithm
again, this time with 40 clusters

DBSCAN
§ Agglomerative Clustering - (Figure 3.47)
§ dark skinned and smiling
§ collared shirt
§ smiling woman
§ Hussein
§ high forehead
§ We could also find these highly similar clusters using
Dendograms

DBSCAN
Summary of Clustering Methods
§ Applying and evaluating clustering is a highly qualitative procedure
§ Most helpful in the exploratory phase of data analysis
§ Three clustering algorithms:
§ k-means
§ DBSCAN,
§ Agglomerative
§ All three have a way of controlling the granularity of clustering
§ k-means and agglomerative clustering allos to specify the number of
desired clusters
§ DBSCAN allows to define proximity using the eps parameter, which
indirectly influences cluster size
§ All three methods can be
§ Used on large data sets
§ Used on Real-world datasets
§ Relatively easy to understand
§ Allow for clustering into many clusters

DBSCAN
Summary of Clustering Methods
§ Strengths -
§ k-means -
§ k-means allows for a characterization of the clusters using the
cluster means
§ It can also be viewed as a decomposition method, where each
data point is represented by its cluster center
§ DBSCAN -
§ Allows for the detection of “noise points” (i.e., datapoints that
are not assigned to any cluster)
§ It can help automatically determine the number of clusters
§ Allows for complex cluster shape
§ Sometimes produces clusters of very differing size, which can be
a strength or a weakness
§ Agglomerative clustering -
§ Provide a whole hierarchy of possible partitions of the data
§ Easily inspected via dendrograms

Machine Learning - Implementation with Python - 3.pdf

More Related Content

What's hot (20)

Similar to Machine Learning - Implementation with Python - 3.pdf (20)

More from University College of Engineering Kakinada, JNTUK - Kakinada, India (6)

Recently uploaded (20)

Machine Learning - Implementation with Python - 3.pdf