2. Similarity-based Learning
• supervised learning technique
• predicts the class label of a test instance by gauging the similarity of this test
instance with training instances.
• refers to a family of instance-based learning which is used to solve both
classification and regression problems.
Instance-based learning
• makes prediction by computing distances or similarities between test instance and
specific set of training instances local to the test instance in an incremental
process.
• it considers only the nearest instance or instances to predict the class of unseen
instances.
Similarity-based classification is useful in various fields such as image processing, text
classification, pattern recognition, bio informatics, data mining, information retrieval,
natural language processing, etc.
3. Similarity-based learning
• also called as Instance-based learning/Just-in time learning since it does not
build an abstract model of the training instances and performs lazy learning
when classifying a new instance.
• This learning mechanism simply stores all data and uses it only when it needs
to classify an unseen instance.
• The advantage of using this learning is that processing occurs only when a
request to classify a new instance is given.
• The drawback of this learning is that it requires a large memory to store the
data since a global abstract model is not constructed initially with the training
data.
Classification of instances is done based on the measure of similarity in the form of
distance functions over data instances.
Several distance metrics are used to estimate the similarity or dissimilarity between
instances required for clustering, nearest neighbor classification, anomaly detection,
and so on.
Popular distance metrics used are Hamming distance, Euclidean distance,
Manhattan distance, Minkowski distance, Cosine similarity, Mahalanobis
distance, Pearson’s correlation
4. Differences Between Instance- and Model-based Learning
Instance-based Learning Model-based Learning
Lazy Learners Eager Learners
Processing of training instances is done only
during testing phase
Processing of training instances is done
during training phase
No model is built with the training instances
before it receives a test instance
Generalizes a model with the training
instances before it receives a test instance
Predicts the class of the test instance directly
from the training data
Predicts the class of the test instance from the
model built
Slow in testing phase Fast in testing phase
Learns by making many local approximations Learns by creating global approximation
5. NEAREST-NEIGHBOR Algorithm
The K-Nearest Neighbours (KNN) algorithm is one of the simplest supervised machine learning
algorithms that is used to solve both classification and regression problems.
KNN is also known as an instance-based model or a lazy learner because it doesn’t construct an
internal model.
It is a simple and powerful non-parametric algorithm that predicts the category of the test
instance according to the ‘k’ training samples which are closer to the test instance and classifies
it to that category which has the largest probability.
For classification problems, it will find the k nearest neighbors and predict the class by the
majority vote of the nearest neighbors.
For regression problems, it will find the k nearest neighbors and predict the value by calculating
the mean value of the nearest neighbors.
7. Problem 1 :Classification (Continuous Attributes)
Instance x y Class
A 1 2 Red
B 2 3 Blue
C 3 3 Blue
D 5 1 Red
Test Instance (t):
x=2, y=2
k = 3
Training Data
8. Problem 1 : Given Training Data and Test Instance
Instance x y Class
A 1 2 Red
B 2 3 Blue
C 3 3 Blue
D 5 1 Red
Test Instance (t):
x=2, y=2
k = 3
9. Problem 1 : Given Training Data and Test Instance
Instance x y Class
A 1 2 Red
B 2 3 Blue
C 3 3 Blue
D 5 1 Red
Test Instance (t):
x=2, y=2
k = 3
Instance Distance Class
A 1.00 Red
B 1.00 Blue
C 1.41 Blue
Step 2: Sort by distance
Step 3: Choose top k = 3 neighbors
•A (Red), B (Blue), C (Blue)
Step 4: Majority vote
•Blue: 2 votes
•Red: 1 vote
Predicted class: Blue
10. Problem 2: Regression (Continuous Target)
Instance x y Price
A 1 2 200
B 2 3 250
C 3 5 300
D 5 1 400
Training Data: Test Instance:
x=2,y=2
k = 2
Predicted Price: 225
11. Problem 3: Classification (Categorical/Binary Features)
Training Data
Instance Fever Cough Class
A Yes No Flu
B No Yes Cold
C Yes Yes Flu
D No No Healthy
Test Instance:
Fever = Yes, Cough = Yes
Step 1: Hamming distances
(Count differences in categorical features)
Instance
Hamming
Distance
Class
A 1 Flu
B 1 Cold
C 0 Flu
D 2 Healthy
Step 2: Select k = 3 nearest:
C (0)
A (1)
B (1)
Step 3: Majority vote
•Flu: 2
•Cold: 1
Prediction: Flu
12. Training Data
Test Instance (t):
height=150 cm , weight= 61
k = 3
Using k-NN Classify the given test instance
Problem 4
13. Problem 5
Consider the student performance training dataset. Given a test instance (6.1, 40, 5) and a set of
categories {Pass, Fail}.Classify the test instance considering k=3
14. WEIGHTED K-NEAREST-NEIGHBOR ALGORITHM
The Weighted k-NN is an extension of k-NN.
It chooses the neighbors by using the weighted distance.
The k-Nearest Neighbor (k-NN) algorithm has some serious limitations as its performance is solely
dependent on choosing the k nearest neighbors, the distance metric used and the decision rule.
However, the principle idea of Weighted k-NN is that k closest neighbors to the test instance are
assigned a higher weight in the decision as compared to neighbors that are farther away from the test
instance.
The idea is that weights are inversely proportional to distances.
The selected k nearest neighbors can be assigned uniform weights, which means all the instances in
each neighborhood are weighted equally or weights can be assigned by the inverse of their distance.
In the second case, closer neighbors of a query point will have a greater influence than neighbors which
are further away.
16. Problem 1 :Classification (Continuous Attributes)
Instance x y Class
A 1 2 Red
B 2 3 Blue
C 3 3 Blue
D 5 1 Red
Test Instance (t):
x=2, y=2
k = 3
Training Data
17. Problem 1 : Given Training Data and Test Instance
Instance x y Class
A 1 2 Red
B 2 3 Blue
C 3 3 Blue
D 5 1 Red
Test Instance (t):
x=2, y=2
k = 3
18. Problem 1 : Given Training Data and Test Instance
Instance x y Class
A 1 2 Red
B 2 3 Blue
C 3 3 Blue
D 5 1 Red
Test Instance (t):
x=2, y=2
k = 3
19. Problem 1 : Given Training Data and Test Instance
Instance x y Class
A 1 2 Red
B 2 3 Blue
C 3 3 Blue
D 5 1 Red
Test Instance (t):
x=2, y=2
k = 3
Predicted Class: Blue (higher total weight)
20. Problem 2: Regression (Continuous Target)
Instance x y Price
A 1 2 200
B 2 3 250
C 3 5 300
D 5 1 400
Training Data: Test Instance:
x=2,y=2
k = 3
21. Problem 2: Regression (Continuous Target)
Instance x y Price
A 1 2 200
B 2 3 250
C 3 5 300
D 5 1 400
Training Data: Test Instance:
x=2,y=2
k = 3
Predicted Price: ≈ 235.23
22. Problem 5
Consider the student performance training dataset. Given a test instance (6.1, 40, 5) and a set of
categories {Pass, Fail}.Classify the test instance considering k=3 using Weighted k-NN
23. NEAREST CENTROID CLASSIFIER
• Input: Training dataset T, Distance metric d, Test instance t
• Output: Predicted class/category
Steps:
1. Compute the mean (centroid) of each class
2. Compute Euclidean distance between test instance and each centroid
3. Predict the class with the smallest distance
It is a simple classifier and also called as Mean Difference classifier. The idea of this
classifier is to classify a test instance to the class whose centroid/mean is closest to
that instance.
Algorithm
24. Problem 1
Consider the training dataset. Given a test instance t = (4, 4) Classify the test instance considering
using nearest centroid classifier.
Instance X1 X2 Class
A1 1 2 C1
A2 2 3 C1
A3 3 3 C1
B1 6 5 C2
B2 7 7 C2
B3 8 6 C2
25. Problem 1
Consider the training dataset. Given a test instance t = (4, 4) Classify the test instance considering
using nearest centroid classifier.
Instance X1 X2 Class
A1 1 2 C1
A2 2 3 C1
A3 3 3 C1
B1 6 5 C2
B2 7 7 C2
B3 8 6 C2
Since 2.4 < 3.6, classify test instance t = (4, 4) as Class C1.
26. Problem 2
Consider the training dataset. Given a test instance t = (6, 5) Classify the test instance considering
using nearest centroid classifier.
27. x y Class
1 1 Cat
2 2 Cat
6 5 Dog
7 6 Dog
Problem 3
Consider the training dataset. Given a test instance t = (3, 2.5).Classify the test instance considering
using nearest centroid classifier.
28. Locally Weighted Regression (LWR)
Locally Weighted Regression (LWR) is a non-parametric supervised learning algorithm that performs
local regression by combining regression model with nearest neighbor’s model.
LWR is also referred to as a memory-based method as it requires training data while prediction
.
The key idea is that we need to approximate the linear functions of all ‘k’ neighbors that minimize the
error such that the prediction line is no more linear but rather it is a curve.
Ordinary linear regression finds out a linear relationship between the input x and the output y.
29. Locally Weighted Regression (LWR)
1. Given Training Dataset T
2. Train set {(xi,yi)}
3. The standard linear regression hypothesis function is given by :
31. Locally Weighted Regression (LWR)
5. Compute Cost Function
6. Minimize cost to find β specific to this query point (this gives a different model for each test point)