Dr. Shivu___Machine Learning_Module 2pdf

MACHINE LEARNING
(22ISE62)
Module-2
Dr. Shivashankar
Professor
Department of Information Science & Engineering
GLOBAL ACADEMY OF TECHNOLOGY-Bengaluru
24-05-2025 1
GLOBAL ACADEMY OF TECHNOLOGY
Ideal Homes Township, Rajarajeshwari Nagar, Bengaluru – 560 098
Department of Information Science & Engineering
Dr. Shivashankar-ISE-GAT

Module 2 - Understanding Data – 2
Bivariate Data
• Bivariate analysis is one of the statistical analysis where two variables are observed.
• One variable here is dependent (X) while the other is independent (Y).
• Bivariate data can be used to determine whether or not two variables are related.
• The aim of bivariate analysis is to find relationships among data.
• The relationships can then be used in comparisons, finding causes, and in further explorations.
• To do that, graphical display of the data is necessary.
• One such graph method is called scatter plot.
• Scatter plots are the graphs that present the relationship between two variables in a data-set.
• It is a 2D graph showing the relationship between two variables.
• It is useful in exploratory data before calculating a correlation coefficient or fitting regression
curve.
24-05-2025 2

Conti..
Temperature (in
centigrade)
Sales of Sweaters
(in thousands)
5 300
12 250
15 200
20 110
23 45
27 10
35 5
24-05-2025 3
300
250
200
110
45
10 5
-100
-50
0
50
100
150
200
250
300
350
0 5 10 15 20 25 30 35 40
Sales
of
Sweaters
Temparature
Sales of Sweaters (in thousands)
Figure 2.11: Scatter Plot Line graphs are similar to scatter
plots.
300
250
200
110
45
10 5
0
50
100
150
200
250
300
350
1 2 3 4 5 6 7
sales
of
Sweaters
Temparature
Sales of Sweaters (in thousands)
Figure 2.12: Line Chart
Table 2.1: Temperature in a Shop and Sales Data

Bivariate Statistics
• Bivariate analysis is stated to be an analysis of any concurrent relation between two variables or
attributes.
• Example: examples: student's study time vs. their exam scores, ice cream sales vs. temperature,
height vs. weight, income vs. years of education, and patient's BMI vs. blood pressure.
• Covariance and Correlation are methods of bivariate statistics.
• Covariance is a measure of joint probability of random variables, say X and Y.
• It is defined as covariance(X, Y) or COV(X, Y) and is used to measure the variance between two
dimensions.
• The formula for finding co-variance for specific x, and y are:
𝐶𝑂𝑉(𝑋, 𝑌) =
1
𝑁
෍
𝑖=1
𝑁
𝑥𝑖 − 𝐸(𝑋) 𝑦𝑖 − 𝐸(𝑌)
Here, 𝑥𝑖and 𝑦𝑖are data values from X and Y. E(X) and E(Y) are the mean values of 𝑥𝑖 and 𝑦𝑖.
N is the number of given data.
Also, the COV(X, Y) is same as COV(Y, X).
24-05-2025 4

Problem 1: Find the covariance of data X = {1, 2, 3, 4, 5} and Y = {1, 4, 9, 16, 25}.
Solution: Mean(X) = E(X) =
15
5
= 3
Mean(Y) = E(Y) =
55
5
= 11
𝐶𝑂𝑉(𝑋, 𝑌) =
1
𝑁
෍
𝑖=1
𝑁
𝑥𝑖 − 𝐸(𝑋) 𝑦𝑖 − 𝐸(𝑌)
=
1 − 3 1 − 11 + 2 − 3 4 − 11 + 3 − 3 9 − 11 + 4 − 3 16 − 11 + (5 − 3)(25 − 11)
5
= 12
The covariance between X and Y is 12.
24-05-2025 5

Problem 2: Find the covariance between X and Y for the following data:
Solution:
24-05-2025 6
X 3 4 5 8 7 9 6 2 1
Y 4 3 4 7 8 7 6 3 2

Correlation
Correlation refers to a process for establishing the relationships between two
variables.
The correlation coefficient is a statistical measure of the strength of a linear relationship
between two variables. Its values can range from -1 to 1.
The sign is more important than the actual value.
1. If the value is positive, it indicates that the dimensions increase together.
2. If the value is negative, it indicates that while one-dimension increases, the other
dimension decreases.
3. If the value is zero, then it indicates that both the dimensions are independent of each
other.
If the given attributes are X = (𝑥1, 𝑥2, 𝑥3, …, 𝑥𝑛) and Y = (𝑦1, 𝑦2, 𝑦3, …, 𝑦𝑛), then the Pearson
correlation coefficient, that is denoted as r, is given as:
r=
𝐶𝑂𝑉(𝑋,𝑌)
𝜎𝑥,𝜎𝑦
where, 𝜎𝑥, 𝜎𝑦 are the standard deviations of X and Y.
24-05-2025 7

Conti..
Problem 1: Find the correlation coefficient of data X = {1, 2, 3, 4, 5} and Y = {1, 4, 9, 16, 25}.
Solution: Step 1: The mean values of X and Y
Mean(X) = ത
𝑋=
15
5
= 3
Mean(Y) = = ത
𝑌 =
55
5
= 11
Step 2: Calculate the squared differences from the mean
For X for Y
(𝑋1 − ത
𝑋)2= 1 − 3 2=4 (𝑌1 − ത
𝑌)2= 1 − 11 2=100
(𝑋2 − ത
𝑋)2
= 2 − 3 2
=1 (𝑌2 − ത
𝑌)2
= 4 − 11 2
=49
(𝑋3 − ത
𝑋)2= 3 − 3 2=0 (𝑌3 − ത
𝑌)2= 9 − 11 2=4
(𝑋4 − ത
𝑋)2
= 4 − 3 2
=1 (𝑌4 − ത
𝑌)2
= 16 − 11 2
=25
(𝑋5 − ത
𝑋)2
= 5 − 3 2
=4 (𝑌5 − ത
𝑌)2
= 25 − 11 2
=196
Sum of squared differences for X: 10 Sum of squared differences for X: 374
24-05-2025 8

CONTI..
Step 3: Calculate the variance
• The variance for each set is the average of these squared differences.
For X:
• Variance of X =
10
5
= 2
For Y:
• Variance of Y=
374
5
= 74.8
Step 4: Calculate the standard deviation
• The standard deviation is the square root of the variance.
• For X:
• 𝜎𝑋 = 2≈1.414
• For Y:
• 𝜎𝑌 = 74.8≈8.6486
• Therefore, the correlation coefficient is given as ratio of covariance
• 𝐶𝑂𝑉 𝑋, 𝑌 =
1
𝑁
σ𝑖=1
𝑁
𝑥𝑖 − 𝐸 𝑋 𝑦𝑖 − 𝐸 𝑌 =
1−3 1−11 + 2−3 4−11 + 3−3 9−11 + 4−3 16−11 + 5−3 25−11
5
= 12
• Therefore, correlation coefficient, r=
12
1.414+8.6486
= 0.984
24-05-2025 9

Conti..
Problem 1: Find the correlation coefficient of data X = {5,9,10,3,5,7} and Y = {6,11,6,4,6,9}.
Solution:
24-05-2025 10

Multivariate Statistics
• Multivariate statistics refers to methods that examine the simultaneous effect of multiple variables.
• In machine learning, almost all datasets are multivariable.
• Multivariate data is the analysis of more than two observable variables, and often, thousands of multiple
measurements need to be conducted for one or more subjects.
• The multivariate data is like bivariate data but may have more than two dependent variables.
• Some of the multivariate analysis are regression analysis, principal component analysis, and path analysis.
id Attribute-1 Attribute-2 Attribute-3
1 1 4 1
2 2 5 2
3 3 6 1
• The mean of multivariate data is a mean vector and the mean of the above three attributes is given as (2,
5, 1.33).
• The variance of multivariate data becomes the covariance matrix.
• The mean vector is called centroid and variance is called dispersion matrix.
• Multivariate data has three or more variables.
24-05-2025 11

Heatmap
• In machine learning, a heatmap is a data visualization technique that uses color-coding to represent the
magnitude of individual values within a dataset, often displayed as a grid or matrix.
• It helps to identify patterns, correlations, and anomalies within complex datasets by highlighting areas of
significance.
• It takes a matrix as input and colours it.
• The darker colours indicate very large values and lighter colours indicate smaller values.
• The advantage of this method is that humans perceive colours well.
• So, by colour shaping, larger values can be perceived well.
• For example, in vehicle traffic data, heavy traffic regions can be differentiated from low traffic regions
through heatmap.
24-05-2025 12
Figure 2.3 : Grid with Heatmap Pattern

Pairplot
• A scatterplot matrix, is a data visualization tool that displays pairwise relationships between all variables in
a dataset, helping to understand distributions and correlations at a glance.
• Pairplot or scatter matrix is a data visual technique for multivariate data.
• A scatter matrix consists of several pair-wise scatter plots of variables of the multivariate data.
• All the results are presented in a matrix format.
• By visual examination of the chart, one can easily find relationships among the variables such as
correlation between the variables.
• A random matrix of three columns is chosen and the relationships of the columns is plotted as a pairplot.
24-05-2025 13
Figure 1: PAIRPLOT VISUALIZATION

Essential Mathematics for Multivariate Data
• Machine learning involves many mathematical concepts from the domain of Linear algebra, Statistics,
Probability and Information theory.
• Linear algebra deals with linear equations, vectors, matrices, vector spaces and transformations.
• These are the driving forces of machine learning and machine learning cannot exist without these data
types.
Linear Systems and Gaussian Elimination for Multivariate Data
• A linear system of equations is a group of equations with unknown variables. Let Ax = y,
then the solution x is given as: x = y/A = 𝐴−1
y
This is true if y is not zero and A is not zero.
The logic can be extended for N-set of equations with ‘n’ unknown variables.
• It means if and y=(𝑦1, 𝑦2,……, 𝑦𝑛)
• Then unknown variable x= y/A = 𝐴−1
y
24-05-2025 14

Conti..
For solving large number of system of equations, Gaussian elimination can be used.
The procedure for applying Gaussian elimination is given as follows:
1. Write the given matrix.
2. Append vector y to the matrix A. This matrix is called augmentation matrix.
3. Keep the element 𝑎11as pivot and eliminate all 𝑎11 in second row using the matrix operation,
𝑅2−
𝑎21
𝑎11
, here 𝑅2 is the second row and
𝑎21
𝑎11
is called as multiplier. The same logic can be
used to remove 𝑎12in all other equations.
4. Repeat the same logic and reduce it to reduced echelon form. Then, the unknown variable as:
𝑥𝑛=
𝑦𝑛𝑛
𝑥𝑛𝑛
5. Then, the remaining unknown variables can be found by back-substitution as:
𝑥𝑛−1 =
𝑦𝑛−1 − 𝑎𝑛−1 𝑥 𝑎𝑛
𝑎(𝑛−1)(𝑥−1)
This part is called backward substitution.
24-05-2025 15

Problem 4: Solve the following set of equations using Gaussian Elimination method.
2𝑥1 + 4𝑥2 = 6
4𝑥1 + 3𝑥2 = 7
Solution:
2 4 | 6
4 3 | 7
Apply the transformation by dividing the row 1 by 2 (R1/2).
-
1 2 | 3
4 3 | 7
R2=R2-4R1
-
1 2 | 3
0 − 5 | − 5
R2=R2/-5
-
1 2 | 3
0 1 | 1
R1=R1-2R2 -
1 0 | 1
0 1 | 1
x1 = 1, x2 = 1
24-05-2025 16

Problem 5: Solve the following set of equations using Gaussian Elimination
method. 2x+y=-1
3x-5y= -21
solution:
1 0 | − 2
0 1 | 3
24-05-2025 17

Machine Learning and Importance of Probability and Statistics
• Machine learning is linked with statistics and probability.
• Like linear algebra, statistics is the heart of machine learning.
• The importance of statistics needs to be stressed as without statistics; analysis of data is difficult.
• Probability is especially important for machine learning.
• In machine learning, probability is a fundamental concept that deals with the likelihood of events or
outcomes. It's used to model uncertainty and make predictions, especially in algorithms that deal with
probabilistic models like Naive Bayes.
Probability Distributions
• The mathematical function that gives the probabilities of occurrence of possible outcomes for
an experiment.
• In other words, distribution is a function that describes the relationship between the observations in a
sample space.
Probability distributions are of two types:
1. Discrete probability distribution
2. Continuous probability distribution
24-05-2025 18

FEATURE ENGINEERING AND DIMENSIONALITY REDUCTION TECHNIQUES
• The process of selecting, transforming, and creating new features (or variables) from raw data to improve the performance of
machine learning models.
• It involves carefully preparing the input data so that machine learning algorithms can learn effectively and make accurate
predictions.
• Features are attributes.
• Feature engineering is about determining the subset of features that form an important part of the input that improves the
performance of the model, be it classification or any other model in machine learning.
• Feature engineering deals with two problems – Feature Transformation and Feature Selection.
• Feature transformation is extraction of features and creating new features that may be helpful in increasing performance.
• For example, the height and weight may give a new attribute called Body Mass Index (BMI).
• Feature subset selection is another important aspect of feature engineering that focuses on selection of features to reduce
the time but not at the cost of reliability.
The features can be removed based on two aspects:
1. Feature relevancy – Some features contribute more for classification than other features.
For example, a mole on the face can help in face detection than common features like nose.
2. Feature redundancy – Some features are redundant.
For example, when a database table has a field called Date of birth, then age field is not relevant as age can be computed
easily from date of birth. This helps in removing the column age that leads to reduction of dimension one.
24-05-2025 19

conti..
1 Stepwise Forward Selection:
• This procedure starts with an empty set of attributes.
• Every time, an attribute is tested for statistical significance for best quality and is added to the reduced
set. This process is continued till a good reduced set of attributes is obtained.
2 Stepwise Backward Elimination:
• This procedure starts with a complete set of attributes.
• At every stage, the procedure removes the worst attribute from the set, leading to the reduced set.
Combined Approach Both forward and reverse methods can be combined so that the procedure can add
the best attribute and remove the worst attribute.
3 Principal Component Analysis
• The idea of the principal component analysis (PCA) or KL transform is to transform a given set of
measurements to a new set of features so that the features exhibit high information packing properties.
This leads to a reduced and compact set of features.
• Basically, this elimination is made possible because of the information redundancies.
• This compact representation is of a reduced dimension.
24-05-2025 20

PCA
Consider a group of random vectors of the form:
𝑥 =
𝑥1
𝑥2
𝑥3
.
.
𝑥𝑛
The mean vector of the set of random vectors is defined as: 𝑚𝑥 = 𝐸 𝑥
The operator E refers to the expected value of the population.
This is calculated theoretically using the probability density functions (PDF) of the elements 𝑥𝑖 and the joint
probability density functions between the elements 𝑥𝑖 and 𝑥𝑗.
From this, the covariance matrix can be calculated as:
C = E{(x - 𝑚𝑥) 𝑥 − 𝑚𝑥
𝑇
}
For M random vectors, when M is large enough, the mean vector and covariance matrix can be approximately
calculated as:
𝑚𝑥 =
1
𝑁
෍
𝑘=1
𝑀
𝑥𝑘
𝐴 =
1
𝑁
෍
𝑘=1
𝑀
𝑥𝑘 𝑥𝑘
𝑇
− 𝑚𝑥𝑚𝑥
𝑇
24-05-2025 21

conti..
The mapping of the vectors x to y using the transformation can now be described as:
y = A(x - 𝑚𝑥)
This transform is also called as Karhunen-Loeve or Hoteling transform. The original vector x can now be
reconstructed as follows:
x = 𝐴𝑇
y + 𝑚𝑥
The goal of PCA is to reduce the set of attributes to a newer, smaller set that captures the variance of the data.
The variance is captured by fewer components, which would give the same result as the original, with all the
attributes.
If K largest eigen values are used, the recovered information would be:
𝑥 = 𝐴𝑘
𝑇
𝑚𝑥
The advantages of PCA are immense.
It reduces the attribute list by eliminating all irrelevant attributes.
The PCA algorithm is as follows: 1. The target dataset x is obtained
2. The mean is subtracted from the dataset. Let the mean be m.
Thus, the adjusted dataset is X – m. The objective of this process is to transform the dataset with zero mean.
3. The covariance of dataset x is obtained. Let it be C.
24-05-2025 22

conti..
4. Eigen values and eigen vectors of the covariance matrix are calculated.
5. The eigen vector of the highest eigen value is the principal component of the dataset. The eigen values
are arranged in a descending order.
The feature vector is formed with these eigen vectors in its columns. Feature vector = {𝐸𝑖𝑔𝑒𝑛 𝑣𝑒𝑐𝑡𝑜𝑟1,
𝐸𝑖𝑔𝑒𝑛 𝑣𝑒𝑐𝑡𝑜𝑟2, 𝐸𝑖𝑔𝑒𝑛 𝑣𝑒𝑐𝑡𝑜𝑟3,…. 𝐸𝑖𝑔𝑒𝑛 𝑣𝑒𝑐𝑡𝑜𝑟𝑛}
6. Obtain the transpose of feature vector. Let it be A.
7. PCA transform is y = A × (x – m), where x is the input dataset, m is the mean, and A is the transpose of the
feature vector.
The original data can be retrieved using the formula given below:
Original data (f) = { 𝐴 −1
× y} + m
= { 𝐴 𝑇
× y} + m
The new data is a dimensionally reduced matrix that represents the original data.
Therefore, PCA is effective in removing the attributes that do not contribute.
24-05-2025 23

Conti..
Problem 1: Let the data points be
2
6
and
1
7
. Apply PCA and find the transformed data. Again, apply the inverse
and prove that PCA works.
Solution: One can combine two vectors into a matrix as follows:
The mean vector can be computed as follows:
𝜇 =
2 + 1
2
6 + 7
2
=
1.5
6.5
As part of PCA, the mean must be subtracted from the data to get the adjusted data:
𝑥1=
2 − 1.5
6 − 6.5
=
0.5
−0.5
𝑥2=
1 − 1.5
7 − 6.5
=
−0.5
0.5
The covariance can be obtained as follows:
𝑚1= (𝑥1 − 𝜇) 𝑥1 − 𝜇 𝑇
=
0.5
−0.5
0.5 − 0.5 =
0.25 − 0.25
−0.25 0.25
𝑚2= (𝑥2 − 𝜇) 𝑥2 − 𝜇 𝑇 −0.5
0.5
−0.5 0.5 =
0.25 − 0.25
−0.25 0.25
m=(𝑚1+𝑚2) =
0.5 − 0.5
−0.5 0.5
24-05-2025 24

Conti..
The final covariance matrix is obtained by adding these two matrices as:
C =
0.5 − 0.5
−0.5 0.5
λ is an eigen value for a matrix M if it is a solution of the characteristic equation |m – λI| = 0 =
0.5 − 0.5
−0.5 0.5
-
λ 0
0 λ)
= 0
0.5− λ − 0.5
−0.5 0.5 − λ)
= 0 → 0.5− λ 0.5− λ - −0.5 −0.5 = 0
(0.5)2
+ λ2
- 2[0.5 – λ] - (0.5)2
= 0, λ2
- λ = 0, λ (λ-1) = 0
Therefore, λ= 1,0
For λ = 0
0.5 − 0.5
−0.5 0.5
𝑥
𝑦
0.5x – 0.5y = 0
-0.5x + 0.5y = 0
X = 1, y =1 =
1
1
For λ = 1
−0.5 − 0.5
−0.5 − 0.5
𝑥
𝑦
X = -1, y = 1 =
−1
1
24-05-2025 25

Conti..
Now from λ = 1, 0, Adjacent matrix
A =
−1 1
1 1
Transferred matix 𝐴𝑇 =
−1 1
1 1
Normalization factors: for λ = 1, −1 2 + 1 2 = 2
= 0, 1 2 + 1 2 = 2
Therefore, A =
−
1
2
1
2
1
2
1
2
Transferred data = A(𝑚1+ 𝑚2) =
−
1
2
1
2
1
2
1
2
0.5 − 0.5
−0.5 0.5
=
−
1
2
1
2
1
2
1
2
1
2
−
1
2
−
1
2
1
2
=
−
1
2
1
2
0 0
24-05-2025 26

Conti..
One can check that the PCA matrix A is orthogonal. A matrix is orthogonal is 𝐴−1= A and 𝐴𝐴−1= 1
𝐴𝐴𝑇 =
−
1
2
1
2
1
2
1
2
−
1
2
1
2
1
2
1
2
=
1 0
0 1
The transformed matrix y is given as:
Y=A (x-m)
Recollect that (x-m) is the adjusted matrix.
Y=A (x-m) =
−
1
2
1
2
1
2
1
2
0.5 − 0.5
−0.5 0.5
=
−
1
2
1
2
1
2
1
2
1
2
−
1
2
−
1
2
1
2
=
−
1
2
1
2
0 0
One can check the original matrix can be retrieved from this matrix as:
X= 𝐴𝑇𝑦 + 𝑚 =
−
1
2
1
2
1
2
1
2
−
1
2
1
2
0 0
+
1.5
6.5
=
2 1
6 7
Therefore, one can infer the original is obtained without any loss of information.
24-05-2025 27

Conti..
Problem 2: Consider the two dimensional patterns (2, 1), (3, 5), (4, 3), (5, 6), (6, 7), (7, 8) or
(
2
1
3
5
4
3
5
6
6
7
7
8
), Compute the principal component using PCA Algorithm.
Solution: Mean vector, µ = ((2 + 3 + 4 + 5 + 6 + 7) / 6, (1 + 5 + 3 + 6 + 7 + 8) / 6) = (4.5, 5)
Subtract mean vector (µ) from the given feature vectors.
x1 – µ = (2 – 4.5, 1 – 5) = (-2.5, -4)
x2 – µ = (3 – 4.5, 5 – 5) = (-1.5, 0)
x3 – µ = (4 – 4.5, 3 – 5) = (-0.5, -2)
x4 – µ = (5 – 4.5, 6 – 5) = (0.5, 1)
x5 – µ = (6 – 4.5, 7 – 5) = (1.5, 2)
x6 – µ = (7 – 4.5, 8 – 5) = (2.5, 3)
Feature vectors (xi) after subtracting mean vector (µ) are-
24-05-2025 28

Conti…
Feature vectors (xi) after subtracting mean vector (µ) are-
−2.5
−4
−1.5
0
−0.5
−2
0.5
1
1.5
2
2.5
3
𝑚1 = (𝑥1 − 𝜇) 𝑥1 − 𝜇 𝑇
=
−2.5
−4
−2.5 − 4 =
6.25 10
10 16
𝑚2 =
2.25 0
0 0
𝑚3 =
0.25 1
1 4
𝑚4 =
0.25 0.5
0.5 1
𝑚5 =
2.25 3
3 4
𝑚6 =
6.25 7.5
7.5 9
Covariance, C= m =
17.5 22
22 34
24-05-2025 29

Conti..
• Calculate the eigen values and eigen vectors of the covariance matrix.
• λ is an eigen value for a matrix M if it is a solution of the characteristic
equation |M – λI| = 0.
So, we have-
2.92 3.67
3.67 5.67
-
λ 0
0 λ)
= 0
17.5 22
22 34
From here,
(17.5 – λ)(34 – λ) – (22 x 22) = 0
24-05-2025 30

Basic Learning Theory
Design of Learning System
In machine learning, a learning system is a framework that allows machines to learn from data,
identify patterns, and make decisions with minimal human intervention, improving their
performance and accuracy over time.
A system that is built around a learning algorithm is called a learning system.
The design of systems focuses on these steps:
1. Choosing a training experience
2. Choosing a target function
3. Representation of a target function
4. Function approximation Training Experience Let us consider designing of a chess game.
24-05-2025 31

Conti..
Training Experience
• Machine learning algorithms are trained on datasets, which provide examples of inputs and outputs. The algorithm
uses these examples to identify patterns and relationships in the data.
• It refers to the process of a machine learning algorithm learning from data to make predictions or
decisions. This involves exposing the algorithm to a dataset, allowing it to identify patterns and adjust
its parameters to improve its performance on future, unseen data.
• Example: designing of a chess game.
• If the training samples and testing samples have the same distribution, the results would be good.
Determine the Target Function
• In machine learning, the "target function" is the relationship a model aims to learn and predict,
mapping input variables (features) to an output variable.
• The goal is to approximate this function from training data and use it to make predictions on new
data.
• If x and y are variables, the target function: y = f(x)
• Example:
• Imagine you want to predict house prices (Y) based on features like size (X1), location (X2), and age
(X3). The target function would be the relationship between these features and the house price.
24-05-2025 32

Conti..
Determine the Target Function Representation
The representation of knowledge may be a table, collection of rules or a neural network.
The linear combination of these factors can be coined as:
V=𝑤0 + 𝑤1𝑥1 + 𝑤2𝑥2+………+𝑤𝑛𝑥𝑛
where, 𝑥1, 𝑥2 and 𝑥3represent different board features and 𝑤0, 𝑤1, 𝑤2 and 𝑤3 represent weights.
Choosing an Approximation Algorithm for the Target Function
The focus is to choose weights and fit the given training samples effectively. The aim is to reduce the error
given as:
𝐸 ≡ ෍
𝑇𝑟𝑎𝑖𝑛𝑖𝑛𝑔
𝑆𝑎𝑚𝑝𝑙𝑒𝑠
𝑉𝑡𝑟𝑎𝑖𝑛 𝑏 − ෠
𝑉(𝑏)
2
b is the sample and ෠
𝑉is the predicted hypothesis.
The approximation is carried out as:
• Computing the error as the difference between trained and expected hypothesis.
Let error be error(b).
• Then, for every board feature 𝑥𝑖, the weights are updated as: 𝑤𝑖 = 𝑤𝑖 + μ x error(b) x 𝑥𝑖
Here, μ is the constant that moderates the size of the weight update.
24-05-2025 33

INTRODUCTION TO CONCEPT LEARNING
• The process where a machine learns a general rule or function from a set of specific examples or data
points, enabling it to recognize and classify new, unseen instances.
• It is a learning strategy of acquiring abstract knowledge or inferring a general concept or deriving a
category from the given training samples.
• It is a process of abstraction and generalization from the data.
• Concept learning helps to classify an object that has a set of common, relevant features.
• Thus, it helps a learner compare and contrast categories based on the similarity and association of
positive and negative instances in the training data to classify an object.
• The learner tries to simplify by observing the common features from the training samples and then apply
this simplified model to the future samples.
• This task is also known as learning from experience.
• Each concept or category obtained by learning is a Boolean valued function which takes true or false
value.
• This way of learning categories for object and to recognize new instances of those categories is called as
concept learning.
• It is formally defined as inferring a Boolean valued function by processing training instances.
24-05-2025 34

Conti..
Concept learning requires three things:
1. Input: Training dataset which is a set of training instances, each labeled
with the name of a concept or category to which it belongs. Use this past
experience to train and build the model.
2. Output: Target concept or Target function. It is a mapping function f(x)
from input x to output y. It is to determine the specific features or
common features to identify an object. In other words, it is to find the
hypothesis to determine the target concept. For e.g., the specific set of
features to identify an elephant from all animals.
3. Test: New instances to test the learned model.
Formally, Concept learning is defined as–"Given a set of hypotheses, the
learner searches through the hypothesis space to identify the best
hypothesis that matches the target concept".
24-05-2025 35

Conti..
Representation of a Hypothesis
• A hypothesis ‘h’ approximates a target function ‘f ’ to represent the relationship
between the independent attributes and the dependent attribute of the training
instances.
• The hypothesis is the predicted approximate model that best maps the inputs to
outputs.
• Each hypothesis is represented as a conjunction of attribute conditions in the
antecedent part. For example, (Tail = Short) ^(Color = Black)….
• The set of hypothesis in the search space is called as hypotheses.
24-05-2025 36

Conti..
Hypothesis Space
• Hypothesis space is the set of all possible hypotheses that approximates the target function f.
• The set of all possible approximations of the target function can be defined as hypothesis space.
• From this set of hypotheses in the hypothesis space, a machine learning algorithm would
determine the best possible hypothesis that would best describe the target function or best fit
the outputs.
• For example, a regression algorithm represents the hypothesis space as a linear function
whereas a decision tree algorithm represents the hypothesis space as a tree.
• The set of hypotheses that can be generated by a learning algorithm can be further reduced by
specifying a language bias.
• The subset of hypothesis space that is consistent with all-observed training instances is called as
Version Space.
• Version space represents the only hypotheses that are used for the classification.
• Example:
• Horns - Yes, No Tail - Long, Short Tusks - Yes, No
• Paws - Yes, No Fur - Yes, No Color - Brown, Black,
• White Hooves - Yes, No Size - Medium, Big
24-05-2025 37

Conti..
Hypothesis Space Search by Find-S Algorithm
• The find-S algorithm is a basic concept learning algorithm in machine learning.
• The find-S algorithm finds the most specific hypothesis that fits all the positive examples
• Thus, this algorithm considers only the positive instances and eliminates negative instances while generating the hypothesis.
• It initially starts with the most specific hypothesis.
• Input: Positive instances in the Training dataset
• Output: Hypothesis ‘h’ 1.
1. Initialize ‘h’ to the most specific hypothesis.
h = <Ψ, Ψ, Ψ, Ψ, Ψ,,……..>
2. Generalize the initial hypothesis for the first positive instance [Since ‘h’ is more specific].
3. For each subsequent instances:
If it is a positive instance,
Check for each attribute value in the instance with the hypothesis ‘h’.
If the attribute value is the same as the hypothesis value, then do nothing,
Else if the attribute value is different than the hypothesis value, change it to ‘?’ in ‘h’.
Else if it is a negative instance,
Ignore it.
24-05-2025 38

Conti..
3.4: Consider the training dataset of 4 instances shown in Table 3.2. It contains the details of the
performance of students and their likelihood of getting a job offer or not in their final semester. Apply the
Find-S algorithm.
Solution:
Step 1: Initialize ‘h’ to the most specific hypothesis. There are 6 attributes, so for each attribute, we initially
fill ‘j’ in the initial hypothesis ‘h’. h = < <Ψ, Ψ, Ψ, Ψ, Ψ, Ψ>
Step 2: Generalize the initial hypothesis for the first positive instance. I1 is a positive instance, so generalize
the most specific hypothesis ‘h’ to include this positive instance. Hence,
I1 : ≥9 Yes Excellent Good Fast Yes Positive instance
h = < ≥9 Yes Excellent Good Fast Yes>
24-05-2025 39
CGPA Instructiveness Practical
knowledge
Communicati
on skill
Logical thinking Interest Job offer
≥ 9 Yes Excellent Good Fast Yes Yes
≥ 9 Yes Good Good Fast Yes Yes
≥ 8 No Good Good Fast No No
≥ 9 Yes Good Good Slow No Yes

Conti..
Step 3: Scan the next instance I2, since I2 is a positive instance. Generalize ‘h’ to include
positive instance I2. For each of the non-matching attribute value in ‘h’ put a ‘?’ to include
this positive instance. The third attribute value is mismatching in ‘h’ with I2, so put a ‘?’.
I2: ≥9 Yes Good Good Fast Yes Positive instance
h = < ≥9 Yes? Good Fast Yes>
Now, scan I3. Since it is a negative instance, ignore it. Hence, the hypothesis remains the
same without any change after scanning I3.
I3: ≥8 No Good Good Fast No Negative instance
h = < ≥9 Yes? Good Fast Yes>
Now scan I4. Since it is a positive instance, check for mismatch in the hypothesis ‘h’ with
I4. The 5th and 6th attribute value are mismatching, so add ‘?’ to those attributes in ’h’.
I4: ≥9 Yes Good Good Slow No Positive instance
h = < ≥9 Yes? Good? ?>
Now, the final hypothesis generated with Find-S algorithm is:
h = < ≥9 Yes? Good? ?>
It includes all positive instances and obviously ignores any negative instance.
24-05-2025 40

Conti..
3.6: Consider the training dataset of 4 instances shown in Table 3.6. It contains the details of the weather
conditions to paly Football. Apply the Find-S algorithm.
• Solution: Initialize h to most specific hypothesis in H
• H0 = <Ψ, Ψ, Ψ, Ψ, Ψ, Ψ>
I1: <Sunny, Warm, Normal, Strong, Warm, same>
Iteration 1: h1 = <Sunny, Warm, Normal, Strong, Warm, same>
24-05-2025 41
Example Sky AirTemp Humidity Wind Water Forecast EnjoySpo
rt
1 Sunny Warm Normal Strong Warm Same Yes
2 Sunny Warm High Strong Warm Same Yes
3 Rainy Cold High Strong Warm Change No
4 sunny Warm High Strong Cool Change Yes

Conti..
h1: <Sunny, Warm, Normal, Strong, Warm, same>
Iteration 2: I2 = <Sunny, Warm, High, Strong, Warm, same>
h2: <Sunny, Warm, ?, Strong, Warm, same>
Iteration 3 (Rainy): Ignore h3: <Sunny, Warm, ?, Strong, Warm, same>
h3: <Sunny, Warm, ?, Strong, Warm, same>
Iteration 4:
I4 = <Sunny, Warm, High, Strong, Cool, Change>
Step 3:
Output h4: <Sunny, Warm, ?, Strong, ?, ?>
24-05-2025 42

Conti..
3.6: Consider the training dataset of 4 instances shown in Table 3.6. It contains the details
of the weather conditions to paly tennis. Apply the Find-S algorithm.
24-05-2025 43

Candidate Elimination Algorithm
Version space learning is to generate all consistent hypotheses around.
This algorithm computes the version space by the combination of the two cases namely,
• Specific to General learning – Generalize S to include the positive example
• General to Specific learning – Specialize G to exclude the negative example
Candidate Elimination Algorithm:
Input: Set of instances in the Training dataset
Output: Hypothesis G and S
1. Initialize G, to the maximally general hypotheses.
2. Initialize S, to the maximally specific hypotheses.
• Generalize the initial hypothesis for the first positive instance.
3. For each subsequent new training instance,
• If the instance is positive,
➢ Generalize S to include the positive instance,
➢ Check the attribute value of the positive instance and S,
➢ If the attribute value of positive instance and S are different, fill that field value with ‘?’.
➢ If the attribute value of positive instance and S are same, then do no change.
24-05-2025 44

▪ Prune G to exclude all inconsistent hypotheses in G with the positive instance.
• If the instance is negative,
▪ Specialize G to exclude the negative instance,
➢ Add to G all minimal specializations to exclude the negative example and be consistent
with S.
• If the attribute value of S and the negative instance are different, then fill that attribute
value with S value.
• If the attribute value of S and negative instance are same, no need to update ‘G’ and fill
that attribute value with ‘?’.
ο Remove from S all inconsistent hypotheses with the negative instance.
24-05-2025 45

Conti..
3.4: Consider the training dataset of 4 instances shown in Table 3.2. It contains the details of the performance of
students and their likelihood of getting a job offer or not in their final semester. Apply the Candidate Elimination
algorithm.
Solution: Step 1: Initialize ‘G’ boundary to the maximally general hypotheses, G = < ? ? ? ? ? ?>
Step 2: Initialize ‘S’ boundary to the maximally S = < <Ψ, Ψ, Ψ, Ψ, Ψ, Ψ>
Step 2: Generalize the initial hypothesis for the first positive instance. instance. I1 is a positive instance; so
generalize the most specific hypothesis ‘S’ to include this positive instance. Hence,
Sridhar, S; Vijayalakshmi, M. Machine Learning (p. 92). Kindle Edition.
I1 : ≥9 Yes Excellent Good Fast Yes Positive instance
S1 = < ≥9 Yes Excellent Good Fast Yes>
G1 = < ? ? ? ? ? ?>
24-05-2025 46
CGPA Instructiveness Practical
knowledge
Communicati
on skill
Logical thinking Interest Job offer
≥ 9 Yes Excellent Good Fast Yes Yes
≥ 9 Yes Good Good Fast Yes Yes
≥ 8 No Good Good Fast No No
≥ 9 Yes Good Good Slow No Yes

Conti..
Step 3: Iteration 1
The third attribute value is mismatching in ‘S1’ with I2, so put a ‘?’.
I2: ≥9 Yes Good Good Fast Yes Positive instance
S2= < ≥9 ? Good Fast Yes>
Since G1 is consistent with this positive instance, there is no change.
The resulting G2 is, G2 = <? ? ? ? ? ?>
Iteration 2 Now Scan I3,
I3: ≥8 No Good Good Fast No Negative instance
There is no inconsistent hypothesis in S2 with the negative instance, hence S3 remains the same.
G3 = < ≥9 ? ? ? ? ?>
<? Yes ? ? ? ?>
< ? ? ? ? ? Yes>
S3 = < ≥9 Yes ? Good Fast Yes>
Iteration 3
Now Scan I4. Since it is a positive instance, check for mismatch in the hypothesis ‘S3’ with I4. The 5th and 6th attribute
value are mismatching, so add ‘?’ to those attributes in ‘S4’.
I4: ≥9 Yes Good Good Slow No Positive instance
S4 = < ≥9 Yes ? Good ? ?>
24-05-2025 47

Prune G3 to exclude all inconsistent hypotheses with the positive instance I4.
G3 = < ≥9 ? ? ? ? ?>
< ? Yes ? ? ? ?>
< ? ? ? ? ? Yes> Inconsistent
Since the third hypothesis in G3 is inconsistent with this positive instance, remove the third one.
The resulting G4 is,
G4 = < ≥9 ? ? ? ? ?>
< ? Yes ? ? ? ?>
Using the two boundary sets, S4 and G4, the version space is converged to contain the set of
consistent hypotheses.
The final version space is,
< ≥9 Yes ? ? ? ?>
< ≥9 ? ? Good ? ?>
< ? Yes ? Good ? ?>
Thus, the algorithm finds the version space to contain only those hypotheses that are most
general and most specific. The diagrammatic
24-05-2025 48

The diagrammatic representation of deriving the version space is shown
S
S1
S2
S3
S4
Vesrion Space
G4:
G3:
G2:
G1
G:
24-05-2025 49
Ψ, Ψ, Ψ, Ψ, Ψ, Ψ
≥ 9 Yes Exc Good Fast Yes
≥ 9 Yes ? Good Fast Yes
≥ 9 Yes ? Good Fast Yes
≥ 9 Yes ? Good ? ?
? Yes ? Good ? ?
≥ 9 ? ? Good ? ?
≥ 9 Yes ? ? ? ?
? Yes ? ? ? ?
≥ 9 ? ? ? ? ?
≥ 9 ? ? ? ? ?
≥ 9 ? ? ? ? ? ? ? ? ? ? Yes
≥ 9 ? ? ? ? ?
≥ 9 ? ? ? ? ?
≥ 9 ? ? ? ? ?

Conti…
Problem 2: Generate consistent hypotheses for the following training datasets using Candidate Elimination algorithm.
Table 2: “ Enjoy Sport”
24-05-2025 50
Example Sky AirTemp Humidity Wind Water Forecast
Enjoy
Sport
1 Sunny Warm Normal Strong Warm Same Yes
2 Sunny Warm High Strong Warm Same Yes
3 Rainy Cold High Strong Warm Change No
4 Sunny Warm High Strong Cool Change Yes

Dr. Shivu___Machine Learning_Module 2pdf

More Related Content

What's hot (20)

Similar to Dr. Shivu___Machine Learning_Module 2pdf (20)

More from Dr. Shivashankar (20)

Recently uploaded (20)

Dr. Shivu___Machine Learning_Module 2pdf