SlideShare a Scribd company logo
Support Vector Machines
Support Vector Machines: Overview, When Data is Linearly Separable, Support
Vector Classifier, When Data is NOT Linearly Separable, Kernel Functions,
Multiclass SVM.
Support Vector Machine (SVM) is one of the Machine Learning
(ML) Supervised algorithms. There are plenty of algorithms in
ML, but still, reception for SVM is always special because of its
strength while dealing with the data.
• This Support Vector Machine (SVM) presentation will help you
understand Support Vector Machine algorithm, a supervised
machine learning algorithm which can be used for both
classification and regression problems.
• This SVM presentation will help you learn where and when to
use SVM algorithm, how does the algorithm work, what are
hyperplanes and support vectors in SVM, how distance margin
helps in optimizing the hyperplane, kernel functions in SVM for
data transformation and advantages of SVM algorithm.
• At the end, we will also implement Support Vector Machine
algorithm in Python to differentiate crocodiles from alligators for
a given dataset.
• SVM is a supervised machine learning
algorithm that helps in
both classification and regression problem
statements.
• It tries to find an optimal boundary (known as
hyperplane) between different classes.
• In simple words, SVM does complex data
transformations depending on the selected kernel
function, and based on those transformations, it
aims to maximize the separation boundaries
between your data points.
Working of SVM:
• In the simplest form where there is a linear separation,
SVM tries to find a line that maximizes the separation
between a two-class data set of 2-dimensional space
points.
• The objective of SVM: The objective of SVM is to find
a hyperplane that maximizes the separation of the data
points to their actual classes in an n-dimensional space.
• The data points which are at the minimum distance to
the hyperplane i.e, closest points are called Support
Vectors.
• For Example, For the given diagram, the three points that are
layered on the scattered lines are the Support Vectors (2 blue
Why learn Machine Learning?
• Machine Learning is taking over the world- and with that, there
is a growing need among companies for professionals to know
the ins and outs of Machine Learning
• The Machine Learning market size is expected to grow from
USD 1.03 Billion in 2016 to USD 10.81 Billion by 2025, at a
Compound Annual Growth Rate (CAGR) of 54.1% during the
forecast period.
AI / ML
Machine Learning
Using Computer algorithms to
uncover insights, determine
relationships , and make prediction
about future trends.
Artificial Intelligence
Enabling computer systems to perform
tasks that ordinarily requires human
intelligence.
We use machine learning methods to create AI systems.
Machine Learning Paradigms
• Unsupervised Learning
• Find structure in data. (Clusters, Density, Patterns)
• Supervised Learning
• Find mapping between features to labels
Support Vector Machine
• Supervised machine learning Algorithm.
• Can be used for Classification/Regression.
• Works well with small datasets
Classification
• Classification using SVM
• 2 class problem , linearly separable data
The “Best” Separation Boundary
This is the widest road that
separates the two groups
The “Best” Separation Boundary
The “Best” Separation Boundary
The “Best” Separation Boundary
This is the widest road that
separates the two groups
The “Best” Separation Boundary
This is the widest margin
that separates the two
groups
Margin
The “Best” Separation Boundary
The distance between the
points and the line are as
far as possible.
Margin
The “Best” Separation Boundary
The distance between the
support vectors and the
line are as far as possible.
Margin
Support
Vectors
The “Best” Separation Boundary
This hyperplane is an
optimal hyperplane
because it is as far as
possible from the support
vectors.
Maximum
Margin
Support
Vectors
Hyperplane
SVM Objective Function
Decision Rule
+
+
-
-
w
u
Projection
• w : normal vector of
any length
• u : unknown vector
and we want to find it
belongs to which
class?
Then unknown vector
will be classified as +
Constraints
+
+
-
-
Constraint for
positive samples
+
Likewise for
negative samples -
-1
1
0
+
+
-
-
Combining Constraints
Constraint for positive samples
Constraint for negative samples
0
0
To bring above inequalities together we
introduce another variable
For support vectors
Width
+
+
-
-
w
On the equation above x+ and x− are in the
gutter (on hyperplanes maximizing the
separation).
Positive Samples
Likewise Negative
Samples
Width
+
+
-
-
w
Maximize
SVM Objective
+
+
-
-
w
OBJECTIVE:
CONSTRAINT:
(Minimize)
Constrained Optimization problem.
Lagrange Multipliers
Lagrangian
OBJECTIVE: CONSTRAINT:
Solving the PRIMAL
 L P
 w
 L P
 0
 b
 0
The normal vector w are
the linear combination of
support vectors
PRIMAL  DUAL
SVM Objective (DUAL)
OBJECTIVE: Minimize
CONSTRAINT: SVM objective will depend only on
the dot product of pairs of support
vector.
Decision Rule
So whether a new sample will be
on the right of the road depends
on the dot product of the
support vectors and the
unknown sample.
Points to Consider
• SVM problem is constrained minimization problem
• To find the widest road between different samples we just need to
consider dot products of support vectors .
Slack variable
Separable Case
+
+
-
-
w
Non-Separable
+
-
-
w
-
+
+
Slack Variables
PRIMAL Objective
LINEARLY SEPARABLE CASE
LINEARLY NON-SEPARABLE CASE
DUAL Objective
LINEARLY SEPARABLE CASE
LINEARLY NON-SEPARABLE CASE
KERNEL TRICK
Increasing Model Complexity
• Non linear dataset with n features (~n-dimensional)
• Match the complexity of the data by the complexity of the
model.
• Linear Classifier ?
• Improve
accuracy by
transforming
• input feature space.
• For datasets with a lot
of features,
• it becomes next to
impossible to try out all
https://www.youtube.com/watch?v=3liCbRZPrZA
Increasing Model Capacity
y x   w 0  w T
x
M M
y x   w 0   w j j x    w j j x 
j  1 j  0
LINEAR CLASSIFIERS
GENERALIZED LINEAR CLASSIFIERS
KERNEL TRICK
  0
T
y x  w  w x
M
y x   w 0   w j j x 
j  1
D
2
i i j i j i j
i i , j
L    
1
   y y  x  x 
D
2
i i j i j i j
i i , j
L    
1
   y y x x
Kernel Trick
• For a given pair of vectors (in a lower-dimensional feature space) and
a transformation into a higher-dimensional space, there exists a
function (The Kernel Function) which can compute the dot product in
the higher-dimensional space without explicitly transforming the
vectors into the higher-dimensional space first
D i
2 i j i j i j
i i , j
L    
1
   y y  x  x 
KERNEL FUNCTION
K x i , x j     x i   x j 
D i
2 i j i j i j
i i , j
L    
1
   y y K x , x 
Kernel functions
SVM Hyperparameters
• Parameter C : Penalty parameter
• Large Value of parameter C => small margin
• Small Value of parameter C => Large margin
• Parameter gamma : Specific to Gaussian RBF
• Large Value of parameter gamma => small gaussian
• Small Value of parameter gamma => Large gaussian
Multiclass Classification Using SVM
In its most basic type, SVM doesn’t support multiclass
classification. For multiclass classification, the same principle is
utilized after breaking down the multi-classification problem into
smaller subproblems, all of which are binary classification
problems.
The popular methods which are used to perform multi-
classification on the problem statements using SVM are as
follows:
One vs One (OVO) approach
One vs All (OVA) approach
Directed Acyclic Graph (DAG) approach
One vs One (OVO)
This technique breaks down our multiclass
classification problem into subproblems which are binary
classification problems. So, after this strategy, we get
binary classifiers per each pair of classes. For final
prediction for any input use the concept of majority
voting along with the distance from the margin as
its confidence criterion.
The major problem with this approach is that we
have to train too many SVMs.
Let’s have Multi-class/ Multi-labels problems with L
categories, then:
For the (s, t)- th classifier:
– Positive Samples: all the points in class s ({ xi : s
∈ yi })
– Negative samples: all the points in class t ({ xi : t ∈ yi })
– fs, t(x): the decision value of this classifier
( large value of f s, t(x) ⇒ label s has a higher probability
than the label t )
– f t, s (x) = – f s, t(x)
– Prediction: f(x)= argmax s ( Σ t fs, t(x) )
Let’s have an example of 3 class classification problem: Green, Red, and Blue.
In the One-to-One approach, we try to find the hyperplane
that separates between every two classes, neglecting the points
of the third class.
For example, here Red-Blue line tries to maximize the
separation only between blue and red points while It has nothing
to do with the green points.
One vs All (OVA)
In this technique, if we have N class problem, then
we learn N SVMs:
SVM number -1 learns “class_output = 1” vs
“class_output ≠ 1″
SVM number -2 learns “class_output = 2” vs
“class_output ≠ 2″
:
SVM number -N learns “class_output = N” vs
“class_output ≠ N”
Then to predict the output for new input, just predict with each of the
build SVMs and then find which one puts the prediction the farthest
into the positive region (behaves as a confidence criterion for a
particular SVM).
Now, a very important comes to mind that “Are there any
challenges in training these N SVMs?”
Yes, there are some challenges to train these N SVMs, which are:
1. Too much Computation: To implement the OVA strategy, we
require more training points which increases our computation.
2. Problems becomes Unbalanced: Let’s you are working on
an MNIST dataset, in which there are 10 classes from 0 to 9 and if we
have 1000 points per class, then for any one of the SVM having two
classes, one class will have 9000 points and other will have only
1000 data points, so our problem becomes unbalanced.
Now, how to address this unbalanced problem?
You have to take some representative (subsample) from
the class which is having more training samples i.e,
majority class. You can do this by using some below-listed
techniques:
– Use the 3-sigma rule of the normal distribution: Fit
data to a normal distribution and then subsampled
accordingly so that class distribution is maintained.
– Pick some data points randomly from the majority class.
– Use a popular subsampling technique named SMOTE.
Let’s have Multi-class/ multi-labels problems with L
categories, then:
For the t -th classifier:
– Positive Samples: all the points in class t ({ xi : t ∈ yi })
– Negative samples: all the points not in class t ({ xi : t ∉ yi })
– ft(x): the decision value for the t -th classifier.
( large value of ft ⇒ higher probability that x is in the class t)
– Prediction: f(x) = argmax t ft(x)
In the One vs All approach, we try to find a hyperplane to
separate the classes. This means the separation takes all points
into account and then divides them into two groups in which there is
a group for the one class points and the other group for all other
points.
For example, here, the Greenline tries to maximize the gap between green points and all other
points at once.
NOTE: A single SVM does binary
classification and can differentiate between
two classes. So according to the two above
approaches, to classify the data points from
L classes data set:
In the One vs All approach, the classifier
can use L SVMs.
In the One vs One approach, the classifier
can use L(L-1)/2 SVMs.
Directed Acyclic Graph (DAG)
This approach is more hierarchical in nature and it tries to
addresses the problems of the One vs One and One vs All approach.
This is a graphical approach in which we group the classes
based on some logical grouping.
Benefits: Benefits of this approach includes a fewer number of
SVM trains with respect to the OVA approach and it reduces the
diversity from the majority class which is a problem of the OVA
approach.
Problem: If we have given the dataset itself in the form of
different groups ( e.g, cifar 10 image classification dataset ) then
we can directly apply this approach but if we don’t give the groups,
then the problem with this approach is of finding the logical grouping
in the dataset i.e, we have to manually pick the logical grouping.
What I really do?
The advantages of support vector machines are:
• Effective in high dimensional spaces.
• Still effective in cases where number of dimensions is greater
than the number of samples.
• Uses a subset of training points in the decision function (called
support vectors), so it is also memory efficient.
• Versatile: different Kernel functions can be specified for the
decision function. Common kernels are provided, but it is also
possible to specify custom kernels.
The disadvantages of support vector machines include:
• If the number of features is much greater than the number of
samples, avoid over-fitting in choosing Kernel functions and
regularization term is crucial.
• SVMs do not directly provide probability estimates, these are
calculated using an expensive five-fold cross-validation
Questions
Source
• https://www.quora.com/What-are-C-and-gamma-with-regards-to-a-support-vector-
machine
• https://www.quora.com/How-can-I-choose-the-parameter-C-for-SVM
• https://www.youtube.com/watch?v=_PwhiWxHK8o
• https://www.youtube.com/watch?v=N1vOgolbjSc
• https://medium.com/@pushkarmandot/what-is-the-significance-of-c-value-in-support-
vector-machine-28224e852c5a
• https://towardsdatascience.com/understanding-support-vector-machine-part-1-
lagrange-multipliers-5c24a52ffc5e
• https://towardsdatascience.com/understanding-support-vector-machine-part-2-kernel-
trick-mercers-theorem-e1e6848c6c4d
• http://web.mit.edu/6.034/wwwbob/svm-notes-long-08.pdf
• https://www.quora.com/What-is-the-kernel-trick

More Related Content

PPTX
support vector machine 1.pptx
PPTX
sentiment analysis using support vector machine
PPTX
Support vector machine
PPTX
Support Vector Machine ppt presentation
PPTX
PPTX
SVM[Support vector Machine] Machine learning
PPTX
Support Vector Machine topic of machine learning.pptx
PPTX
Support-Vector-Machine (Supervised Learning).pptx
support vector machine 1.pptx
sentiment analysis using support vector machine
Support vector machine
Support Vector Machine ppt presentation
SVM[Support vector Machine] Machine learning
Support Vector Machine topic of machine learning.pptx
Support-Vector-Machine (Supervised Learning).pptx

Similar to Support Vector Machines USING MACHINE LEARNING HOW IT WORKS (20)

PPTX
Statistical Machine Learning unit4 lecture notes
PPT
lec10svm.ppt
PPTX
Module-3_SVM_Kernel_KNN.pptx
PPTX
Classification-Support Vector Machines.pptx
PDF
Data Science - Part IX - Support Vector Machine
PPTX
classification algorithms in machine learning.pptx
PPTX
Introduction to Machine Learning Elective Course
PPT
lec10svm.ppt
PPT
Support Vector Machines (lecture by Geoffrey Hinton)
PPT
lec10svm.ppt SVM lecture machine learning
PPT
Svm ms
PPT
SVM_UNI_TORON_SPACE_VECTOR_MACHINE_MACHINE_LEARNING.ppt
PPT
SUPPORT _ VECTOR _ MACHINE _ PRESENTATION
PPT
lec10svm.ppt
PDF
Support vector machine, machine learning
PDF
OM-DS-Fall2022-Session10-Support vector machine.pdf
PPTX
Support vector machine-SVM's
PPT
Supervised and unsupervised learning
PPTX
Lec_XX_Support Vector Machine Algorithm.pptx
PPTX
SVM & KNN Presentation.pptx
Statistical Machine Learning unit4 lecture notes
lec10svm.ppt
Module-3_SVM_Kernel_KNN.pptx
Classification-Support Vector Machines.pptx
Data Science - Part IX - Support Vector Machine
classification algorithms in machine learning.pptx
Introduction to Machine Learning Elective Course
lec10svm.ppt
Support Vector Machines (lecture by Geoffrey Hinton)
lec10svm.ppt SVM lecture machine learning
Svm ms
SVM_UNI_TORON_SPACE_VECTOR_MACHINE_MACHINE_LEARNING.ppt
SUPPORT _ VECTOR _ MACHINE _ PRESENTATION
lec10svm.ppt
Support vector machine, machine learning
OM-DS-Fall2022-Session10-Support vector machine.pdf
Support vector machine-SVM's
Supervised and unsupervised learning
Lec_XX_Support Vector Machine Algorithm.pptx
SVM & KNN Presentation.pptx
Ad

More from rajalakshmi5921 (20)

PPTX
CASE STUDY ON Human resource analyticesR782024.pptx
PPTX
HRA 5TH MODULE Defining metrics and Demographics.pptx
PPTX
module 3 HR Analytics for VTU MBA syllabus.pptx
PPTX
Module_-_3_Product_Mgt_&_Pricing[1].pptx
PDF
mental health education for learners.pdf
PPTX
General Nurses Role in child mental CAP.pptx
PPTX
Role of Family in Mental Health welbeing.pptx
PPTX
Developmental disorders in children .pptx
PPTX
Bangaluru Water crisis problem solving method.pptx
PPTX
The efforts made by Karnataka government not enough.pptx
PPTX
business excellence in technology and industry
PPTX
employablility training mba mca students to get updated
PPTX
EDAB Module 5 Singular Value Decomposition (SVD).pptx
PPTX
Business Administration Expertise.pptx
PPTX
R basics for MBA Students[1].pptx
PPTX
RRCE MBA students.pptx
PPTX
employablility training mba mca.pptx
PPTX
Singular Value Decomposition (SVD).pptx
PPT
variableselectionmodelBuilding.ppt
PPTX
Statistical Learning and Model Selection (1).pptx
CASE STUDY ON Human resource analyticesR782024.pptx
HRA 5TH MODULE Defining metrics and Demographics.pptx
module 3 HR Analytics for VTU MBA syllabus.pptx
Module_-_3_Product_Mgt_&_Pricing[1].pptx
mental health education for learners.pdf
General Nurses Role in child mental CAP.pptx
Role of Family in Mental Health welbeing.pptx
Developmental disorders in children .pptx
Bangaluru Water crisis problem solving method.pptx
The efforts made by Karnataka government not enough.pptx
business excellence in technology and industry
employablility training mba mca students to get updated
EDAB Module 5 Singular Value Decomposition (SVD).pptx
Business Administration Expertise.pptx
R basics for MBA Students[1].pptx
RRCE MBA students.pptx
employablility training mba mca.pptx
Singular Value Decomposition (SVD).pptx
variableselectionmodelBuilding.ppt
Statistical Learning and Model Selection (1).pptx
Ad

Recently uploaded (20)

PDF
RMMM.pdf make it easy to upload and study
PPTX
Tissue processing ( HISTOPATHOLOGICAL TECHNIQUE
PPTX
Pharma ospi slides which help in ospi learning
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
PPTX
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PPTX
Presentation on HIE in infants and its manifestations
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PDF
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PDF
Chinmaya Tiranga quiz Grand Finale.pdf
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PPTX
Introduction-to-Literarature-and-Literary-Studies-week-Prelim-coverage.pptx
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
01-Introduction-to-Information-Management.pdf
PDF
Complications of Minimal Access Surgery at WLH
RMMM.pdf make it easy to upload and study
Tissue processing ( HISTOPATHOLOGICAL TECHNIQUE
Pharma ospi slides which help in ospi learning
Final Presentation General Medicine 03-08-2024.pptx
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
Module 4: Burden of Disease Tutorial Slides S2 2025
Presentation on HIE in infants and its manifestations
2.FourierTransform-ShortQuestionswithAnswers.pdf
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
Pharmacology of Heart Failure /Pharmacotherapy of CHF
Chinmaya Tiranga quiz Grand Finale.pdf
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
Abdominal Access Techniques with Prof. Dr. R K Mishra
Introduction-to-Literarature-and-Literary-Studies-week-Prelim-coverage.pptx
Final Presentation General Medicine 03-08-2024.pptx
01-Introduction-to-Information-Management.pdf
Complications of Minimal Access Surgery at WLH

Support Vector Machines USING MACHINE LEARNING HOW IT WORKS

  • 1. Support Vector Machines Support Vector Machines: Overview, When Data is Linearly Separable, Support Vector Classifier, When Data is NOT Linearly Separable, Kernel Functions, Multiclass SVM. Support Vector Machine (SVM) is one of the Machine Learning (ML) Supervised algorithms. There are plenty of algorithms in ML, but still, reception for SVM is always special because of its strength while dealing with the data.
  • 2. • This Support Vector Machine (SVM) presentation will help you understand Support Vector Machine algorithm, a supervised machine learning algorithm which can be used for both classification and regression problems. • This SVM presentation will help you learn where and when to use SVM algorithm, how does the algorithm work, what are hyperplanes and support vectors in SVM, how distance margin helps in optimizing the hyperplane, kernel functions in SVM for data transformation and advantages of SVM algorithm. • At the end, we will also implement Support Vector Machine algorithm in Python to differentiate crocodiles from alligators for a given dataset.
  • 3. • SVM is a supervised machine learning algorithm that helps in both classification and regression problem statements. • It tries to find an optimal boundary (known as hyperplane) between different classes. • In simple words, SVM does complex data transformations depending on the selected kernel function, and based on those transformations, it aims to maximize the separation boundaries between your data points.
  • 4. Working of SVM: • In the simplest form where there is a linear separation, SVM tries to find a line that maximizes the separation between a two-class data set of 2-dimensional space points. • The objective of SVM: The objective of SVM is to find a hyperplane that maximizes the separation of the data points to their actual classes in an n-dimensional space. • The data points which are at the minimum distance to the hyperplane i.e, closest points are called Support Vectors. • For Example, For the given diagram, the three points that are layered on the scattered lines are the Support Vectors (2 blue
  • 5. Why learn Machine Learning? • Machine Learning is taking over the world- and with that, there is a growing need among companies for professionals to know the ins and outs of Machine Learning • The Machine Learning market size is expected to grow from USD 1.03 Billion in 2016 to USD 10.81 Billion by 2025, at a Compound Annual Growth Rate (CAGR) of 54.1% during the forecast period.
  • 6. AI / ML Machine Learning Using Computer algorithms to uncover insights, determine relationships , and make prediction about future trends. Artificial Intelligence Enabling computer systems to perform tasks that ordinarily requires human intelligence. We use machine learning methods to create AI systems.
  • 7. Machine Learning Paradigms • Unsupervised Learning • Find structure in data. (Clusters, Density, Patterns) • Supervised Learning • Find mapping between features to labels
  • 8. Support Vector Machine • Supervised machine learning Algorithm. • Can be used for Classification/Regression. • Works well with small datasets
  • 9. Classification • Classification using SVM • 2 class problem , linearly separable data
  • 10. The “Best” Separation Boundary This is the widest road that separates the two groups
  • 13. The “Best” Separation Boundary This is the widest road that separates the two groups
  • 14. The “Best” Separation Boundary This is the widest margin that separates the two groups Margin
  • 15. The “Best” Separation Boundary The distance between the points and the line are as far as possible. Margin
  • 16. The “Best” Separation Boundary The distance between the support vectors and the line are as far as possible. Margin Support Vectors
  • 17. The “Best” Separation Boundary This hyperplane is an optimal hyperplane because it is as far as possible from the support vectors. Maximum Margin Support Vectors Hyperplane
  • 19. Decision Rule + + - - w u Projection • w : normal vector of any length • u : unknown vector and we want to find it belongs to which class? Then unknown vector will be classified as +
  • 21. + + - - Combining Constraints Constraint for positive samples Constraint for negative samples 0 0 To bring above inequalities together we introduce another variable For support vectors
  • 22. Width + + - - w On the equation above x+ and x− are in the gutter (on hyperplanes maximizing the separation). Positive Samples Likewise Negative Samples
  • 26. Solving the PRIMAL  L P  w  L P  0  b  0 The normal vector w are the linear combination of support vectors
  • 28. SVM Objective (DUAL) OBJECTIVE: Minimize CONSTRAINT: SVM objective will depend only on the dot product of pairs of support vector.
  • 29. Decision Rule So whether a new sample will be on the right of the road depends on the dot product of the support vectors and the unknown sample.
  • 30. Points to Consider • SVM problem is constrained minimization problem • To find the widest road between different samples we just need to consider dot products of support vectors .
  • 34. PRIMAL Objective LINEARLY SEPARABLE CASE LINEARLY NON-SEPARABLE CASE
  • 35. DUAL Objective LINEARLY SEPARABLE CASE LINEARLY NON-SEPARABLE CASE
  • 37. Increasing Model Complexity • Non linear dataset with n features (~n-dimensional) • Match the complexity of the data by the complexity of the model. • Linear Classifier ? • Improve accuracy by transforming • input feature space. • For datasets with a lot of features, • it becomes next to impossible to try out all https://www.youtube.com/watch?v=3liCbRZPrZA
  • 38. Increasing Model Capacity y x   w 0  w T x M M y x   w 0   w j j x    w j j x  j  1 j  0 LINEAR CLASSIFIERS GENERALIZED LINEAR CLASSIFIERS
  • 39. KERNEL TRICK   0 T y x  w  w x M y x   w 0   w j j x  j  1 D 2 i i j i j i j i i , j L     1    y y  x  x  D 2 i i j i j i j i i , j L     1    y y x x
  • 40. Kernel Trick • For a given pair of vectors (in a lower-dimensional feature space) and a transformation into a higher-dimensional space, there exists a function (The Kernel Function) which can compute the dot product in the higher-dimensional space without explicitly transforming the vectors into the higher-dimensional space first D i 2 i j i j i j i i , j L     1    y y  x  x  KERNEL FUNCTION K x i , x j     x i   x j  D i 2 i j i j i j i i , j L     1    y y K x , x 
  • 42. SVM Hyperparameters • Parameter C : Penalty parameter • Large Value of parameter C => small margin • Small Value of parameter C => Large margin • Parameter gamma : Specific to Gaussian RBF • Large Value of parameter gamma => small gaussian • Small Value of parameter gamma => Large gaussian
  • 43. Multiclass Classification Using SVM In its most basic type, SVM doesn’t support multiclass classification. For multiclass classification, the same principle is utilized after breaking down the multi-classification problem into smaller subproblems, all of which are binary classification problems. The popular methods which are used to perform multi- classification on the problem statements using SVM are as follows: One vs One (OVO) approach One vs All (OVA) approach Directed Acyclic Graph (DAG) approach
  • 44. One vs One (OVO) This technique breaks down our multiclass classification problem into subproblems which are binary classification problems. So, after this strategy, we get binary classifiers per each pair of classes. For final prediction for any input use the concept of majority voting along with the distance from the margin as its confidence criterion. The major problem with this approach is that we have to train too many SVMs.
  • 45. Let’s have Multi-class/ Multi-labels problems with L categories, then: For the (s, t)- th classifier: – Positive Samples: all the points in class s ({ xi : s ∈ yi }) – Negative samples: all the points in class t ({ xi : t ∈ yi }) – fs, t(x): the decision value of this classifier ( large value of f s, t(x) ⇒ label s has a higher probability than the label t ) – f t, s (x) = – f s, t(x) – Prediction: f(x)= argmax s ( Σ t fs, t(x) )
  • 46. Let’s have an example of 3 class classification problem: Green, Red, and Blue.
  • 47. In the One-to-One approach, we try to find the hyperplane that separates between every two classes, neglecting the points of the third class. For example, here Red-Blue line tries to maximize the separation only between blue and red points while It has nothing to do with the green points.
  • 48. One vs All (OVA) In this technique, if we have N class problem, then we learn N SVMs: SVM number -1 learns “class_output = 1” vs “class_output ≠ 1″ SVM number -2 learns “class_output = 2” vs “class_output ≠ 2″ : SVM number -N learns “class_output = N” vs “class_output ≠ N”
  • 49. Then to predict the output for new input, just predict with each of the build SVMs and then find which one puts the prediction the farthest into the positive region (behaves as a confidence criterion for a particular SVM). Now, a very important comes to mind that “Are there any challenges in training these N SVMs?” Yes, there are some challenges to train these N SVMs, which are: 1. Too much Computation: To implement the OVA strategy, we require more training points which increases our computation. 2. Problems becomes Unbalanced: Let’s you are working on an MNIST dataset, in which there are 10 classes from 0 to 9 and if we have 1000 points per class, then for any one of the SVM having two classes, one class will have 9000 points and other will have only 1000 data points, so our problem becomes unbalanced.
  • 50. Now, how to address this unbalanced problem? You have to take some representative (subsample) from the class which is having more training samples i.e, majority class. You can do this by using some below-listed techniques: – Use the 3-sigma rule of the normal distribution: Fit data to a normal distribution and then subsampled accordingly so that class distribution is maintained. – Pick some data points randomly from the majority class. – Use a popular subsampling technique named SMOTE. Let’s have Multi-class/ multi-labels problems with L categories, then: For the t -th classifier:
  • 51. – Positive Samples: all the points in class t ({ xi : t ∈ yi }) – Negative samples: all the points not in class t ({ xi : t ∉ yi }) – ft(x): the decision value for the t -th classifier. ( large value of ft ⇒ higher probability that x is in the class t) – Prediction: f(x) = argmax t ft(x) In the One vs All approach, we try to find a hyperplane to separate the classes. This means the separation takes all points into account and then divides them into two groups in which there is a group for the one class points and the other group for all other points.
  • 52. For example, here, the Greenline tries to maximize the gap between green points and all other points at once. NOTE: A single SVM does binary classification and can differentiate between two classes. So according to the two above approaches, to classify the data points from L classes data set: In the One vs All approach, the classifier can use L SVMs. In the One vs One approach, the classifier can use L(L-1)/2 SVMs.
  • 53. Directed Acyclic Graph (DAG) This approach is more hierarchical in nature and it tries to addresses the problems of the One vs One and One vs All approach. This is a graphical approach in which we group the classes based on some logical grouping. Benefits: Benefits of this approach includes a fewer number of SVM trains with respect to the OVA approach and it reduces the diversity from the majority class which is a problem of the OVA approach. Problem: If we have given the dataset itself in the form of different groups ( e.g, cifar 10 image classification dataset ) then we can directly apply this approach but if we don’t give the groups, then the problem with this approach is of finding the logical grouping in the dataset i.e, we have to manually pick the logical grouping.
  • 55. The advantages of support vector machines are: • Effective in high dimensional spaces. • Still effective in cases where number of dimensions is greater than the number of samples. • Uses a subset of training points in the decision function (called support vectors), so it is also memory efficient. • Versatile: different Kernel functions can be specified for the decision function. Common kernels are provided, but it is also possible to specify custom kernels. The disadvantages of support vector machines include: • If the number of features is much greater than the number of samples, avoid over-fitting in choosing Kernel functions and regularization term is crucial. • SVMs do not directly provide probability estimates, these are calculated using an expensive five-fold cross-validation
  • 57. Source • https://www.quora.com/What-are-C-and-gamma-with-regards-to-a-support-vector- machine • https://www.quora.com/How-can-I-choose-the-parameter-C-for-SVM • https://www.youtube.com/watch?v=_PwhiWxHK8o • https://www.youtube.com/watch?v=N1vOgolbjSc • https://medium.com/@pushkarmandot/what-is-the-significance-of-c-value-in-support- vector-machine-28224e852c5a • https://towardsdatascience.com/understanding-support-vector-machine-part-1- lagrange-multipliers-5c24a52ffc5e • https://towardsdatascience.com/understanding-support-vector-machine-part-2-kernel- trick-mercers-theorem-e1e6848c6c4d • http://web.mit.edu/6.034/wwwbob/svm-notes-long-08.pdf • https://www.quora.com/What-is-the-kernel-trick