4.Support Vector Machines.ppt machine learning and development

1
Machine Learning
Support Vector Machines

2
Perceptron Revisited: Linear
Separators
 Binary classification can be viewed as the task of
separating classes in feature space:
wTx + b = 0
wTx + b < 0
wTx + b > 0
g(x) = sign(wTx + b)

3
Linear Discriminant Function
 g(x) is a linear function:
( ) T
g b
 
x w x
x1
x2
wT x + b < 0
wT x + b > 0
 A hyper-plane in the feature
space
 (Unit-length) normal vector of
the hyper-plane:

w
n
w
n

4
 How would you classify these
points using a linear
discriminant function in order
to minimize the error rate?
denotes +1
denotes -1
x1
x2
 Infinite number of answers!

5
denotes +1
denotes -1
x1
x2

6
denotes +1
denotes -1
x1
x2

7
x1
x2
denotes +1
denotes -1
 Which one is the best?

8
Large Margin Linear Classifier
“safe zone”
 The linear discriminant
function (classifier) with the
maximum margin is the best
 Margin is defined as the width
that the boundary could be
increased by before hitting a
data point
 Why it is the best?
 Robust to outliners and thus
strong generalization ability
Margin
x1
x2
denotes +1
denotes -1

9
Classification Margin
 Distance from example xi to the separator is
 Examples closest to the hyperplane are support vectors.
 Margin ρ of the separator is the distance between support vectors.
w
x
w b
r i
T


r
ρ

10
Maximum Margin Classification
 Maximizing the margin is good according to intuition and
PAC theory.
 Implies that only support vectors matter; other training
examples are ignorable.

11
Large Margin Linear Classifier
 We know that
 The margin width is:
x1
x2
denotes +1
denotes -1
1
1
T
T
b
b


 
  
w x
w x
Margin
x+
x+
x-
( )
2
( )
M  
 
  
   
x x n
w
x x
w w
n
Support Vectors

12
Linear SVMs Mathematically (cont.)
 Then we can formulate the quadratic optimization problem:
Which can be reformulated as:
Find w and b such that
is maximized
and for all (xi, yi), i=1..n : yi(wTxi + b) ≥ 1
w
2


Φ(w) = ||w||2=wTw is minimized
and for all (xi, yi), i=1..n : yi (wTxi + b) ≥ 1

13
Solving the Optimization Problem
 Need to optimize a quadratic function subject to linear constraints.
 Quadratic optimization problems are a well-known class of
mathematical programming problems for which several (non-trivial)
algorithms exist.
 The solution involves constructing a dual problem where a
Lagrange multiplier αi is associated with every inequality constraint
in the primal (original) problem:
Φ(w) =wTw is minimized
and for all (xi, yi), i=1..n : yi (wTxi + b) ≥ 1
Find α1…αn such that
Q(α) =Σαi - ½ΣΣαiαjyiyjxi
Txj is maximized and
(1) Σαiyi = 0
(2) αi ≥ 0 for all αi

14
The Optimization Problem Solution
 Given a solution α1…αn to the dual problem, solution to the primal is:
 Each non-zero αi indicates that corresponding xi is a support vector.
 Then the classifying function is (note that we don’t need w explicitly):
 Notice that it relies on an inner product between the test point x and the
support vectors xi
 Also keep in mind that solving the optimization problem involved
computing the inner products xi
Txj between all training points.
w =Σαiyixi b = yk - Σαiyixi
Txk for any αk > 0
f(x) = Σαiyixi
Tx + b

15
Soft Margin Classification
 What if the training set is not linearly separable?
 Slack variables ξi can be added to allow misclassification of difficult
or noisy examples, resulting margin called soft.
ξi
ξi



R
k
k
ε
C
1
.
2
1
w
w
What should our quadratic
optimization criterion be?
Minimize

16
Soft Margin Classification Mathematically
 The old formulation:
 Modified formulation incorporates slack variables:
 Parameter C can be viewed as a way to control overfitting: it
“trades off” the relative importance of maximizing the margin and
fitting the training data.
Φ(w) =wTw is minimized
and for all (xi ,yi), i=1..n : yi (wTxi + b) ≥ 1
Φ(w) =wTw + CΣξi is minimized
and for all (xi ,yi), i=1..n : yi (wTxi + b) ≥ 1 – ξi, , ξi ≥ 0

17
Non-linear SVMs
 Datasets that are linearly separable with some noise work out
great:
 But what are we going to do if the dataset is just too hard?
 How about… mapping data to a higher-dimensional space:
0
0
0
x2
x
x
x

18
Non-linear SVMs: Feature spaces
 General idea: the original feature space can always be
mapped to some higher-dimensional feature space
where the training set is separable:
Φ: x → φ(x)

19
The “Kernel Trick”
 The linear classifier relies on inner product between vectors K(xi,xj)=xi
Txj
 If every datapoint is mapped into high-dimensional space via some
transformation Φ: x → φ(x), the inner product becomes:
K(xi,xj)= φ(xi) Tφ(xj)
 A kernel function is a function that is equivalent to an inner product in
some feature space.
 Example:
2-dimensional vectors x=[x1 x2]; let K(xi,xj)=(1 + xi
Txj)2
,
Need to show that K(xi,xj)= φ(xi) Tφ(xj):
K(xi,xj)=(1 + xi
Txj)2
,= 1+ xi1
2xj1
2 + 2 xi1xj1 xi2xj2+ xi2
2xj2
2 + 2xi1xj1 + 2xi2xj2=
= [1 xi1
2 √2 xi1xi2 xi2
2 √2xi1 √2xi2]T [1 xj1
2 √2 xj1xj2 xj2
2 √2xj1 √2xj2] =
= φ(xi) Tφ(xj), where φ(x) = [1 x1
2 √2 x1x2 x2
2 √2x1 √2x2]
 Thus, a kernel function implicitly maps data to a high-dimensional space
(without the need to compute each φ(x) explicitly).

20
What Functions are Kernels?
 For some functions K(xi,xj) checking that K(xi,xj)= φ(xi) Tφ(xj) can be
cumbersome.
 Mercer’s theorem:
Every semi-positive definite symmetric function is a kernel
 Semi-positive definite symmetric functions correspond to a semi-
positive definite symmetric Gram matrix:
K(x1,x1) K(x1,x2) K(x1,x3) … K(x1,xn)
K(x2,x1) K(x2,x2) K(x2,x3) K(x2,xn)
… … … … …
K(xn,x1) K(xn,x2) K(xn,x3) … K(xn,xn)
K=
For any non-zero vector x, xTKx>0

21
Examples of Kernel Functions
 Linear: K(xi,xj)= xi
Txj
 Polynomial of power p: K(xi,xj)= (1+ xi
Txj)p
 Gaussian (radial-basis function network):
 Sigmoid: K(xi,xj)= tanh(β0xi
Txj + β1)
)
2
exp(
)
,
( 2
2

j
i
j
i
x
x
x
x



K

22
Support Vector Machine:
Algorithm
 1. Choose a kernel function
 2. Choose a value for C
 3. Solve the quadratic programming problem (many software
packages available)
 4. Construct the discriminant function from the support
vectors

23
Some Issues
 Choice of kernel
- Gaussian or polynomial kernel are the mostly used non-linear kernels
- if ineffective, more elaborate kernels are needed
- domain experts can give assistance in formulating appropriate similarity
measures
 Choice of kernel parameters
- e.g. σ in Gaussian kernel
- σ is the distance between closest points with different classifications
- In the absence of reliable criteria, applications rely on the use of a
validation set or cross-validation to set such parameters.
 Optimization criterion – Hard margin v.s. Soft margin
- a lengthy series of experiments in which various parameters are tested

24
24
Why Is SVM Effective on High Dimensional Data?
 The complexity of trained classifier is characterized by the # of
support vectors rather than the dimensionality of the data
 The support vectors are the essential or critical training examples —
they lie closest to the decision boundary
 If all other training examples are removed and the training is
repeated, the same separating hyperplane would be found
 The number of support vectors found can be used to compute an
(upper) bound on the expected error rate of the SVM classifier, which
is independent of the data dimensionality
 Thus, an SVM with a small number of support vectors can have good
generalization, even when the dimensionality of the data is high

25
SVM applications
 SVMs were originally proposed by Boser, Guyon and Vapnik in 1992 and gained
increasing popularity in late 1990s.
 SVMs are currently among the best performers for a number of classification
tasks ranging from text to genomic data.
 SVMs can be applied to complex data types beyond feature vectors (e.g.
graphs, sequences, relational data) by designing kernel functions for such data.
 SVM techniques have been extended to a number of tasks such as regression
[Vapnik et al. ’97], principal component analysis [Schölkopf et al. ’99], etc.
 Most popular optimization algorithms for SVMs use decomposition to hill-climb
over a subset of αi’s at a time, e.g. SMO [Platt ’99] and [Joachims ’99]
 Tuning SVMs remains a black art: selecting a specific kernel and parameters is
usually done in a try-and-see manner.

26
SVM vs. Neural Network
 SVM
 Relatively new concept
 Deterministic algorithm
 Nice Generalization
properties
 Hard to learn – learned in
batch mode using quadratic
programming techniques,
but faster with good
optimization methods
 Using kernels can learn very
complex functions
 Neural Network
 Relatively old
 Nondeterministic algorithm
 Generalizes well but doesn’t
have strong mathematical
foundation
 Can easily be learned in
incremental fashion
 To learn complex functions—
use multilayer perceptron (not
that trivial)

27
Summary: Support Vector
Machine
 1. Large Margin Classifier
 Better generalization ability & less over-fitting
 2. The Kernel Trick
 Map data points to higher dimensional space in order
to make them linearly separable.
 Since only dot product is used, we do not need to
represent the mapping explicitly.

28
SVM resources
 http://www.kernel-machines.org
 http://www.csie.ntu.edu.tw/~cjlin/libsvm/

29
Model Evaluation
 Metrics for Performance Evaluation
 How to evaluate the performance of a model?
 Methods for Performance Evaluation
 How to obtain reliable estimates?
 Methods for Model Comparison
 How to compare the relative performance among
competing models?

30
Metrics for Performance Evaluation
 Focus on the predictive capability of a model
 Rather than how fast it takes to classify or build
models, scalability, etc.
 Confusion Matrix:
PREDICTED CLASS
ACTUAL
CLASS
Class=Yes Class=No
Class=Yes a b
Class=No c d
a: TP (true positive)
b: FN (false negative)
c: FP (false positive)
d: TN (true negative)

31
Metrics for Performance Evaluation…
 Most widely-used metric:
PREDICTED CLASS
ACTUAL
CLASS
Class=Yes Class=No
Class=Yes a
(TP)
b
(FN)
Class=No c
(FP)
d
(TN)
FN
FP
TN
TP
TN
TP
d
c
b
a
d
a










Accuracy

32
Limitation of Accuracy
 Consider a 2-class problem
 Number of Class 0 examples = 9990
 Number of Class 1 examples = 10
 If model predicts everything to be class 0,
accuracy is 9990/10000 = 99.9 %
 Accuracy is misleading because model does not detect
any class 1 example

34
Cost-Sensitive Measures
c
b
a
a
p
r
rp
b
a
a
c
a
a









2
2
2
(F)
measure
-
F
(r)
Recall
(p)
Precision
 Precision is biased towards C(Yes|Yes) & C(Yes|No)
 Recall is biased towards C(Yes|Yes) & C(No|Yes)
 F-measure is biased towards all except C(No|No)
d
w
c
w
b
w
a
w
d
w
a
w
4
3
2
1
4
1
Accuracy
Weighted






35
Model Evaluation
competing models?

36
Methods for Performance Evaluation
 How to obtain a reliable estimate of
performance?
 Performance of a model may depend on other
factors besides the learning algorithm:
 Class distribution
 Cost of misclassification
 Size of training and test sets

37
Learning Curve
 Learning curve shows
how accuracy changes
with varying sample size
 Requires a sampling
schedule for creating
learning curve:
 Arithmetic sampling
(Langley, et al)
 Geometric sampling
(Provost et al)
Effect of small sample size:
- Bias in the estimate
- Variance of estimate

38
Methods of Estimation
 Holdout
 Reserve 2/3 for training and 1/3 for testing
 Random subsampling
 Repeated holdout
 Cross validation
 Partition data into k disjoint subsets
 k-fold: train on k-1 partitions, test on the remaining one
 Leave-one-out: k=n
 Stratified sampling
 oversampling vs undersampling
 Bootstrap
 Sampling with replacement

39
Model Evaluation
competing models?

40
ROC (Receiver Operating Characteristic)
 Characterize the trade-off between positive hits
and false alarms
 ROC curve plots TP (on the y-axis) against FP (on
the x-axis)
 Performance of each classifier represented as a
point on the ROC curve
 changing the threshold of algorithm, sample
distribution or cost matrix changes the location of the
point

41
ROC Curve
At threshold t:
TP=0.5, FN=0.5, FP=0.12, FN=0.88
- 1-dimensional data set containing 2 classes (positive and negative)
- any points located at x > t is classified as positive

42
ROC Curve
(TP,FP):
 (0,0): declare everything
to be negative class
 (1,1): declare everything
to be positive class
 (1,0): ideal
 Diagonal line:
 Random guessing
 Below diagonal line:
 prediction is opposite of the
true class

43
Using ROC for Model Comparison
 No model consistently
outperform the other
 M1 is better for
small FPR
 M2 is better for
large FPR
 Area Under the ROC
curve
 Ideal:
 Area = 1
 Random guess:
 Area = 0.5

4.Support Vector Machines.ppt machine learning and development

More Related Content

Similar to 4.Support Vector Machines.ppt machine learning and development (20)

Recently uploaded (20)

4.Support Vector Machines.ppt machine learning and development