SlideShare a Scribd company logo
Advanced Computing Seminar  Data Mining and Its Industrial Applications  — Chapter 8 —   Support Vector Machines   Zhongzhi Shi, Markus Stumptner, Yalei Hao, Gerald Quirchmayr  Knowledge and Software Engineering Lab Advanced Computing Research Centre School of Computer and Information Science University of South Australia
Outline Introduction Support Vector Machine Non-linear Classification SVM and PAC Applications Summary
History SVM is a classifier derived from statistical learning theory by Vapnik and Chervonenkis SVMs introduced  by Boser, Guyon, Vapnik  in COLT-92 Initially popularized in the NIPS community, now an important and active field of all Machine Learning research. Special issues of Machine Learning Journal, and Journal of Machine Learning Research.
What is SVM? SVMs are learning systems that  use a hypothesis space of  linear functions in a high dimensional feature space —  Kernel function trained with a learning algorithm from optimization theory —  Lagrange Implements a learning bias derived from statistical learning theory —  Generalisation  SVM is a classifier derived from statistical learning theory by Vapnik and Chervonenkis
Linear Classifiers  y est denotes +1 denotes -1 How would you classify this data? Copyright © 2001, 2003, Andrew W. Moore f  x f ( x , w ,b ) = sign( w . x   -  b )
Linear Classifiers f  x  y est denotes +1 denotes -1 f ( x , w ,b ) = sign( w . x   -  b ) How would you classify this data? Copyright © 2001, 2003, Andrew W. Moore
Linear Classifiers f  x  y est denotes +1 denotes -1 f ( x , w ,b ) = sign( w . x   -  b ) How would you classify this data? Copyright © 2001, 2003, Andrew W. Moore
Linear Classifiers f  x  y est denotes +1 denotes -1 f ( x , w ,b ) = sign( w . x   -  b ) How would you classify this data? Copyright © 2001, 2003, Andrew W. Moore
Linear Classifiers f  x  y est denotes +1 denotes -1 f ( x , w ,b ) = sign( w . x   -  b ) How would you classify this data? Copyright © 2001, 2003, Andrew W. Moore
Maximum Margin f  x  y est denotes +1 denotes -1 f ( x , w ,b ) = sign( w . x   -  b ) The  maximum margin linear classifier  is the linear classifier with the maximum margin. This is the simplest kind of SVM (Called an LSVM) Linear SVM Copyright © 2001, 2003, Andrew W. Moore
Model of Linear Classification Binary classification is frequently performed by using a real-valued  hypothesis  function: The input x is assigned to the  positive  class, if Otherwise to the  negative  class.
The concept of Hyperplane For a binary  linear separable  training set, we can find at least a  hyperplane   (w,b)  which divides the space into two half spaces.  The definition of  hyperplane
Tuning the Hyperplane (w,b)  The  Perceptron  Algorithm Proposed by  Frank Rosenblatt  in 1956 Preliminary definition The  functional margin  of an example (x i ,y i ) implies correct classification of (x i ,y i )
The  Perceptron  Algorithm The number of mistakes is at most
The Geometric margin  -> The Euclidean distance of an example (x i ,y i ) from the decision boundary
The Geometric margin The margin of a training set  S Maximal Margin Hyperplane A hyperplane realising the maximun geometric margin The optimal linear classifier If it can form the  Maximal Margin Hyperplane.
How to Find the optimal solution ? The drawback of the  perceptron  algorithm The algorithm may give a  different solution  depending on the  order  in which the examples are processed. The superiority of SVM The kind of learning machines tune the solution based on the  optimization theory .
The Maximal Margin Classifier The simplest model of SVM Finds the  maximal margin hyperplane  in an chosen  kernel-induced  feature space. A  convex optimization problem Minimizing a  quadratic function  under linear inequality constrains
Support Vector Classifiers Support vector machines  Cortes and Vapnik (1995) well suited for high-dimensional data binary classification Training set  D = {( x i ,y i ), i=1,…,n},  x i    R m  and y i    {-1,1} Linear discriminant classifier  Separating hyperplane  {  x  : g( x ) =  w T x  + w 0  = 0 }  model parameters:  w     R m  and w 0     R
Formalizi the geometric margin Assumes that  The geometric margin  In order to find the maximum  ,we must find the minimum
Minimizing the norm  ->  Because We can re-formalize the optimization problem
Minimizing the norm  -> Uses the  Lagrangian  function Obtained Resubstituting into the primal to obtain
Minimizing the norm Finds the minimum  is equivalent to find the maximum  The strategies for minimizing differentiable function Decomposition Sequential Minimal Optimization (SMO)
The Support Vector The condition of the optimization problem states that This implies that only for input xi for which the functional margin is one  This implies that it lies closest to the hyperplane The corresponding
The optimal hypothesis (w,b) The two parameters can be obtained from The hypothesis is
Soft Margin Optimization The main problem with the maximal margin classifier is that it always products perfectly a consistent hypothesis a hypothesis with no training error Relax the boundary
Non-linear Classification The problem The maximal margin classifier is an important concept, but it cannot be used in many real-world problems There will in general be no linear separation in the feature space. The solution Maps the data into another space that can be separated linearly.
A learning machine A learning machine  f   takes an input  x  and transforms it, somehow using weights   , into a predicted output  y est  = +/- 1 f  x  y est    is some vector of adjustable parameters
Some definitions Given some machine  f And under the assumption that all training points  (x k ,y k )  were drawn i.i.d from some distribution. And under the assumption that future test points will be drawn from the same distribution Define Official terminology
Some definitions Given some machine  f And under the assumption that all training points  (x k ,y k )  were drawn i.i.d from some distribution. And under the assumption that future test points will be drawn from the same distribution Define Official terminology R = #training set data points
Vapnik-Chervonenkis  Dimension  Given some machine  f , let  h  be its VC dimension. h  is a measure of  f ’s power  ( h  does not depend on the choice of training set) Vapnik showed that with probability 1-  This gives us a way to estimate the error on future data based only on the training error and the VC-dimension of  f
Structural Risk Minimization Let   (f)  = the set of functions representable by f. Suppose  Then  We’re trying to decide which machine to use. We train each machine and make a table… f 4 4 f 5 5 f 6 6  f 3 3 f 2 2 f 1 1 Choice Probable upper bound on TESTERR VC-Conf TRAINERR f i i
Kernel-Induced Feature Space Mapping the data of space  X  into space  F
Implicit Mapping into Feature Space For the non-linear separable data set, we can modify the hypothesis to map implicitly the data to another feature space
Kernel Function A Kernel is a function  K , such that for all The benefits Solve the computational problem of working with many dimensions
Kernel function
The Polynomial Kernel The kind of kernel represents the inner product of two vector(point) in a feature space of  dimension. For example
 
 
Text Categorization Inductive learning  Inpute : Output :  f(x) = confidence(class) In the case of text classification ,the attribute are words in the document ,and the classes are the categories.
PROPERTIES OF TEXT-CLASSIFICATION TASKS High-Dimensional Feature Space. Sparse Document Vectors. High Level of Redundancy.
Text representation and feature selection Binary feature term frequency  Inverse document frequency  n is the total number of documents DF(w) is the number of documents the word  occurs in
 
Learning SVMS To learn the vector of feature weights  Linear SVMS Polynomial classifiers Radial basis functions
Processing Text files are processed to produce a vector of words Select 300 words with highest  mutual information  with each category(remove stopwords) A separate classifier is learned for each category.
An example - Reuters ( trends & controversies) Category : interest Weight vector large positive weights : prime (.70), rate (.67), interest (.63), rates (.60), and discount (.46) large negative weights: group (–.24),year (–.25), sees (–.33) world (–.35), and dlrs (–.71)
 
Text Categorization Results Dumais et al. (1998)
Apply to the Linear Classifier Substitutes to the hypothesis Substitutes to the margin optimization
SVMs and PAC Learning Theorems connect PAC theory to the size of the  margin Basically, the  larger  the margin, the better the expected accuracy See, for example, Chapter 4 of  Support Vector Machines  by Christianini and Shawe-Taylor, Cambridge University Press, 2002
PAC and the Number of Support Vectors The fewer the support vectors, the better the generalization will be Recall, non-support vectors are Correctly classified Don’t change the learned model if left out of the training set So
VC-dimension of an SVM Very  loosely speaking there is some theory which under some different assumptions puts an upper bound on the VC dimension as where Diameter  is the diameter of the smallest sphere that can enclose all the high-dimensional term-vectors derived from the training set. Margin  is the smallest margin we’ll let the SVM use This can be used in SRM (Structural Risk Minimization) for choosing the polynomial degree, RBF   , etc. But most people just use Cross-Validation Copyright © 2001, 2003, Andrew W. Moore
Finding Non-Linear Separating Surfaces Map inputs into new space Example: features  x 1   x 2 5  4 Example: features  x 1   x 2  x 1 2   x 2 2   x 1 *x 2 5  4  25  16  20 Solve SVM program in this new space Computationally complex if many features But a clever trick exists
Summary Maximize the margin between positive and negative examples (connects to PAC theory) Non-linear Classification The support vectors contribute to the solution Kernels map examples into a new, usually non-linear space
References Vladimir Vapnik.  The Nature of Statistical Learning Theory , Springer, 1995  Andrew W. Moore.  cmsc726: SVMs.  http:// www.cs.cmu.edu/~awm/tutorials C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2):955-974, 1998. http://citeseer.nj.nec.com/burges98tutorial.html  Vladimir Vapnik. Statistical Learning Theory.  Wiley-Interscience; 1998 Thorsten Joachims  (joachims_01a):  A Statistical Learning Model of Text Classification for Support Vector Machines
www.intsci.ac.cn/shizz / Questions?!

More Related Content

PPTX
Support vector machine
PPT
Support Vector machine
PPTX
Support vector machine
PPTX
Support Vector Machines- SVM
PPTX
Support vector machine
PPTX
Support vector machine-SVM's
PDF
Support Vector Machines for Classification
PPTX
Support Vector Machines Simply
Support vector machine
Support Vector machine
Support vector machine
Support Vector Machines- SVM
Support vector machine
Support vector machine-SVM's
Support Vector Machines for Classification
Support Vector Machines Simply

What's hot (20)

PPTX
Support Vector Machine ppt presentation
PDF
Support Vector Machines ( SVM )
PDF
Dimensionality Reduction
PDF
Regularization
PPTX
Support Vector Machine
PPTX
Linear regression with gradient descent
PPTX
Support vector machine
PPTX
Random forest
PPTX
Support vector machines (svm)
PDF
Convolutional Neural Networks (CNN)
PDF
Introduction to Machine Learning Classifiers
PPTX
Decision Tree Algorithm With Example | Decision Tree In Machine Learning | Da...
PPTX
Activation function
PDF
Naive Bayes
PPTX
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
PDF
Machine Learning: Introduction to Neural Networks
PPTX
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...
PPTX
Curse of dimensionality
PPTX
An overview of gradient descent optimization algorithms
PDF
Linear discriminant analysis
Support Vector Machine ppt presentation
Support Vector Machines ( SVM )
Dimensionality Reduction
Regularization
Support Vector Machine
Linear regression with gradient descent
Support vector machine
Random forest
Support vector machines (svm)
Convolutional Neural Networks (CNN)
Introduction to Machine Learning Classifiers
Decision Tree Algorithm With Example | Decision Tree In Machine Learning | Da...
Activation function
Naive Bayes
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
Machine Learning: Introduction to Neural Networks
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...
Curse of dimensionality
An overview of gradient descent optimization algorithms
Linear discriminant analysis
Ad

Similar to Support Vector Machines (20)

PPT
Introduction to Support Vector Machine 221 CMU.ppt
PPTX
Support Vector Machines USING MACHINE LEARNING HOW IT WORKS
PPTX
Statistical Machine Learning unit4 lecture notes
PPT
PPT
SVM (2).ppt
PDF
OM-DS-Fall2022-Session10-Support vector machine.pdf
DOC
Introduction to Support Vector Machines
PPTX
Support Vector Machine.pptx
PDF
[ML]-SVM2.ppt.pdf
PDF
course slides of Support-Vector-Machine.pdf
PPT
2.6 support vector machines and associative classifiers revised
PPT
PERFORMANCE EVALUATION PARAMETERS FOR MACHINE LEARNING
PPTX
SVM[Support vector Machine] Machine learning
PPTX
Module 3 -Support Vector Machines data mining
PPTX
SVMs.pptx support vector machines machine learning
PDF
Lecture 1: linear SVM in the primal
PPT
Linear Discrimination Centering on Support Vector Machines
PPT
4.Support Vector Machines.ppt machine learning and development
PPT
Lecture 2
PPTX
ML-Lec-17-SVM,sshwqw - Non-Linear (1).pptx
Introduction to Support Vector Machine 221 CMU.ppt
Support Vector Machines USING MACHINE LEARNING HOW IT WORKS
Statistical Machine Learning unit4 lecture notes
SVM (2).ppt
OM-DS-Fall2022-Session10-Support vector machine.pdf
Introduction to Support Vector Machines
Support Vector Machine.pptx
[ML]-SVM2.ppt.pdf
course slides of Support-Vector-Machine.pdf
2.6 support vector machines and associative classifiers revised
PERFORMANCE EVALUATION PARAMETERS FOR MACHINE LEARNING
SVM[Support vector Machine] Machine learning
Module 3 -Support Vector Machines data mining
SVMs.pptx support vector machines machine learning
Lecture 1: linear SVM in the primal
Linear Discrimination Centering on Support Vector Machines
4.Support Vector Machines.ppt machine learning and development
Lecture 2
ML-Lec-17-SVM,sshwqw - Non-Linear (1).pptx
Ad

More from nextlib (20)

PDF
Nio
PDF
Hadoop Map Reduce Arch
PDF
D Rb Silicon Valley Ruby Conference
PPT
Multi-core architectures
PPT
Aldous Huxley Brave New World
PDF
Social Graph
PPT
Ajax Prediction
PDF
Closures for Java
PDF
A Content-Driven Reputation System for the Wikipedia
PPT
SVD review
PDF
Mongrel Handlers
PPT
Blue Ocean Strategy
PPT
日本7-ELEVEN消費心理學
PDF
Comparing State-of-the-Art Collaborative Filtering Systems
PPT
Item Based Collaborative Filtering Recommendation Algorithms
PPT
Agile Adoption2007
PPT
Modern Compiler Design
PPT
透过众神的眼睛--鸟瞰非洲
PDF
Improving Quality of Search Results Clustering with Approximate Matrix Factor...
PPT
Bigtable
Nio
Hadoop Map Reduce Arch
D Rb Silicon Valley Ruby Conference
Multi-core architectures
Aldous Huxley Brave New World
Social Graph
Ajax Prediction
Closures for Java
A Content-Driven Reputation System for the Wikipedia
SVD review
Mongrel Handlers
Blue Ocean Strategy
日本7-ELEVEN消費心理學
Comparing State-of-the-Art Collaborative Filtering Systems
Item Based Collaborative Filtering Recommendation Algorithms
Agile Adoption2007
Modern Compiler Design
透过众神的眼睛--鸟瞰非洲
Improving Quality of Search Results Clustering with Approximate Matrix Factor...
Bigtable

Recently uploaded (20)

PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
Spectroscopy.pptx food analysis technology
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Approach and Philosophy of On baking technology
PPTX
Machine Learning_overview_presentation.pptx
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
Empathic Computing: Creating Shared Understanding
PPTX
1. Introduction to Computer Programming.pptx
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Encapsulation_ Review paper, used for researhc scholars
A comparative analysis of optical character recognition models for extracting...
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Spectroscopy.pptx food analysis technology
Network Security Unit 5.pdf for BCA BBA.
Digital-Transformation-Roadmap-for-Companies.pptx
Spectral efficient network and resource selection model in 5G networks
Approach and Philosophy of On baking technology
Machine Learning_overview_presentation.pptx
Assigned Numbers - 2025 - Bluetooth® Document
Empathic Computing: Creating Shared Understanding
1. Introduction to Computer Programming.pptx
Per capita expenditure prediction using model stacking based on satellite ima...
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Programs and apps: productivity, graphics, security and other tools
Univ-Connecticut-ChatGPT-Presentaion.pdf
Group 1 Presentation -Planning and Decision Making .pptx
Advanced methodologies resolving dimensionality complications for autism neur...

Support Vector Machines

  • 1. Advanced Computing Seminar Data Mining and Its Industrial Applications — Chapter 8 — Support Vector Machines Zhongzhi Shi, Markus Stumptner, Yalei Hao, Gerald Quirchmayr Knowledge and Software Engineering Lab Advanced Computing Research Centre School of Computer and Information Science University of South Australia
  • 2. Outline Introduction Support Vector Machine Non-linear Classification SVM and PAC Applications Summary
  • 3. History SVM is a classifier derived from statistical learning theory by Vapnik and Chervonenkis SVMs introduced by Boser, Guyon, Vapnik in COLT-92 Initially popularized in the NIPS community, now an important and active field of all Machine Learning research. Special issues of Machine Learning Journal, and Journal of Machine Learning Research.
  • 4. What is SVM? SVMs are learning systems that use a hypothesis space of linear functions in a high dimensional feature space — Kernel function trained with a learning algorithm from optimization theory — Lagrange Implements a learning bias derived from statistical learning theory — Generalisation SVM is a classifier derived from statistical learning theory by Vapnik and Chervonenkis
  • 5. Linear Classifiers  y est denotes +1 denotes -1 How would you classify this data? Copyright © 2001, 2003, Andrew W. Moore f x f ( x , w ,b ) = sign( w . x - b )
  • 6. Linear Classifiers f x  y est denotes +1 denotes -1 f ( x , w ,b ) = sign( w . x - b ) How would you classify this data? Copyright © 2001, 2003, Andrew W. Moore
  • 7. Linear Classifiers f x  y est denotes +1 denotes -1 f ( x , w ,b ) = sign( w . x - b ) How would you classify this data? Copyright © 2001, 2003, Andrew W. Moore
  • 8. Linear Classifiers f x  y est denotes +1 denotes -1 f ( x , w ,b ) = sign( w . x - b ) How would you classify this data? Copyright © 2001, 2003, Andrew W. Moore
  • 9. Linear Classifiers f x  y est denotes +1 denotes -1 f ( x , w ,b ) = sign( w . x - b ) How would you classify this data? Copyright © 2001, 2003, Andrew W. Moore
  • 10. Maximum Margin f x  y est denotes +1 denotes -1 f ( x , w ,b ) = sign( w . x - b ) The maximum margin linear classifier is the linear classifier with the maximum margin. This is the simplest kind of SVM (Called an LSVM) Linear SVM Copyright © 2001, 2003, Andrew W. Moore
  • 11. Model of Linear Classification Binary classification is frequently performed by using a real-valued hypothesis function: The input x is assigned to the positive class, if Otherwise to the negative class.
  • 12. The concept of Hyperplane For a binary linear separable training set, we can find at least a hyperplane (w,b) which divides the space into two half spaces. The definition of hyperplane
  • 13. Tuning the Hyperplane (w,b) The Perceptron Algorithm Proposed by Frank Rosenblatt in 1956 Preliminary definition The functional margin of an example (x i ,y i ) implies correct classification of (x i ,y i )
  • 14. The Perceptron Algorithm The number of mistakes is at most
  • 15. The Geometric margin -> The Euclidean distance of an example (x i ,y i ) from the decision boundary
  • 16. The Geometric margin The margin of a training set S Maximal Margin Hyperplane A hyperplane realising the maximun geometric margin The optimal linear classifier If it can form the Maximal Margin Hyperplane.
  • 17. How to Find the optimal solution ? The drawback of the perceptron algorithm The algorithm may give a different solution depending on the order in which the examples are processed. The superiority of SVM The kind of learning machines tune the solution based on the optimization theory .
  • 18. The Maximal Margin Classifier The simplest model of SVM Finds the maximal margin hyperplane in an chosen kernel-induced feature space. A convex optimization problem Minimizing a quadratic function under linear inequality constrains
  • 19. Support Vector Classifiers Support vector machines Cortes and Vapnik (1995) well suited for high-dimensional data binary classification Training set D = {( x i ,y i ), i=1,…,n}, x i  R m and y i  {-1,1} Linear discriminant classifier Separating hyperplane { x : g( x ) = w T x + w 0 = 0 } model parameters: w  R m and w 0  R
  • 20. Formalizi the geometric margin Assumes that The geometric margin In order to find the maximum ,we must find the minimum
  • 21. Minimizing the norm -> Because We can re-formalize the optimization problem
  • 22. Minimizing the norm -> Uses the Lagrangian function Obtained Resubstituting into the primal to obtain
  • 23. Minimizing the norm Finds the minimum is equivalent to find the maximum The strategies for minimizing differentiable function Decomposition Sequential Minimal Optimization (SMO)
  • 24. The Support Vector The condition of the optimization problem states that This implies that only for input xi for which the functional margin is one This implies that it lies closest to the hyperplane The corresponding
  • 25. The optimal hypothesis (w,b) The two parameters can be obtained from The hypothesis is
  • 26. Soft Margin Optimization The main problem with the maximal margin classifier is that it always products perfectly a consistent hypothesis a hypothesis with no training error Relax the boundary
  • 27. Non-linear Classification The problem The maximal margin classifier is an important concept, but it cannot be used in many real-world problems There will in general be no linear separation in the feature space. The solution Maps the data into another space that can be separated linearly.
  • 28. A learning machine A learning machine f takes an input x and transforms it, somehow using weights  , into a predicted output y est = +/- 1 f x  y est  is some vector of adjustable parameters
  • 29. Some definitions Given some machine f And under the assumption that all training points (x k ,y k ) were drawn i.i.d from some distribution. And under the assumption that future test points will be drawn from the same distribution Define Official terminology
  • 30. Some definitions Given some machine f And under the assumption that all training points (x k ,y k ) were drawn i.i.d from some distribution. And under the assumption that future test points will be drawn from the same distribution Define Official terminology R = #training set data points
  • 31. Vapnik-Chervonenkis Dimension Given some machine f , let h be its VC dimension. h is a measure of f ’s power ( h does not depend on the choice of training set) Vapnik showed that with probability 1-  This gives us a way to estimate the error on future data based only on the training error and the VC-dimension of f
  • 32. Structural Risk Minimization Let  (f) = the set of functions representable by f. Suppose Then We’re trying to decide which machine to use. We train each machine and make a table… f 4 4 f 5 5 f 6 6  f 3 3 f 2 2 f 1 1 Choice Probable upper bound on TESTERR VC-Conf TRAINERR f i i
  • 33. Kernel-Induced Feature Space Mapping the data of space X into space F
  • 34. Implicit Mapping into Feature Space For the non-linear separable data set, we can modify the hypothesis to map implicitly the data to another feature space
  • 35. Kernel Function A Kernel is a function K , such that for all The benefits Solve the computational problem of working with many dimensions
  • 37. The Polynomial Kernel The kind of kernel represents the inner product of two vector(point) in a feature space of dimension. For example
  • 38.  
  • 39.  
  • 40. Text Categorization Inductive learning Inpute : Output : f(x) = confidence(class) In the case of text classification ,the attribute are words in the document ,and the classes are the categories.
  • 41. PROPERTIES OF TEXT-CLASSIFICATION TASKS High-Dimensional Feature Space. Sparse Document Vectors. High Level of Redundancy.
  • 42. Text representation and feature selection Binary feature term frequency Inverse document frequency n is the total number of documents DF(w) is the number of documents the word occurs in
  • 43.  
  • 44. Learning SVMS To learn the vector of feature weights Linear SVMS Polynomial classifiers Radial basis functions
  • 45. Processing Text files are processed to produce a vector of words Select 300 words with highest mutual information with each category(remove stopwords) A separate classifier is learned for each category.
  • 46. An example - Reuters ( trends & controversies) Category : interest Weight vector large positive weights : prime (.70), rate (.67), interest (.63), rates (.60), and discount (.46) large negative weights: group (–.24),year (–.25), sees (–.33) world (–.35), and dlrs (–.71)
  • 47.  
  • 48. Text Categorization Results Dumais et al. (1998)
  • 49. Apply to the Linear Classifier Substitutes to the hypothesis Substitutes to the margin optimization
  • 50. SVMs and PAC Learning Theorems connect PAC theory to the size of the margin Basically, the larger the margin, the better the expected accuracy See, for example, Chapter 4 of Support Vector Machines by Christianini and Shawe-Taylor, Cambridge University Press, 2002
  • 51. PAC and the Number of Support Vectors The fewer the support vectors, the better the generalization will be Recall, non-support vectors are Correctly classified Don’t change the learned model if left out of the training set So
  • 52. VC-dimension of an SVM Very loosely speaking there is some theory which under some different assumptions puts an upper bound on the VC dimension as where Diameter is the diameter of the smallest sphere that can enclose all the high-dimensional term-vectors derived from the training set. Margin is the smallest margin we’ll let the SVM use This can be used in SRM (Structural Risk Minimization) for choosing the polynomial degree, RBF  , etc. But most people just use Cross-Validation Copyright © 2001, 2003, Andrew W. Moore
  • 53. Finding Non-Linear Separating Surfaces Map inputs into new space Example: features x 1 x 2 5 4 Example: features x 1 x 2 x 1 2 x 2 2 x 1 *x 2 5 4 25 16 20 Solve SVM program in this new space Computationally complex if many features But a clever trick exists
  • 54. Summary Maximize the margin between positive and negative examples (connects to PAC theory) Non-linear Classification The support vectors contribute to the solution Kernels map examples into a new, usually non-linear space
  • 55. References Vladimir Vapnik. The Nature of Statistical Learning Theory , Springer, 1995 Andrew W. Moore. cmsc726: SVMs. http:// www.cs.cmu.edu/~awm/tutorials C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2):955-974, 1998. http://citeseer.nj.nec.com/burges98tutorial.html Vladimir Vapnik. Statistical Learning Theory. Wiley-Interscience; 1998 Thorsten Joachims (joachims_01a): A Statistical Learning Model of Text Classification for Support Vector Machines