SlideShare a Scribd company logo
Text Classification
Using String Kernels
Presented by
Dibyendu Nath & Divya Sambasivan
CS 290D : Spring 2014
Huma Lodhi, Craig Saunders, et al
Department of Computer Science, Royal Holloway, University of London
Intro: Text Classification
• Task of assigning a document to one or more
categories.
• Done manually (library science) or
algorithmically (information science, data
mining, machine learning).
• Learning systems (neural networks or
decision trees) work on feature vectors,
transformed from the input space.
• Text documents cannot readily be described
by explicit feature vectors.
lingua-systems.eu
Problem Definition
• Input : A corpus of documents.
• Output : A kernel representing the documents.
• This kernel can then be used to classify, cluster etc. using
existing algorithms which work on kernels, eg: SVM,
perceptron.
• Methodology : Find a mapping and a kernel function so
that we can apply any of the standard kernel methods of
classification, clustering etc. to the corpus of documents.
Overview
• Motivation
• Kernel Methods
• Algorithms - with increasingly better efficiency
• Approximation
• Evaluation
• Follow Up
• Conclusion
Overview
• Motivation
• Kernel Methods
• Algorithms - with increasingly better efficiency
• Approximation
• Evaluation
• Follow Up
• Conclusion
Motivation
• Text documents cannot readily be described by
explicit feature vectors.
• Feature Extraction
- Requires extensive domain knowledge
- Possible loss of important information.
• Kernel Methods
– an alternative to explicit feature extraction
Overview
• Motivation
• Kernel Methods
• Algorithms - with increasingly better efficiency
• Approximation
• Evaluation
• Follow Up
• Conclusion
The Kernel Trick
• Map data into feature space via mapping ϕ.
• The mapping may be assessed via a kernel function.
• Construct a linear function in feature space
slide from Huma Lodhi
Kernel Function
slide from Huma Lodhi
Kernel Function – Measure of Similarity, returns
the inner product between mapped data points
K(xi, xj) = < Φ(xi), Φ(xj)>
Example –
Kernels for Sequences
• Word Kernels [WK] - Bag of Words
- Sequence of characters followed by punctuation
or space
• N-Grams Kernel [NGK]
• Sequence of n consecutive substrings
• Example : “quick brown”
3-gram - qui, uic, ick, ck_, _br, bro, row, own
• String Subsequence Kernel [SSK]
• All (non-contiguous) substrings of n-symbols
Word Kernels
• Documents are mapped to very high dimensional space
where dimensionality of the feature space is equal to the
number of unique words in the corpus.
• Each entry of the vector represents the occurrence or
non-occurrence of the word.
• Kernel - inner product between mapped sequences give a
sum over all common (weighted) words
fish tank sea
Doc 1 2 0 1
Doc 2 1 1 0
String Subsequence Kernels
Basic Idea
Non-contiguous substrings :
substring “c-a-r”
card – length of sequence = 3
custard – length of sequence = 6
The more subsequences (of length n) two strings have in
common, the more similar they are considered
Decay Factor
Substrings are weighted according to the degree of contiguity
in a string by a decay factor λ ∊ (0,1)
Example
c-a c-t a-t c-r a-r
car
cat
car cat
Documents we want to compare
λ2
λ2
λ3
0 0
λ2
λ2
λ3 0 0
K(car, car) = 2λ4
+ λ6 K(cat, cat) = 2λ4
+ λ6
n=2
K(car, cat) = K(car, cat) = λ4
Overview
• Motivation
• Kernel Methods
• Algorithms - with increasingly better efficiency
• Approximation
• Evaluation
• Follow Up
• Conclusion
Algorithm Definitions
• Alphabet
Let Σ be the finite alphabet
• String
A string is a finite sequence of characters from alphabet with
length |s|
• Subsequence
A vector of indices ij, sorted in ascending order, in a string ‘s’
such that they form the letters of a sequence
Eg: ‘lancasters’ = [4,5,9]
Length of subsequence = in – i1 +1 = 9 - 4 + 1 = 6
Algorithm Definitions
• Feature Spaces
• Feature Mapping
The feature mapping φ for a string s is given by defining
the u coordinate φu(s) for each u Σ∈ n
These features measure the number of occurrences of
subsequences in the string s weighting them according
to their lengths.
String Kernel
• The inner product between two mapped strings is a
sum over all the common weighted subsequence
λ2
λ2
λ3
0 0
λ2
λ2
λ3 0 0
K(car, cat) = λ4
Intermediate Kernel
c-a c-t a-t c-r a-r
car
cat
λ2 λ2
λ3
0 0
λ2
λ2λ3
0 0
λ3
λ3
Count the length from the beginning of the sequence
through the end of the strings s and t.
K’
Recursive Computation
Null sub-string
Target string is shorter
than search sub-string
c-a c-t a-t c-r a-r
car
cat
λ2
λ30 0
λ2
λ3
0 0
λ3
λ3
c-a c-t a-t c-r a-r
cart
3
cat λ2
λ3
0 0λ3
s
t
sx
t
λ4 λλ40 0
K’(car,cat) = λ6
K’(cart,cat) = λ7
λ3λ4
+λ7
+λ5
K’
K’
λ2
λ2
λ3
0 0
λ2
λ2
λ3 0 0
K(car,cat) = λ4
s
t
c-a c-t a-t c-r a-r
cart
cat λ2
λ2
λ3
λ4
λ2
λ3
0
λ3
λ2
0
K(cart,cat) = λ4
sx
t
+λ7
+λ5
K
K
Recursive Computation
Null sub-string
Target string is shorter
than search sub-string
O(n |s||t|2
) O(n |s||t|)
Dynamic
Programming
Recursion
Efficiency
O(|Σ|n
)
O(n |s||t|)
O(n |s||t|2
)
All subsequences of length n.
Kernel Normalization
Setting Algorithm
Parameters
Overview
• Motivation
• Kernel Methods
• Algorithms - with increasingly better efficiency
• Approximation
• Evaluation
• Follow Up
• Conclusion
Kernel Approximation
Suppose, we have some training points (x
i
, y
i
) X × Y∈ , and some
kernel function K(x,z) corresponding to a feature space mapping φ : X
→ F such that K(x, z) = φ(x), φ(z)⟨ ⟩.
Consider a set S of vectors S = {s
i
X }∈ .
If the cardinality of S is equal to the dimensionality of the space F and
the vectors φ(s
i
) are orthogonal
*
Kernel Approximation
If instead of forming a complete orthonormal basis, the
cardinality of S S is less than the dimensionality of X or thẽ ⊆
vectors si are not fully orthogonal, then we can construct an
approximation to the kernel K:
If the set S is carefully constructed, then the production of ã
Gram matrix which is closely aligned to the true Gram matrix
can be achieved with a fraction of the computational cost.
Problem : Choose the set S to ensure that the vectors φ(s̃ i) are
orthogonal.
Selecting Feature Subset
Heuristic for obtaining the set S is as follows:̃
1.We choose a substring size n.
2.We enumerate all possible contiguous strings of
length n.
3.We choose the x strings of length n which occur most
frequently in the dataset and this forms our set S .̃
By definition, all such strings of length n are orthogonal (i.e.
K(si,sj) = Cδij for some constant C) when used in conjunction with
the string kernel of degree n.
Kernel Approximation
Results
Overview
• Motivation
• Kernel Methods
• Algorithms - with increasingly better efficiency
• Approximation
• Evaluation
• Follow Up
• Conclusion
Evaluation
Dataset : Reuters-21578, ModeApte Split
Categoried Selected:
Precision = relevant documents
categorized relevant / total
documents categorized relevant
Recall = relevant documents
categorized relevant/total
relevant documents
F1 =
2*Precision*Recall/Precision+R
ecall
Evaluation
Evaluation
Evaluation
Effectiveness of Sequence Length
[k = 7] [k = 5]
[k = 6] [k = 5]
[k = 5]
[k = 5][k = 5]
[k = 5]
Evaluation
Effectiveness of Decay Factor
λ = 0.3
λ = 0.03
λ = 0.05
λ = 0.03
Overview
• Motivation
• Kernel Methods
• Algorithms - with increasingly better efficiency
• Approximation
• Evaluation
• Follow Up
• Conclusion
Follow Up
• String Kernel using sequences of words rather than
characters, less computationally demanding, no fixed decay
factor, combination of string kernels
Cancedda, Nicola, et al. "Word sequence kernels." The Journal of Machine
Learning Research 3 (2003): 1059-1082.
• Extracting semantic relations between entities in natural
language text, based on a generalization of subsequence
kernels.
Bunescu, Razvan, and Raymond J. Mooney. "Subsequence kernels for relation
extraction." NIPS. 2005.
Follow Up
•Homology – Computational biology method to identify the
ancestry of proteins.
Model should be able to tolerate upto m-mismatches. The
kernels used in this method measure sequence similarity
based on shared occurrences of k-length subsequences,
counted with up to m-mismatches.
Overview
• Motivation
• Kernel Methods
• Algorithms - with increasingly better efficiency
• Approximation
• Evaluation
• Follow Up
• Conclusion
Conclusion
Key Idea: Using non-contiguous string subsequences to
compute similarity between documents with a decay factor
which discounts similarity according to the degree of contiguity
•Highly computationally intensive method – authors reduced the
time complexity from O(|Σ|n
) to O(n|s||t|) by a dynamic
programming approach
•Still less intensive method – Kernel Approximation by Feature
Subset Selection.
•Empirical estimation of k and λ, from experimental results
•Showed promising results only for small datasets
•Seems to mimic stemming for small datasets
Any Q?
Thank You :)
Ad

Recommended

LSTM Tutorial
LSTM Tutorial
Ralph Schlosser
 
Text Classification
Text Classification
RAX Automation Suite
 
"An Introduction to Machine Learning and How to Teach Machines to See," a Pre...
"An Introduction to Machine Learning and How to Teach Machines to See," a Pre...
Edge AI and Vision Alliance
 
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
Edureka!
 
PyTorch Introduction
PyTorch Introduction
Yash Kawdiya
 
Automatic Machine Learning, AutoML
Automatic Machine Learning, AutoML
Himadri Mishra
 
Feature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive models
Gabriel Moreira
 
Support Vector Machines
Support Vector Machines
nextlib
 
bag-of-words models
bag-of-words models
Xiaotao Zou
 
Introduction to NP Completeness
Introduction to NP Completeness
Gene Moo Lee
 
Text classification presentation
Text classification presentation
Marijn van Zelst
 
Unit 2 unsupervised learning.pptx
Unit 2 unsupervised learning.pptx
Dr.Shweta
 
Introduction to Statistical Machine Learning
Introduction to Statistical Machine Learning
mahutte
 
Naive bayes
Naive bayes
umeskath
 
Natural language processing: feature extraction
Natural language processing: feature extraction
Gabriel Hamilton
 
Markov Chain Monte Carlo Methods
Markov Chain Monte Carlo Methods
Francesco Casalegno
 
Optimization for Neural Network Training - Veronica Vilaplana - UPC Barcelona...
Optimization for Neural Network Training - Veronica Vilaplana - UPC Barcelona...
Universitat Politècnica de Catalunya
 
44 randomized-algorithms
44 randomized-algorithms
AjitSaraf1
 
Knn 160904075605-converted
Knn 160904075605-converted
rameswara reddy venkat
 
Machine Learning
Machine Learning
Shrey Malik
 
Deep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter Tuning
Shubhmay Potdar
 
Machine Learning With Logistic Regression
Machine Learning With Logistic Regression
Knoldus Inc.
 
K-Nearest Neighbor Classifier
K-Nearest Neighbor Classifier
Neha Kulkarni
 
Machine learning Lecture 2
Machine learning Lecture 2
Srinivasan R
 
K - Nearest neighbor ( KNN )
K - Nearest neighbor ( KNN )
Mohammad Junaid Khan
 
K-Nearest Neighbor(KNN)
K-Nearest Neighbor(KNN)
Abdullah al Mamun
 
Learning With Complete Data
Learning With Complete Data
Vishnuprabhu Gopalakrishnan
 
Svm and kernel machines
Svm and kernel machines
Nawal Sharma
 
ECO_TEXT_CLUSTERING
ECO_TEXT_CLUSTERING
George Simov
 
KNN Neural Network In minimax search, alpha-beta pruning can be applied to pr...
KNN Neural Network In minimax search, alpha-beta pruning can be applied to pr...
movocode
 

More Related Content

What's hot (20)

bag-of-words models
bag-of-words models
Xiaotao Zou
 
Introduction to NP Completeness
Introduction to NP Completeness
Gene Moo Lee
 
Text classification presentation
Text classification presentation
Marijn van Zelst
 
Unit 2 unsupervised learning.pptx
Unit 2 unsupervised learning.pptx
Dr.Shweta
 
Introduction to Statistical Machine Learning
Introduction to Statistical Machine Learning
mahutte
 
Naive bayes
Naive bayes
umeskath
 
Natural language processing: feature extraction
Natural language processing: feature extraction
Gabriel Hamilton
 
Markov Chain Monte Carlo Methods
Markov Chain Monte Carlo Methods
Francesco Casalegno
 
Optimization for Neural Network Training - Veronica Vilaplana - UPC Barcelona...
Optimization for Neural Network Training - Veronica Vilaplana - UPC Barcelona...
Universitat Politècnica de Catalunya
 
44 randomized-algorithms
44 randomized-algorithms
AjitSaraf1
 
Knn 160904075605-converted
Knn 160904075605-converted
rameswara reddy venkat
 
Machine Learning
Machine Learning
Shrey Malik
 
Deep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter Tuning
Shubhmay Potdar
 
Machine Learning With Logistic Regression
Machine Learning With Logistic Regression
Knoldus Inc.
 
K-Nearest Neighbor Classifier
K-Nearest Neighbor Classifier
Neha Kulkarni
 
Machine learning Lecture 2
Machine learning Lecture 2
Srinivasan R
 
K - Nearest neighbor ( KNN )
K - Nearest neighbor ( KNN )
Mohammad Junaid Khan
 
K-Nearest Neighbor(KNN)
K-Nearest Neighbor(KNN)
Abdullah al Mamun
 
Learning With Complete Data
Learning With Complete Data
Vishnuprabhu Gopalakrishnan
 
Svm and kernel machines
Svm and kernel machines
Nawal Sharma
 
bag-of-words models
bag-of-words models
Xiaotao Zou
 
Introduction to NP Completeness
Introduction to NP Completeness
Gene Moo Lee
 
Text classification presentation
Text classification presentation
Marijn van Zelst
 
Unit 2 unsupervised learning.pptx
Unit 2 unsupervised learning.pptx
Dr.Shweta
 
Introduction to Statistical Machine Learning
Introduction to Statistical Machine Learning
mahutte
 
Naive bayes
Naive bayes
umeskath
 
Natural language processing: feature extraction
Natural language processing: feature extraction
Gabriel Hamilton
 
Markov Chain Monte Carlo Methods
Markov Chain Monte Carlo Methods
Francesco Casalegno
 
Optimization for Neural Network Training - Veronica Vilaplana - UPC Barcelona...
Optimization for Neural Network Training - Veronica Vilaplana - UPC Barcelona...
Universitat Politècnica de Catalunya
 
44 randomized-algorithms
44 randomized-algorithms
AjitSaraf1
 
Machine Learning
Machine Learning
Shrey Malik
 
Deep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter Tuning
Shubhmay Potdar
 
Machine Learning With Logistic Regression
Machine Learning With Logistic Regression
Knoldus Inc.
 
K-Nearest Neighbor Classifier
K-Nearest Neighbor Classifier
Neha Kulkarni
 
Machine learning Lecture 2
Machine learning Lecture 2
Srinivasan R
 
Svm and kernel machines
Svm and kernel machines
Nawal Sharma
 

Similar to Text classification using Text kernels (20)

ECO_TEXT_CLUSTERING
ECO_TEXT_CLUSTERING
George Simov
 
KNN Neural Network In minimax search, alpha-beta pruning can be applied to pr...
KNN Neural Network In minimax search, alpha-beta pruning can be applied to pr...
movocode
 
is2015_poster
is2015_poster
Jan Svec
 
CNN for modeling sentence
CNN for modeling sentence
ANISH BHANUSHALI
 
Clique and sting
Clique and sting
Subramanyam Natarajan
 
19EC4073_PR_CO3 PPdcdfvsfgfvgfdgbtvfT.pptx
19EC4073_PR_CO3 PPdcdfvsfgfvgfdgbtvfT.pptx
lingaswamy16
 
Db Scan
Db Scan
International Islamic University
 
The science behind predictive analytics a text mining perspective
The science behind predictive analytics a text mining perspective
ankurpandeyinfo
 
Deduplication on large amounts of code
Deduplication on large amounts of code
source{d}
 
Presentation on Text Classification
Presentation on Text Classification
Sai Srinivas Kotni
 
Text clustering
Text clustering
KU Leuven
 
[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need
Daiki Tanaka
 
Encoding survey
Encoding survey
Rajeev Raman
 
Mining the social web 6
Mining the social web 6
HyeonSeok Choi
 
Instance Based Learning in Machine Learning
Instance Based Learning in Machine Learning
Pavithra Thippanaik
 
Cluster
Cluster
guest1babda
 
Space-efficient Feature Maps for String Alignment Kernels
Space-efficient Feature Maps for String Alignment Kernels
Yasuo Tabei
 
Lect4
Lect4
sumit621
 
Unsupervised learning clustering
Unsupervised learning clustering
Arshad Farhad
 
UnSupervised Machincs4811-ch23a-clustering.ppt
UnSupervised Machincs4811-ch23a-clustering.ppt
Ramanamurthy Banda
 
ECO_TEXT_CLUSTERING
ECO_TEXT_CLUSTERING
George Simov
 
KNN Neural Network In minimax search, alpha-beta pruning can be applied to pr...
KNN Neural Network In minimax search, alpha-beta pruning can be applied to pr...
movocode
 
is2015_poster
is2015_poster
Jan Svec
 
19EC4073_PR_CO3 PPdcdfvsfgfvgfdgbtvfT.pptx
19EC4073_PR_CO3 PPdcdfvsfgfvgfdgbtvfT.pptx
lingaswamy16
 
The science behind predictive analytics a text mining perspective
The science behind predictive analytics a text mining perspective
ankurpandeyinfo
 
Deduplication on large amounts of code
Deduplication on large amounts of code
source{d}
 
Presentation on Text Classification
Presentation on Text Classification
Sai Srinivas Kotni
 
Text clustering
Text clustering
KU Leuven
 
[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need
Daiki Tanaka
 
Mining the social web 6
Mining the social web 6
HyeonSeok Choi
 
Instance Based Learning in Machine Learning
Instance Based Learning in Machine Learning
Pavithra Thippanaik
 
Space-efficient Feature Maps for String Alignment Kernels
Space-efficient Feature Maps for String Alignment Kernels
Yasuo Tabei
 
Unsupervised learning clustering
Unsupervised learning clustering
Arshad Farhad
 
UnSupervised Machincs4811-ch23a-clustering.ppt
UnSupervised Machincs4811-ch23a-clustering.ppt
Ramanamurthy Banda
 
Ad

Recently uploaded (20)

Indigo dyeing Presentation (2).pptx as dye
Indigo dyeing Presentation (2).pptx as dye
shreeroop1335
 
The Influence off Flexible Work Policies
The Influence off Flexible Work Policies
sales480687
 
Introduction for GenAI for Faculty for University.pdf
Introduction for GenAI for Faculty for University.pdf
Saeed999312
 
MRI Pulse Sequence in radiology physics.pptx
MRI Pulse Sequence in radiology physics.pptx
BelaynehBishaw
 
Statistics-and-Computer-Tools-for-Analyzing-of-Assessment-Data.pptx
Statistics-and-Computer-Tools-for-Analyzing-of-Assessment-Data.pptx
pelaezmaryjoy90
 
All the DataOps, all the paradigms .
All the DataOps, all the paradigms .
Lars Albertsson
 
PPT2 W1L2.pptx.........................................
PPT2 W1L2.pptx.........................................
palicteronalyn26
 
Data Visualisation in data science for students
Data Visualisation in data science for students
confidenceascend
 
Crafting-Research-Recommendations Grade 12.pptx
Crafting-Research-Recommendations Grade 12.pptx
DaryllWhere
 
Model Evaluation & Visualisation part of a series of intro modules for data ...
Model Evaluation & Visualisation part of a series of intro modules for data ...
brandonlee626749
 
英国毕业证范本利物浦约翰摩尔斯大学成绩单底纹防伪LJMU学生证办理学历认证
英国毕业证范本利物浦约翰摩尔斯大学成绩单底纹防伪LJMU学生证办理学历认证
taqyed
 
Artigo - Playing to Win.planejamento docx
Artigo - Playing to Win.planejamento docx
KellyXavier15
 
llm_presentation and deep learning methods
llm_presentation and deep learning methods
sayedabdussalam11
 
最新版意大利米兰大学毕业证(UNIMI毕业证书)原版定制
最新版意大利米兰大学毕业证(UNIMI毕业证书)原版定制
taqyea
 
Shifting Focus on AI: How it Can Make a Positive Difference
Shifting Focus on AI: How it Can Make a Positive Difference
1508 A/S
 
Starbucks in the Indian market through its joint venture.
Starbucks in the Indian market through its joint venture.
sales480687
 
Attendance Presentation Project Excel.pptx
Attendance Presentation Project Excel.pptx
s2025266191
 
lecture12.pdf Introduction to bioinformatics
lecture12.pdf Introduction to bioinformatics
SergeyTsygankov6
 
YEAP !NOT WHAT YOU THINK aakshdjdncnkenfj
YEAP !NOT WHAT YOU THINK aakshdjdncnkenfj
payalmistryb
 
最新版美国佐治亚大学毕业证(UGA毕业证书)原版定制
最新版美国佐治亚大学毕业证(UGA毕业证书)原版定制
Taqyea
 
Indigo dyeing Presentation (2).pptx as dye
Indigo dyeing Presentation (2).pptx as dye
shreeroop1335
 
The Influence off Flexible Work Policies
The Influence off Flexible Work Policies
sales480687
 
Introduction for GenAI for Faculty for University.pdf
Introduction for GenAI for Faculty for University.pdf
Saeed999312
 
MRI Pulse Sequence in radiology physics.pptx
MRI Pulse Sequence in radiology physics.pptx
BelaynehBishaw
 
Statistics-and-Computer-Tools-for-Analyzing-of-Assessment-Data.pptx
Statistics-and-Computer-Tools-for-Analyzing-of-Assessment-Data.pptx
pelaezmaryjoy90
 
All the DataOps, all the paradigms .
All the DataOps, all the paradigms .
Lars Albertsson
 
PPT2 W1L2.pptx.........................................
PPT2 W1L2.pptx.........................................
palicteronalyn26
 
Data Visualisation in data science for students
Data Visualisation in data science for students
confidenceascend
 
Crafting-Research-Recommendations Grade 12.pptx
Crafting-Research-Recommendations Grade 12.pptx
DaryllWhere
 
Model Evaluation & Visualisation part of a series of intro modules for data ...
Model Evaluation & Visualisation part of a series of intro modules for data ...
brandonlee626749
 
英国毕业证范本利物浦约翰摩尔斯大学成绩单底纹防伪LJMU学生证办理学历认证
英国毕业证范本利物浦约翰摩尔斯大学成绩单底纹防伪LJMU学生证办理学历认证
taqyed
 
Artigo - Playing to Win.planejamento docx
Artigo - Playing to Win.planejamento docx
KellyXavier15
 
llm_presentation and deep learning methods
llm_presentation and deep learning methods
sayedabdussalam11
 
最新版意大利米兰大学毕业证(UNIMI毕业证书)原版定制
最新版意大利米兰大学毕业证(UNIMI毕业证书)原版定制
taqyea
 
Shifting Focus on AI: How it Can Make a Positive Difference
Shifting Focus on AI: How it Can Make a Positive Difference
1508 A/S
 
Starbucks in the Indian market through its joint venture.
Starbucks in the Indian market through its joint venture.
sales480687
 
Attendance Presentation Project Excel.pptx
Attendance Presentation Project Excel.pptx
s2025266191
 
lecture12.pdf Introduction to bioinformatics
lecture12.pdf Introduction to bioinformatics
SergeyTsygankov6
 
YEAP !NOT WHAT YOU THINK aakshdjdncnkenfj
YEAP !NOT WHAT YOU THINK aakshdjdncnkenfj
payalmistryb
 
最新版美国佐治亚大学毕业证(UGA毕业证书)原版定制
最新版美国佐治亚大学毕业证(UGA毕业证书)原版定制
Taqyea
 
Ad

Text classification using Text kernels

  • 1. Text Classification Using String Kernels Presented by Dibyendu Nath & Divya Sambasivan CS 290D : Spring 2014 Huma Lodhi, Craig Saunders, et al Department of Computer Science, Royal Holloway, University of London
  • 2. Intro: Text Classification • Task of assigning a document to one or more categories. • Done manually (library science) or algorithmically (information science, data mining, machine learning). • Learning systems (neural networks or decision trees) work on feature vectors, transformed from the input space. • Text documents cannot readily be described by explicit feature vectors. lingua-systems.eu
  • 3. Problem Definition • Input : A corpus of documents. • Output : A kernel representing the documents. • This kernel can then be used to classify, cluster etc. using existing algorithms which work on kernels, eg: SVM, perceptron. • Methodology : Find a mapping and a kernel function so that we can apply any of the standard kernel methods of classification, clustering etc. to the corpus of documents.
  • 4. Overview • Motivation • Kernel Methods • Algorithms - with increasingly better efficiency • Approximation • Evaluation • Follow Up • Conclusion
  • 5. Overview • Motivation • Kernel Methods • Algorithms - with increasingly better efficiency • Approximation • Evaluation • Follow Up • Conclusion
  • 6. Motivation • Text documents cannot readily be described by explicit feature vectors. • Feature Extraction - Requires extensive domain knowledge - Possible loss of important information. • Kernel Methods – an alternative to explicit feature extraction
  • 7. Overview • Motivation • Kernel Methods • Algorithms - with increasingly better efficiency • Approximation • Evaluation • Follow Up • Conclusion
  • 8. The Kernel Trick • Map data into feature space via mapping ϕ. • The mapping may be assessed via a kernel function. • Construct a linear function in feature space slide from Huma Lodhi
  • 9. Kernel Function slide from Huma Lodhi Kernel Function – Measure of Similarity, returns the inner product between mapped data points K(xi, xj) = < Φ(xi), Φ(xj)> Example –
  • 10. Kernels for Sequences • Word Kernels [WK] - Bag of Words - Sequence of characters followed by punctuation or space • N-Grams Kernel [NGK] • Sequence of n consecutive substrings • Example : “quick brown” 3-gram - qui, uic, ick, ck_, _br, bro, row, own • String Subsequence Kernel [SSK] • All (non-contiguous) substrings of n-symbols
  • 11. Word Kernels • Documents are mapped to very high dimensional space where dimensionality of the feature space is equal to the number of unique words in the corpus. • Each entry of the vector represents the occurrence or non-occurrence of the word. • Kernel - inner product between mapped sequences give a sum over all common (weighted) words fish tank sea Doc 1 2 0 1 Doc 2 1 1 0
  • 12. String Subsequence Kernels Basic Idea Non-contiguous substrings : substring “c-a-r” card – length of sequence = 3 custard – length of sequence = 6 The more subsequences (of length n) two strings have in common, the more similar they are considered Decay Factor Substrings are weighted according to the degree of contiguity in a string by a decay factor λ ∊ (0,1)
  • 13. Example c-a c-t a-t c-r a-r car cat car cat Documents we want to compare λ2 λ2 λ3 0 0 λ2 λ2 λ3 0 0 K(car, car) = 2λ4 + λ6 K(cat, cat) = 2λ4 + λ6 n=2 K(car, cat) = K(car, cat) = λ4
  • 14. Overview • Motivation • Kernel Methods • Algorithms - with increasingly better efficiency • Approximation • Evaluation • Follow Up • Conclusion
  • 15. Algorithm Definitions • Alphabet Let Σ be the finite alphabet • String A string is a finite sequence of characters from alphabet with length |s| • Subsequence A vector of indices ij, sorted in ascending order, in a string ‘s’ such that they form the letters of a sequence Eg: ‘lancasters’ = [4,5,9] Length of subsequence = in – i1 +1 = 9 - 4 + 1 = 6
  • 16. Algorithm Definitions • Feature Spaces • Feature Mapping The feature mapping φ for a string s is given by defining the u coordinate φu(s) for each u Σ∈ n These features measure the number of occurrences of subsequences in the string s weighting them according to their lengths.
  • 17. String Kernel • The inner product between two mapped strings is a sum over all the common weighted subsequence λ2 λ2 λ3 0 0 λ2 λ2 λ3 0 0 K(car, cat) = λ4
  • 18. Intermediate Kernel c-a c-t a-t c-r a-r car cat λ2 λ2 λ3 0 0 λ2 λ2λ3 0 0 λ3 λ3 Count the length from the beginning of the sequence through the end of the strings s and t. K’
  • 19. Recursive Computation Null sub-string Target string is shorter than search sub-string
  • 20. c-a c-t a-t c-r a-r car cat λ2 λ30 0 λ2 λ3 0 0 λ3 λ3 c-a c-t a-t c-r a-r cart 3 cat λ2 λ3 0 0λ3 s t sx t λ4 λλ40 0 K’(car,cat) = λ6 K’(cart,cat) = λ7 λ3λ4 +λ7 +λ5 K’ K’
  • 21. λ2 λ2 λ3 0 0 λ2 λ2 λ3 0 0 K(car,cat) = λ4 s t c-a c-t a-t c-r a-r cart cat λ2 λ2 λ3 λ4 λ2 λ3 0 λ3 λ2 0 K(cart,cat) = λ4 sx t +λ7 +λ5 K K
  • 22. Recursive Computation Null sub-string Target string is shorter than search sub-string O(n |s||t|2 ) O(n |s||t|) Dynamic Programming Recursion
  • 26. Overview • Motivation • Kernel Methods • Algorithms - with increasingly better efficiency • Approximation • Evaluation • Follow Up • Conclusion
  • 27. Kernel Approximation Suppose, we have some training points (x i , y i ) X × Y∈ , and some kernel function K(x,z) corresponding to a feature space mapping φ : X → F such that K(x, z) = φ(x), φ(z)⟨ ⟩. Consider a set S of vectors S = {s i X }∈ . If the cardinality of S is equal to the dimensionality of the space F and the vectors φ(s i ) are orthogonal *
  • 28. Kernel Approximation If instead of forming a complete orthonormal basis, the cardinality of S S is less than the dimensionality of X or thẽ ⊆ vectors si are not fully orthogonal, then we can construct an approximation to the kernel K: If the set S is carefully constructed, then the production of ã Gram matrix which is closely aligned to the true Gram matrix can be achieved with a fraction of the computational cost. Problem : Choose the set S to ensure that the vectors φ(s̃ i) are orthogonal.
  • 29. Selecting Feature Subset Heuristic for obtaining the set S is as follows:̃ 1.We choose a substring size n. 2.We enumerate all possible contiguous strings of length n. 3.We choose the x strings of length n which occur most frequently in the dataset and this forms our set S .̃ By definition, all such strings of length n are orthogonal (i.e. K(si,sj) = Cδij for some constant C) when used in conjunction with the string kernel of degree n.
  • 31. Overview • Motivation • Kernel Methods • Algorithms - with increasingly better efficiency • Approximation • Evaluation • Follow Up • Conclusion
  • 32. Evaluation Dataset : Reuters-21578, ModeApte Split Categoried Selected: Precision = relevant documents categorized relevant / total documents categorized relevant Recall = relevant documents categorized relevant/total relevant documents F1 = 2*Precision*Recall/Precision+R ecall
  • 35. Evaluation Effectiveness of Sequence Length [k = 7] [k = 5] [k = 6] [k = 5] [k = 5] [k = 5][k = 5] [k = 5]
  • 36. Evaluation Effectiveness of Decay Factor λ = 0.3 λ = 0.03 λ = 0.05 λ = 0.03
  • 37. Overview • Motivation • Kernel Methods • Algorithms - with increasingly better efficiency • Approximation • Evaluation • Follow Up • Conclusion
  • 38. Follow Up • String Kernel using sequences of words rather than characters, less computationally demanding, no fixed decay factor, combination of string kernels Cancedda, Nicola, et al. "Word sequence kernels." The Journal of Machine Learning Research 3 (2003): 1059-1082. • Extracting semantic relations between entities in natural language text, based on a generalization of subsequence kernels. Bunescu, Razvan, and Raymond J. Mooney. "Subsequence kernels for relation extraction." NIPS. 2005.
  • 39. Follow Up •Homology – Computational biology method to identify the ancestry of proteins. Model should be able to tolerate upto m-mismatches. The kernels used in this method measure sequence similarity based on shared occurrences of k-length subsequences, counted with up to m-mismatches.
  • 40. Overview • Motivation • Kernel Methods • Algorithms - with increasingly better efficiency • Approximation • Evaluation • Follow Up • Conclusion
  • 41. Conclusion Key Idea: Using non-contiguous string subsequences to compute similarity between documents with a decay factor which discounts similarity according to the degree of contiguity •Highly computationally intensive method – authors reduced the time complexity from O(|Σ|n ) to O(n|s||t|) by a dynamic programming approach •Still less intensive method – Kernel Approximation by Feature Subset Selection. •Empirical estimation of k and λ, from experimental results •Showed promising results only for small datasets •Seems to mimic stemming for small datasets