Space-efficient Feature Maps for String Alignment Kernels

Space-efficient Feature Maps
for String Alignment Kernels
Yasuo Tabei (RIKEN-AIP)
Joint work with
Yoshihiro Yamanishi (Kyutech)
Rasmus Pagh (IT University of Copenhagen)
ICDM’19@Beijing, Nov. 10th, 2019

Kernel methods
• Kernels are inner product in some feature space H:
𝐾 𝑥, 𝑥′ =< φ 𝑥 , φ 𝑥′ >
• Intuitively, kernel is a measure of similarity of x and x’
• x and x’ can be vectors, trees, graphs.
• x and x’ are strings in this talk.
• Kernels are useful for
- Classification (SVM), Regression, Feature selection, Two-sample
problems, etc.

String alignment kernels
• Typical string kernel uses substring (k-mers) features
• Alignment kernel uses string alignment (e.g., edit
distance) as a similarity measure
• It has a wide variety of applications in string processing
E.g.) text classification, remote homology detection for
proteins/DNA [BMCBioinfo.’06], etc
• Advantage: High prediction accuracy
• Drawback: Large computation complexity
Square time to the length of strings (dynamic programming)
Quadratic time in the number of training data

Feature maps (FMs) for kernel approximations
[A. Rahimi and B. Recht, NIPS, 2007]
• FMs map d-dimensional vector 𝑥 ∈ 𝑅 𝑑
into D-dimensional
vector φ(𝑥) ∈ 𝑅 𝐷
using O(d×D) memory and time
• It can approximate kernel function k(x,y) by the inner
product of compact vectors 𝜑(𝑥)・φ(𝑦)
• Linear model 𝑓𝑙 𝑥 = 𝑤・φ 𝑥 has approximately the same
functionality as nonlinear model 𝑓 𝑛 𝑥 = 𝑖 𝑘(𝑥, 𝑦𝑖)
• Advantage: can enhance the scalability of kernel methods
(i)Input vectors (ii)Compact vectors (iii)Linear model
Map Learn model weight w
𝑓𝑙 𝑥 = 𝑤・φ 𝑥

Existing feature maps (FMs)
• Several FMs with different input formats and kernel
similarities have been proposed
• No previous work has been able to approximate
alignment kernels
Method Kernel Input
b-bit MinHash [Li'11] Jaccard Binary vector
Tensor Sketching [Pham'13] Polynomial Real vector
0-bit CWS [Li’15] Min-Max Real vector
C-Hash [Mu'12] Cosine Real vector
Random Feature [Rahimi'07] RBF Real vector
RFM [Kar'12] Dot product Real vector
PWLSGD [Maji'09] Intersection Real vector

Space-efficient feature maps
for string alignment kernels
• Basic idea : use two hash functions of (i) edit-sensitive
parsing (ESP) and (ii) feature maps (FMs) for
Laplacian kernel
• Feature maps consumes a large amount of memory
O(d×D) memory for input dimension d and output
dimension D
• Present space-efficient FMs using O(d) memory
• Can achieve high classification accuracy by training
linear SVM with compact vectors
S1 = ABRACADA
S2 = ABRA
S3 = ABRACA
S4 = ATGCAGA
S5 = BARACR
x1 = (3, 1, 3)
x2 = (2, 4, 1)
x3 = (5, 1, 2)
x4 = (1, 0, 0)
x5 = (2, 2, 1)
z1 = (1.2,0.1,1,2)
z2 = (2,1,1.2,3.4)
z3 = (-1.2,0,2.2,3)
z4 = (-3.2,0,2.2,1)
z5 = (2, 2, -1.2, 0)
F(zi)=wzi
(i) ESP (ii) FMs (iii) Learn linear model

Edit-sensitive parsing (ESP)
[G.Cormode and S.Mushukrishnan, 2007]
• Build a single parse tree from input string S
• Build a parse tree from the bottom (leaves) to the top (root)
• Nodes with the same node label are built from pairs of the
same symbol pairs
• Can be used for mapping string S into integer vector x
– Each element of x is the number of appearances of node labels
• Can approximate edit distance with moves EDM as L1 distance
between mapped vectors (i.e., EDM(Si,Sj)≒||xi-xj||1)
• Computational time is linear to the length of S
ABBAABA B B
A B X1 X2 X3 X4 X5X6
x = (4, 5, 2, 1, 1, 1, 1, 1)

ESP for mapping strings
into integer vectors
Step1
Given S and S’, make vectors V(S)
and V(S’) each dimension of which is
the number of each characters in S
and S’.
A B
S=ABABABBAB → V(S) = (4,5)
S’=ABBABAB → V(S’) = (3,4)
ABBAABA B B Level1
Level2
A BBABA B Level1
Level2
Step2
Assign each pair or triple to the
same non-terminal symbol
Step3
Count the number of each node
label and update vectors V(S) and
V(S’)
A B X1 X2 X3 X4
V(S) = (4, 5, 2, 1, 1, 0)
V(S’)= (5, 4, 2, 0, 0, 1)
Step4
Replace strings by the sequence of
node labels in level2
S = X1X2X3X1
S’= X1X4X1
Goto step2

Feature maps for string alignment
kernels
• EDM is approximated by L1-distance with mapped vectors as
EDM(Si,Sj)≒||xi-xj||1
• Alignment kernel is defined as
K(Si,Sj) = exp(-||xi-xj||1/β)) (Laplacian kernel)
• Feature maps (FMs) can approximate Laplacian kernel as
exp(-||xi-xj||1/β)≒<zi,zj>
• FMs are space-inefficient with O(dD) memory for input
dimension d and output dimension D
Fast food approach [ICML’13] can approximate feature
maps for RBF kernels
S1 = ABRACADA
S2 = ABRA
S3 = ABRACA
x1 = (3, 1, 3)
x2 = (2, 4, 1)
x3 = (5, 1, 2)
z1 = (1.2,0.1,1,2)
z2 = (2,1,1.2,3.4)
z3 = (-1.2,0,2.2,3)
F(zi)=wzi
(i)ESP (ii) FMs
(iii) Learn
linear model

Space-efficient FMs (Beliefly)
• Basic idea: reduce random matrix R of size D×d in
standard FMs to random matrix M of size t×d
• Approximate R[i,j] element using polynomial equation:
R[i, j] ≒ M[i,1] + M[i,2]・j1 + ・・・+ M[i,t-1]・jt-1
t-wise independent family distribution
• Theoretical gurantee (concentration bound):
Pr[|𝑧 𝑥 ′
𝑧 𝑦 − 𝑘(𝑥, 𝑦)| ≥ ε] ≤ 2/(ε2
/𝐷)
Rd
D
Md
t
Random matrix R for
standard FMs
O(D×d)
memory
Random matrix M for
Space-efficient FMs
O(t×d) memory

Experiments
• 5 massive string datasets in real world
• Competitors
 5 SVMs with string kernels: LAK [Bioinfo’08], GAK
[ICML’11], ESP+Kernel, CGK+Kernel, stk17 [NIPS’17]
 FMs for alignment kernels: D2KE [KDD’19]
 SFMEDM: proposed

Classification accuracy in AUC score

Summary
• Space-efficient feature maps for string alignment kernels
• Use two hash functions
– ESP: maps strings into integer vectors
– Feature maps: maps integer vectors into feature vectors
• Linear SVMs are trained on feature vectors
– Linear SVMs behaves such as non-linear SVM with alignment
kernels
• Advantage: highly scalable
• Code and datasets are available:
https://sites.google.com/ view/alignmentkernels/home

Space-efficient Feature Maps for String Alignment Kernels

Recommended

More Related Content

What's hot (20)

Similar to Space-efficient Feature Maps for String Alignment Kernels (20)

More from Yasuo Tabei (19)

Recently uploaded (20)

Space-efficient Feature Maps for String Alignment Kernels

Editor's Notes