Scalable Global Alignment Graph Kernel Using Random Features: From Node Embedding to Graph Embedding

Scalable Global Alignment Graph Kernel Using
Random Features: From Node Embedding to
Graph Embedding
KDD2019
Lingfei Wu, Ian En-Hsu Yen, Zhen Zhang †, Kun Xu, Liang Zhao, Xi
Peng, Yinglong Xia, Charu Aggarwal
Presenter: Hagawa, Nishi, Eugene
2019.11.11
1 / 35

Problem Setup
Goal:
▶ Create a good kernel to measure Graph similarity
▶ Less computational complexity
▶ Take into account global and local graph property
▶ Have positive deﬁnite
▶ Leads to good classiﬁer
Application:
▶ Kernel SVM (input: graph,
output: binary)
▶ Kernel PCA
▶ Kernel Ridge Regression
▶ . . .
How similar?
𝑘( ) = 0.5,
2 / 35

Difficulty : Graph isomorphism
difficulty to define similarity between graphs
▶ 2 graphs : G1(V1, E1, ℓ1, L1), G2(V2, E2, ℓ2, L2)
▶ Bijection1 f exists, if and only if, G1 is isomorphism with G2
▶ Bijection f : V1 → V2 s.t {va, vb} ∈ E1, va and vb are adjacent.
▶ Partial isomorphism is NP-complete
1
全単射
3 / 35

Related Work
2 groups of recent graph kernel method
Comparing sub-structure:
▶ The major difference is how to define and explore sub-structures
- random walks, shortest paths, cycles, subtree patterns, graphlets...
Geometric node embeddings:
▶ Capture global property
▶ Achieved state-of-the-art performance in the graph classification task
Bad points of related works
Comparing sub-structure:
▶ Do not take into account the global property
Geometric node embeddings:
▶ Do not necessarily use positive definite for Kernel
Poor scalability:
4 / 35

Contribution
▶ Propose a Positive deﬁnite Kernel
▶ Reduce computational complexity
▶ From quadratic to (quasi-)linear 2
▶ Propose an approximation of the kernel with convergence analysis
▶ Take into account global property
▶ Outperforms 12 state-of-the-art graph classiﬁcation algorithms
- Include graph kernels, deep graph neural networks
2
quasi-linear : n log n. Time and Space.
5 / 35

Common kernel
Compare directly 2 graphs using kernel
Similarity
𝒌(･, ･)
Figure: calculation of kernel value between 2 graphs
6 / 35

Proposed kernel
Compare directly 2 graphs using kernel
Similarity
𝒌(･, ･)
Random Graphs
Similarity
with 𝒌(･, ･)
Figure: calculation of kernel value between 2 graphs
7 / 35

Notation : Graph deﬁnition
Graph: G = (V , E, ℓ)
Node: V = {vi }n
i=1
Edge: E = (V × V )
Assign label function: ℓ : V → Σ
Size of node: n
# of edge: m
Node label: l
# of graphs: N
G<latexit sha1_base64="QLLEFqFGXJzmcwbhRTcNSo8/+r8=">AAAB6HicbVDLSsNAFJ3UV62vqks3g0VwVRItPnZFF7pswT6gDWUyvWnHTiZhZiKU0C9w40IRt36SO//GSRpErQcuHM65l3vv8SLOlLbtT6uwtLyyulZcL21sbm3vlHf32iqMJYUWDXkoux5RwJmAlmaaQzeSQAKPQ8ebXKd+5wGkYqG409MI3ICMBPMZJdpIzZtBuWJX7Qx4kTg5qaAcjUH5oz8MaRyA0JQTpXqOHWk3IVIzymFW6scKIkInZAQ9QwUJQLlJdugMHxlliP1QmhIaZ+rPiYQESk0Dz3QGRI/VXy8V//N6sfYv3ISJKNYg6HyRH3OsQ5x+jYdMAtV8agihkplbMR0TSag22ZSyEC5TnH2/vEjaJ1XntFpr1ir1qzyOIjpAh+gYOegc1dEtaqAWogjQI3pGL9a99WS9Wm/z1oKVz+yjX7DevwC1D40D</latexit>
v1<latexit sha1_base64="6r48FeRijmeRwM0ce/9YOgxnVX0=">AAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0m0+HErevFYwbSFNpTNdtMu3WzC7qZQQn+DFw+KePUHefPfuEmDqPXBwOO9GWbm+TFnStv2p1VaWV1b3yhvVra2d3b3qvsHbRUlklCXRDySXR8rypmgrmaa024sKQ59Tjv+5DbzO1MqFYvEg57F1AvxSLCAEayN5E4HqTMfVGt23c6BlolTkBoUaA2qH/1hRJKQCk04Vqrn2LH2Uiw1I5zOK/1E0RiTCR7RnqECh1R5aX7sHJ0YZYiCSJoSGuXqz4kUh0rNQt90hliP1V8vE//zeokOrryUiTjRVJDFoiDhSEco+xwNmaRE85khmEhmbkVkjCUm2uRTyUO4znDx/fIyaZ/VnfN6475Ra94UcZThCI7hFBy4hCbcQQtcIMDgEZ7hxRLWk/VqvS1aS1Yxcwi/YL1/AeZQjuI=</latexit>
v2<latexit sha1_base64="HvFip7AjDkPR91+3+J6CugKM0SQ=">AAAB7HicbVBNS8NAEJ34WetX1aOXxSJ4KkktftyKXjxWMG2hDWWz3bRLN5uwuymU0N/gxYMiXv1B3vw3btIgan0w8Hhvhpl5fsyZ0rb9aa2srq1vbJa2yts7u3v7lYPDtooSSahLIh7Jro8V5UxQVzPNaTeWFIc+px1/cpv5nSmVikXiQc9i6oV4JFjACNZGcqeDtD4fVKp2zc6BlolTkCoUaA0qH/1hRJKQCk04Vqrn2LH2Uiw1I5zOy/1E0RiTCR7RnqECh1R5aX7sHJ0aZYiCSJoSGuXqz4kUh0rNQt90hliP1V8vE//zeokOrryUiTjRVJDFoiDhSEco+xwNmaRE85khmEhmbkVkjCUm2uRTzkO4znDx/fIyaddrznmtcd+oNm+KOEpwDCdwBg5cQhPuoAUuEGDwCM/wYgnryXq13hatK1YxcwS/YL1/AefVjuM=</latexit>
v3<latexit sha1_base64="+XpoULfOHqCHvyZwfk/DV8G7sg0=">AAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0m0+HErevFYwbSFNpTNdtMu3WzC7qZQQn+DFw+KePUHefPfuEmDqPXBwOO9GWbm+TFnStv2p1VaWV1b3yhvVra2d3b3qvsHbRUlklCXRDySXR8rypmgrmaa024sKQ59Tjv+5DbzO1MqFYvEg57F1AvxSLCAEayN5E4H6fl8UK3ZdTsHWiZOQWpQoDWofvSHEUlCKjThWKmeY8faS7HUjHA6r/QTRWNMJnhEe4YKHFLlpfmxc3RilCEKImlKaJSrPydSHCo1C33TGWI9Vn+9TPzP6yU6uPJSJuJEU0EWi4KEIx2h7HM0ZJISzWeGYCKZuRWRMZaYaJNPJQ/hOsPF98vLpH1Wd87rjftGrXlTxFGGIziGU3DgEppwBy1wgQCDR3iGF0tYT9ar9bZoLVnFzCH8gvX+BelajuQ=</latexit>
V = {v1, v2, v3}<latexit sha1_base64="5/LCIMtGZ5h5wVQCMmTMg/5dkCc=">AAAB/3icbVDLSsNAFJ34rPUVFdy4GSyCCylJW3wshKIblxXsA5oQJtNJO3QyCTOTQold+CtuXCji1t9w5984SYuo9cBcDufcy71z/JhRqSzr01hYXFpeWS2sFdc3Nre2zZ3dlowSgUkTRywSHR9JwignTUUVI51YEBT6jLT94XXmt0dESBrxOzWOiRuiPqcBxUhpyTP3W/ASOikcefaJLpWsVJ2JZ5asspUDzhN7RkpghoZnfji9CCch4QozJGXXtmLlpkgoihmZFJ1EkhjhIeqTrqYchUS6aX7/BB5ppQeDSOjHFczVnxMpCqUch77uDJEayL9eJv7ndRMVnLsp5XGiCMfTRUHCoIpgFgbsUUGwYmNNEBZU3wrxAAmElY6smIdwkeH0+8vzpFUp29Vy7bZWql/N4iiAA3AIjoENzkAd3IAGaAIM7sEjeAYvxoPxZLwab9PWBWM2swd+wXj/AprvlA8=</latexit>
⌃ = { , }<latexit sha1_base64="ZY89SR6jHBd25PoJ2nDrWsihEs4=">AAAB/XicbVDLSsNAFJ34rPUVHzs3g0VwISXR4mMhFN24rGgf0IQymU7aoTNJmJkINbT+ihsXirj1P9z5N07SIGo9cOFwzr3ce48XMSqVZX0aM7Nz8wuLhaXi8srq2rq5sdmQYSwwqeOQhaLlIUkYDUhdUcVIKxIEcY+Rpje4TP3mHRGShsGtGkbE5agXUJ9ipLTUMbedG9rjCJ5DJxmPxwe6nFHHLFllKwOcJnZOSiBHrWN+ON0Qx5wECjMkZdu2IuUmSCiKGRkVnViSCOEB6pG2pgHiRLpJdv0I7mmlC/1Q6AoUzNSfEwniUg65pzs5Un3510vF/7x2rPxTN6FBFCsS4MkiP2ZQhTCNAnapIFixoSYIC6pvhbiPBMJKB1bMQjhLcfz98jRpHJbto3LlulKqXuRxFMAO2AX7wAYnoAquQA3UAQb34BE8gxfjwXgyXo23SeuMkc9sgV8w3r8ASdWVRQ==</latexit>
8 / 35

Notation
Set of graphs: G = {Gi }N
i=1
Set of graph lebels: Y = {Yi }N
i=1
Set of geometric embeddings (each graph): U = {ui }n
i=1 ∈ Rn×d
Latent node embedding space (each node): u ∈ Rd
𝐺" ・・・
𝑁
𝑛Latent node ↑
embedding
Node size→
＃ of graphs→
𝑌"Graph label→ 𝑌&
𝐺&
u1 2 Rd
<latexit sha1_base64="FiX+xGGr4lrH54q+qBxUWlkIUrA=">AAACDnicbVDLSsNAFJ3UV62vqEs3g6XgqiRafOyKblxWsQ9oYphMJu3QySTMTIQS8gVu/BU3LhRx69qdf2OSBlHrgQuHc+7l3nvciFGpDONTqywsLi2vVFdra+sbm1v69k5PhrHApItDFoqBiyRhlJOuooqRQSQIClxG+u7kIvf7d0RIGvIbNY2IHaARpz7FSGWSozcsN2RegNQ4iVMnMVNoUQ6tXBBBcp3eJl4KoaPXjaZRAM4TsyR1UKLj6B+WF+I4IFxhhqQcmkak7AQJRTEjac2KJYkQnqARGWaUo4BIOyneSWEjUzzohyIrrmCh/pxIUCDlNHCzzvxO+dfLxf+8Yaz8UzuhPIoV4Xi2yI8ZVCHMs4EeFQQrNs0IwoJmt0I8RgJhlSVYK0I4y3H8/fI86R02zaNm66pVb5+XcVTBHtgHB8AEJ6ANLkEHdAEG9+ARPIMX7UF70l61t1lrRStndsEvaO9fjuucjQ==</latexit>
u2<latexit sha1_base64="Pm48/PPv93nEVMYQDi7yld7eDYw=">AAAB+XicbVDLSsNAFJ34rPUVdelmsAiuSlKLj13RjcsK9gFtCJPJpB06mQkzk0IJ/RM3LhRx65+482+cpEHUemDgcM693DMnSBhV2nE+rZXVtfWNzcpWdXtnd2/fPjjsKpFKTDpYMCH7AVKEUU46mmpG+okkKA4Y6QWT29zvTYlUVPAHPUuIF6MRpxHFSBvJt+1hIFgYIz3O0rmfNea+XXPqTgG4TNyS1ECJtm9/DEOB05hwjRlSauA6ifYyJDXFjMyrw1SRBOEJGpGBoRzFRHlZkXwOT40SwkhI87iGhfpzI0OxUrM4MJN5RvXXy8X/vEGqoysvozxJNeF4cShKGdQC5jXAkEqCNZsZgrCkJivEYyQR1qasalHCdY6L7y8vk26j7p7Xm/fNWuumrKMCjsEJOAMuuAQtcAfaoAMwmIJH8AxerMx6sl6tt8XoilXuHIFfsN6/ACNylCA=</latexit>
u3<latexit sha1_base64="w2BS8kqWqIp26xG7B4vB81cBpaY=">AAAB+XicbVDLSsNAFJ3UV62vqEs3g0VwVRItPnZFNy4r2Ae0IUwmk3boZCbMTAol9E/cuFDErX/izr9xkgZR64GBwzn3cs+cIGFUacf5tCorq2vrG9XN2tb2zu6evX/QVSKVmHSwYEL2A6QIo5x0NNWM9BNJUBww0gsmt7nfmxKpqOAPepYQL0YjTiOKkTaSb9vDQLAwRnqcpXM/O5/7dt1pOAXgMnFLUgcl2r79MQwFTmPCNWZIqYHrJNrLkNQUMzKvDVNFEoQnaEQGhnIUE+VlRfI5PDFKCCMhzeMaFurPjQzFSs3iwEzmGdVfLxf/8wapjq68jPIk1YTjxaEoZVALmNcAQyoJ1mxmCMKSmqwQj5FEWJuyakUJ1zkuvr+8TLpnDfe80bxv1ls3ZR1VcASOwSlwwSVogTvQBh2AwRQ8gmfwYmXWk/VqvS1GK1a5cwh+wXr/AiT3lCE=</latexit>
9 / 35

Geometric Embeddings
Use partial eigendecomposition 3 to extract node embeddings:
1. Create normalized Laplacian matrix L ∈ Rn×n
2. Do partial eigendecomposition and obtaining U
3. Use the smallest d eigenvectors
Normalized
Laplacian matrix
→
Partial
Eigendecomposition
𝑈Λ𝑈#
𝑛×𝑑 𝑑×𝑑 𝑑×𝑛
The smallest 𝑑 eigenvectors
L𝑛×𝑛
A
B C
A B C
A 0 1 1
B 1 0 0
C 1 0 0
Adjacency matrix
A B C
A 2 0 0
B 0 1 0
C 0 0 1
Degree matrix
A B C
A 2 -1 -1
B -1 1 0
C -1 0 1
Laplacian matrix
-=
Normalize
Figure: Example obtaining U
3
Time complexity: Linear (# of graph edge) (...I don’t know how.)
10 / 35

Transportation Distance [1]
Earth Mover’s Distance (EMD): measure of dissimilarity
EMD (Gx , Gy ) := min
T ∈R
nx ×ny
+
⟨D, T ⟩
s.t.T 1 = t(Gx )
, T T
1 = t(Gy )
▶ Linear programming problem
▶ Flow matrix T
- Tij ： how much of vi in Gx travels to vj in Gy
▶ GX → UX = {ux
1, ux
2, · · · , ux
nx
}
▶ GY → UY = {uy
1, uy
2, · · · , uy
ny }
▶ Transport cost matrix D
- Dij = ∥ux
i − uy
j ∥2
11 / 35

Transportation Distance [1]
Earth Mover’s Distance (EMD): measure of dissimilarity
EMD (Gx , Gy ) := min
T ∈R
nx ×ny
+
⟨D, T ⟩
s.t.T 1 = t(Gx )
, T T
1 = t(Gy )
▶ Node vi has ci outgoing edges
▶ Normalized bog-of-words (nBOW): ti = ci /
∑n
j=1 cj ∈ R
12 / 35

Transportation Distance: Example
AB
C
ab
c
A
B
C
a
b
c
a b c
A
B
C
A
B
C
a b c
Figure: EMD example
▶ EMD focus on node size and outgoing edges of each graph
13 / 35

Straightforward way to deﬁne kernel, It’s high cost
EDM based Kernel = −
1
2
JDemd J
J = I −
1
N
11⊤
▶ Not necessarily positive deﬁnite
▶ Time complexity:O(N2n3log(n)), Space complexity :O(N2)
Graph A Graph CGraph B
A B C
A EMD(A,A) EMD(A,B) EMD(A,C)
B EMD(B,A) EMD(B,B) EMD(B,C)
C EMD(C,A) EMD(C,B) EMD(C,C)
Distance
Matrix
Demd<latexit sha1_base64="/2TlWLrs+SXNjZgq6EbvOI79l2o=">AAAB7nicbVDLSsNAFJ3UV62vqks3g0VwVRItPnZFXbisYB/QhjKZ3LRDJ5MwMxFK6Ee4caGIW7/HnX/jJA2i1gMXDufcy733eDFnStv2p1VaWl5ZXSuvVzY2t7Z3qrt7HRUlkkKbRjySPY8o4ExAWzPNoRdLIKHHoetNrjO/+wBSsUjc62kMbkhGggWMEm2k7s0whdCfDas1u27nwIvEKUgNFWgNqx8DP6JJCEJTTpTqO3as3ZRIzSiHWWWQKIgJnZAR9A0VJATlpvm5M3xkFB8HkTQlNM7VnxMpCZWahp7pDIkeq79eJv7n9RMdXLgpE3GiQdD5oiDhWEc4+x37TALVfGoIoZKZWzEdE0moNglV8hAuM5x9v7xIOid157TeuGvUmldFHGV0gA7RMXLQOWqiW9RCbUTRBD2iZ/RixdaT9Wq9zVtLVjGzj37Bev8Cc8SPyQ==</latexit>
Figure: Straightforward kernel based on EMD
14 / 35

Global Alignment Graph Kernel
Using EMD and Random feature (RF)
Proposed Kernel: 4
k (Gx , Gy ) :=
∫
p (Gω) ϕGω (Gx ) dGω
where ϕGω
:= exp (−γEMD(Gx , Gω))
▶ Gω : random graph
▶ W = {wi }D
i=1
▶ wi is sampled from V ∈ Rd
▶ p(Gω) is a distribution over the space of all random graphs of variable
sizes Ω := ∪Dmax
D=1VD
4
ランダムグラフの詳細に踏み込もうと思ったが, 非常に込み入った話でハガワは挫折
した. 気になる方はこちらを参照. 確率なんもワカンネ.
15 / 35

Global Alignment Graph Kernel Using EMD and RF
Approximation5:
˜k (Gx , Gy ) =
1
R
R∑
i=1
ϕGωi (Gx ) ϕGωi (Gy )
→ k (Gx , Gy ) , as R → ∞
𝜙"#
(𝐺&)
Random Graphs
𝐺(
𝐺&
𝜙"#
(𝐺))
𝐺)
5
the uniform convergence of approximate proposed kernel
16 / 35

Algorithm
Set Data and hyperparameters
▶ Node embedding size (dimension): d
▶ Max size of random graphs: Dmax
▶ Graph embedding size: R
𝜙"#
(𝐺&)
Random Graphs
𝐺(
𝐺&
𝜙"#
(𝐺))
𝐺)
DataGraphs
𝑅𝐷,-&
𝑑
Algorithm 1 Random Graph Embedding
Input: Data graphs {Gi }N
i=1, node embedding size d, maximum
size of random graphs Dmax , graph embedding size R.
Output: Feature matrix ZN ⇥R for data graphs
1: Compute nBOW weights vectors {t(Gi )}N
i=1 of the normalized
Laplacian L of all graphs
2: Obtain node embedding vectors {ui }n
i=1 by computing d small-
est eigenvectors of L
3: for j = 1, . . . ,R do
4: Draw Dj uniformly from [1, Dmax ].
5: Generate a random graph G j with Dj number of nodes
embeddings W from Algorithm 2.
6: Compute a feature vector Zj = G j
({Gi }N
i=1)) using EMD
or other optimal transportation distance in Equation (3).
7: end for
8: Return feature matrix Z({Gi }N
i=1) = 1p
R
{Zi }R
i=1
17 / 35

Compute {t(Gt )}N
i=1 and Laplacian matrix L
A
B C
A B C
A 2 -1 -1
B -1 1 0
C -1 0 1
Laplacian matrix
𝒕(𝑮 𝒙)
½
¼
¼
→For All Graphs
3: for j = 1, . . . ,R do
({Gi }N
i=1)) using EMD
7: end for
i=1) = 1p
R
{Zi }R
i=1
18 / 35

Obtain node embedding vectors
Normalized
Laplacian matrix
→
Partial
Eigendecomposition
𝑈Λ𝑈#
𝑛×𝑑 𝑑×𝑑 𝑑×𝑛
The smallest 𝑑 eigenvectors
L𝑛×𝑛
→For All Graphs
𝐺"
u1 2 Rd
u3<latexit sha1_base64="w2BS8kqWqIp26xG7B4vB81cBpaY=">AAAB+XicbVDLSsNAFJ3UV62vqEs3g0VwVRItPnZFNy4r2Ae0IUwmk3boZCbMTAol9E/cuFDErX/izr9xkgZR64GBwzn3cs+cIGFUacf5tCorq2vrG9XN2tb2zu6evX/QVSKVmHSwYEL2A6QIo5x0NNWM9BNJUBww0gsmt7nfmxKpqOAPepYQL0YjTiOKkTaSb9vDQLAwRnqcpXM/O5/7dt1pOAXgMnFLUgcl2r79MQwFTmPCNWZIqYHrJNrLkNQUMzKvDVNFEoQnaEQGhnIUE+VlRfI5PDFKCCMhzeMaFurPjQzFSs3iwEzmGdVfLxf/8wapjq68jPIk1YTjxaEoZVALmNcAQyoJ1mxmCMKSmqwQj5FEWJuyakUJ1zkuvr+8TLpnDfe80bxv1ls3ZR1VcASOwSlwwSVogTvQBh2AwRQ8gmfwYmXWk/VqvS1GK1a5cwh+wXr/AiT3lCE=</latexit>
3: for j = 1, . . . ,R do
({Gi }N
i=1)) using EMD
7: end for
i=1) = 1p
R
{Zi }R
i=1
19 / 35

Generate random graph 6
2 ← 𝑅𝑎𝑛𝑑(1, 𝐷,-.)
𝑈2×𝑑
← Generate_𝑟𝑎𝑛𝑑𝑜𝑚_𝑔𝑟𝑎𝑝ℎ (2, 𝑑)
u1 2 Rd
Figure: 2 nodes random graph
example
3: for j = 1, . . . ,R do
({Gi }N
i=1)) using EMD
7: end for
i=1) = 1p
R
{Zi }R
i=1
6
In after section, I show 2 way to generate random graphs.
20 / 35

Compute a feature veotor Zj
𝜙"#
(𝐺&)
𝑧)=
𝑍)+
⋮
⋮
⋮
𝑍)-
Zji = ϕGω (Gi ) := exp (−γ EMD (Gi , Gω))
3: for j = 1, . . . ,R do
({Gi }N
i=1)) using EMD
7: end for
i=1) = 1p
R
{Zi }R
i=1
21 / 35

Generate random graph for R times
𝑧"=
𝑍""
⋮
⋮
⋮
𝑍"%
𝑧&=
𝑍&"
⋮
⋮
⋮
𝑍&%
𝜙()
(𝐺,) 𝜙()
(𝐺,)
⋯
𝜙()
(𝐺,)
𝑧/=
𝑍/"
⋮
⋮
⋮
𝑍/%
⋯
3: for j = 1, . . . ,R do
({Gi }N
i=1)) using EMD
7: end for
i=1) = 1p
R
{Zi }R
i=1
7
7
R : number of Random graphs
22 / 35

Output N × R Matrix Z
𝑍=
"
√$
𝑍""
⋮
𝑍"&
⋮
𝑍"'
⋯

⋯
𝑍$"
⋮
𝑍$&
⋮
𝑍$'
𝑍"&
𝑍*&
𝑍*'
⋯

⋯
3: for j = 1, . . . ,R do
({Gi }N
i=1)) using EMD
7: end for
i=1) = 1p
R
{Zi }R
i=1
23 / 35

How to generate Random Graph
Data-independent and Data-dependent Distributions
Data-dependent 8
Random Graph Embedding(Anchor Sug-Graphs(ASG)):
1. Pick up Gk from data set
2. Uniformly draw Dj nodes
3. {wi }
Dj
i=1 = {un1 , un1 , · · · , unDj
}
Incorporating Label information:
▶ d(ui , uj) = max(∥ui − uj∥2,
√
d) if vi and vj have diﬀrent node label
▶ Make distance between diﬀerent node labels
▶
√
d is largest distance in a d-dimentionnal unit hypercube space
8
data independent は appendix を参照
24 / 35

Complexity comparison (Left： Proposed, Right： Straightforward)
𝜙"#
(𝐺&)
Random Graphs
𝐺(
𝐺&
𝜙"#
(𝐺))
𝐺)
Figure: Proposed kernel
Graph A Graph CGraph B
A B C
A EMD(A,A) EMD(A,B) EMD(A,C)
B EMD(B,A) EMD(B,B) EMD(B,C)
C EMD(C,A) EMD(C,B) EMD(C,C)
Distance
Matrix
Figure: Straitforward kernel
Time complexity (dmz is partial eigendecomposition cost) 9:
▶ O(NRD2nlog(n) + dmz) ▶ O(N2n3log(n) + dmz)
※ R is # of Random Graphs, D is # of Random Graph nodes (D < n)
Space complexity:
▶ O(NR) ▶ O(N2)
9
dmz is eigendecomposition cost.
25 / 35

Experiments
Experimental setup
Machine:
▶ Use linear SVM (LIBLINEAR)
Data:
▶ 9 Datasets
Hyperparameters:
▶ γ(Kernel)→[1e-3 1e-2 1e-1 1 10]
▶ D max (Size of random graph)→[3:3:30]
▶ SVM
Evaluation:
▶ 10-fold cross-validation
▶ 10 times average accuracy
26 / 35

# of Random Graph (R) and Testing accuracy:
10
0
10
1
10
2
10
3
10
4
Varying R
15
20
25
30
35
40
45
50
TestingAccuracy%
Testing Accuracy VS R
RGE(RF)
RGE(ASG)
RGE(ASG)-NodeLab
(a) ENZYMES
10
0
10
1
10
2
10
3
10
4
Varying R
62
64
66
68
70
72
74
76
TestingAccuracy%
RGE(RF)
RGE(ASG)
RGE(ASG)-NodeLab
(b) NCI109
10
0
10
1
10
2
10
3
10
4
Varying R
55
60
65
70
75
TestingAccuracy%
RGE(RF)
RGE(ASG)
(c) IMDBBINARY
10
0
10
1
10
2
10
3
10
4
Varying R
55
60
65
70
75
80
TestingAccuracy%
RGE(RF)
RGE(ASG)
(d) COLLAB
0 500 1000 1500 2000 2500
Varying R
0
10
20
30
40
Runtime(Seconds)
Total Runtime VS R
RGE(RF)
RGE(ASG)
RGE(ASG)-NodeLab
(e) ENZYMES
0 1000 2000 3000 4000 5000
Varying R
0
100
200
300
400
500
Runtime(Seconds)
Total Runtime VS R
RGE(RF)
RGE(ASG)
RGE(ASG)-NodeLab
(f) NCI109
0 1000 2000 3000 4000 5000
Varying R
0
20
40
60
80
100
120
140
Runtime(Seconds)
Total Runtime VS R
RGE(RF)
RGE(ASG)
(g) IMDBBINARY
0 1000 2000 3000 4000 5000
Varying R
0
500
1000
1500
2000
Runtime(Seconds)
Total Runtime VS R
RGE(RF)
RGE(ASG)
(h) COLLAB
Figure 2: Test accuracies and runtime of three variants of RGE with and without node labels when varying R.
▶ Converge very rapidly when increasing R
# of Random Graph (R) and Runtime:
10
0
10
1
10
2
10
3
10
4
Varying R
15
20
25
30
35
40
45
50
TestingAccuracy%
RGE(RF)
RGE(ASG)
RGE(ASG)-NodeLab
(a) ENZYMES
10
0
10
1
10
2
10
3
10
4
Varying R
62
64
66
68
70
72
74
76
TestingAccuracy%
RGE(RF)
RGE(ASG)
RGE(ASG)-NodeLab
(b) NCI109
10
0
10
1
10
2
10
3
10
4
Varying R
55
60
65
70
75
TestingAccuracy%
RGE(RF)
RGE(ASG)
(c) IMDBBINARY
10
0
10
1
10
2
10
3
10
4
Varying R
55
60
65
70
75
80
TestingAccuracy%
RGE(RF)
RGE(ASG)
(d) COLLAB
0 500 1000 1500 2000 2500
Varying R
0
10
20
30
40
Runtime(Seconds)
Total Runtime VS R
RGE(RF)
RGE(ASG)
RGE(ASG)-NodeLab
(e) ENZYMES
0 1000 2000 3000 4000 5000
Varying R
0
100
200
300
400
500
Runtime(Seconds)
Total Runtime VS R
RGE(RF)
RGE(ASG)
RGE(ASG)-NodeLab
(f) NCI109
0 1000 2000 3000 4000 5000
Varying R
0
20
40
60
80
100
120
140
Runtime(Seconds)
Total Runtime VS R
RGE(RF)
RGE(ASG)
(g) IMDBBINARY
0 1000 2000 3000 4000 5000
Varying R
0
500
1000
1500
2000
Runtime(Seconds)
Total Runtime VS R
RGE(RF)
RGE(ASG)
(h) COLLAB
Figure 2: Test accuracies and runtime of three variants of RGE with and without node labels when varying R.
▶ Show quasi-linear scalability with respect to R
27 / 35

10
2
10
3
10
4
Varying number of graphs N
10
-2
10
0
10
2
10
4
10
6
10
8
Time(Seconds)
Runtime VS number of graphs N
RGE(Eigentime)
RGE(FeaGentime)
RGE(Runtime)
Linear
Quatratic
(a) Number of graphs N
10
2
10
3
Varying size of graph n
10
0
10
1
10
2
10
3
10
4
10
5
Time(Seconds)
Runtime VS size of graph n
RGE(Eigentime)
RGE(FeaGentime)
RGE(Runtime)
Linear
Quatratic
(b) Size of graph n
▶ shows the linear scalability with respect to N (a)
▶ shows the quasi-liniear scalability with respect to n (b)
28 / 35

classiﬁcation accuracy:
Table 1: Comparison of classication accuracy against graph kernel methods without node labels.
Datasets MUTAG PTC-MR ENZYMES NCI1 NCI019
RGE(RF) 86.33 ± 1.39(1s) 59.82 ± 1.42(1s) 35.98 ± 0.89(38s) 74.70 ± 0.56(727s) 72.50 ± 0.32(865s)
RGE(ASG) 85.56 ± 0.91(2s) 59.97 ± 1.65 (1s) 38.52 ± 0.91(18s) 74.30 ± 0.45(579s) 72.70 ± 0.42(572s)
EMD 84.66 ± 2.69 (7s) 57.65 ± 0.59 (46s) 35.45 ± 0.93 (216s) 72.65 ± 0.34 (8359s) 70.84 ± 0.18 (8281s)
PM 83.83 ± 2.86 59.41 ± 0.68 28.17 ± 0.37 69.73 ± 0.11 68.37 ± 0.14
Lo- 82.58 ± 0.79 55.21 ± 0.72 26.5 ± 0.54 62.28 ± 0.34 62.52 ± 0.29
OA-E (A) 79.89 ± 0.98 56.77 ± 0.85 36.12 ± 0.81 67.99 ± 0.28 67.14 ± 0.26
RW 77.78 ± 0.98 56.18 ± 1.12 20.17 ± 0.83 56.89 ± 0.34 56.13 ± 0.31
GL 66.11 ± 1.31 57.05 ± 0.83 18.16 ± 0.47 47.37 ± 0.15 48.39 ± 0.18
SP 82.22 ± 1.14 56.18 ± 0.56 28.17 ± 0.64 62.02 ± 0.17 61.41 ± 0.32
Table 2: Comparison of classication accuracy against graph kernel methods with node labels or WL technique.
Datasets PTC-MR ENZYMES PROTEINS NCI1 NCI019
RGE(ASG) 61.5 ± 2.34(1s) 48.27 ± 0.99(28s) 75.98 ± 0.71(20s) 76.46 ± 0.45(379s) 74.42 ± 0.30(526s)
EMD 57.67 ± 2.11 (42s) 42.85 ± 0.72 (296s) 76.03 ± 0.28 (1936s) 75.89 ± 0.16 (7942s) 73.63 ± 0.33 (8073s)
PM 60.38 ± 0.86 40.33 ± 0.34 74.39 ± 0.45 72.91 ± 0.53 71.97 ± 0.15
OA-E (A) 58.76 ± 0.92 43.56 ± 0.66 — 69.83 ± 0.30 68.96 ± 0.35
V-OA 56.4 ± 1.8 35.1 ± 1.1 73.8 ± 0.5 65.6 ± 0.4 65.1 ± 0.4
RW 57.06 ± 0.86 19.33 ± 0.62 71.67 ± 0.78 63.34 ± 0.27 63.51 ± 0.18
GL 59.41 ± 0.94 32.70 ± 1.20 71.63 ± 0.33 66.00 ± 0.07 66.59 ± 0.08
SP 60.00 ± 0.72 41.68 ± 1.79 73.32 ± 0.45 73.47 ± 0.11 73.07 ± 0.11
WL-RGE(ASG) 62.20 ± 1.67(1s) 57.97 ± 1.16(38s) 76.63 ± 0.82(30s) 85.85 ± 0.42(401s) 85.32 ± 0.29(798s)
WL-ST 57.64 ± 0.68 52.22 ± 0.71 72.92 ± 0.67 82.19 ± 0.18 82.46 ± 0.24
▶ RGE is much faster than EMD
29 / 35

Table 2: Comparison of classication accuracy against graph kernel methods with node labels or WL technique.
Datasets PTC-MR ENZYMES PROTEINS NCI1 NCI019
RGE(ASG) 61.5 ± 2.34(1s) 48.27 ± 0.99(28s) 75.98 ± 0.71(20s) 76.46 ± 0.45(379s) 74.42 ± 0.30(526s)
EMD 57.67 ± 2.11 (42s) 42.85 ± 0.72 (296s) 76.03 ± 0.28 (1936s) 75.89 ± 0.16 (7942s) 73.63 ± 0.33 (8073s)
PM 60.38 ± 0.86 40.33 ± 0.34 74.39 ± 0.45 72.91 ± 0.53 71.97 ± 0.15
OA-E (A) 58.76 ± 0.92 43.56 ± 0.66 — 69.83 ± 0.30 68.96 ± 0.35
V-OA 56.4 ± 1.8 35.1 ± 1.1 73.8 ± 0.5 65.6 ± 0.4 65.1 ± 0.4
RW 57.06 ± 0.86 19.33 ± 0.62 71.67 ± 0.78 63.34 ± 0.27 63.51 ± 0.18
GL 59.41 ± 0.94 32.70 ± 1.20 71.63 ± 0.33 66.00 ± 0.07 66.59 ± 0.08
SP 60.00 ± 0.72 41.68 ± 1.79 73.32 ± 0.45 73.47 ± 0.11 73.07 ± 0.11
WL-RGE(ASG) 62.20 ± 1.67(1s) 57.97 ± 1.16(38s) 76.63 ± 0.82(30s) 85.85 ± 0.42(401s) 85.32 ± 0.29(798s)
WL-ST 57.64 ± 0.68 52.22 ± 0.71 72.92 ± 0.67 82.19 ± 0.18 82.46 ± 0.24
WL-SP 56.76 ± 0.78 59.05 ± 1.05 74.49 ± 0.74 84.55 ± 0.36 83.53 ± 0.30
WL-OA-E (A) 59.72 ± 1.10 53.76 ± 0.82 — 84.75 ± 0.21 84.23 ± 0.19
Table 3: Comparison of classication accuracy against recent deep learning models on graphs.
Datasets PTC-MR PROTEINS NCI1 IMDB-B IMDB-M COLLAB
(WL-)RGE(ASG) 62.20 ± 1.67 76.63 ± 0.82 85.85 ± 0.42 71.48 ± 1.01 47.26 ± 0.89 76.85 ± 0.34
DGCNN 58.59 ± 2.47 75.54 ± 0.94 74.44 ± 0.47 70.03 ± 0.86 47.83 ± 0.85 73.76 ± 0.49
PSCN 62.30 ± 5.70 75.00 ± 2.51 76.34 ± 1.68 71.00 ± 2.29 45.23 ± 2.84 72.60 ± 2.15
DCNN 56.6 ± 1.20 61.29 ± 1.60 56.61 ± 1.04 49.06 ± 1.37 33.49 ± 1.42 52.11 ± 0.53
DGK 57.32 ± 1.13 71.68 ± 0.50 62.48 ±0.25 66.96 ± 0.56 44.55 ± 0.52 73.09 ± 0.25
aph in the range of n = [8 1024], respectively. When generating
ndom adjacency matrices, we set the number of edges always be
ice the number of nodes in a graph. We report the runtime for
mputing node embeddings using a state-of-the-art eigensolver
0], generating RGE graph embeddings, and the overall computa-
n of graph classication, accordingly. Fig. 3(a) shows the linear
alability of RGE when increasing the number of graphs, conrm-
g our complexity analysis in the previous Section. In addition, as
property of our RGE embeddings, which open the door to lar
scale applications of graph kernels for various applications such
social networks analysis and computational biology.
Comparison with All Baselines. Tables 1, 2, and 3 show th
RGE consistently outperforms or matches other state-of-the-
graph kernels and deep learning approaches in terms of clas
cation accuracy. There are several further observations wor
making here. First, EMD, the closest method to RGE, shows go
▶ Outperforms other graph kernels and deep learning approaches
▶ RGE is much faster than EMD
▶ WL-technique makes good performance
30 / 35

Conclusion
Proposed good graph kernel!
▶ Be scalable
▶ Take into account global property
thank you.
31 / 35

Appendix I
▶ グラフが同型ならば, 隣接行列の固有値は一致するが, 逆は成り立た
ない
Normalized Laplacian Matrix:
Li,j :=



1 if i = j and deg (vi ) ̸= 0
− 1√
deg(vi ) deg(vj )
if i ̸= j and vi is adjacent to vj
0 otherwise.
deg(v): Degree of node (vertex) v
32 / 35

Appendix III Table 4: Properties of the datasets.
Dataset MUTAG PTC ENZYMES PROTEINS NCI1 NCI109 IMDB-B IMDB-M COLLAB
Max # Nodes 28 109 126 620 111 111 136 89 492
Min # Nodes 10 2 2 4 3 4 12 7 32
Ave # Nodes 17.9 25.6 32.6 39.05 29.9 29.7 19.77 13.0 74.49
Max # Edges 33 108 149 1049 119 119 1249 1467 40119
Min # Edges 10 1 1 5 2 3 26 12 60
Ave # Edges 19.8 26.0 62.1 72.81 32.3 32.1 96.53 65.93 2457.34
# Graph 188 344 600 1113 4110 4127 1000 1500 5000
# Graph Labels 2 2 6 2 2 2 2 3 3
# Node Labels 7 19 3 3 37 38 — — —
wice the number of nodes in a graph. We use the size of
ding d = 6 just like in the previous sections. We set the
eters related to RGE itself are DMax = 10 and R = 128.
e runtime for computing node embeddings using state-
gensolver [33, 40] and RGE graph embeddings, and the
me, respectively.
ditional Results and Discussions on
mparisons Against All Baselines
e RGE is a graph embedding, we directly employ a lin-
plemented in LIBLIBNEAR [7] since it can faithfully
eectiveness of our feature representation from the
nonlinear learning solvers. Following the convention
experiments ten times (thus 100 runs per dataset) an
average prediction accuracies and standard deviations
of hyperparameters and D_max are [1e-3 1e-2 1e
[3:3:30], respectively. All parameters of the SVM and
eters of our method were optimized only on the train
The node embedding size is set to either 4, 6 or 8 bu
the same number for all variants of RGE on the same
eliminate the random eects, we repeat the whole exp
times and report the average prediction accuracies a
deviations. For all baselines we take the best number
the papers except EMD, where we rerun the experim
comparisons in terms of both accuracy and runtime. Sin
EMD, and PM are essentially built on the same node
Terms
WL test:
▶ Technique to improve kernel with node labels
RGE(ASG)-NodeLab:
▶ Data-dependent random graph + Incorporating Label information
WL-RGE:
▶ Data-dependent random graph + WL test
34 / 35

引用 I
Giannis Nikolentzos, Polykarpos Meladianos, and Michalis
Vazirgiannis.
Matching node embeddings for graph similarity.
In Thirty-First AAAI Conference on Artiﬁcial Intelligence, 2017.
35 / 35

Scalable Global Alignment Graph Kernel Using Random Features: From Node Embedding to Graph Embedding

Recommended

More Related Content

What's hot (20)

Similar to Scalable Global Alignment Graph Kernel Using Random Features: From Node Embedding to Graph Embedding (20)

Recently uploaded (20)

Scalable Global Alignment Graph Kernel Using Random Features: From Node Embedding to Graph Embedding