Using Local Spectral Methods to Robustify Graph-Based Learning

Using Local Spectral
Methods to Robustify
Graph-Based Learning
David F. Gleich!
Purdue University!
Joint work with
Michael
Mahoney @
Berkeley
supported by "
NSF CAREER
CCF-1149756
Code www.cs.purdue.edu/homes/dgleich/codes/robust-diffusions!
KDD2015

The graph-based data analysis pipeline
1 0 0 0 1 0 0 1
0 1 0 1 0 0 1 1
0 1 0 1 0 0 0 1
1 0 0 0 0 0 1 1
1 1 0 1 1 1 0 1
1 0 1 1 0 0 0 1
1 0 1 1 1 0 1 0
1 1 1 1 0 1 0 0
1 1 1 0 0 1 1 1
1 1 0 1 1 1 1 1
"
Raw data!
•  Relationships
•  Images
•  Text records
•  Etc.
"
Convert to a graph!
•  Nearest neighs
•  Kernels
•  2-mode to 1-mode
•  Etc.
"
Algorithm/Learning!
•  Important nodes
•  Infer features
•  Clustering
•  Etc.
KDD2015
David Gleich · Purdue
2

“Noise” in the initial data
modeling decisions
"
Explicit graphs!
are those that are
given to a data
analyst.

“A social network”
•  Known spam
accounts included?
•  Users not logged in
for a year?
•  Etc.
A type of noise
"
Constructed graphs!
are built based on some
other primary data."

“nearest neighbor graphs”
•  K-NN or ε-NN
•  Thresholding correlations
to zero

Often made for computational
convenience! (Graph too big.)
A different type a noise!
"
Labeled graphs!
occur in information
diffusion/propagation

“function prediction”
•  Labeled nodes
•  Labeled edges
•  Some are wrong

A direct type of noise!
Do these decisions matter?
Our experience Yes! Dramatically so!
KDD2015
3

The graph-based data analysis pipeline
1 0 0 0 1 0 0 1
0 1 0 1 0 0 1 1
0 1 0 1 0 0 0 1
1 0 0 0 0 0 1 1
1 1 0 1 1 1 0 1
1 0 1 1 0 0 0 1
1 0 1 1 1 0 1 0
1 1 1 1 0 1 0 0
1 1 1 0 0 1 1 1
1 1 0 1 1 1 1 1
"
Raw data!
•  Relationships
•  Images
•  Text records
•  Etc.
"
Convert to a graph!
•  Nearest neighs
•  Kernels
•  2-mode to 1-mode
•  Etc.
"
Algorithm/Learning!
•  Important nodes
•  Infer features
•  Clustering
•  Etc.
Most algorithmic and
statistical research
happens here
The potential
downstream signal is
determined by this step
KDD2015
4

Our goal: towards an integrative analysis
1 0 0 0 1 0 0 1
0 1 0 1 0 0 1 1
0 1 0 1 0 0 0 1
1 0 0 0 0 0 1 1
1 1 0 1 1 1 0 1
1 0 1 1 0 0 0 1
1 0 1 1 1 0 1 0
1 1 1 1 0 1 0 0
1 1 1 0 0 1 1 1
1 1 0 1 1 1 1 1
!
•  How does the graph creation process affect
the outcomes of graph-based learning?
•  Is there anything we can do to make this
process more robust?
KDD2015
5

Graph-based learning is usually only
one component of a big pipeline
KDD2015
6
1 0 0 0 1 0 0 1
0 1 0 1 0 0 1 1
0 1 0 1 0 0 0 1
1 0 0 0 0 0 1 1
1 1 0 1 1 1 0 1
1 0 1 1 0 0 0 1
1 0 1 1 1 0 1 0
1 1 1 1 0 1 0 0
1 1 1 0 0 1 1 1
1 1 0 1 1 1 1 1
1 0 0 0 1 0 0 1
0 1 0 1 0 0 1 1
0 1 0 1 0 0 0 1
1 0 0 0 0 0 1 1
1 1 0 1 1 1 0 1
1 0 1 1 0 0 0 1
1 0 1 1 1 0 1 0
1 1 1 1 0 1 0 0
1 1 1 0 0 1 1 1
1 1 0 1 1 1 1 1
1 0 0 0 1 0 0 1
0 1 0 1 0 0 1 1
0 1 0 1 0 0 0 1
1 0 0 0 0 0 1 1
1 1 0 1 1 1 0 1
1 0 1 1 0 0 0 1
1 0 1 1 1 0 1 0
1 1 1 1 0 1 0 0
1 1 1 0 0 1 1 1
1 1 0 1 1 1 1 1
1 0 0 0 1 0 0 1
0 1 0 1 0 0 1 1
0 1 0 1 0 0 0 1
1 0 0 0 0 0 1 1
1 1 0 1 1 1 0 1
1 0 1 1 0 0 0 1
1 0 1 1 1 0 1 0
1 1 1 1 0 1 0 0
1 1 1 0 0 1 1 1
1 1 0 1 1 1 1 1
1 0 0 0 1 0 0 1
0 1 0 1 0 0 1 1
0 1 0 1 0 0 0 1
1 0 0 0 0 0 1 1
1 1 0 1 1 1 0 1
1 0 1 1 0 0 0 1
1 0 1 1 1 0 1 0
1 1 1 1 0 1 0 0
1 1 1 0 0 1 1 1
1 1 0 1 1 1 1 1
1 0 0 0 1 0 0 1
0 1 0 1 0 0 1 1
0 1 0 1 0 0 0 1
1 0 0 0 0 0 1 1
1 1 0 1 1 1 0 1
1 0 1 1 0 0 0 1
1 0 1 1 1 0 1 0
1 1 1 1 0 1 0 0
1 1 1 0 0 1 1 1
1 1 0 1 1 1 1 1
1 0 0 0 1 0 0 1
0 1 0 1 0 0 1 1
0 1 0 1 0 0 0 1
1 0 0 0 0 0 1 1
1 1 0 1 1 1 0 1
1 0 1 1 0 0 0 1
1 0 1 1 1 0 1 0
1 1 1 1 0 1 0 0
1 1 1 0 0 1 1 1
1 1 0 1 1 1 1 1
1 0 0 0 1 0 0 1
0 1 0 1 0 0 1 1
0 1 0 1 0 0 0 1
1 0 0 0 0 0 1 1
1 1 0 1 1 1 0 1
1 0 1 1 0 0 0 1
1 0 1 1 1 0 1 0
1 1 1 1 0 1 0 0
1 1 1 0 0 1 1 1
1 1 0 1 1 1 1 1
Many databases over"
genes with survival rates"
for various cancers
List of possible genes"
responsible for survival
Cluster analysis
Reinteration of data
THIS STEP SHOULD!
BE ROBUST TO !
VARIATIONS ABOVE!

Scalable graph analytics
Local methods are one of the most successful
classes of scalable graph analytics
They don’t even look at the entire graph.
•  Andersen Chung Lang (ACL) Push method
Conjecture"
Local methods regularize some variant of the
original algorithm or problem.
Justiﬁcation"
For ACL and a few relatives this is exact!
Impact?!
Improved robustness to noise?
KDD2015
7
c
For instance, to "
answer “what function” is
shared by the started node,
we’d only look at the circled
region.

Our contributions
We study these issues in the case of "
semi-supervised learning (SSL) on graphs
1.  We illustrate a common mincut framework for a variety of SSL
methods
2.  Show how to “localize” one (and make it scalable!)
3.  Provide a more robust SSL labeling method
4.  Identify a weakness in SSL methods: they cannot use extra
edges! We ﬁnd one useful way to do so.
KDD2015
8

Semi-supervised "
graph-based learning
KDD2015
9
Given a graph, and a few labeled nodes,
predict the labels on the rest of the graph.
Algorithm

1.  Run a diffusion for
each label (possibly
with neg. info from
other classes)
2.  Assign new labels
based on the value of
each diffusion

Semi-supervised "
KDD2015
10
Algorithm

with neg. info from
other classes)
each diffusion

Semi-supervised "
KDD2015
11
Algorithm

with neg. info from
other classes)
each diffusion

The diffusions proposed for semi-
supervised learning are s,t-cut minorants
KDD2015
12
1
3
2
6
4
5
7
8
9
10
t
s
In the unweighted case, "
solve via max-ﬂow.

In the weighted case,
solve via network simplex
or industrial LP.
minimize
qP
ij2E Ci,j |xi xj |2
subject to xs = 1, xt = 0.
minimize
P
ij2E Ci,j |xi xj |
MINCUT LP
Spectral minorant – lin. sys.

Representative cut problems
KDD2015
13
∞
∞
∞
∞
s
t
ZGL
α
α 4α
3α
4α
6α
3α
3α
5α
5α
5α 2α
5α
4α
5α
s
t
Zhou et al.
Positive label
Neg. label
Unlabeled
Andersen-Lang weighting "
variation too
Joachims has a variation too.
Zhou et al. NIPS 2003; Zhu et al., ICML 2003;
Andersen Lang, SODA 2008; Joachims, ICML 2003
These help our intuition about the solutions
All spectral minorants are linear systems.

Implicit regularization views
on the Zhou et al. diffusion
KDD2015
14
α
α 4α
3α
4α
6α
3α
3α
5α
5α
5α 2α
5α
4α
5α
s
t
Zhou et al.
RESULT!
The spectral minorant of Zhou is equivalent to
the weakly-local MOV solution.
PROOF!
The two linear systems are the same (after
working out a few equivalences).
IMPORTANCE!
We’d expect Zhou to be “more robust”
minimize
qP
ij2E Ci,j |xi xj |2
The Mahoney-Orecchia-Vishnoi (MOV) vector is a
localized variation on the Fiedler vector to ﬁnd a small
conductance set nearby a seed set.

A scalable, localized algorithm for Zhou
et al’s diffusion.
KDD2015
15
RESULT!
We can use a variation on coordinate descent methods related to
the Andersen-Chung-Lang PUSH procedure to solve Zhou’s
diffusion in a scalable manner.
PROOF. See Gleich-Mahoney ICML ‘14
IMPORTANCE (1)!
We should be able to make Zhou et al. scale.
IMPORTANCE (2)!
Using this algorithm adds another implicit regularization term that
should further improve robustness!
minimize
qP
ij2E Ci,j |xi xj |2
minimize
P
ij2E Ci,j |xi xj |2
+ ⌧
P
i2V di xi
subject to xs = 1, xt = 0, xi 0.

Semi-supervised "
KDD2015
16
Algorithm

with neg. info from
other classes)
each diffusion

Traditional rounding methods
for SSL are value-based
KDD2015
17
Class 1 Class 2Class 3
Class 1
Class 2
Class 3
CLASS 1
CLASS 2
CLASS 3
VALUE-BASED
Use the largest value of the diffusion to pick the label.
Zhou’s diffusion

But value based rounding
doesn’t work for all diffusions
KDD2015
18
Class 1
Class 2
Class 3
Class 1 Class 2Class 3 Class 1 Class 2Class 3 Class 1 Class 2Class 3
(b) Zhou et al., l = 3 (c) Andersen-Lang, l = 3 (d) Joachims, l = 3 (e) ZGL, l = 3
Class 1 Class 2Class 3 Class 1 Class 2Class 3 Class 1 Class 2Class 3 Class 1 Class 2Class 3
(f) Zhou et al., l = 15 (g) Andersen-Lang, l = 15 (h) Joachims, l = 15 (i) ZGL, l = 15
CLASS 1
CLASS 3
CLASS 2
VALUE-BASED rounding fails
for most of these diffusions
BUT
There is still a
signal there!
Adding more labels doesn’t help either, see the
paper for those details

Rank-based rounding is far
more robust.
KDD2015
19
NEW IDEA!
Look at the RANK of the
item in each diffusion
instead of it’s VALUE.

JUSTIFICATION!
Based on the idea of
sweep-cut rounding in
spectral methods (use the
order induced by the
eigenvector, not its values)

IMPACT!
Much more robust
rounding to labels

Rank-based rounding has a
big impact on a real-study.
KDD2015
20
2 4 6 8 10
0
0.2
0.4
0.6
0.8
errorrate
average training samples per class
Zhou
Zhou+Push
2 4 6 8 10
0
0.2
0.4
0.6
0.8
1
errorrate
average training samples per class
Zhou
Zhou+Push
We used the digit prediction task out of Zhou’s paper and added just a bit of noise"
as label errors and switched parameters.
VALUE-BASED
RANK-BASED

Main empirical results
1.  Zhou’s diffusion seems to work best for sparse
graphs whereas the ZGL diffusion works best for
dense
2.  On the digit’s dataset, dense graph constructions
yield higher error rates
3.  Densifying the a super-sparse graph construction on
the digits dataset yields lower error.
4.  And a similar fact holds on an Amazon co-
purchasing network.
KDD2015
21

5 10
0
50
100
150
number of labels
numberofmistakes
An illustrative synthetic problem
shows the differences.
Two-class block-model, 150 nodes each"
between prob = 0.02, "
withinprob = 0.35 (dense) or 0.06 (sparse)
Reveal labels for k nodes (varied) and we have
different error rates (sparse 0/10% low/high)
and dense (20%/60% for low-high)
KDD2015
22
5 10
0
50
100
150
number of labels
numberofmistakes
Joachims
Zhou
ZGL
Real-world scenario"
sparse graph, high error
Sparse graph,"
low error
20 40 60
0
20
40
60
80
100
number of labels
numberofmistakes
20 40 60
0
20
40
60
80
100
number of labels
numberofmistakes
Dense graph,"
low error rate
Dense graph,"
high error rate

Varying density in an SSL
construction.
KDD2015
23
Ai,j = exp
✓
kdi dj k2
2
2 2
◆
di
dj = 2.5
= 1.25
We use the digits
experiment from
Zhou et al. 2003.
10 digits and a
few label errors.

We vary density
either by the
num. of nearest
neighbors or by
the kernel width.

As density increases, the
results just get worse
KDD2015
24
1 1.5 2 2.5
0
0.1
0.2
0.3
0.4
errorrate
σ
0.8
1.2
1.5
1.8
2.1
2.5
Zhou
Zhou+Push
10
2
0
0.1
0.2
0.3
0.4
errorrate
nearest neighbors
5
10
25
50
100
150
200
250
Zhou
Zhou+Push
Varying kernel width
Varying nearest neighors
•  Adding “more” edges seems to only hurt. (Unless there is no signal).
•  Zhou+Push seems to be slightly more robust (Maybe).

Some observations and a
question.
Adding “more data” yields “worse results” for
this procedure (in a simple setting).

Suppose I have a real-world system that can
work with up to E edges on some graph.
Is there a way I can create new edges?
KDD2015
25

Densifying the graph with
path expansions
KDD2015
26
Ak =
kX
`=1
A` If A is the adjacency matrix, then this counts the
total weight on all paths up to length k.
We now repeat the nearest neighbor computation, but with paired
parameters such that we have the same average degree.
Zhou Zhou w. Push
Avg. Deg k = 1 k 1 k = 1 k 1
19 0.163 0.114 0.156 0.117
41 0.156 0.132 0.158 0.113
53 0.183 0.142 0.179 0.136
104 0.193 0.145 0.178 0.144
138 0.216 0.102 0.204 0.101
k=4, nn = 3

The same result holds for Amazon’s
co-purchasing network
KDD2015
27
mean F1 Conﬁdence intervals
k Zhou Zhou
w. Push
Zhou Zhou w. Push
1 0.173 0.229 [0.15 0.19] [0.21 0.25]
2 0.197 0.231 [0.18 0.22] [0.21 0.25]
3 0.221 0.238 [0.17 0.27] [0.19 0.28]
Amazon’s co-purchasing network (on Snap) is effectively
a highly sparse nearest-neighbor network from their
(denser) co-purchasing graph.

We attempt to predict the items in a product category
based on a small sample and study the F1 score for the
predictions.
Some small details missing – see the full paper.
Ak =
kX
`=1
A`

(a) K2 sparse (b) K2 dense (c) RK2
Figure 2: We artiﬁcially densify this graph to Ak based on a cons
and dense di↵usions and regularization. The color indicates the
circled nodes. The unavoidable errors are caused by a mislabeled
regularizing di↵usions on dense graphs produces only a small
Towards some theory, i.e. why are
densiﬁed sparse graphs better?
KDD2015
28
How do sparsity, density,
and regularization of a
diffusion play into the results
in a controlled setting?
THE ERROR
Labels

(a) K2 sparse (b) K2 dense (c) RK2
Figure 2: We artiﬁcially densify this graph to Ak based on a cons
and dense di↵usions and regularization. The color indicates the
circled nodes. The unavoidable errors are caused by a mislabeled
regularizing di↵usions on dense graphs produces only a small
Towards some theory, i.e. why are
densiﬁed sparse graphs better?
KDD2015
29
THE ERROR
Labels
Using Push algorithm
P5
`=1 A`
P5
`=1 A`
Regularization
Dense
Dense

Recap, discussion, future work
Contributions
1.  Flow-setup for SSL
diffusions
2.  New robust rounding rule for
class selection
3.  Localized Zhou’s diffusion
4.  Empirical insights on density
of graph constructions
Observations
•  Many of these insights
translate to directed,
weighted graphs with fuzzy
labels and/or some parallel
architectures.
•  Weakness mainly empirical
results on the density.
•  We need a theoretical basis
for the densiﬁcation theory!
KDD2015
30
Supported by NSF, ARO, DARPA
CODE www.cs.purdue.edu/homes/dgleich/codes/robust-diffusions

Using Local Spectral Methods to Robustify Graph-Based Learning

More Related Content

What's hot (20)

Viewers also liked (18)

Similar to Using Local Spectral Methods to Robustify Graph-Based Learning (20)

Recently uploaded (20)

Using Local Spectral Methods to Robustify Graph-Based Learning