Graph processing

Why Graph Processing?
Graphs are everywhere!

Why Distributed Graph
Processing?
They are getting bigger!

Road Scale
>24 million vertices
>58 million edges
*Route Planning in Road
Networks - 2008

Social Scale
>1 billion vertices
~1 trillion edges
*Facebook Engineering Blog
~41 million vertices
>1.4 billion edges
*Twitter Graph- 2010

Web Scale
>50 billion vertices
>1 trillion edges
*NSA Big Graph Experiment- 2013

Brain Scale
>100 billion vertices
>100 trillion edges
*NSA Big Graph Experiment- 2013

CHALLENGES IN PARALLEL
GRAPH PROCESSING
Lumsdaine, Andrew, et al. "Challenges in parallel graph
processing." Parallel Processing Letters 17.01 -2007

Challenges
1
Structure driven computation
Data Transfer Issues
2 Irregular Structure
Partitioning Issues
*Concept borrowed from Cristina Abad’s PhD defense slides

Overcoming the challenges
1 Extend Existing Paradigms
2 BUILD NEW FRAMEWORKS!

Build New Graph Frameworks!
Key Requirements from Graph Processing
Frameworks

1 Less pre-processing
2 Low and load-balanced
computation
3 Low and load-balanced
communication
4 Low memory footprint
5 Scalable wrt cluster size
and graph size

PREGEL
Malewicz, Grzegorz, et al. "Pregel: a system for large-scale graph
processing.“ ACM SIGMOD -2010.

Life of a Vertex Program
Placement
Of Vertices
Iteration 0 Iteration 1
Barrier Barrier Barrier
Time
Computation
Communication
*Concept borrowed from LFGraph Slides
Computation
Communication

B D
C E
A
Sample Graph
*Graph Borrowed from LFGraph Paper

B D
C E
A
Shortest Path Example

B
0
D
∞
C
∞
E
∞
A
∞
Iteration 0

B
0
D
∞
C
∞
E
∞
A
1
Iteration 1

B
0
D
2
C
∞
E
2
A
1
Iteration 2

Can we do better?
GOAL PREGEL
Computation 1 Pass
Communication ∝ #Edge cuts
Pre-processing Cheap (Hash)
Memory High (out edges + buffered
messages)

LFGRAPH – YES, WE CAN!
Hoque, Imranul, and Indranil Gupta. "LFGraph: Simple and Fast
Distributed Graph Analytics”. TRIOS-2013

B D
C E
A
Features
Cheap Vertex Placement: Hash
Based
Low graph initialization time

B D
C E
A
Features
Publish Subscribe fetch once
information flow
Low communication overhead

B D
C E
A
Subscribe
Subscribing to vertex A

B D
C E
A
Publish
Publish List of Server 1: (Server2, A)

B D
C E
A
LFGraph Model
Value of A

B D
C E
A
Features
Only stores in-neighbor
vertices
Reduces memory footprint

B D
C E
A
In-neighbor storage
Local in-neighbor – simply read
the value
Remote in-neighbor – read
locally available value

B
0
D
∞
C
∞
E
∞
A
∞
Iteration 1
A
1
Value change in
duplicate store
Value of A

B
0
D
∞
C
∞
E
∞
A
1
Iteration 2
Local read of A
Local read of A
D
2
E
2

B D
C E
A
Features
Single Pass Computation
Low computation overhead

Life of a Vertex Program
Placement
Of Vertices
Iteration 0 Iteration 1
Barrier Barrier Barrier
Time
Computation Communication Computation Communication
*Concept Borrowed from LFGraph Slides

GRAPHLAB
Low, Yucheng, et al. "Graphlab: A new framework for parallel
machine learning”. Conference on Uncertainty in Artificial
Intelligence (UAI) - 2010

B D
C E
A A
D
E
GraphLab Model

Can we do better?
GOAL GRAPHLAB
Computation 2 passes
Communication ∝ #Vertex Ghosts
Pre-processing Cheap (Hash)
Memory High (in & out edges + ghost
values)

POWERGRAPH
Gonzalez, Joseph E., et al. "Powergraph: Distributed graph-parallel
computation on natural graphs." USENIX OSDI - 2012.

B D
C E
A1 A2
PowerGraph Model

Can we do better?
GOAL POWERGRAPH
Computation 2 passes
Communication ∝ #Vertex Mirrors
Pre-processing Expensive (Intelligent)
Memory High (in & out edges + mirror
values)

Communication Analysis
External
on edge cuts
Ghost vertices -
in and out neighbors
Mirrors -
in and out neighbors
External in neighbors

Computation Balance Analysis
• Power Law graphs have substantial load
imbalance.
• Power law graphs have degree d with
probability proportional to d-α.
•Lower α means a denser graph with more
high degree vertices.

Communication Balance Analysis

PageRank – Runtime w/o
partition

PageRank – Runtime with
partition

PageRank – Network
Communication

X-Stream: Edge-centric Graph
Processing using Streaming
Partitions
* Some figures adopted from author’s presentation

Motivation
• Can sequential access be used instead of random access?!
• Can large graph processing be done on a single machine?!

Sequential Access: Key to
Performance!
Medium Read (MB/s) Write (MB/s)
Random Sequential Speed up Random Sequential Speed up
RAM (1 core) 567 2605 4.6 1057 2248 2.2
RAM (16 core) 14198 25658 1.9 10044 13384 1.4
SSD 22.5 667.69 29.7 48.6 576.5 11.9
Magnetic Disk 0.6 328 546.7 2 316.3 158.2
Speed up of sequential access over random access in different media
Test bed: 64 GB RAM + 200 GB SSD + 3TB drive

How to Use Sequential Access?
Sequential
access …
Edge-Centric Processing

Vertex-Centric
Scatter
for each vertex v
if state has updated
for each output edge e of v
scatter update on e

Vertex-Centric
Gather
for each vertex v
for each input edge e of v
if e has an update
apply update on state

1 6
3
5
8
7
4
2
BFS
SOURCE DEST
1 3
1 5
2 7
2 4
3 2
3 8
4 3
4 7
4 8
5 6
6 1
8 5
8 6
V
1
2
3
4
5
6
7
8
Vertex-Centric

Edge-Centric
Scatter
for each edge e
if e.source has updated
scatter update on e
A
B
C
= Updated Vertex
Update u1
Update un

Edge-Centric
Gather
for each update u on edge e
apply update u to e.destination
X
Y
Z
= Updated Vertex
Update u1
Update un
Update u2
X
Y
Z

Sequential Access via Edge-Centric!
In Fast Storage
In Slow Storage
In Slow Storage

1 6
3
5
8
7
4
2
SOURCE DEST
1 3
1 5
2 7
2 4
3 2
3 8
4 3
4 7
4 8
5 6
6 1
8 5
8 6
V
1
2
3
4
5
6
7
8
BFS
Edge-Centric
Lots of wasted reads!
Most real world graphs have
small diameter
Large Diameter makes
X-Stream slow and wasteful

SOURCE DEST
1 3
1 5
2 7
2 4
3 2
3 8
4 3
4 7
4 8
5 6
6 1
8 5
8 6
66
=
SOURCE DEST
1 3
8 6
5 6
2 4
3 2
4 7
4 3
3 8
4 8
2 7
6 1
8 5
1 5
Order is not important
No pre-processing (sorting and indexing) needed!

But, still …
• Random access for vertices
• Vertices may not fit into fast storage

Streaming Partitions
V=Subset of vertices
E=Outgoing edges of V
U=Incoming updates to V
Mutually disjoint
Changing in each
scatter phase
Constant set

Vn
En
Un
Scatter and Shuffle
V1
E1
U1
Input buffere1 e2 e3 …
Update bufferu1 u2 u3 …
Output bufferu'1 u'2 u'3 …
Vertex setv1 v2 v3 …
Fast Memory
Read source
Add update
Shuffle
…
Append to updates

Shuffle
Stream Buffer with k partitions

Gather
V1
E1
U1
Update bufferu1 u2 u3 …
Vertex setv1 v2 v3 …
Fast Memory
Apply update
No output!

Parallelism
• State stored in vertices
• Disjoint vertex set in partitions
Compute partitions in parallel
Parallel scatter and gather

X-Stream Speedup over Graphchi
0 1 2 3 4 5 6
Netflix/ALS
Twitter/Pagerank
Twitter/Belief Propagation
RMAT27/WCC
Mean Speedup = 2.3
Speedup without considering the pre-process time of Graphchi

0 1 2 3 4 5 6
Netflix/ALS
Twitter/Pagerank
Twitter/Belief Propagation
RMAT27/WCC
X-Stream Speedup over Graphchi
Mean Speedup = 3.7
Speedup considering the pre-process time of Graphchi

0
500
1000
1500
2000
2500
3000
Time(sec)
Graphchi Sharding
X-Stream runtime
X-Stream Runtime vs Graphchi
Sharding

Disk Transfer Rates
Metric X-Stream Graphchi
Data moved 224 GB 322 GB
Time taken 398 seconds 2613 seconds
Transfer rate 578 MB/s 126 MB/s
77
SSD sustain reads = 667 MB/s, writes = 576 MB/s
Data transfer rates on Page Rank algorithm on Twitter workload

Scalability on Input Data size
0:00:01
0:00:05
0:00:21
0:01:24
0:05:37
0:22:30
1:30:00
6:00:00
24:00:00
Time(HH:MM:SS)
RAM SSD Disk
8 Million V, 128 Million E,
8 sec
256 Million V, 4 Billion E,
33 mins
4 Billion V,
64 Billion E,
26 hours

Discussion
• Features like global values, aggregation functions,
asynchronous computation missing from LFGraph. Will
the overhead of adding these features slow it down?
• LFGraph assumes that all edge values are same. If the
edge values are not, either the receiving vertices or the
server will have to incorporate that value. Overheads?
• LFGraph has one pass computation but then it executes
the vertex program at each vertex (active or inactive).
Trade-off?

Discussion
• Independent computation and communication rounds
may not always be preferred. Use bandwidth when
available.
• Faul Tolerance is another feature missing from
LFGraph. Overheads?
• Three benchmarks for experiments. Enough evaluation?
• Scalability comparison with Pregel with different
experiment settings. Memory comparison with
PowerGraph based on heap values from logs. Fair
experiments?

Discussion
• Could the system become asynchronous?
• Could the scatter and gather phase be combined into one
phase?
• Does not support iterating over the edges/updates of a vertex.
Can this be added?
• How good do they determine number of partitions?
• Can shuffle be optimized by counting the updates of each
partition during scatter?

Thank you for listening!
Questions?

Qualitative Comparison
GOAL PREGEL GRAPHLAB POWERGRAPH LFGRAPH
Computation 2 passes,
Combiners
2 passes 2 passes 1 pass
Communication ∝ #Edge cuts ∝ #Vertex
Ghosts
∝ #Vertex
Mirrors
∝ #External in-
neighbors
Pre-processing Cheap (Hash) Cheap (Hash) Expensive
(Intelligent)
Cheap (Hash)
Memory High (out edges
+ buffered
messages)
High (in & out
edges + ghost
values)
High (in & out
edges + mirror
values)
Low (in edges +
remote values)

Read Bandwidth - SSD
0
200
400
600
800
1000
Read(MB/s)
5 minute window
X-Stream
Graphchi

Write Bandwidth - SSD
0
100
200
300
400
500
600
700
800
Write(MB/s)
5 minute window
X-Stream
Graphchi

Scalability on Number of I/O
Devices

Sharding-Computing Breakdown in
Graphchi
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
FractionofRuntime
Benchmark
Graphchi Runtime Breakdown
Compute + I/O
Re-sort shard

Large Diameter makes X-stream
Slow!

In-Memory X-Stream Performance
0
50
100
1 2 4 8 16
Runtime(s)Loweris
better
Threads
BFS (32M vertices/256M edges)
BFS-1 [HPC
2010]
BFS-2 [PACT
2011]
X-Stream

Discussion
• The current implementation is on a single
machine, can it be extended to clusters?
– Would it still perform good
– How to provide fault tolerance and synchronization?
• The waste rate is high (~65%). Could this be
improved?
• Can the partition be more intelligent? Dynamic
partitioning?
• Could all vertex-centric programs be converted to
edge-centric?
• When does streaming outperform random access?

Graph processing

More Related Content

What's hot (19)

Viewers also liked (20)

Similar to Graph processing (20)

Recently uploaded (20)

Graph processing

Editor's Notes