Web-Scale Graph Analytics with Apache Spark with Tim Hunter

Web-Scale Graph Analytics
with Apache Spark
Tim Hunter, Databricks
#EUds6

2
About Me
• Tim Hunter
• Software engineer @ Databricks
• Ph.D. from UC Berkeley in Machine Learning
• Very early Spark user
• Contributor to MLlib
• Co-author of TensorFrames, GraphFrames, Deep Learning
Pipelines
#EUds6 2

3
Outline
• Why GraphFrames
• Writing scalable graph algorithms with Spark
– Where is my vertex? Indexing data
– Connected Components: implementing complex
algorithms with Spark and GraphFrames
– The Social Network: real-world issues
• Future of GraphFrames
3#EUds6

4
Graphs are everywhere
4#EUds6
JFK
IAD
LAX
SFO
SEA
DFW
Example: airports & flights between them
src dst delay tripID
“JFK” “SEA” 45 1058923
id City State
“JFK” “New York” NY
Vertices:
Edges:

5
Apache Spark’s GraphX library
• General-purpose graph
processing library
• Built into Spark
• Optimized for fast distributed
computing
• Library of algorithms:
PageRank, Connected
Components, etc.
5
Issues:
• No Java, Python APIs
• Lower-level RDD-based API
(vs. DataFrames)
• Cannot use recent Spark
optimizations: Catalyst query
optimizer, Tungsten memory
management
#EUds6

6
The GraphFrames Spark Package
Brings DataFrames API for Spark
• Simplifies interactive queries
• Benefits from DataFrames optimizations
• Integrates with the rest of Spark ecosystem
Collaboration between Databricks, UC Berkeley & MIT
6
#EUds6

7
Dataframe-based representation
7#EUds6
JFK
IAD
LAX
SFO
SEA
DFW
id City State
“JFK” “New York” NY
“SEA
”
“Seattle” WA
src dst delay tripID
“JFK” “SEA” 45 1058923
“DFW
”
“SFO” -7 4100224
vertices DataFrame
edges DataFrame

8
Supported graph algorithms
• Find Vertices:
– PageRank
– Motif finding
• Communities:
– Connected Components
– Strongly Connected Components
– Label propagation (LPA)
• Paths:
– Breadth-first search
– Shortest paths
• Other:
– Triangle count
– SVD++
8#EUds6
(Bold: native DataFrame implementation)

1
Assigning integral vertex IDs
… lessons learned
10#EUds6

1
Pros of integer vertex IDs
GraphFrames take arbitrary vertex IDs.
 convenient for users
Algorithms prefer integer vertex IDs.
 optimize in-memory storage
 reduce communication
Our task: Map unique vertex IDs to unique (long) integers.
#EUds6 11

1
The hashing trick?
• Possible solution: hash vertex ID to long integer
• What is the chance of collision?
–1 - (N-1)/N * (N-2)/N * …
–seems unlikely with long range N=264
–with 1 billion nodes, the chance is ~5.4%
• Problem: collisions change graph
topology.
Name Hash
Sue Ann 84088
Joseph -2070372689
Xiangrui 264245405
Felix 67762524
#EUds6 12

1
Generating unique IDs
Spark has built-in methods to generate unique IDs.
• RDD: zipWithUniqueId(), zipWithIndex()
• DataFrame: monotonically_increasing_id()
Possible solution: just use these methods
#EUds6 13

1
How it works
Partition 1
Vertex ID
Sue Ann 0
Joseph 1
Partition 2
Vertex ID
Xiangrui 100 + 0
Felix 100 + 1
Partition 3
Vertex ID
Veronica 200 + 0
… 200 + 1
#EUds6 14

1
… but not always
• DataFrames/RDDs are immutable and reproducible
by design.
• However, records do not always have stable
orderings.
–distinct
–repartition
• cache() does not help.
Partition 1
Vertex ID
Xiangrui 0
Joseph 1
Partition 1
Vertex ID
Joseph 0
Xiangrui 1
repartition
distinct
shuffle
#EUds6 15

1
Our implementation
We implemented (v0.5.0) an expensive but correct
version:
1. (hash) re-partition + distinct vertex IDs
2. sort vertex IDs within each partition
3. generate unique integer IDs
#EUds6 16

1
Connected Components
17#EUds6

1
Assign each vertex a component ID such that vertices
receive the same component ID iff they are connected.
Applications:
–fraud detection
• Spark Summit 2016 keynote from Capital One
–clustering
–entity resolution
1 3
2
#EUds6 18

1
Naive implementation (GraphX)
1.Assign each vertex a unique component ID.
2.Iterate until convergence:
–For each vertex v, update:
component ID of v  Smallest component ID in neighborhood of v
Pro: easy to implement
Con: slow convergence on large-diameter graphs
#EUds6 19

2
Small-/large-star algorithm
Kiveris et al. "Connected Components in MapReduce and Beyond."
1. Assign each vertex a unique ID.
2. Iterate until convergence:
–(small-star) for each vertex,
connect smaller neighbors to smallest neighbor
–(big-star) for each vertex,
connect bigger neighbors to smallest neighbor (or
itself)
#EUds6 20

2
Small-star operation
Kiveris et al., Connected Components in MapReduce and Beyond.
#EUds6 21

2
Big-star operation
Kiveris et al., Connected Components in MapReduce and Beyond.
#EUds6 22

2
Another interpretation
1 5 7 8 9
1 x
5 x
7 x
8 x
9
adjacency matrix
#EUds6 23

2
Small-star operation
1 5 7 8 9
1 x x x
5
7
8 x
9
1 5 7 8 9
1 x
5 x
7 x
8 x
9
rotate & lift
#EUds6 24

2
Big-star operation
lift
1 5 7 8 9
1 x x
5 x
7 x
8
9
1 5 7 8 9
1 x
5 x
7 x
8 x
9
#EUds6 25

2
Convergence
1 5 7 8 9
1 x x x x x
5
7
8
9
#EUds6 26

2
Properties of the algorithm
• Small-/big-star operations do not change graph
connectivity.
• Extra edges are pruned during iterations.
• Each connected component converges to a star
graph.
• Converges in log2(#nodes) iterations
#EUds6 27

2
Implementation
Iterate:
• filter
• self-join
Challenge: handle these operations at scale.
#EUds6 28

2
Skewed joins
Real-world graphs contain big components.
The ”Justin Bieber problem” at Twitter
 data skew during connected components iterations
src dst
0 1
0 2
0 3
0 4
… …
0 2,000,000
1 3
2 5
src Component id neighbors
0 0 2,000,000
1 0 10
2 3 5
join
#EUds6 29

3
Skewed joins
3
0
src dst
0 1
0 2
0 3
0 4
… …
0 2,000,000
hash join
1 3
2 5
broadcast join
(#nbrs > 1,000,000)
union
src Component id neighbors
0 0 2,000,000
1 0 10
2 3 5
#EUds6

3
Checkpointing
We checkpoint every 2 iterations to avoid:
• query plan explosion (exponential growth)
• optimizer slowdown
• disk out of shuffle space
• unexpected node failures
3
1
#EUds6

3
Experiments
twitter-2010 from WebGraph datasets (small
diameter)
–42 million vertices, 1.5 billion edges
16 r3.4xlarge workers on Databricks
–GraphX: 4 minutes
–GraphFrames: 6 minutes
• algorithm difference, checkpointing, checking skewness
3
2
#EUds6

3
Experiments
uk-2007-05 from WebGraph datasets
–105 million vertices, 3.7 billion edges
–GraphX: 25 minutes
• slow convergence
–GraphFrames: 4.5 minutes
3
3
#EUds6

3
Experiments
regular grid 32,000 x 32,000 (large diameter)
–1 billion nodes, 4 billion edges
–GraphX: failed
–GraphFrames: 1 hour
3
4
#EUds6

3
Future improvements
GraphFrames
• better graph partitioning
• letting Spark SQL handle skewed joins and iterations
• graph compression
• local iterations
• node pruning and better stopping criteria
#EUds6 35

Thank you!
• http://graphframes.github.io
• https://docs.databricks.com
36#EUds6

3
2 types of graph representations
Algorithm-based Query-based
Standard & custom algorithms
Optimized for batch processing
Motif finding
Point queries & updates
GraphFrames: Both algorithms & queries (but not point
updates)#EUds6 38

3
Graph analysis with GraphFrames
Simple queries
Motif finding
Graph algorithms
39
#EUds6

4
Simple queries
SQL queries on vertices & edges
40
Simple graph queries (e.g., vertex degrees)
#EUds6

4
Motif finding
41
JFK
IAD
LAX
SFO
SEA
DFW
Search for structural
patterns within a graph.
val paths: DataFrame =
g.find(“(a)-[e1]->(b);
(b)-[e2]->(c);
!(c)-[]->(a)”)
#EUds6

4
Motif finding
42
JFK
IAD
LAX
SFO
SEA
DFW
(b)
(a)Search for structural
g.find(“(a)-[e1]->(b);
(b)-[e2]->(c);
!(c)-[]->(a)”)
#EUds6

4
Motif finding
43
JFK
IAD
LAX
SFO
SEA
DFW
(b)
(a)
(c)
g.find(“(a)-[e1]->(b);
(b)-[e2]->(c);
!(c)-[]->(a)”)
#EUds6

4
Motif finding
44
JFK
IAD
LAX
SFO
SEA
DFW
(b)
(a)
(c)
g.find(“(a)-[e1]->(b);
(b)-[e2]->(c);
!(c)-[]->(a)”)
#EUds6

4
Motif finding
45
JFK
IAD
LAX
SFO
SEA
DFW
g.find(“(a)-[e1]->(b);
(b)-[e2]->(c);
!(c)-[]->(a)”)
(b)
(a)
(c)
Then filter using vertex
& edge data.
paths.filter(“e1.delay > 20”)
#EUds6

4
Graph algorithms
Find important vertices
• PageRank
46
Find paths between sets of
vertices
• Breadth-first search (BFS)
• Shortest paths
Find groups of vertices
(components, communities)
• Connected components
• Strongly connected components
• Label Propagation Algorithm (LPA)
Other
• Triangle counting
• SVDPlusPlus
#EUds6

4
Saving & loading graphs
Save & load the DataFrames.
vertices = sqlContext.read.parquet(...)
edges = sqlContext.read.parquet(...)
g = GraphFrame(vertices, edges)
g.vertices.write.parquet(...)
g.edges.write.parquet(...)
47
#EUds6

Web-Scale Graph Analytics with Apache Spark with Tim Hunter

More Related Content

What's hot (20)

Similar to Web-Scale Graph Analytics with Apache Spark with Tim Hunter (20)

More from Databricks (20)

Recently uploaded (20)

Web-Scale Graph Analytics with Apache Spark with Tim Hunter

Editor's Notes