GraphFrames: Graph Queries In Spark SQL

GraphFrames: Graph Queries in
Apache Spark SQL
Ankur Dave
UC Berkeley AMPLab
Joint work with Alekh Jindal (Microsoft), Li Erran Li
(Uber), Reynold Xin (Databricks), Joseph Gonzalez (UC
Berkeley), and Matei Zaharia (MIT and Databricks)

+ Graph
Queries
2016
Apache Spark +
GraphFrames
GraphFrames (2016)
+ Graph
Algorithms
2013
Apache Spark +
GraphX
Relational
Queries
2009
Spark

Graph Algorithms vs. Graph Queries
≈
x
PageRank
Alternating Least Squares
Graph Algorithms Graph Queries

Graph Algorithm: PageRank Graph Query: Wikipedia Collaborators
Editor 1 Editor 2 Article 1 Article 2
⇓
Article 1
Article 2
Editor 1
Editor 2
same day} same day}

Graph Algorithm: PageRank
// Iterate until convergence
wikipedia.pregel(
sendMsg = { e =>
e.sendToDst(e.srcRank * e.weight)
},
mergeMsg = _ + _,
vprog = { (id, oldRank, msgSum) =>
0.15 + 0.85 * msgSum
})
Graph Query: Wikipedia Collaborators
wikipedia.find(
"(u1)-[e11]->(article1);
(u2)-[e21]->(article1);
(u1)-[e12]->(article2);
(u2)-[e22]->(article2)")
.select(
"*",
"e11.date – e21.date".as("d1"),
"e12.date – e22.date".as("d2"))
.sort("d1 + d2".desc).take(10)

Separate Systems

Raw Wikipedia
< / >< / >< / >
XML
Text Table
Edit Graph
Edit Table
Frequent
Collaborators
Problem: Mixed Graph Analysis
Hyperlinks PageRank
Article Text
User Article
Vandalism
Suspects
User User
User Article

Solution: GraphFrames
Spark SQL
GraphFramesAPI
Pattern Query
Optimizer

GraphFrames API
• Unifies graph algorithms, graph queries, and DataFrames
• Available in Scala,Java, and Python
class GraphFrame {
def vertices: DataFrame
def edges: DataFrame
def find(pattern: String): DataFrame
def registerView(pattern: String, df: DataFrame): Unit
def degrees(): DataFrame
def pageRank(): GraphFrame
def connectedComponents(): GraphFrame
...
}

Implementation
Parsed
Pattern
Logical Plan
Materialized
Views
Optimized
Logical Plan
DataFrame
Result
Query String
Graph–Relational
Translation Join Elimination
and Reordering
Spark SQL
View Selection
Graph
Algorithms
GraphX

Graph–Relational Translation
B
D
A
C
Existing
Logical Plan
Output: A,B,C
Src Dst
⋈C=Src
Edge Table
ID Attr
VertexTable
⋈D=ID

Materialized View Selection
GraphX: Triplet view enabled efficient message-passing algorithms
Vertices
B
A
C
D
Edges
A B
A C
B C
C D
A
B
Triplet View
A C
B C
C D
Graph
+
Updated
PageRanks
B
A
C
D
A

Materialized View Selection
GraphFrames: User-defined views enable efficient graph queries
Vertices
B
A
C
D
Edges
A B
A C
B C
C D
A
B
Triplet View
A C
B C
C D
Graph
User-Defined Views
PageRank
Community
Detection
…
Graph Queries

Join Elimination
Src Dst
1 2
1 3
2 3
2 5
Edges
ID Attr
1 A
2 B
3 C
4 D
Vertices
SELECT src, dst
FROM edges INNER JOIN vertices ON src = id;
Unnecessaryjoin
can be eliminated if tables satisfy referential
integrity, simplifying graph–relational
translation:
SELECT src, dst FROM edges;

Join Reordering
A → B B → A
⋈A, B
B → D
C → B⋈B
B → E⋈B
C → D⋈B
C → E⋈C, D
⋈C, E
Example Query
Left-Deep Plan BushyPlan
A → B B → A
⋈A, B
B → D C → B
⋈B
B → E⋈B
⋈B
⋈B, C
User-Defined View

Evaluation
Faster than Neo4j for unanchored patternqueries
0
0.5
1
1.5
2
2.5
GraphFrames Neo4j
Querylatency,s
AnchoredPatternQuery
0
10
20
30
40
50
60
70
80
GraphFrames Neo4j
Querylatency,s
UnanchoredPatternQuery
Triangle query on 1M edge subgraph of web-Google. Each system configured touse a single core.

Evaluation
Approaches performance of GraphX for graph algorithms using Spark SQL
whole-stage code generation
0
1
2
3
4
5
6
7
GraphFrames GraphX Naïve Spark
Per-iterationruntime,s
PageRankPerformance
Per-iteration performance on web-Google, single 8-core machine. Naïve SparkusesScala RDD API.

Evaluation
Registering the right views cangreatlyimprove performance for some queries
Workload: J. Huang, K. Venkatraman, and D.J. Abadi.Query optimization of distributed pattern matching. In ICDE 2014.

Future Work
• Suggest views automatically
• Exploit attribute-based partitioning in optimizer
• Code generationfor single node

Try It Out!
Releasedas a Spark Package at:
https://github.com/graphframes/graphframes
Thanks to Joseph Bradley,Xiangrui Meng,and Timothy Hunter.
ankurd@eecs.berkeley.edu

GraphFrames: Graph Queries In Spark SQL

Recommended

More Related Content

What's hot (20)

Similar to GraphFrames: Graph Queries In Spark SQL (20)

More from Spark Summit (20)

Recently uploaded (20)

GraphFrames: Graph Queries In Spark SQL