SlideShare a Scribd company logo
Web-Scale Graph Analytics
with Apache Spark
Tim Hunter, Databricks
#EUds6
2
About Me
• Tim Hunter
• Software engineer @ Databricks
• Ph.D. from UC Berkeley in Machine Learning
• Very early Spark user
• Contributor to MLlib
• Co-author of TensorFrames, GraphFrames, Deep Learning
Pipelines
#EUds6 2
3
Outline
• Why GraphFrames
• Writing scalable graph algorithms with Spark
– Where is my vertex? Indexing data
– Connected Components: implementing complex
algorithms with Spark and GraphFrames
– The Social Network: real-world issues
• Future of GraphFrames
3#EUds6
4
Graphs are everywhere
4#EUds6
JFK
IAD
LAX
SFO
SEA
DFW
Example: airports & flights between them
src dst delay tripID
“JFK” “SEA” 45 1058923
id City State
“JFK” “New York” NY
Vertices:
Edges:
5
Apache Spark’s GraphX library
• General-purpose graph
processing library
• Built into Spark
• Optimized for fast distributed
computing
• Library of algorithms:
PageRank, Connected
Components, etc.
5
Issues:
• No Java, Python APIs
• Lower-level RDD-based API
(vs. DataFrames)
• Cannot use recent Spark
optimizations: Catalyst query
optimizer, Tungsten memory
management
#EUds6
6
The GraphFrames Spark Package
Brings DataFrames API for Spark
• Simplifies interactive queries
• Benefits from DataFrames optimizations
• Integrates with the rest of Spark ecosystem
Collaboration between Databricks, UC Berkeley & MIT
6
#EUds6
7
Dataframe-based representation
7#EUds6
JFK
IAD
LAX
SFO
SEA
DFW
id City State
“JFK” “New York” NY
“SEA
”
“Seattle” WA
src dst delay tripID
“JFK” “SEA” 45 1058923
“DFW
”
“SFO” -7 4100224
vertices DataFrame
edges DataFrame
8
Supported graph algorithms
• Find Vertices:
– PageRank
– Motif finding
• Communities:
– Connected Components
– Strongly Connected Components
– Label propagation (LPA)
• Paths:
– Breadth-first search
– Shortest paths
• Other:
– Triangle count
– SVD++
8#EUds6
(Bold: native DataFrame implementation)
1
Assigning integral vertex IDs
… lessons learned
10#EUds6
1
Pros of integer vertex IDs
GraphFrames take arbitrary vertex IDs.
 convenient for users
Algorithms prefer integer vertex IDs.
 optimize in-memory storage
 reduce communication
Our task: Map unique vertex IDs to unique (long) integers.
#EUds6 11
1
The hashing trick?
• Possible solution: hash vertex ID to long integer
• What is the chance of collision?
–1 - (N-1)/N * (N-2)/N * …
–seems unlikely with long range N=264
–with 1 billion nodes, the chance is ~5.4%
• Problem: collisions change graph
topology.
Name Hash
Sue Ann 84088
Joseph -2070372689
Xiangrui 264245405
Felix 67762524
#EUds6 12
1
Generating unique IDs
Spark has built-in methods to generate unique IDs.
• RDD: zipWithUniqueId(), zipWithIndex()
• DataFrame: monotonically_increasing_id()
Possible solution: just use these methods
#EUds6 13
1
How it works
Partition 1
Vertex ID
Sue Ann 0
Joseph 1
Partition 2
Vertex ID
Xiangrui 100 + 0
Felix 100 + 1
Partition 3
Vertex ID
Veronica 200 + 0
… 200 + 1
#EUds6 14
1
… but not always
• DataFrames/RDDs are immutable and reproducible
by design.
• However, records do not always have stable
orderings.
–distinct
–repartition
• cache() does not help.
Partition 1
Vertex ID
Xiangrui 0
Joseph 1
Partition 1
Vertex ID
Joseph 0
Xiangrui 1
repartition
distinct
shuffle
#EUds6 15
1
Our implementation
We implemented (v0.5.0) an expensive but correct
version:
1. (hash) re-partition + distinct vertex IDs
2. sort vertex IDs within each partition
3. generate unique integer IDs
#EUds6 16
1
Connected Components
17#EUds6
1
Connected Components
Assign each vertex a component ID such that vertices
receive the same component ID iff they are connected.
Applications:
–fraud detection
• Spark Summit 2016 keynote from Capital One
–clustering
–entity resolution
1 3
2
#EUds6 18
1
Naive implementation (GraphX)
1.Assign each vertex a unique component ID.
2.Iterate until convergence:
–For each vertex v, update:
component ID of v  Smallest component ID in neighborhood of v
Pro: easy to implement
Con: slow convergence on large-diameter graphs
#EUds6 19
2
Small-/large-star algorithm
Kiveris et al. "Connected Components in MapReduce and Beyond."
1. Assign each vertex a unique ID.
2. Iterate until convergence:
–(small-star) for each vertex,
connect smaller neighbors to smallest neighbor
–(big-star) for each vertex,
connect bigger neighbors to smallest neighbor (or
itself)
#EUds6 20
2
Small-star operation
Kiveris et al., Connected Components in MapReduce and Beyond.
#EUds6 21
2
Big-star operation
Kiveris et al., Connected Components in MapReduce and Beyond.
#EUds6 22
2
Another interpretation
1 5 7 8 9
1 x
5 x
7 x
8 x
9
adjacency matrix
#EUds6 23
2
Small-star operation
1 5 7 8 9
1 x x x
5
7
8 x
9
1 5 7 8 9
1 x
5 x
7 x
8 x
9
rotate & lift
#EUds6 24
2
Big-star operation
lift
1 5 7 8 9
1 x x
5 x
7 x
8
9
1 5 7 8 9
1 x
5 x
7 x
8 x
9
#EUds6 25
2
Convergence
1 5 7 8 9
1 x x x x x
5
7
8
9
#EUds6 26
2
Properties of the algorithm
• Small-/big-star operations do not change graph
connectivity.
• Extra edges are pruned during iterations.
• Each connected component converges to a star
graph.
• Converges in log2(#nodes) iterations
#EUds6 27
2
Implementation
Iterate:
• filter
• self-join
Challenge: handle these operations at scale.
#EUds6 28
2
Skewed joins
Real-world graphs contain big components.
The ”Justin Bieber problem” at Twitter
 data skew during connected components iterations
src dst
0 1
0 2
0 3
0 4
… …
0 2,000,000
1 3
2 5
src Component id neighbors
0 0 2,000,000
1 0 10
2 3 5
join
#EUds6 29
3
Skewed joins
3
0
src dst
0 1
0 2
0 3
0 4
… …
0 2,000,000
hash join
1 3
2 5
broadcast join
(#nbrs > 1,000,000)
union
src Component id neighbors
0 0 2,000,000
1 0 10
2 3 5
#EUds6
3
Checkpointing
We checkpoint every 2 iterations to avoid:
• query plan explosion (exponential growth)
• optimizer slowdown
• disk out of shuffle space
• unexpected node failures
3
1
#EUds6
3
Experiments
twitter-2010 from WebGraph datasets (small
diameter)
–42 million vertices, 1.5 billion edges
16 r3.4xlarge workers on Databricks
–GraphX: 4 minutes
–GraphFrames: 6 minutes
• algorithm difference, checkpointing, checking skewness
3
2
#EUds6
3
Experiments
uk-2007-05 from WebGraph datasets
–105 million vertices, 3.7 billion edges
16 r3.4xlarge workers on Databricks
–GraphX: 25 minutes
• slow convergence
–GraphFrames: 4.5 minutes
3
3
#EUds6
3
Experiments
regular grid 32,000 x 32,000 (large diameter)
–1 billion nodes, 4 billion edges
32 r3.8xlarge workers on Databricks
–GraphX: failed
–GraphFrames: 1 hour
3
4
#EUds6
3
Future improvements
GraphFrames
• better graph partitioning
• letting Spark SQL handle skewed joins and iterations
• graph compression
Connected Components
• local iterations
• node pruning and better stopping criteria
#EUds6 35
Thank you!
• http://graphframes.github.io
• https://docs.databricks.com
36#EUds6
37#EUds6
3
2 types of graph representations
Algorithm-based Query-based
Standard & custom algorithms
Optimized for batch processing
Motif finding
Point queries & updates
GraphFrames: Both algorithms & queries (but not point
updates)#EUds6 38
3
Graph analysis with GraphFrames
Simple queries
Motif finding
Graph algorithms
39
#EUds6
4
Simple queries
SQL queries on vertices & edges
40
Simple graph queries (e.g., vertex degrees)
#EUds6
4
Motif finding
41
JFK
IAD
LAX
SFO
SEA
DFW
Search for structural
patterns within a graph.
val paths: DataFrame =
g.find(“(a)-[e1]->(b);
(b)-[e2]->(c);
!(c)-[]->(a)”)
#EUds6
4
Motif finding
42
JFK
IAD
LAX
SFO
SEA
DFW
(b)
(a)Search for structural
patterns within a graph.
val paths: DataFrame =
g.find(“(a)-[e1]->(b);
(b)-[e2]->(c);
!(c)-[]->(a)”)
#EUds6
4
Motif finding
43
JFK
IAD
LAX
SFO
SEA
DFW
(b)
(a)
(c)
Search for structural
patterns within a graph.
val paths: DataFrame =
g.find(“(a)-[e1]->(b);
(b)-[e2]->(c);
!(c)-[]->(a)”)
#EUds6
4
Motif finding
44
JFK
IAD
LAX
SFO
SEA
DFW
(b)
(a)
(c)
Search for structural
patterns within a graph.
val paths: DataFrame =
g.find(“(a)-[e1]->(b);
(b)-[e2]->(c);
!(c)-[]->(a)”)
#EUds6
4
Motif finding
45
JFK
IAD
LAX
SFO
SEA
DFW
Search for structural
patterns within a graph.
val paths: DataFrame =
g.find(“(a)-[e1]->(b);
(b)-[e2]->(c);
!(c)-[]->(a)”)
(b)
(a)
(c)
Then filter using vertex
& edge data.
paths.filter(“e1.delay > 20”)
#EUds6
4
Graph algorithms
Find important vertices
• PageRank
46
Find paths between sets of
vertices
• Breadth-first search (BFS)
• Shortest paths
Find groups of vertices
(components, communities)
• Connected components
• Strongly connected components
• Label Propagation Algorithm (LPA)
Other
• Triangle counting
• SVDPlusPlus
#EUds6
4
Saving & loading graphs
Save & load the DataFrames.
vertices = sqlContext.read.parquet(...)
edges = sqlContext.read.parquet(...)
g = GraphFrame(vertices, edges)
g.vertices.write.parquet(...)
g.edges.write.parquet(...)
47
#EUds6

More Related Content

What's hot (20)

PDF
Spark graphx
Carol McDonald
 
PDF
Introducing DataFrames in Spark for Large Scale Data Science
Databricks
 
PDF
Intro to Spark and Spark SQL
jeykottalam
 
PPTX
Data cubes
Mohammed
 
PPTX
Introduction to Apache Spark
Rahul Jain
 
PDF
Web-Scale Graph Analytics with Apache® Spark™
Databricks
 
PPTX
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Simplilearn
 
PDF
Finding Graph Isomorphisms In GraphX And GraphFrames
Spark Summit
 
PDF
Understanding Query Plans and Spark UIs
Databricks
 
PDF
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Edureka!
 
PDF
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Databricks
 
PDF
Machine learning pipeline with spark ml
datamantra
 
PDF
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Spark Summit
 
PPTX
Apache doris (incubating) introduction
leanderlee2
 
PDF
Productizing Structured Streaming Jobs
Databricks
 
PDF
Spark SQL Join Improvement at Facebook
Databricks
 
PPTX
Optimizing Apache Spark SQL Joins
Databricks
 
PDF
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.
 
PDF
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
PDF
Introduction to Spark Internals
Pietro Michiardi
 
Spark graphx
Carol McDonald
 
Introducing DataFrames in Spark for Large Scale Data Science
Databricks
 
Intro to Spark and Spark SQL
jeykottalam
 
Data cubes
Mohammed
 
Introduction to Apache Spark
Rahul Jain
 
Web-Scale Graph Analytics with Apache® Spark™
Databricks
 
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Simplilearn
 
Finding Graph Isomorphisms In GraphX And GraphFrames
Spark Summit
 
Understanding Query Plans and Spark UIs
Databricks
 
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Edureka!
 
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Databricks
 
Machine learning pipeline with spark ml
datamantra
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Spark Summit
 
Apache doris (incubating) introduction
leanderlee2
 
Productizing Structured Streaming Jobs
Databricks
 
Spark SQL Join Improvement at Facebook
Databricks
 
Optimizing Apache Spark SQL Joins
Databricks
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Introduction to Spark Internals
Pietro Michiardi
 

Similar to Web-Scale Graph Analytics with Apache Spark with Tim Hunter (20)

PDF
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Databricks
 
PDF
Challenging Web-Scale Graph Analytics with Apache Spark
Databricks
 
PDF
Web-Scale Graph Analytics with Apache® Spark™
Databricks
 
PDF
Data Summer Conf 2018, “Analysing Billion Node Graphs (ENG)” — Giorgi Jvaridz...
Provectus
 
PDF
GraphFrames: DataFrame-based graphs for Apache® Spark™
Databricks
 
PPTX
Graphs in data structures are non-linear data structures made up of a finite ...
bhargavi804095
 
PDF
Graph Algorithms - Map-Reduce Graph Processing
Jason J Pulikkottil
 
PDF
Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink
Flink Forward
 
PDF
Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015
Martin Junghanns
 
PDF
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Spark Summit
 
PDF
Distributed graph processing
Bartosz Konieczny
 
PDF
Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs
Jason Riedy
 
PDF
Single-Pass Graph Stream Analytics with Apache Flink
Paris Carbone
 
PDF
F14 lec12graphs
ankush karwa
 
PDF
Graph Analytics in Spark
Paco Nathan
 
PDF
GraphX: Graph analytics for insights about developer communities
Paco Nathan
 
PDF
An excursion into Graph Analytics with Apache Spark GraphX
Krishna Sankar
 
PDF
Ling liu part 02:big graph processing
jins0618
 
PDF
Graph Analyses with Python and NetworkX
Benjamin Bengfort
 
PDF
Graph Analysis Beyond Linear Algebra
Jason Riedy
 
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Databricks
 
Challenging Web-Scale Graph Analytics with Apache Spark
Databricks
 
Web-Scale Graph Analytics with Apache® Spark™
Databricks
 
Data Summer Conf 2018, “Analysing Billion Node Graphs (ENG)” — Giorgi Jvaridz...
Provectus
 
GraphFrames: DataFrame-based graphs for Apache® Spark™
Databricks
 
Graphs in data structures are non-linear data structures made up of a finite ...
bhargavi804095
 
Graph Algorithms - Map-Reduce Graph Processing
Jason J Pulikkottil
 
Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink
Flink Forward
 
Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015
Martin Junghanns
 
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Spark Summit
 
Distributed graph processing
Bartosz Konieczny
 
Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs
Jason Riedy
 
Single-Pass Graph Stream Analytics with Apache Flink
Paris Carbone
 
F14 lec12graphs
ankush karwa
 
Graph Analytics in Spark
Paco Nathan
 
GraphX: Graph analytics for insights about developer communities
Paco Nathan
 
An excursion into Graph Analytics with Apache Spark GraphX
Krishna Sankar
 
Ling liu part 02:big graph processing
jins0618
 
Graph Analyses with Python and NetworkX
Benjamin Bengfort
 
Graph Analysis Beyond Linear Algebra
Jason Riedy
 
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
PDF
Machine Learning CI/CD for Email Attack Detection
Databricks
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Machine Learning CI/CD for Email Attack Detection
Databricks
 
Ad

Recently uploaded (20)

PPT
Camuflaje Tipos Características Militar 2025.ppt
e58650738
 
PPTX
Data Analytics using sparkabcdefghi.pptx
KarkuzhaliS3
 
PDF
Prescriptive Process Monitoring Under Uncertainty and Resource Constraints: A...
Mahmoud Shoush
 
PPSX
PPT1_CB_VII_CS_Ch3_FunctionsandChartsinCalc.ppsx
animaroy81
 
PDF
Digital-Transformation-for-Federal-Agencies.pdf.pdf
One Federal Solution
 
DOCX
The Influence off Flexible Work Policies
sales480687
 
PPTX
ppt somu_Jarvis_AI_Assistant_presen.pptx
MohammedumarFarhan
 
PPT
Reliability Monitoring of Aircrfat commerce
Rizk2
 
DOCX
brigada_PROGRAM_25.docx the boys white house
RonelNebrao
 
PPTX
一比一原版(TUC毕业证书)开姆尼茨工业大学毕业证如何办理
taqyed
 
DOCX
Starbucks in the Indian market through its joint venture.
sales480687
 
PPTX
Mynd company all details what they are doing a
AniketKadam40952
 
PPTX
PPT2 W1L2.pptx.........................................
palicteronalyn26
 
PPTX
Parental Leave Policies & Research Bulgaria
Елица Димитрова
 
PDF
ilide.info-tg-understanding-culture-society-and-politics-pr_127f984d2904c57ec...
jed P
 
PDF
Exploiting the Low Volatility Anomaly: A Low Beta Model Portfolio for Risk-Ad...
Bradley Norbom, CFA
 
PDF
11_L2_Defects_and_Trouble_Shooting_2014[1].pdf
gun3awan88
 
DOCX
COT Feb 19, 2025 DLLgvbbnnjjjjjj_Digestive System and its Functions_PISA_CBA....
kayemorales1105
 
PDF
NVIDIA Triton Inference Server, a game-changing platform for deploying AI mod...
Tamanna36
 
DOCX
Udemy - data management Luisetto Mauro.docx
M. Luisetto Pharm.D.Spec. Pharmacology
 
Camuflaje Tipos Características Militar 2025.ppt
e58650738
 
Data Analytics using sparkabcdefghi.pptx
KarkuzhaliS3
 
Prescriptive Process Monitoring Under Uncertainty and Resource Constraints: A...
Mahmoud Shoush
 
PPT1_CB_VII_CS_Ch3_FunctionsandChartsinCalc.ppsx
animaroy81
 
Digital-Transformation-for-Federal-Agencies.pdf.pdf
One Federal Solution
 
The Influence off Flexible Work Policies
sales480687
 
ppt somu_Jarvis_AI_Assistant_presen.pptx
MohammedumarFarhan
 
Reliability Monitoring of Aircrfat commerce
Rizk2
 
brigada_PROGRAM_25.docx the boys white house
RonelNebrao
 
一比一原版(TUC毕业证书)开姆尼茨工业大学毕业证如何办理
taqyed
 
Starbucks in the Indian market through its joint venture.
sales480687
 
Mynd company all details what they are doing a
AniketKadam40952
 
PPT2 W1L2.pptx.........................................
palicteronalyn26
 
Parental Leave Policies & Research Bulgaria
Елица Димитрова
 
ilide.info-tg-understanding-culture-society-and-politics-pr_127f984d2904c57ec...
jed P
 
Exploiting the Low Volatility Anomaly: A Low Beta Model Portfolio for Risk-Ad...
Bradley Norbom, CFA
 
11_L2_Defects_and_Trouble_Shooting_2014[1].pdf
gun3awan88
 
COT Feb 19, 2025 DLLgvbbnnjjjjjj_Digestive System and its Functions_PISA_CBA....
kayemorales1105
 
NVIDIA Triton Inference Server, a game-changing platform for deploying AI mod...
Tamanna36
 
Udemy - data management Luisetto Mauro.docx
M. Luisetto Pharm.D.Spec. Pharmacology
 

Web-Scale Graph Analytics with Apache Spark with Tim Hunter

  • 1. Web-Scale Graph Analytics with Apache Spark Tim Hunter, Databricks #EUds6
  • 2. 2 About Me • Tim Hunter • Software engineer @ Databricks • Ph.D. from UC Berkeley in Machine Learning • Very early Spark user • Contributor to MLlib • Co-author of TensorFrames, GraphFrames, Deep Learning Pipelines #EUds6 2
  • 3. 3 Outline • Why GraphFrames • Writing scalable graph algorithms with Spark – Where is my vertex? Indexing data – Connected Components: implementing complex algorithms with Spark and GraphFrames – The Social Network: real-world issues • Future of GraphFrames 3#EUds6
  • 4. 4 Graphs are everywhere 4#EUds6 JFK IAD LAX SFO SEA DFW Example: airports & flights between them src dst delay tripID “JFK” “SEA” 45 1058923 id City State “JFK” “New York” NY Vertices: Edges:
  • 5. 5 Apache Spark’s GraphX library • General-purpose graph processing library • Built into Spark • Optimized for fast distributed computing • Library of algorithms: PageRank, Connected Components, etc. 5 Issues: • No Java, Python APIs • Lower-level RDD-based API (vs. DataFrames) • Cannot use recent Spark optimizations: Catalyst query optimizer, Tungsten memory management #EUds6
  • 6. 6 The GraphFrames Spark Package Brings DataFrames API for Spark • Simplifies interactive queries • Benefits from DataFrames optimizations • Integrates with the rest of Spark ecosystem Collaboration between Databricks, UC Berkeley & MIT 6 #EUds6
  • 7. 7 Dataframe-based representation 7#EUds6 JFK IAD LAX SFO SEA DFW id City State “JFK” “New York” NY “SEA ” “Seattle” WA src dst delay tripID “JFK” “SEA” 45 1058923 “DFW ” “SFO” -7 4100224 vertices DataFrame edges DataFrame
  • 8. 8 Supported graph algorithms • Find Vertices: – PageRank – Motif finding • Communities: – Connected Components – Strongly Connected Components – Label propagation (LPA) • Paths: – Breadth-first search – Shortest paths • Other: – Triangle count – SVD++ 8#EUds6 (Bold: native DataFrame implementation)
  • 9. 1 Assigning integral vertex IDs … lessons learned 10#EUds6
  • 10. 1 Pros of integer vertex IDs GraphFrames take arbitrary vertex IDs.  convenient for users Algorithms prefer integer vertex IDs.  optimize in-memory storage  reduce communication Our task: Map unique vertex IDs to unique (long) integers. #EUds6 11
  • 11. 1 The hashing trick? • Possible solution: hash vertex ID to long integer • What is the chance of collision? –1 - (N-1)/N * (N-2)/N * … –seems unlikely with long range N=264 –with 1 billion nodes, the chance is ~5.4% • Problem: collisions change graph topology. Name Hash Sue Ann 84088 Joseph -2070372689 Xiangrui 264245405 Felix 67762524 #EUds6 12
  • 12. 1 Generating unique IDs Spark has built-in methods to generate unique IDs. • RDD: zipWithUniqueId(), zipWithIndex() • DataFrame: monotonically_increasing_id() Possible solution: just use these methods #EUds6 13
  • 13. 1 How it works Partition 1 Vertex ID Sue Ann 0 Joseph 1 Partition 2 Vertex ID Xiangrui 100 + 0 Felix 100 + 1 Partition 3 Vertex ID Veronica 200 + 0 … 200 + 1 #EUds6 14
  • 14. 1 … but not always • DataFrames/RDDs are immutable and reproducible by design. • However, records do not always have stable orderings. –distinct –repartition • cache() does not help. Partition 1 Vertex ID Xiangrui 0 Joseph 1 Partition 1 Vertex ID Joseph 0 Xiangrui 1 repartition distinct shuffle #EUds6 15
  • 15. 1 Our implementation We implemented (v0.5.0) an expensive but correct version: 1. (hash) re-partition + distinct vertex IDs 2. sort vertex IDs within each partition 3. generate unique integer IDs #EUds6 16
  • 17. 1 Connected Components Assign each vertex a component ID such that vertices receive the same component ID iff they are connected. Applications: –fraud detection • Spark Summit 2016 keynote from Capital One –clustering –entity resolution 1 3 2 #EUds6 18
  • 18. 1 Naive implementation (GraphX) 1.Assign each vertex a unique component ID. 2.Iterate until convergence: –For each vertex v, update: component ID of v  Smallest component ID in neighborhood of v Pro: easy to implement Con: slow convergence on large-diameter graphs #EUds6 19
  • 19. 2 Small-/large-star algorithm Kiveris et al. "Connected Components in MapReduce and Beyond." 1. Assign each vertex a unique ID. 2. Iterate until convergence: –(small-star) for each vertex, connect smaller neighbors to smallest neighbor –(big-star) for each vertex, connect bigger neighbors to smallest neighbor (or itself) #EUds6 20
  • 20. 2 Small-star operation Kiveris et al., Connected Components in MapReduce and Beyond. #EUds6 21
  • 21. 2 Big-star operation Kiveris et al., Connected Components in MapReduce and Beyond. #EUds6 22
  • 22. 2 Another interpretation 1 5 7 8 9 1 x 5 x 7 x 8 x 9 adjacency matrix #EUds6 23
  • 23. 2 Small-star operation 1 5 7 8 9 1 x x x 5 7 8 x 9 1 5 7 8 9 1 x 5 x 7 x 8 x 9 rotate & lift #EUds6 24
  • 24. 2 Big-star operation lift 1 5 7 8 9 1 x x 5 x 7 x 8 9 1 5 7 8 9 1 x 5 x 7 x 8 x 9 #EUds6 25
  • 25. 2 Convergence 1 5 7 8 9 1 x x x x x 5 7 8 9 #EUds6 26
  • 26. 2 Properties of the algorithm • Small-/big-star operations do not change graph connectivity. • Extra edges are pruned during iterations. • Each connected component converges to a star graph. • Converges in log2(#nodes) iterations #EUds6 27
  • 27. 2 Implementation Iterate: • filter • self-join Challenge: handle these operations at scale. #EUds6 28
  • 28. 2 Skewed joins Real-world graphs contain big components. The ”Justin Bieber problem” at Twitter  data skew during connected components iterations src dst 0 1 0 2 0 3 0 4 … … 0 2,000,000 1 3 2 5 src Component id neighbors 0 0 2,000,000 1 0 10 2 3 5 join #EUds6 29
  • 29. 3 Skewed joins 3 0 src dst 0 1 0 2 0 3 0 4 … … 0 2,000,000 hash join 1 3 2 5 broadcast join (#nbrs > 1,000,000) union src Component id neighbors 0 0 2,000,000 1 0 10 2 3 5 #EUds6
  • 30. 3 Checkpointing We checkpoint every 2 iterations to avoid: • query plan explosion (exponential growth) • optimizer slowdown • disk out of shuffle space • unexpected node failures 3 1 #EUds6
  • 31. 3 Experiments twitter-2010 from WebGraph datasets (small diameter) –42 million vertices, 1.5 billion edges 16 r3.4xlarge workers on Databricks –GraphX: 4 minutes –GraphFrames: 6 minutes • algorithm difference, checkpointing, checking skewness 3 2 #EUds6
  • 32. 3 Experiments uk-2007-05 from WebGraph datasets –105 million vertices, 3.7 billion edges 16 r3.4xlarge workers on Databricks –GraphX: 25 minutes • slow convergence –GraphFrames: 4.5 minutes 3 3 #EUds6
  • 33. 3 Experiments regular grid 32,000 x 32,000 (large diameter) –1 billion nodes, 4 billion edges 32 r3.8xlarge workers on Databricks –GraphX: failed –GraphFrames: 1 hour 3 4 #EUds6
  • 34. 3 Future improvements GraphFrames • better graph partitioning • letting Spark SQL handle skewed joins and iterations • graph compression Connected Components • local iterations • node pruning and better stopping criteria #EUds6 35
  • 35. Thank you! • http://graphframes.github.io • https://docs.databricks.com 36#EUds6
  • 37. 3 2 types of graph representations Algorithm-based Query-based Standard & custom algorithms Optimized for batch processing Motif finding Point queries & updates GraphFrames: Both algorithms & queries (but not point updates)#EUds6 38
  • 38. 3 Graph analysis with GraphFrames Simple queries Motif finding Graph algorithms 39 #EUds6
  • 39. 4 Simple queries SQL queries on vertices & edges 40 Simple graph queries (e.g., vertex degrees) #EUds6
  • 40. 4 Motif finding 41 JFK IAD LAX SFO SEA DFW Search for structural patterns within a graph. val paths: DataFrame = g.find(“(a)-[e1]->(b); (b)-[e2]->(c); !(c)-[]->(a)”) #EUds6
  • 41. 4 Motif finding 42 JFK IAD LAX SFO SEA DFW (b) (a)Search for structural patterns within a graph. val paths: DataFrame = g.find(“(a)-[e1]->(b); (b)-[e2]->(c); !(c)-[]->(a)”) #EUds6
  • 42. 4 Motif finding 43 JFK IAD LAX SFO SEA DFW (b) (a) (c) Search for structural patterns within a graph. val paths: DataFrame = g.find(“(a)-[e1]->(b); (b)-[e2]->(c); !(c)-[]->(a)”) #EUds6
  • 43. 4 Motif finding 44 JFK IAD LAX SFO SEA DFW (b) (a) (c) Search for structural patterns within a graph. val paths: DataFrame = g.find(“(a)-[e1]->(b); (b)-[e2]->(c); !(c)-[]->(a)”) #EUds6
  • 44. 4 Motif finding 45 JFK IAD LAX SFO SEA DFW Search for structural patterns within a graph. val paths: DataFrame = g.find(“(a)-[e1]->(b); (b)-[e2]->(c); !(c)-[]->(a)”) (b) (a) (c) Then filter using vertex & edge data. paths.filter(“e1.delay > 20”) #EUds6
  • 45. 4 Graph algorithms Find important vertices • PageRank 46 Find paths between sets of vertices • Breadth-first search (BFS) • Shortest paths Find groups of vertices (components, communities) • Connected components • Strongly connected components • Label Propagation Algorithm (LPA) Other • Triangle counting • SVDPlusPlus #EUds6
  • 46. 4 Saving & loading graphs Save & load the DataFrames. vertices = sqlContext.read.parquet(...) edges = sqlContext.read.parquet(...) g = GraphFrame(vertices, edges) g.vertices.write.parquet(...) g.edges.write.parquet(...) 47 #EUds6

Editor's Notes

  • #2: Xiangrui’s talk: https://www.youtube.com/watch?v=D2kBcdldNT8&feature=youtu.be Xiangrui’s slides: https://www.slideshare.net/databricks/challenging-webscale-graph-analytics-with-apache-spark-with-xiangrui-meng GraphFrames doc: https://docs.databricks.com/spark/latest/graph-analysis/graphframes/graph-analysis-tutorial.html
  • #21: Raimondas Kiveris