SlideShare a Scribd company logo
Web-Scale Graph Analytics
with Apache Spark
Joseph K Bradley
Bay Area Apache Spark Meetup
September 7, 2017
2
About me
Software engineer at Databricks
Apache Spark committer & PMC member
Ph.D. Carnegie Mellon in Machine Learning
3
TEAM
About Databricks
Started Spark project (now Apache Spark) at UC Berkeley in 2009
3	3	
PRODUCT
Unified Analytics Platform
MISSION
Making Big Data Simple
4
UNIFIED ANALYTICS PLATFORM
Try Apache Spark in Databricks!
•  Collaborative cloud environment
•  Free version (community edition)
4	4	
DATABRICKS RUNTIME 3.2
•  Apache Spark - optimized for the cloud
•  Caching and optimization layer - DBIO
•  Enterprise security - DBES
Try for free today.
databricks.com
5
Apache Spark Engine
…
Spark Core
Spark
Streaming
Spark SQL MLlib GraphX
Unified engine across diverse workloads & environments
Scale out, fault tolerant
Python, Java, Scala, & R APIs
Standard libraries
6
7
Spark Packages
340+ packages written for Spark
80+ packages for ML and Graphs
E.g.:
• GraphFrames: DataFrame-based graphs
• Bisecting K-Means: now part of MLlib
• Stanford CoreNLP integration: UDFs for NLP
spark-packages.org
8
Outline
Intro to GraphFrames
Moving implementations to DataFrames
•  Vertex indexing
•  Scaling Connected Components
•  Other challenges: skewed joins and checkpoints
Future of GraphFrames
9
Graphs
vertex
edge
id City State
“JFK” “New York” NY
Example: airports & flights between them
JFK
IAD
LAX
SFO
SEA
DFW
src dst delay tripID
“JFK” “SEA” 45 1058923
10
Apache Spark’s GraphX library
Overview
•  General-purpose graph
processing library
•  Optimized for fast
distributed computing
•  Library of algorithms:
PageRank, Connected
Components, etc.
10	
Challenges
•  No Java, Python APIs
•  Lower-level RDD-based API
(vs. DataFrames)
•  Cannot use recent Spark
optimizations: Catalyst
query optimizer, Tungsten
memory management
11
The GraphFrames Spark Package
Goal: DataFrame-based graphs on Apache Spark
•  Simplify interactive queries
•  Support motif-finding for structural pattern search
•  Benefit from DataFrame optimizations
Collaboration between Databricks, UC Berkeley & MIT
+ Now with community contributors & committers!
11
12
Graphs
vertex
edge
JFK
IAD
LAX
SFO
SEA
DFW
id City State
“JFK” “New York” NY
src dst delay tripID
“JFK” “SEA” 45 1058923
13
GraphFrames
13	
id City State
“JFK” “New York” NY
“SEA” “Seattle” WA
src dst delay tripID
“JFK” “SEA” 45 1058923
“DFW” “SFO” -7 4100224
vertices DataFrame edges DataFrame
14
Graph analysis with GraphFrames
Simple queries
Motif finding
Graph algorithms
14
15
Simple queries
SQL queries on vertices & edges
15	
Simple graph queries (e.g., vertex degrees)
16
Motif finding
16	
JFK
IAD
LAX
SFO
SEA
DFW
Search for structural
patterns within a graph.
val paths: DataFrame =
g.find(“(a)-[e1]->(b);
(b)-[e2]->(c);
!(c)-[]->(a)”)
17
Motif finding
17	
JFK
IAD
LAX
SFO
SEA
DFW
(b)
(a)Search for structural
patterns within a graph.
val paths: DataFrame =
g.find(“(a)-[e1]->(b);
(b)-[e2]->(c);
!(c)-[]->(a)”)
18
Motif finding
18	
JFK
IAD
LAX
SFO
SEA
DFW
(b)
(a)
(c)
Search for structural
patterns within a graph.
val paths: DataFrame =
g.find(“(a)-[e1]->(b);
(b)-[e2]->(c);
!(c)-[]->(a)”)
19
Motif finding
19	
JFK
IAD
LAX
SFO
SEA
DFW
(b)
(a)
(c)
Search for structural
patterns within a graph.
val paths: DataFrame =
g.find(“(a)-[e1]->(b);
(b)-[e2]->(c);
!(c)-[]->(a)”)
20
Motif finding
20	
JFK
IAD
LAX
SFO
SEA
DFW
Search for structural
patterns within a graph.
val paths: DataFrame =
g.find(“(a)-[e1]->(b);
(b)-[e2]->(c);
!(c)-[]->(a)”)
(b)
(a)
(c)
Then filter using vertex &
edge data.
paths.filter(“e1.delay > 20”)
21
Graph algorithms
Find important vertices
•  PageRank
21	
Find paths between sets of vertices
•  Breadth-first search (BFS)
•  Shortest paths
Find groups of vertices
(components, communities)
•  Connected components
•  Strongly connected components
•  Label Propagation Algorithm (LPA)
Other
•  Triangle counting
•  SVDPlusPlus
22
Saving & loading graphs
Save & load the DataFrames.
vertices = sqlContext.read.parquet(...)
edges = sqlContext.read.parquet(...)
g = GraphFrame(vertices, edges)
g.vertices.write.parquet(...)
g.edges.write.parquet(...)
22
23
GraphFrames vs. GraphX
23	
GraphFrames GraphX
Built on DataFrames RDDs
Languages Scala, Java, Python Scala
Use cases Queries & algorithms Algorithms
Vertex IDs Any type (in Catalyst) Long
Vertex/edge
attributes
Any number of
DataFrame columns
Any type (VD, ED)
24
2 types of graph libraries
Graph algorithms Graph queries
Standard & custom algorithms
Optimized for batch processing
Motif finding
Point queries & updates
GraphFrames: Both algorithms & queries (but not point updates)
25
Outline
Intro to GraphFrames
Moving implementations to DataFrames
•  Vertex indexing
•  Scaling Connected Components
•  Other challenges: skewed joins and checkpoints
Future of GraphFrames
26
Algorithm implementations
Mostly wrappers for GraphX
•  PageRank
•  Shortest paths
•  Strongly connected components
•  Label Propagation Algorithm (LPA)
•  SVDPlusPlus
26	
Some algorithms implemented
using DataFrames
•  Breadth-first search
•  Connected components
•  Triangle counting
•  Motif finding
27
Moving implementations to DataFrames
DataFrames are optimized for a huge number of small records.
•  columnar storage
•  code generation (“Project Tungsten”)
•  query optimization (“Project Catalyst”)
27
28
Outline
Intro to GraphFrames
Moving implementations to DataFrames
•  Vertex indexing
•  Scaling Connected Components
•  Other challenges: skewed joins and checkpoints
Future of GraphFrames
29
Pros of integer vertex IDs
GraphFrames take arbitrary vertex IDs.
à convenient for users
Algorithms prefer integer vertex IDs.
à optimize in-memory storage
à reduce communication
Our task: Map unique vertex IDs to unique (long) integers.
30
The hashing trick?
• Possible solution: hash vertex ID to long integer
• What is the chance of collision?
•  1 - (k-1)/N * (k-2)/N * …
•  seems unlikely with long range N=264
•  with 1 billion nodes, the chance is ~5.4%
• Problem: collisions change graph topology.
Name Hash
Tim 84088
Joseph -2070372689
Xiangrui 264245405
Felix 67762524
31
Generating unique IDs
Spark has built-in methods to generate unique IDs.
•  RDD: zipWithUniqueId(), zipWithIndex()
•  DataFrame: monotonically_increasing_id()
!
Possible solution: just use these methods
32
How it works
ParCCon	1	
Vertex	 ID	
Tim	 0	
Joseph	 1	
ParCCon	2	
Vertex	 ID	
Xiangrui	 100	+	0	
Felix	 100	+	1	
ParCCon	3	
Vertex	 ID	
…	 200	+	0	
…	 200	+	1
33
… but not always
• DataFrames/RDDs are immutable and reproducible by design.
• However, records do not always have stable orderings.
•  distinct
•  repartition
• cache() does not help.
ParCCon	1	
Vertex	 ID	
Tim	 0	
Joseph	 1	
ParCCon	1	
Vertex	 ID	
Joseph	 0	
Tim	 1	
re-compute
34
Our implementation
We implemented (v0.5.0) an expensive but correct version:
1.  (hash) re-partition + distinct vertex IDs
2.  sort vertex IDs within each partition
3.  generate unique integer IDs
35
Outline
Intro to GraphFrames
Moving implementations to DataFrames
•  Vertex indexing
•  Scaling Connected Components
•  Other challenges: skewed joins and checkpoints
Future of GraphFrames
36
Connected Components
Assign each vertex a component ID such that vertices receive the
same component ID iff they are connected.
Applications:
•  fraud detection
• Spark Summit 2016 keynote from Capital One
•  clustering
•  entity resolution
1	 3	
2
37
Naive implementation (GraphX)
1.  Assign each vertex a unique component ID.
2.  Iterate until convergence:
•  For each vertex v, update:
component ID of v ß Smallest component ID in neighborhood of v
Pro: easy to implement
Con: slow convergence on large-diameter graphs
38
Small-/large-star algorithm
Kiveris et al. "Connected Components in MapReduce and Beyond."
1.  Assign each vertex a unique ID.
2.  Iterate until convergence:
• (small-star) for each vertex,
connect smaller neighbors to smallest neighbor
• (big-star) for each vertex,
connect bigger neighbors to smallest neighbor (or itself)
39
Small-star operation
Kiveris et al., Connected Components in MapReduce and Beyond.
40
Big-star operation
Kiveris et al., Connected Components in MapReduce and Beyond.
41
Another interpretation
1	 5	 7	 8	 9	
1	 x	
5	 x	
7	 x	
8	 x	
9	
adjacency	matrix
42
Small-star operation
1	 5	 7	 8	 9	
1	 x	 x	 x	
5	
7	
8	 x	
9	
1	 5	 7	 8	 9	
1	 x	
5	 x	
7	 x	
8	 x	
9	
rotate	&	liK
43
Big-star operation
liK	
1	 5	 7	 8	 9	
1	 x	 x	
5	 x	
7	 x	
8	
9	
1	 5	 7	 8	 9	
1	 x	
5	 x	
7	 x	
8	 x	
9
44
Convergence
1	 5	 7	 8	 9	
1	 x	 x	 x	 x	 x	
5	
7	
8	
9
45
Properties of the algorithm
• Small-/big-star operations do not change graph connectivity.
• Extra edges are pruned during iterations.
• Each connected component converges to a star graph.
• Converges in log2(#nodes) iterations
46
Implementation
Iterate:
• filter
• self-join
Challenge: handle these operations at scale.
47
Outline
Intro to GraphFrames
Moving implementations to DataFrames
•  Vertex indexing
•  Scaling Connected Components
•  Other challenges: skewed joins and checkpoints
Future of GraphFrames
48
Skewed joins
Real-world graphs contain big components.
à data skew during connected components iterations
src	 dst	
0	 1	
0	 2	
0	 3	
0	 4	
…	 …	
0	 2,000,000	
1	 3	
2	 5	
src	 Component	id	 neighbors	
0	 0	 2,000,000	
1	 0	 10	
2	 3	 5	
join
49
Skewed joins
4
src	 dst	
0	 1	
0	 2	
0	 3	
0	 4	
…	 …	
0	 2,000,000	
hash	join	
1	 3	
2	 5	
broadcast	join	
(#nbrs	>	1,000,000)	
union	
src	 Component	id	 neighbors	
0	 0	 2,000,000	
1	 0	 10	
2	 3	 5
50
Checkpointing
We checkpoint every 2 iterations to avoid:
•  query plan explosion (exponential growth)
•  optimizer slowdown
•  disk out of shuffle space
•  unexpected node failures
5
51
Experiments
twitter-2010 from WebGraph datasets (small diameter)
•  42 million vertices, 1.5 billion edges
16 r3.4xlarge workers on Databricks
•  GraphX: 4 minutes
•  GraphFrames: 6 minutes
–  algorithm difference, checkpointing, checking skewness
5
52
Experiments
uk-2007-05 from WebGraph datasets
•  105 million vertices, 3.7 billion edges
16 r3.4xlarge workers on Databricks
•  GraphX: 25 minutes
–  slow convergence
•  GraphFrames: 4.5 minutes
5
53
Experiments
regular grid 32,000 x 32,000 (large diameter)
•  1 billion nodes, 4 billion edges
32 r3.8xlarge workers on Databricks
•  GraphX: failed
•  GraphFrames: 1 hour
5
54
Experiments
regular grid 50,000 x 50,000 (large diameter)
•  2.5 billion nodes, 10 billion edges
32 r3.8xlarge workers on Databricks
•  GraphX: failed
•  GraphFrames: 1.6 hours
5
55
Future improvements
GraphFrames
•  update inefficient code (due to Spark 1.6 compatibility)
•  better graph partitioning
•  letting Spark SQL handle skewed joins and iterations
•  graph compression
Connected Components
•  local iterations
•  node pruning and better stopping criteria
56
https://spark-summit.org/eu-2017/
15% Discount code: Databricks
hRp://dbricks.co/2sK35XT
https://databricks.com/company/careers
Thank you!
Get started with GraphFrames
Docs, downloads & tutorials
http://graphframes.github.io
https://docs.databricks.com
Dev community
Github issues & PRs
Twitter: @jkbatcmu à I’ll share my slides.

More Related Content

PDF
Machine learning pipeline with spark ml
PDF
Machine learning life cycle
PDF
Generative adversarial networks
PPTX
Knowledge Graph Introduction
PDF
Feature Engineering
PPTX
Feature Engineering
PDF
Finding Graph Isomorphisms In GraphX And GraphFrames
PDF
An Algebraic Data Model for Graphs and Hypergraphs (Category Theory meetup, N...
Machine learning pipeline with spark ml
Machine learning life cycle
Generative adversarial networks
Knowledge Graph Introduction
Feature Engineering
Feature Engineering
Finding Graph Isomorphisms In GraphX And GraphFrames
An Algebraic Data Model for Graphs and Hypergraphs (Category Theory meetup, N...

What's hot (20)

PPTX
Feature enginnering and selection
PDF
Predicting Influence and Communities Using Graph Algorithms
PDF
Loss Functions for Deep Learning - Javier Ruiz Hidalgo - UPC Barcelona 2018
PDF
An Introduction to Neural Architecture Search
PPTX
Image-to-Image Translation pix2pix
PPTX
The Apache Solr Semantic Knowledge Graph
PPTX
Reinforcement Learning
PDF
Notes from Coursera Deep Learning courses by Andrew Ng
PDF
텐서플로우 설치도 했고 튜토리얼도 봤고 기초 예제도 짜봤다면 TensorFlow KR Meetup 2016
PDF
Tutorial on Deep Learning in Recommender System, Lars summer school 2019
PPTX
Generative Adversarial Networks (GAN)
PDF
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
PDF
PromQL Deep Dive - The Prometheus Query Language
PDF
MLflow: Infrastructure for a Complete Machine Learning Life Cycle with Mani ...
PDF
Learning to Rank for Recommender Systems - ACM RecSys 2013 tutorial
PDF
Thought Vectors and Knowledge Graphs in AI-powered Search
PDF
A Review of Deep Contextualized Word Representations (Peters+, 2018)
PDF
[QCon.ai 2019] People You May Know: Fast Recommendations Over Massive Data
PPTX
Unsupervised learning (clustering)
PDF
Feature Engineering
Feature enginnering and selection
Predicting Influence and Communities Using Graph Algorithms
Loss Functions for Deep Learning - Javier Ruiz Hidalgo - UPC Barcelona 2018
An Introduction to Neural Architecture Search
Image-to-Image Translation pix2pix
The Apache Solr Semantic Knowledge Graph
Reinforcement Learning
Notes from Coursera Deep Learning courses by Andrew Ng
텐서플로우 설치도 했고 튜토리얼도 봤고 기초 예제도 짜봤다면 TensorFlow KR Meetup 2016
Tutorial on Deep Learning in Recommender System, Lars summer school 2019
Generative Adversarial Networks (GAN)
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
PromQL Deep Dive - The Prometheus Query Language
MLflow: Infrastructure for a Complete Machine Learning Life Cycle with Mani ...
Learning to Rank for Recommender Systems - ACM RecSys 2013 tutorial
Thought Vectors and Knowledge Graphs in AI-powered Search
A Review of Deep Contextualized Word Representations (Peters+, 2018)
[QCon.ai 2019] People You May Know: Fast Recommendations Over Massive Data
Unsupervised learning (clustering)
Feature Engineering
Ad

Similar to Web-Scale Graph Analytics with Apache® Spark™ (20)

PDF
Web-Scale Graph Analytics with Apache® Spark™
PDF
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
PDF
Challenging Web-Scale Graph Analytics with Apache Spark
PPTX
Web-Scale Graph Analytics with Apache Spark with Tim Hunter
PDF
GraphFrames: DataFrame-based graphs for Apache® Spark™
PDF
Data Summer Conf 2018, “Analysing Billion Node Graphs (ENG)” — Giorgi Jvaridz...
PPTX
Graph processing at scale using spark & graph frames
PDF
Distributed graph processing
PPTX
Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case
PDF
Graph Algorithms - Map-Reduce Graph Processing
PDF
Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs
PDF
An excursion into Graph Analytics with Apache Spark GraphX
PDF
Graph Analytics with ArangoDB
PDF
Improve ml predictions using graph algorithms (webinar july 23_19).pptx
PDF
Graph Analytics in Spark
PDF
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
PDF
Ling liu part 02:big graph processing
PDF
F14 lec12graphs
PDF
GraphX: Graph analytics for insights about developer communities
PPTX
Graphs in data structures are non-linear data structures made up of a finite ...
Web-Scale Graph Analytics with Apache® Spark™
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark
Web-Scale Graph Analytics with Apache Spark with Tim Hunter
GraphFrames: DataFrame-based graphs for Apache® Spark™
Data Summer Conf 2018, “Analysing Billion Node Graphs (ENG)” — Giorgi Jvaridz...
Graph processing at scale using spark & graph frames
Distributed graph processing
Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case
Graph Algorithms - Map-Reduce Graph Processing
Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs
An excursion into Graph Analytics with Apache Spark GraphX
Graph Analytics with ArangoDB
Improve ml predictions using graph algorithms (webinar july 23_19).pptx
Graph Analytics in Spark
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Ling liu part 02:big graph processing
F14 lec12graphs
GraphX: Graph analytics for insights about developer communities
Graphs in data structures are non-linear data structures made up of a finite ...
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake

Recently uploaded (20)

PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PPTX
history of c programming in notes for students .pptx
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PPTX
Odoo POS Development Services by CandidRoot Solutions
PPTX
L1 - Introduction to python Backend.pptx
PPTX
Why Generative AI is the Future of Content, Code & Creativity?
PDF
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PPTX
CHAPTER 2 - PM Management and IT Context
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PDF
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
PDF
Designing Intelligence for the Shop Floor.pdf
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PDF
Cost to Outsource Software Development in 2025
PDF
Design an Analysis of Algorithms I-SECS-1021-03
Internet Downloader Manager (IDM) Crack 6.42 Build 41
history of c programming in notes for students .pptx
How to Choose the Right IT Partner for Your Business in Malaysia
Odoo POS Development Services by CandidRoot Solutions
L1 - Introduction to python Backend.pptx
Why Generative AI is the Future of Content, Code & Creativity?
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
CHAPTER 2 - PM Management and IT Context
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
Navsoft: AI-Powered Business Solutions & Custom Software Development
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
Designing Intelligence for the Shop Floor.pdf
Adobe Illustrator 28.6 Crack My Vision of Vector Design
Cost to Outsource Software Development in 2025
Design an Analysis of Algorithms I-SECS-1021-03

Web-Scale Graph Analytics with Apache® Spark™

  • 1. Web-Scale Graph Analytics with Apache Spark Joseph K Bradley Bay Area Apache Spark Meetup September 7, 2017
  • 2. 2 About me Software engineer at Databricks Apache Spark committer & PMC member Ph.D. Carnegie Mellon in Machine Learning
  • 3. 3 TEAM About Databricks Started Spark project (now Apache Spark) at UC Berkeley in 2009 3 3 PRODUCT Unified Analytics Platform MISSION Making Big Data Simple
  • 4. 4 UNIFIED ANALYTICS PLATFORM Try Apache Spark in Databricks! •  Collaborative cloud environment •  Free version (community edition) 4 4 DATABRICKS RUNTIME 3.2 •  Apache Spark - optimized for the cloud •  Caching and optimization layer - DBIO •  Enterprise security - DBES Try for free today. databricks.com
  • 5. 5 Apache Spark Engine … Spark Core Spark Streaming Spark SQL MLlib GraphX Unified engine across diverse workloads & environments Scale out, fault tolerant Python, Java, Scala, & R APIs Standard libraries
  • 6. 6
  • 7. 7 Spark Packages 340+ packages written for Spark 80+ packages for ML and Graphs E.g.: • GraphFrames: DataFrame-based graphs • Bisecting K-Means: now part of MLlib • Stanford CoreNLP integration: UDFs for NLP spark-packages.org
  • 8. 8 Outline Intro to GraphFrames Moving implementations to DataFrames •  Vertex indexing •  Scaling Connected Components •  Other challenges: skewed joins and checkpoints Future of GraphFrames
  • 9. 9 Graphs vertex edge id City State “JFK” “New York” NY Example: airports & flights between them JFK IAD LAX SFO SEA DFW src dst delay tripID “JFK” “SEA” 45 1058923
  • 10. 10 Apache Spark’s GraphX library Overview •  General-purpose graph processing library •  Optimized for fast distributed computing •  Library of algorithms: PageRank, Connected Components, etc. 10 Challenges •  No Java, Python APIs •  Lower-level RDD-based API (vs. DataFrames) •  Cannot use recent Spark optimizations: Catalyst query optimizer, Tungsten memory management
  • 11. 11 The GraphFrames Spark Package Goal: DataFrame-based graphs on Apache Spark •  Simplify interactive queries •  Support motif-finding for structural pattern search •  Benefit from DataFrame optimizations Collaboration between Databricks, UC Berkeley & MIT + Now with community contributors & committers! 11
  • 12. 12 Graphs vertex edge JFK IAD LAX SFO SEA DFW id City State “JFK” “New York” NY src dst delay tripID “JFK” “SEA” 45 1058923
  • 13. 13 GraphFrames 13 id City State “JFK” “New York” NY “SEA” “Seattle” WA src dst delay tripID “JFK” “SEA” 45 1058923 “DFW” “SFO” -7 4100224 vertices DataFrame edges DataFrame
  • 14. 14 Graph analysis with GraphFrames Simple queries Motif finding Graph algorithms 14
  • 15. 15 Simple queries SQL queries on vertices & edges 15 Simple graph queries (e.g., vertex degrees)
  • 16. 16 Motif finding 16 JFK IAD LAX SFO SEA DFW Search for structural patterns within a graph. val paths: DataFrame = g.find(“(a)-[e1]->(b); (b)-[e2]->(c); !(c)-[]->(a)”)
  • 17. 17 Motif finding 17 JFK IAD LAX SFO SEA DFW (b) (a)Search for structural patterns within a graph. val paths: DataFrame = g.find(“(a)-[e1]->(b); (b)-[e2]->(c); !(c)-[]->(a)”)
  • 18. 18 Motif finding 18 JFK IAD LAX SFO SEA DFW (b) (a) (c) Search for structural patterns within a graph. val paths: DataFrame = g.find(“(a)-[e1]->(b); (b)-[e2]->(c); !(c)-[]->(a)”)
  • 19. 19 Motif finding 19 JFK IAD LAX SFO SEA DFW (b) (a) (c) Search for structural patterns within a graph. val paths: DataFrame = g.find(“(a)-[e1]->(b); (b)-[e2]->(c); !(c)-[]->(a)”)
  • 20. 20 Motif finding 20 JFK IAD LAX SFO SEA DFW Search for structural patterns within a graph. val paths: DataFrame = g.find(“(a)-[e1]->(b); (b)-[e2]->(c); !(c)-[]->(a)”) (b) (a) (c) Then filter using vertex & edge data. paths.filter(“e1.delay > 20”)
  • 21. 21 Graph algorithms Find important vertices •  PageRank 21 Find paths between sets of vertices •  Breadth-first search (BFS) •  Shortest paths Find groups of vertices (components, communities) •  Connected components •  Strongly connected components •  Label Propagation Algorithm (LPA) Other •  Triangle counting •  SVDPlusPlus
  • 22. 22 Saving & loading graphs Save & load the DataFrames. vertices = sqlContext.read.parquet(...) edges = sqlContext.read.parquet(...) g = GraphFrame(vertices, edges) g.vertices.write.parquet(...) g.edges.write.parquet(...) 22
  • 23. 23 GraphFrames vs. GraphX 23 GraphFrames GraphX Built on DataFrames RDDs Languages Scala, Java, Python Scala Use cases Queries & algorithms Algorithms Vertex IDs Any type (in Catalyst) Long Vertex/edge attributes Any number of DataFrame columns Any type (VD, ED)
  • 24. 24 2 types of graph libraries Graph algorithms Graph queries Standard & custom algorithms Optimized for batch processing Motif finding Point queries & updates GraphFrames: Both algorithms & queries (but not point updates)
  • 25. 25 Outline Intro to GraphFrames Moving implementations to DataFrames •  Vertex indexing •  Scaling Connected Components •  Other challenges: skewed joins and checkpoints Future of GraphFrames
  • 26. 26 Algorithm implementations Mostly wrappers for GraphX •  PageRank •  Shortest paths •  Strongly connected components •  Label Propagation Algorithm (LPA) •  SVDPlusPlus 26 Some algorithms implemented using DataFrames •  Breadth-first search •  Connected components •  Triangle counting •  Motif finding
  • 27. 27 Moving implementations to DataFrames DataFrames are optimized for a huge number of small records. •  columnar storage •  code generation (“Project Tungsten”) •  query optimization (“Project Catalyst”) 27
  • 28. 28 Outline Intro to GraphFrames Moving implementations to DataFrames •  Vertex indexing •  Scaling Connected Components •  Other challenges: skewed joins and checkpoints Future of GraphFrames
  • 29. 29 Pros of integer vertex IDs GraphFrames take arbitrary vertex IDs. à convenient for users Algorithms prefer integer vertex IDs. à optimize in-memory storage à reduce communication Our task: Map unique vertex IDs to unique (long) integers.
  • 30. 30 The hashing trick? • Possible solution: hash vertex ID to long integer • What is the chance of collision? •  1 - (k-1)/N * (k-2)/N * … •  seems unlikely with long range N=264 •  with 1 billion nodes, the chance is ~5.4% • Problem: collisions change graph topology. Name Hash Tim 84088 Joseph -2070372689 Xiangrui 264245405 Felix 67762524
  • 31. 31 Generating unique IDs Spark has built-in methods to generate unique IDs. •  RDD: zipWithUniqueId(), zipWithIndex() •  DataFrame: monotonically_increasing_id() ! Possible solution: just use these methods
  • 32. 32 How it works ParCCon 1 Vertex ID Tim 0 Joseph 1 ParCCon 2 Vertex ID Xiangrui 100 + 0 Felix 100 + 1 ParCCon 3 Vertex ID … 200 + 0 … 200 + 1
  • 33. 33 … but not always • DataFrames/RDDs are immutable and reproducible by design. • However, records do not always have stable orderings. •  distinct •  repartition • cache() does not help. ParCCon 1 Vertex ID Tim 0 Joseph 1 ParCCon 1 Vertex ID Joseph 0 Tim 1 re-compute
  • 34. 34 Our implementation We implemented (v0.5.0) an expensive but correct version: 1.  (hash) re-partition + distinct vertex IDs 2.  sort vertex IDs within each partition 3.  generate unique integer IDs
  • 35. 35 Outline Intro to GraphFrames Moving implementations to DataFrames •  Vertex indexing •  Scaling Connected Components •  Other challenges: skewed joins and checkpoints Future of GraphFrames
  • 36. 36 Connected Components Assign each vertex a component ID such that vertices receive the same component ID iff they are connected. Applications: •  fraud detection • Spark Summit 2016 keynote from Capital One •  clustering •  entity resolution 1 3 2
  • 37. 37 Naive implementation (GraphX) 1.  Assign each vertex a unique component ID. 2.  Iterate until convergence: •  For each vertex v, update: component ID of v ß Smallest component ID in neighborhood of v Pro: easy to implement Con: slow convergence on large-diameter graphs
  • 38. 38 Small-/large-star algorithm Kiveris et al. "Connected Components in MapReduce and Beyond." 1.  Assign each vertex a unique ID. 2.  Iterate until convergence: • (small-star) for each vertex, connect smaller neighbors to smallest neighbor • (big-star) for each vertex, connect bigger neighbors to smallest neighbor (or itself)
  • 39. 39 Small-star operation Kiveris et al., Connected Components in MapReduce and Beyond.
  • 40. 40 Big-star operation Kiveris et al., Connected Components in MapReduce and Beyond.
  • 41. 41 Another interpretation 1 5 7 8 9 1 x 5 x 7 x 8 x 9 adjacency matrix
  • 42. 42 Small-star operation 1 5 7 8 9 1 x x x 5 7 8 x 9 1 5 7 8 9 1 x 5 x 7 x 8 x 9 rotate & liK
  • 43. 43 Big-star operation liK 1 5 7 8 9 1 x x 5 x 7 x 8 9 1 5 7 8 9 1 x 5 x 7 x 8 x 9
  • 44. 44 Convergence 1 5 7 8 9 1 x x x x x 5 7 8 9
  • 45. 45 Properties of the algorithm • Small-/big-star operations do not change graph connectivity. • Extra edges are pruned during iterations. • Each connected component converges to a star graph. • Converges in log2(#nodes) iterations
  • 47. 47 Outline Intro to GraphFrames Moving implementations to DataFrames •  Vertex indexing •  Scaling Connected Components •  Other challenges: skewed joins and checkpoints Future of GraphFrames
  • 48. 48 Skewed joins Real-world graphs contain big components. à data skew during connected components iterations src dst 0 1 0 2 0 3 0 4 … … 0 2,000,000 1 3 2 5 src Component id neighbors 0 0 2,000,000 1 0 10 2 3 5 join
  • 49. 49 Skewed joins 4 src dst 0 1 0 2 0 3 0 4 … … 0 2,000,000 hash join 1 3 2 5 broadcast join (#nbrs > 1,000,000) union src Component id neighbors 0 0 2,000,000 1 0 10 2 3 5
  • 50. 50 Checkpointing We checkpoint every 2 iterations to avoid: •  query plan explosion (exponential growth) •  optimizer slowdown •  disk out of shuffle space •  unexpected node failures 5
  • 51. 51 Experiments twitter-2010 from WebGraph datasets (small diameter) •  42 million vertices, 1.5 billion edges 16 r3.4xlarge workers on Databricks •  GraphX: 4 minutes •  GraphFrames: 6 minutes –  algorithm difference, checkpointing, checking skewness 5
  • 52. 52 Experiments uk-2007-05 from WebGraph datasets •  105 million vertices, 3.7 billion edges 16 r3.4xlarge workers on Databricks •  GraphX: 25 minutes –  slow convergence •  GraphFrames: 4.5 minutes 5
  • 53. 53 Experiments regular grid 32,000 x 32,000 (large diameter) •  1 billion nodes, 4 billion edges 32 r3.8xlarge workers on Databricks •  GraphX: failed •  GraphFrames: 1 hour 5
  • 54. 54 Experiments regular grid 50,000 x 50,000 (large diameter) •  2.5 billion nodes, 10 billion edges 32 r3.8xlarge workers on Databricks •  GraphX: failed •  GraphFrames: 1.6 hours 5
  • 55. 55 Future improvements GraphFrames •  update inefficient code (due to Spark 1.6 compatibility) •  better graph partitioning •  letting Spark SQL handle skewed joins and iterations •  graph compression Connected Components •  local iterations •  node pruning and better stopping criteria
  • 59. Thank you! Get started with GraphFrames Docs, downloads & tutorials http://graphframes.github.io https://docs.databricks.com Dev community Github issues & PRs Twitter: @jkbatcmu à I’ll share my slides.