SlideShare a Scribd company logo
Optimize the Large Scale Graph Applications by
using Apache Spark with 4-5x Performance
Improvements
Agenda
© 2020 PayPal Inc. Confidential and proprietary.
Challenges
Our Lesson & Learn
• Improve the scalability of the large graph
computation
• Optimization & Enhancement in the production
environment
Learning Summary
Challenges
The main challenges we are facing
2+ billion Vertices
100+ billion Edges
Degrees
• Avg: 110
• Max : 2+ million
© 2020 PayPal Inc. Confidential and proprietary.
• Large graph with the data skew in nature • Strict SLA but various limitations in the
production
Limited Resources
Various production guidelines
Dedicated pool but shared
common services, E.g.,
NameNode
Our Lesson and Learn
•
•
Use Case#1 Community detection
© 2020 PayPal Inc. Confidential and proprietary.
• Using the Connected Component to
group the communities
• Reference the paper - Connected
Components in MapReduce and Beyond
SCALABILITY
1
5
4
6
3
2
Sample undirected graph
Find Connected Component
(1,2)
(1,3)
(1,4)
(1,5)
(1,6)
Community – (1)
The data skew in nature caused “Buckets effect” & OOM
© 2020 PayPal Inc. Confidential and proprietary.
Sample illustration
SCALABILITY
(6,1)
(6,1)
(6,2)
…
(5,2)
(4,2 )
(3,2)
Group by
starting
node
1. Find smallest node
in each group
2. Generate new
pairs by linking
node to smallest
node in each
group
(1,6)
(2,6)
(2,5)
(2,4)
(2,3)
(6,1)
(6,2)
(5,2)
(4,2)
(3,2)
(1,6)
(2,3)
(2,4)
(2,5)
(2,6)
Make it
directed
( 1, [6,2] )
( 2, [1,3,4,5,6]
)
( 3, [2] )
( 4, [2] )
( 5, [2] )
( 6, [1,2] )
(6,1)
(2,1)
(3,1)
(4,1)
(5,1)
(3,2)
(4,2)
(5,2)
Group by
starting
node
1. Find smallest
node in each
group
2. Generate new
pairs by linking
node to smallest
node in each
group
(6,1)
(6,2)
(2,1)
(3,2)
(4,2)
(5,2)
(1,6)
(2,6)
(1,2)
(2,3)
(2,4)
(2,5)
(6,1)
(6,2)
(2,1)
(3,2)
(4,2)
(5,2)
Make it
directed
Iteration#1
Iteration#2
(6,1)
(6,2)
(2,1)
(3,2)
(4,2)
(5,2)
Identify unique
representative vertex within
the community
Find connected components
in Reducer
(1, [6])
(6, [1,2])
(2,
[3,4,5,6])
(5, [2])
(4, [2])
(3, [2])
Iteration#1
1
5
4
6
3
2
Intermediate graph -
1
Iteration#2
1
5
4
6
3
2
Intermediate graph - 2
Dedup
(6,1)
(2,1)
(2,1)
(3,1)
(4,1)
(5,1)
(6,1)
…
Dedup
The data skew in nature caused “Buckets effect” & OOM
© 2020 PayPal Inc. Confidential and proprietary.
Sample illustration – Cont.
( 1, [2,3,4,5,6] )
( 2, [1,3,4,5] )
( 3, [1,2] )
( 4, [1,2] )
( 5, [1,2] )
( 6, [1] )
Group by
starting
node
(6,1)
(2,1)
(3,1)
(4,1)
(5,1)
(3,2)
(4,2)
(5,2)
(1,6)
(1,2)
(1,3)
(1,4)
(1,5)
(2,3)
(2,4)
(2,5)
(6,1)
(2,1)
(3,1)
(4,1)
(5,1)
(3,2)
(4,2)
(5,2)
Make it
directed
SCALABILITY
Find connected components
in Reducer
(6,1)
(5,1)
(4,1)
(3,1)
(2,1)
Identify unique
representative vertex within
the community
• The size of connected components increases significantly in each iteration.
• It caused “bucket effect” (Slow Reduce tasks)
• Keeping the connected components in memory caused OOM in some Reducer
For example:
• 50,000,000+ nodes connected
Iteration#3
1
5 4
6
3
2
Found one connected component,
id is 1, members are [1,2,3,4,5,6]
Iteration#3
( 6,1 )
( 5,1 )
( 4,1 )
…
( 5,1 )
( 5,2 )
( 6,1)
1. Find smallest
node in each
group
2. Generate new
pairs by linking
node to smallest
node in each
group
Dedup
Our approach to resolve “Buckets effect” & OOM
© 2020 PayPal Inc. Confidential and proprietary.
SCALABILITY
Separate huge and
normal keys
(2,1)
(3,1)
(3,2)
(4,1)
(4,2)
(5,1)
(5,2)
(6,1)
(1,2)
(1,3)
(1,4)
(1,5)
(1,6)
(6,1)
(2,1)
(3,1)
(4,1)
(5,1)
(3,2)
(4,2)
(5,2)
(1,6)
(1,2)
(1,3)
(1,4)
(1,5)
(2,3)
(2,4)
(2,5)
1. Find min for
each huge key
2. Divide the key
by adding
random
number as
prefix
(01,6)
(01,2)
(11,3)
(11,4)
(11,5)
Processed as introduced :
1. Group by starting node
2. Find smallest node in each group
3. explode the map to rows
4. Dedup
(2,1)
(3,1)
(4,1)
(5,1)
(6,1)
1. Group by key
2. Spill to mmap file if list
length of single key
exceed threshold
3. Keep remaining list in
memory
4. Keep min value of original
key in each group (11, ([file1],
[5],1))
(01, ([file2],[],1))
• Spilled [3,4] into file1
• Spilled [2,6] into file2
• Min value of original key is
1
Read list of files and in-
memory list, then
generate new pairs
(2,1)
(3,1)
(4,1)
(5,1)
(6,1)
Merge &
Dedup
(2,1)
(3,1)
(4,1)
(5,1)
(6,1)
1.
Separate
keys
2. Splitting
huge keys
3. Spill to
disk
Our Lesson & Learn of the scalability
v Don’t blame Spark when you see OOM
v Elegant memory usage is the KING
v Inevitable data skew, but scalability can be achieved
v Split huge key
v Spill to disk when necessary
© 2020 PayPal Inc. Confidential and proprietary.
Our Lesson and Learn
•
•
Use Case#2 Prepare the graph data by using Hive
© 2020 PayPal Inc. Confidential and proprietary.
How to choose the proper join solution in Spark?
PERFORMANCE
Note:
• spark 2.3.0
• join without joining keys is not included here
canBroadcastByHints ?
BroadCastJoin ShuffleHashJoin SortMergeJoin
canBroadcastBySizes ?
preferSortMergeJoin ?
Y
Y
N
N
Y
canBuildLocalHashMap ?
N
Y
N
--Quiz: Broadcast ? LocalHashMap ? SortMergeJoin?
select * from A inner join B on A.id=B.id where B.dt = ‘2020-
06-25’
• Both Table A and Table B are extra large table
• Table B contains one partition on Date(dt) column; The partition size is
around 1M.
• Inner join between small partition in Table B and an extra-large Table A
• Broadcast
• Smaller table
broadcasted
• No shuffle
• LocalHashMap
• Shuffle needed
• Build hash map for
smaller side in reducer
• SortMergeJoin
• Shuffle needed
• Sorting each partition
of both sides before
merge
Comparison among various join solutions
Use Case#2 Prepare the graph data by using Hive
© 2020 PayPal Inc. Confidential and proprietary.
How to choose the proper join solution in Spark?
PERFORMANCE
Note:
• spark 2.3.0
• join without joining keys is not included here
canBroadcastByHints ?
BroadCastJoin ShuffleHashJoin SortMergeJoin
canBroadcastBySizes ?
preferSortMergeJoin ?
Y
Y
N
N
Y
canBuildLocalHashMap ?
N
Y
N
--Quiz: Broadcast ? LocalHashMap ? SortMergeJoin?
select * from A inner join B on A.id=B.id where B.dt = ‘2020-
06-25’
• Both Table A and Table B are extra large table
• Table B contains one partition on Date(dt) column; The partition size is
around 1M.
• Inner join between small partition in Table B and an extra-large Table A
Expectation Execution …
• Broadcast
• Smaller table
broadcasted
• No shuffle
• LocalHashMap
• Shuffle needed
• Build hash map for
smaller side in reducer
• SortMergeJoin
• Shuffle needed
• Sorting each partition
of both sides before
merge
Comparison among various join solutions
Our approach to enable broadcast join with 3x performance improved
© 2020 PayPal Inc. Confidential and proprietary.
select *
from A inner join B on A.id = B.id
where B.dt = ‘2020-06-25’
Parser
‘Project (*)
‘Filter (dt=‘2020-06-25’)
‘Join (A.id=B.id)
‘UnresolvedRelation A ‘UnresolvedRelation B
Project (*)
Filter (dt=‘2020-06-25’)
Join (A.id=B.id)
HiveTableRelation A
(sizeInBytes=1GB)
HiveTableRelation B
(sizeInBytes=1GB)
Analyzer
Optimizer
Project (*)
Join (A.id=B.id)
HiveTableRelation A
(sizeInBytes=1GB)
HiveTableRelation B
(sizeInBytes=1GB)
Filter (dt=‘2020-06-
25’)
Spark Strategies
(including JoinSelection)
ProjectExec (*)
SortMergeJoinExec (A.id=B.id)
HiveTableScanExec A HiveTableScanExec B
FilterExec (dt=‘2020-06-25’)
Before :
PERFORMANCE
Our approach to enable broadcast join with 3x performance improved
© 2020 PayPal Inc. Confidential and proprietary.
After:
select *
from A inner join B on A.id =
B.id
where B.dt = ‘2020-06-25’
Parser
‘Project (*)
‘Filter (dt=‘2020-06-25’)
‘Join (A.id=B.id)
‘UnresolvedRelation A ‘UnresolvedRelation B
Project (*)
Filter (dt=‘2020-06-25’)
Join (A.id=B.id)
HiveTableRelation A
(sizeInBytes=1GB)
HiveTableRelation B
(sizeInBytes=1GB)
Analyzer
Optimizer with rule
PruneHiveTablePartition
s
Project (*)
Join (A.id=B.id)
HiveTableRelation A
(sizeInBytes=1GB)
HiveTableRelation B
(sizeInBytes=1MB)
Filter (dt=‘2020-06-
25’)
Spark Strategies
(including JoinSelection)
ProjectExec (*)
BroadcastHashJoinExec (A.id=B.id)
HiveTableScanExec A HiveTableScanExec B
FilterExec (dt=‘2020-06-25’)
See PR #26805
merged in Spark 3.0
PERFORMANCE
Prune partitions and
update sizeInBytes
sizeInBytes updated
Broadcast join
selected
Use Case#3 Persist the graph data into Hive tables
© 2020 PayPal Inc. Confidential and proprietary.
Step 1. DDL Auditing process Step 2. Manipulate the data in Dataframe
Mis-partitioning the column(s) overloaded the HDFS namenode in production
PERFORMANCE
DDL Query Example reviewed and :
create table default.emp (
dept_id int, --1
emp_id int, --2
age int, --3
gender string, --4
address string --5
) partitioned by
(
country string, --6
city string --7
)
DML Query Example reviewed and :
// new a dataframe df1 from the other logic
df1.registerTempTable(“tmpTable”)
val df2 = sparkSession.sql(
“select
department_id as dept_id, --1
employee_id as emp_id, --2
emp_age as age, --3
emp_gender as gender, --4
cnty as country, --5
addr as address , --6
city_name as city --7
from tempTable“)
df2.write.insertInto(“default.emp”)
• address column has been mis-matched to country column
• country has 200+ distinct value while address has 10+ million distinct value
• Tons of new folders and files were created
• Generated platform alerts due to overloading the namenode continuously
Before :
Our approach to refine the interface explicitly
© 2020 PayPal Inc. Confidential and proprietary.
That avoids the column or partitioned column mis-match in compiling your code
Step 1. DDL Auditing process
Step 2. Manipulate the data in Dataframe
DML Query Example reviewed and :
// new a dataframe df1 from the other logic
df1.registerTempTable(“tmpTable”)
val df2 = sparkSession.sql(
“select
department_id as dept_id, --1
employee_id as emp_id, --2
emp_age as age, --3
emp_gender as gender, --4
cnty as country, --5
addr as address , --6
city_name as city --7
from tempTable“)
df2.write.insertInto(“default.emp”, true)
def insertInto(tableName: String, byName: Boolean): Unit
If byName is true, spark will do :
1. Match the columns between data frame and target table by name
2. Throw exception if column name in data frame does not exist in target
table
PERFORMANCE
Step 2. Manipulate the data in Dataframe
After:
DDL Query Example reviewed and :
create table default.emp (
dept_id int, --1
emp_id int, --2
age int, --3
gender string, --4
address string --5
) partitioned by
(
country string, --6
city string --7
)
Our Lesson & Learn of optimization & enhancement in production
Ø Nothing is too tiny to optimize performance
Ø Deep understanding of spark internals is helpful
Ø Misusage may lead to serious impact on shared service
Ø Explicit interface help avoid misusage
Ø Overall, the performance has been improved by 4-5x
© 2020 PayPal Inc. Confidential and proprietary.
Learning Summary
Our Learning summary
Ø Use memory elegantly in user code to improve scalability
Ø Understanding Spark deeply is helpful for optimization
Ø Achieve performance improvement from 2 days to around 10 hours
Open to the new learning journey by connecting with you all.
© 2020 PayPal Inc. Confidential and proprietary.
From our practices of the real cases in production
Q & A
Ad

Recommended

Democratizing Data
Democratizing Data
Databricks
 
Building Identity Graphs over Heterogeneous Data
Building Identity Graphs over Heterogeneous Data
Databricks
 
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
Flink Forward
 
A High Performance Mutable Engagement Activity Delta Lake
A High Performance Mutable Engagement Activity Delta Lake
Databricks
 
Building the Next-gen Digital Meter Platform for Fluvius
Building the Next-gen Digital Meter Platform for Fluvius
Databricks
 
Phar Data Platform: From the Lakehouse Paradigm to the Reality
Phar Data Platform: From the Lakehouse Paradigm to the Reality
Databricks
 
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Databricks
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
Databricks
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture
Databricks
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Paris Data Engineers !
 
Saving Energy in Homes with a Unified Approach to Data and AI
Saving Energy in Homes with a Unified Approach to Data and AI
Databricks
 
Real-Time Forecasting at Scale using Delta Lake and Delta Caching
Real-Time Forecasting at Scale using Delta Lake and Delta Caching
Databricks
 
Intro to databricks delta lake
Intro to databricks delta lake
Mykola Zerniuk
 
Building Data Intensive Analytic Application on Top of Delta Lakes
Building Data Intensive Analytic Application on Top of Delta Lakes
Databricks
 
Add Historical Analysis of Operational Data with Easy Configurations in Fivet...
Add Historical Analysis of Operational Data with Easy Configurations in Fivet...
Databricks
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
Databricks
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse Technology
Matei Zaharia
 
Scaling and Modernizing Data Platform with Databricks
Scaling and Modernizing Data Platform with Databricks
Databricks
 
Building Robust Production Data Pipelines with Databricks Delta
Building Robust Production Data Pipelines with Databricks Delta
Databricks
 
SQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at Comcast
Databricks
 
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Databricks
 
Personalization Journey: From Single Node to Cloud Streaming
Personalization Journey: From Single Node to Cloud Streaming
Databricks
 
Lambda architecture for real time big data
Lambda architecture for real time big data
Trieu Nguyen
 
Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...
Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...
Databricks
 
Data Pipline Observability meetup
Data Pipline Observability meetup
Omid Vahdaty
 
Unifying Streaming and Historical Telemetry Data For Real-time Performance Re...
Unifying Streaming and Historical Telemetry Data For Real-time Performance Re...
Databricks
 
Spark - Migration Story
Spark - Migration Story
Roman Chukh
 
Distributed Heterogeneous Mixture Learning On Spark
Distributed Heterogeneous Mixture Learning On Spark
Spark Summit
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Databricks
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
Databricks
 

More Related Content

What's hot (20)

Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture
Databricks
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Paris Data Engineers !
 
Saving Energy in Homes with a Unified Approach to Data and AI
Saving Energy in Homes with a Unified Approach to Data and AI
Databricks
 
Real-Time Forecasting at Scale using Delta Lake and Delta Caching
Real-Time Forecasting at Scale using Delta Lake and Delta Caching
Databricks
 
Intro to databricks delta lake
Intro to databricks delta lake
Mykola Zerniuk
 
Building Data Intensive Analytic Application on Top of Delta Lakes
Building Data Intensive Analytic Application on Top of Delta Lakes
Databricks
 
Add Historical Analysis of Operational Data with Easy Configurations in Fivet...
Add Historical Analysis of Operational Data with Easy Configurations in Fivet...
Databricks
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
Databricks
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse Technology
Matei Zaharia
 
Scaling and Modernizing Data Platform with Databricks
Scaling and Modernizing Data Platform with Databricks
Databricks
 
Building Robust Production Data Pipelines with Databricks Delta
Building Robust Production Data Pipelines with Databricks Delta
Databricks
 
SQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at Comcast
Databricks
 
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Databricks
 
Personalization Journey: From Single Node to Cloud Streaming
Personalization Journey: From Single Node to Cloud Streaming
Databricks
 
Lambda architecture for real time big data
Lambda architecture for real time big data
Trieu Nguyen
 
Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...
Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...
Databricks
 
Data Pipline Observability meetup
Data Pipline Observability meetup
Omid Vahdaty
 
Unifying Streaming and Historical Telemetry Data For Real-time Performance Re...
Unifying Streaming and Historical Telemetry Data For Real-time Performance Re...
Databricks
 
Spark - Migration Story
Spark - Migration Story
Roman Chukh
 
Distributed Heterogeneous Mixture Learning On Spark
Distributed Heterogeneous Mixture Learning On Spark
Spark Summit
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture
Databricks
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Paris Data Engineers !
 
Saving Energy in Homes with a Unified Approach to Data and AI
Saving Energy in Homes with a Unified Approach to Data and AI
Databricks
 
Real-Time Forecasting at Scale using Delta Lake and Delta Caching
Real-Time Forecasting at Scale using Delta Lake and Delta Caching
Databricks
 
Intro to databricks delta lake
Intro to databricks delta lake
Mykola Zerniuk
 
Building Data Intensive Analytic Application on Top of Delta Lakes
Building Data Intensive Analytic Application on Top of Delta Lakes
Databricks
 
Add Historical Analysis of Operational Data with Easy Configurations in Fivet...
Add Historical Analysis of Operational Data with Easy Configurations in Fivet...
Databricks
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
Databricks
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse Technology
Matei Zaharia
 
Scaling and Modernizing Data Platform with Databricks
Scaling and Modernizing Data Platform with Databricks
Databricks
 
Building Robust Production Data Pipelines with Databricks Delta
Building Robust Production Data Pipelines with Databricks Delta
Databricks
 
SQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at Comcast
Databricks
 
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Databricks
 
Personalization Journey: From Single Node to Cloud Streaming
Personalization Journey: From Single Node to Cloud Streaming
Databricks
 
Lambda architecture for real time big data
Lambda architecture for real time big data
Trieu Nguyen
 
Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...
Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...
Databricks
 
Data Pipline Observability meetup
Data Pipline Observability meetup
Omid Vahdaty
 
Unifying Streaming and Historical Telemetry Data For Real-time Performance Re...
Unifying Streaming and Historical Telemetry Data For Real-time Performance Re...
Databricks
 
Spark - Migration Story
Spark - Migration Story
Roman Chukh
 
Distributed Heterogeneous Mixture Learning On Spark
Distributed Heterogeneous Mixture Learning On Spark
Spark Summit
 

Similar to Optimize the Large Scale Graph Applications by using Apache Spark with 4-5x Performance Improvements (20)

Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Databricks
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
Databricks
 
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Databricks
 
Challenging Web-Scale Graph Analytics with Apache Spark
Challenging Web-Scale Graph Analytics with Apache Spark
Databricks
 
PyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_production
PyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_production
Chetan Khatri
 
What’s New in the Upcoming Apache Spark 3.0
What’s New in the Upcoming Apache Spark 3.0
Databricks
 
Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016
Holden Karau
 
Spark + AI Summit recap jul16 2020
Spark + AI Summit recap jul16 2020
Guido Oswald
 
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
Databricks
 
Beyond shuffling - Scala Days Berlin 2016
Beyond shuffling - Scala Days Berlin 2016
Holden Karau
 
Quick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skills
Ravindra kumar
 
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
Holden Karau
 
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Holden Karau
 
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
Chetan Khatri
 
Streaming & Scaling Spark - London Spark Meetup 2016
Streaming & Scaling Spark - London Spark Meetup 2016
Holden Karau
 
Meetup talk
Meetup talk
Arpit Tak
 
Spark Summit EU talk by Stavros kontopoulos and Justin Pihony
Spark Summit EU talk by Stavros kontopoulos and Justin Pihony
Spark Summit
 
Spark Summit EU Supporting Spark (Brussels 2016)
Spark Summit EU Supporting Spark (Brussels 2016)
Stavros Kontopoulos
 
Problem Solving Recipes Learned from Supporting Spark: Spark Summit East talk...
Problem Solving Recipes Learned from Supporting Spark: Spark Summit East talk...
Spark Summit
 
Optimizations in Spark; RDD, DataFrame
Optimizations in Spark; RDD, DataFrame
Knoldus Inc.
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Databricks
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
Databricks
 
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Databricks
 
Challenging Web-Scale Graph Analytics with Apache Spark
Challenging Web-Scale Graph Analytics with Apache Spark
Databricks
 
PyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_production
PyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_production
Chetan Khatri
 
What’s New in the Upcoming Apache Spark 3.0
What’s New in the Upcoming Apache Spark 3.0
Databricks
 
Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016
Holden Karau
 
Spark + AI Summit recap jul16 2020
Spark + AI Summit recap jul16 2020
Guido Oswald
 
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
Databricks
 
Beyond shuffling - Scala Days Berlin 2016
Beyond shuffling - Scala Days Berlin 2016
Holden Karau
 
Quick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skills
Ravindra kumar
 
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
Holden Karau
 
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Holden Karau
 
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
Chetan Khatri
 
Streaming & Scaling Spark - London Spark Meetup 2016
Streaming & Scaling Spark - London Spark Meetup 2016
Holden Karau
 
Spark Summit EU talk by Stavros kontopoulos and Justin Pihony
Spark Summit EU talk by Stavros kontopoulos and Justin Pihony
Spark Summit
 
Spark Summit EU Supporting Spark (Brussels 2016)
Spark Summit EU Supporting Spark (Brussels 2016)
Stavros Kontopoulos
 
Problem Solving Recipes Learned from Supporting Spark: Spark Summit East talk...
Problem Solving Recipes Learned from Supporting Spark: Spark Summit East talk...
Spark Summit
 
Optimizations in Spark; RDD, DataFrame
Optimizations in Spark; RDD, DataFrame
Knoldus Inc.
 
Ad

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Ad

Recently uploaded (20)

Media_Literacy_Index_of_Media_Sector_Employees.pdf
Media_Literacy_Index_of_Media_Sector_Employees.pdf
OlhaTatokhina1
 
Report_Government Authorities_Index_ENG_FIN.pdf
Report_Government Authorities_Index_ENG_FIN.pdf
OlhaTatokhina1
 
Data Warehousing and Analytics IFI Techsolutions .pptx
Data Warehousing and Analytics IFI Techsolutions .pptx
IFI Techsolutions
 
最新版西班牙莱里达大学毕业证(UdL毕业证书)原版定制
最新版西班牙莱里达大学毕业证(UdL毕业证书)原版定制
Taqyea
 
SAP_S4HANA_EWM_Food_Processing_Industry.pptx
SAP_S4HANA_EWM_Food_Processing_Industry.pptx
vemulavenu484
 
Hypothesis Testing Training Material.pdf
Hypothesis Testing Training Material.pdf
AbdirahmanAli51
 
apidays New York 2025 - Beyond Webhooks: The Future of Scalable API Event Del...
apidays New York 2025 - Beyond Webhooks: The Future of Scalable API Event Del...
apidays
 
Fundamental Analysis for Dummies.pdf somwmdw
Fundamental Analysis for Dummies.pdf somwmdw
ssuserc74044
 
Veilig en vlot fietsen in Oost-Vlaanderen: Fietssnelwegen geoptimaliseerd met...
Veilig en vlot fietsen in Oost-Vlaanderen: Fietssnelwegen geoptimaliseerd met...
jacoba18
 
Attendance Presentation Project Excel.pptx
Attendance Presentation Project Excel.pptx
s2025266191
 
Advanced_English_Pronunciation_in_Use.pdf
Advanced_English_Pronunciation_in_Use.pdf
leogoemmanguyenthao
 
REGRESSION DIAGNOSTIC II: HETEROSCEDASTICITY
REGRESSION DIAGNOSTIC II: HETEROSCEDASTICITY
Ameya Patekar
 
QUALITATIVE EXPLANATORY VARIABLES REGRESSION MODELS
QUALITATIVE EXPLANATORY VARIABLES REGRESSION MODELS
Ameya Patekar
 
apidays New York 2025 - Life is But a (Data) Stream by Sandon Jacobs (Confluent)
apidays New York 2025 - Life is But a (Data) Stream by Sandon Jacobs (Confluent)
apidays
 
Pause Travail 22 Hostiou Girard 12 juin 2025.pdf
Pause Travail 22 Hostiou Girard 12 juin 2025.pdf
Institut de l'Elevage - Idele
 
METHODS OF DATA COLLECTION (Research methodology)
METHODS OF DATA COLLECTION (Research methodology)
anwesha248
 
KLIP2Data voor de herinrichting van R4 West en Oost
KLIP2Data voor de herinrichting van R4 West en Oost
jacoba18
 
最新版美国威斯康星大学拉克罗斯分校毕业证(UW–L毕业证书)原版定制
最新版美国威斯康星大学拉克罗斯分校毕业证(UW–L毕业证书)原版定制
Taqyea
 
SUNSSE Engineering Introduction 2021.pdf
SUNSSE Engineering Introduction 2021.pdf
Ongkino
 
Grade 10 selection and placement (1).pptx
Grade 10 selection and placement (1).pptx
FIDELISMUSEMBI
 
Media_Literacy_Index_of_Media_Sector_Employees.pdf
Media_Literacy_Index_of_Media_Sector_Employees.pdf
OlhaTatokhina1
 
Report_Government Authorities_Index_ENG_FIN.pdf
Report_Government Authorities_Index_ENG_FIN.pdf
OlhaTatokhina1
 
Data Warehousing and Analytics IFI Techsolutions .pptx
Data Warehousing and Analytics IFI Techsolutions .pptx
IFI Techsolutions
 
最新版西班牙莱里达大学毕业证(UdL毕业证书)原版定制
最新版西班牙莱里达大学毕业证(UdL毕业证书)原版定制
Taqyea
 
SAP_S4HANA_EWM_Food_Processing_Industry.pptx
SAP_S4HANA_EWM_Food_Processing_Industry.pptx
vemulavenu484
 
Hypothesis Testing Training Material.pdf
Hypothesis Testing Training Material.pdf
AbdirahmanAli51
 
apidays New York 2025 - Beyond Webhooks: The Future of Scalable API Event Del...
apidays New York 2025 - Beyond Webhooks: The Future of Scalable API Event Del...
apidays
 
Fundamental Analysis for Dummies.pdf somwmdw
Fundamental Analysis for Dummies.pdf somwmdw
ssuserc74044
 
Veilig en vlot fietsen in Oost-Vlaanderen: Fietssnelwegen geoptimaliseerd met...
Veilig en vlot fietsen in Oost-Vlaanderen: Fietssnelwegen geoptimaliseerd met...
jacoba18
 
Attendance Presentation Project Excel.pptx
Attendance Presentation Project Excel.pptx
s2025266191
 
Advanced_English_Pronunciation_in_Use.pdf
Advanced_English_Pronunciation_in_Use.pdf
leogoemmanguyenthao
 
REGRESSION DIAGNOSTIC II: HETEROSCEDASTICITY
REGRESSION DIAGNOSTIC II: HETEROSCEDASTICITY
Ameya Patekar
 
QUALITATIVE EXPLANATORY VARIABLES REGRESSION MODELS
QUALITATIVE EXPLANATORY VARIABLES REGRESSION MODELS
Ameya Patekar
 
apidays New York 2025 - Life is But a (Data) Stream by Sandon Jacobs (Confluent)
apidays New York 2025 - Life is But a (Data) Stream by Sandon Jacobs (Confluent)
apidays
 
METHODS OF DATA COLLECTION (Research methodology)
METHODS OF DATA COLLECTION (Research methodology)
anwesha248
 
KLIP2Data voor de herinrichting van R4 West en Oost
KLIP2Data voor de herinrichting van R4 West en Oost
jacoba18
 
最新版美国威斯康星大学拉克罗斯分校毕业证(UW–L毕业证书)原版定制
最新版美国威斯康星大学拉克罗斯分校毕业证(UW–L毕业证书)原版定制
Taqyea
 
SUNSSE Engineering Introduction 2021.pdf
SUNSSE Engineering Introduction 2021.pdf
Ongkino
 
Grade 10 selection and placement (1).pptx
Grade 10 selection and placement (1).pptx
FIDELISMUSEMBI
 

Optimize the Large Scale Graph Applications by using Apache Spark with 4-5x Performance Improvements

  • 1. Optimize the Large Scale Graph Applications by using Apache Spark with 4-5x Performance Improvements
  • 2. Agenda © 2020 PayPal Inc. Confidential and proprietary. Challenges Our Lesson & Learn • Improve the scalability of the large graph computation • Optimization & Enhancement in the production environment Learning Summary
  • 4. The main challenges we are facing 2+ billion Vertices 100+ billion Edges Degrees • Avg: 110 • Max : 2+ million © 2020 PayPal Inc. Confidential and proprietary. • Large graph with the data skew in nature • Strict SLA but various limitations in the production Limited Resources Various production guidelines Dedicated pool but shared common services, E.g., NameNode
  • 5. Our Lesson and Learn • •
  • 6. Use Case#1 Community detection © 2020 PayPal Inc. Confidential and proprietary. • Using the Connected Component to group the communities • Reference the paper - Connected Components in MapReduce and Beyond SCALABILITY 1 5 4 6 3 2 Sample undirected graph Find Connected Component (1,2) (1,3) (1,4) (1,5) (1,6) Community – (1)
  • 7. The data skew in nature caused “Buckets effect” & OOM © 2020 PayPal Inc. Confidential and proprietary. Sample illustration SCALABILITY (6,1) (6,1) (6,2) … (5,2) (4,2 ) (3,2) Group by starting node 1. Find smallest node in each group 2. Generate new pairs by linking node to smallest node in each group (1,6) (2,6) (2,5) (2,4) (2,3) (6,1) (6,2) (5,2) (4,2) (3,2) (1,6) (2,3) (2,4) (2,5) (2,6) Make it directed ( 1, [6,2] ) ( 2, [1,3,4,5,6] ) ( 3, [2] ) ( 4, [2] ) ( 5, [2] ) ( 6, [1,2] ) (6,1) (2,1) (3,1) (4,1) (5,1) (3,2) (4,2) (5,2) Group by starting node 1. Find smallest node in each group 2. Generate new pairs by linking node to smallest node in each group (6,1) (6,2) (2,1) (3,2) (4,2) (5,2) (1,6) (2,6) (1,2) (2,3) (2,4) (2,5) (6,1) (6,2) (2,1) (3,2) (4,2) (5,2) Make it directed Iteration#1 Iteration#2 (6,1) (6,2) (2,1) (3,2) (4,2) (5,2) Identify unique representative vertex within the community Find connected components in Reducer (1, [6]) (6, [1,2]) (2, [3,4,5,6]) (5, [2]) (4, [2]) (3, [2]) Iteration#1 1 5 4 6 3 2 Intermediate graph - 1 Iteration#2 1 5 4 6 3 2 Intermediate graph - 2 Dedup (6,1) (2,1) (2,1) (3,1) (4,1) (5,1) (6,1) … Dedup
  • 8. The data skew in nature caused “Buckets effect” & OOM © 2020 PayPal Inc. Confidential and proprietary. Sample illustration – Cont. ( 1, [2,3,4,5,6] ) ( 2, [1,3,4,5] ) ( 3, [1,2] ) ( 4, [1,2] ) ( 5, [1,2] ) ( 6, [1] ) Group by starting node (6,1) (2,1) (3,1) (4,1) (5,1) (3,2) (4,2) (5,2) (1,6) (1,2) (1,3) (1,4) (1,5) (2,3) (2,4) (2,5) (6,1) (2,1) (3,1) (4,1) (5,1) (3,2) (4,2) (5,2) Make it directed SCALABILITY Find connected components in Reducer (6,1) (5,1) (4,1) (3,1) (2,1) Identify unique representative vertex within the community • The size of connected components increases significantly in each iteration. • It caused “bucket effect” (Slow Reduce tasks) • Keeping the connected components in memory caused OOM in some Reducer For example: • 50,000,000+ nodes connected Iteration#3 1 5 4 6 3 2 Found one connected component, id is 1, members are [1,2,3,4,5,6] Iteration#3 ( 6,1 ) ( 5,1 ) ( 4,1 ) … ( 5,1 ) ( 5,2 ) ( 6,1) 1. Find smallest node in each group 2. Generate new pairs by linking node to smallest node in each group Dedup
  • 9. Our approach to resolve “Buckets effect” & OOM © 2020 PayPal Inc. Confidential and proprietary. SCALABILITY Separate huge and normal keys (2,1) (3,1) (3,2) (4,1) (4,2) (5,1) (5,2) (6,1) (1,2) (1,3) (1,4) (1,5) (1,6) (6,1) (2,1) (3,1) (4,1) (5,1) (3,2) (4,2) (5,2) (1,6) (1,2) (1,3) (1,4) (1,5) (2,3) (2,4) (2,5) 1. Find min for each huge key 2. Divide the key by adding random number as prefix (01,6) (01,2) (11,3) (11,4) (11,5) Processed as introduced : 1. Group by starting node 2. Find smallest node in each group 3. explode the map to rows 4. Dedup (2,1) (3,1) (4,1) (5,1) (6,1) 1. Group by key 2. Spill to mmap file if list length of single key exceed threshold 3. Keep remaining list in memory 4. Keep min value of original key in each group (11, ([file1], [5],1)) (01, ([file2],[],1)) • Spilled [3,4] into file1 • Spilled [2,6] into file2 • Min value of original key is 1 Read list of files and in- memory list, then generate new pairs (2,1) (3,1) (4,1) (5,1) (6,1) Merge & Dedup (2,1) (3,1) (4,1) (5,1) (6,1) 1. Separate keys 2. Splitting huge keys 3. Spill to disk
  • 10. Our Lesson & Learn of the scalability v Don’t blame Spark when you see OOM v Elegant memory usage is the KING v Inevitable data skew, but scalability can be achieved v Split huge key v Spill to disk when necessary © 2020 PayPal Inc. Confidential and proprietary.
  • 11. Our Lesson and Learn • •
  • 12. Use Case#2 Prepare the graph data by using Hive © 2020 PayPal Inc. Confidential and proprietary. How to choose the proper join solution in Spark? PERFORMANCE Note: • spark 2.3.0 • join without joining keys is not included here canBroadcastByHints ? BroadCastJoin ShuffleHashJoin SortMergeJoin canBroadcastBySizes ? preferSortMergeJoin ? Y Y N N Y canBuildLocalHashMap ? N Y N --Quiz: Broadcast ? LocalHashMap ? SortMergeJoin? select * from A inner join B on A.id=B.id where B.dt = ‘2020- 06-25’ • Both Table A and Table B are extra large table • Table B contains one partition on Date(dt) column; The partition size is around 1M. • Inner join between small partition in Table B and an extra-large Table A • Broadcast • Smaller table broadcasted • No shuffle • LocalHashMap • Shuffle needed • Build hash map for smaller side in reducer • SortMergeJoin • Shuffle needed • Sorting each partition of both sides before merge Comparison among various join solutions
  • 13. Use Case#2 Prepare the graph data by using Hive © 2020 PayPal Inc. Confidential and proprietary. How to choose the proper join solution in Spark? PERFORMANCE Note: • spark 2.3.0 • join without joining keys is not included here canBroadcastByHints ? BroadCastJoin ShuffleHashJoin SortMergeJoin canBroadcastBySizes ? preferSortMergeJoin ? Y Y N N Y canBuildLocalHashMap ? N Y N --Quiz: Broadcast ? LocalHashMap ? SortMergeJoin? select * from A inner join B on A.id=B.id where B.dt = ‘2020- 06-25’ • Both Table A and Table B are extra large table • Table B contains one partition on Date(dt) column; The partition size is around 1M. • Inner join between small partition in Table B and an extra-large Table A Expectation Execution … • Broadcast • Smaller table broadcasted • No shuffle • LocalHashMap • Shuffle needed • Build hash map for smaller side in reducer • SortMergeJoin • Shuffle needed • Sorting each partition of both sides before merge Comparison among various join solutions
  • 14. Our approach to enable broadcast join with 3x performance improved © 2020 PayPal Inc. Confidential and proprietary. select * from A inner join B on A.id = B.id where B.dt = ‘2020-06-25’ Parser ‘Project (*) ‘Filter (dt=‘2020-06-25’) ‘Join (A.id=B.id) ‘UnresolvedRelation A ‘UnresolvedRelation B Project (*) Filter (dt=‘2020-06-25’) Join (A.id=B.id) HiveTableRelation A (sizeInBytes=1GB) HiveTableRelation B (sizeInBytes=1GB) Analyzer Optimizer Project (*) Join (A.id=B.id) HiveTableRelation A (sizeInBytes=1GB) HiveTableRelation B (sizeInBytes=1GB) Filter (dt=‘2020-06- 25’) Spark Strategies (including JoinSelection) ProjectExec (*) SortMergeJoinExec (A.id=B.id) HiveTableScanExec A HiveTableScanExec B FilterExec (dt=‘2020-06-25’) Before : PERFORMANCE
  • 15. Our approach to enable broadcast join with 3x performance improved © 2020 PayPal Inc. Confidential and proprietary. After: select * from A inner join B on A.id = B.id where B.dt = ‘2020-06-25’ Parser ‘Project (*) ‘Filter (dt=‘2020-06-25’) ‘Join (A.id=B.id) ‘UnresolvedRelation A ‘UnresolvedRelation B Project (*) Filter (dt=‘2020-06-25’) Join (A.id=B.id) HiveTableRelation A (sizeInBytes=1GB) HiveTableRelation B (sizeInBytes=1GB) Analyzer Optimizer with rule PruneHiveTablePartition s Project (*) Join (A.id=B.id) HiveTableRelation A (sizeInBytes=1GB) HiveTableRelation B (sizeInBytes=1MB) Filter (dt=‘2020-06- 25’) Spark Strategies (including JoinSelection) ProjectExec (*) BroadcastHashJoinExec (A.id=B.id) HiveTableScanExec A HiveTableScanExec B FilterExec (dt=‘2020-06-25’) See PR #26805 merged in Spark 3.0 PERFORMANCE Prune partitions and update sizeInBytes sizeInBytes updated Broadcast join selected
  • 16. Use Case#3 Persist the graph data into Hive tables © 2020 PayPal Inc. Confidential and proprietary. Step 1. DDL Auditing process Step 2. Manipulate the data in Dataframe Mis-partitioning the column(s) overloaded the HDFS namenode in production PERFORMANCE DDL Query Example reviewed and : create table default.emp ( dept_id int, --1 emp_id int, --2 age int, --3 gender string, --4 address string --5 ) partitioned by ( country string, --6 city string --7 ) DML Query Example reviewed and : // new a dataframe df1 from the other logic df1.registerTempTable(“tmpTable”) val df2 = sparkSession.sql( “select department_id as dept_id, --1 employee_id as emp_id, --2 emp_age as age, --3 emp_gender as gender, --4 cnty as country, --5 addr as address , --6 city_name as city --7 from tempTable“) df2.write.insertInto(“default.emp”) • address column has been mis-matched to country column • country has 200+ distinct value while address has 10+ million distinct value • Tons of new folders and files were created • Generated platform alerts due to overloading the namenode continuously Before :
  • 17. Our approach to refine the interface explicitly © 2020 PayPal Inc. Confidential and proprietary. That avoids the column or partitioned column mis-match in compiling your code Step 1. DDL Auditing process Step 2. Manipulate the data in Dataframe DML Query Example reviewed and : // new a dataframe df1 from the other logic df1.registerTempTable(“tmpTable”) val df2 = sparkSession.sql( “select department_id as dept_id, --1 employee_id as emp_id, --2 emp_age as age, --3 emp_gender as gender, --4 cnty as country, --5 addr as address , --6 city_name as city --7 from tempTable“) df2.write.insertInto(“default.emp”, true) def insertInto(tableName: String, byName: Boolean): Unit If byName is true, spark will do : 1. Match the columns between data frame and target table by name 2. Throw exception if column name in data frame does not exist in target table PERFORMANCE Step 2. Manipulate the data in Dataframe After: DDL Query Example reviewed and : create table default.emp ( dept_id int, --1 emp_id int, --2 age int, --3 gender string, --4 address string --5 ) partitioned by ( country string, --6 city string --7 )
  • 18. Our Lesson & Learn of optimization & enhancement in production Ø Nothing is too tiny to optimize performance Ø Deep understanding of spark internals is helpful Ø Misusage may lead to serious impact on shared service Ø Explicit interface help avoid misusage Ø Overall, the performance has been improved by 4-5x © 2020 PayPal Inc. Confidential and proprietary.
  • 20. Our Learning summary Ø Use memory elegantly in user code to improve scalability Ø Understanding Spark deeply is helpful for optimization Ø Achieve performance improvement from 2 days to around 10 hours Open to the new learning journey by connecting with you all. © 2020 PayPal Inc. Confidential and proprietary. From our practices of the real cases in production
  • 21. Q & A