SlideShare a Scribd company logo
14
Most read
18
Most read
22
Most read
Optimizing
Apache Spark SQL Joins
Vida Ha
Solutions Architect
About Me
2005 Mobile Web & Voice Search
2
About Me
2005 Mobile Web & Voice Search
3
2012 Reporting & Analytics
About Me
2005 Mobile Web & Voice Search
4
2012 Reporting & Analytics
2014 Solutions Architect
Evolution of Spark…
5
2014:
• Spark 1.x
• RDD based API’s.
• Everyday I’m Shufflin’
2017:
• Spark 2.x
• Dataframes & Datasets
• Adv SQL Catalyst
• Optimizing Joins
6
SELECT …
FROM TABLE A
JOIN TABLE B
ON A.KEY1 = B.KEY2
Spark SQL Joins
Topics Covered Today
7
Basic Joins:
• Shuffle Hash Join
• Troubleshooting
• Broadcast Hash Join
• Cartesian Join
Special Cases:
• Theta Join
• One to Many Join
Shuffle Hash Join
8
A Shuffle Hash Join is the most basic type of
join, and goes back to Map Reduce
Fundamentals.
• Map through two different data frames/tables.
• Use the fields in the join condition as the output
key.
• Shuffle both datasets by the output key.
• In the reduce phase, join the two datasets now
any rows of both tables with the same keys are on
the same machine and are sorted.
Shuffle Hash Join
9
Table 1 Table 2MAP
SHUFFLE
REDUCE Output Output Output Output Output
join_rdd = sqlContext.sql(“select *
FROM people_in_the_us
JOIN states
ON people_in_the_us.state = states.name”)
Shuffle Hash Join Performance
Works best when the DF’s:
• Distribute evenly with the key you are joining on.
• Have an adequate number of keys for parallelism.
US DF
Partition 1
Problems:
● Uneven
Sharding
● Limited
parallelism w/
50 output
partitions
US RDD
Partition 2
US RDD
Partition 2**All** the
Data for CA
**All** the
Data for RI
CA
RI
All the data
for the US
will be
shuffled
into only 50
keys for
each of the
states.
Uneven Sharding & Limited Parallelism,
US DF
Partition 2
US DF
Partition N Small State
DF
A larger Spark Cluster will not solve these
problems!
US DF
Partition 1
Problems:
● Uneven
Sharding
● Limited
parallelism w/
50 output
partitions
US RDD
Partition 2
US RDD
Partition 2**All** the
Data for CA
**All** the
Data for RI
CA
RI
All the data
for the US
will be
shuffled
into only 50
keys for
each of the
states.
Uneven Sharding & Limited Parallelism,
US DF
Partition 2
US DF
Partition N Small State
DF
Broadcast Hash Join can address this problem if
one DF is small enough to fit in memory.
join_rdd = sqlContext.sql(“select *
FROM people_in_california
LEFT JOIN all_the_people_in_the_world
ON people_in_california.id =
all_the_people_in_the_world.id”)
More Performance Considerations
Final output keys = # of people in CA, so don’t
need a huge Spark cluster, right?
The Size of the Spark Cluster to run this job is limited
by the Large table rather than the Medium Sized Table.
Left Join - Shuffle Step
Not a Problem:
● Even Sharding
● Good Parallelism
Shuffles everything
before dropping keys
All CA DF All World
DF
All the Data from
Both Tables
Final
Joined
Output
A Better Solution
Filter the World DF for only entries that match the CA ID
Filter Transform
Benefits:
● Less Data shuffled
over the network
and less shuffle
space needed.
● More transforms,
but still faster.
Shuffle
All CA DF All World
DF
Final
Joined
Output
Partial
World DF
● Can’t tell you.
● There aren’t always strict rules for optimizing.
● If you were only considering two small
columns from the World RDD in Parquet
format, the filtering step may not be worth it.
You should understand your data and it’s unique properties in
order to best optimize your Spark Job.
What’s the Tipping Point for Huge?
Things to Look for:
● Tasks that take much longer to run than others.
● Speculative tasks that are launching.
● Shards that have a lot more input or shuffle output.
Check the Spark
UI pages for task
level detail about
your Spark job.
In Practice: Detecting Shuffle Problems
Broadcast Hash Join
18
Parallelism of the large DF is maintained (n output
partitions), and shuffle is not even needed.
Broadcast
Large DF
Partition N
Large DF
Partition 1
Large DF
Partition 2
Optimization: When one of the DF’s is small
enough to fit in memory on a single machine.
Small DF
Small DF Small DF Small DF
Broadcast Hash Join
19
• Often optimal over Shuffle Hash Join.
• Use “explain” to determine if the Spark SQL
catalyst hash chosen Broadcast Hash Join.
• Should be automatic for many Spark SQL tables,
may need to provide hints for other types.
Cartesian Join
20
• A cartesian join can easily explode the number of output
rows.
100,000 X 100,000 = 10 Billion
• Alternative to a full blown cartesian join:
• Create an RDD of UID by UID.
• Force a Broadcast of the rows of the table .
• Call a UDF given the UID by UID to look up the table
rows and perform your calculation.
• Time your calculation on a sample set to size your cluster.
One To Many Join
21
• A single row on one table can map to many rows on the
2nd table.
• Can explode the number of output rows.
• Not a problem if you use parquet - the size of the output
files is not that much since the duplicate data encodes
well.
Theta Join
22
• Spark SQL consider each keyA against each keyB in
the example above and loop to see if the theta
condition is met.
• Better Solution - create buckets for keyA and KeyB
can be matched on.
join_rdd = sqlContext.sql(“select *
FROM tableA
JOIN tableB
ON (keyA < keyB + 10)”)
Thank you
Questions?

More Related Content

PDF
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
PPTX
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
PDF
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
PDF
Deep Dive into the New Features of Apache Spark 3.0
PDF
Memory Management in Apache Spark
PDF
Spark shuffle introduction
PPTX
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
PDF
A Deep Dive into Query Execution Engine of Spark SQL
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Deep Dive into the New Features of Apache Spark 3.0
Memory Management in Apache Spark
Spark shuffle introduction
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
A Deep Dive into Query Execution Engine of Spark SQL

What's hot (20)

PDF
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
PDF
Hive Bucketing in Apache Spark with Tejas Patil
PDF
Hadoop/Spark で Amazon S3 を徹底的に使いこなすワザ (Hadoop / Spark Conference Japan 2019)
PDF
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
PDF
Enabling Vectorized Engine in Apache Spark
PDF
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
PDF
Deep Dive: Memory Management in Apache Spark
PDF
Introduction to Spark Internals
PDF
Parquet performance tuning: the missing guide
PPTX
Intro to Apache Spark
PDF
SparkSQL: A Compiler from Queries to RDDs
PDF
Cosco: An Efficient Facebook-Scale Shuffle Service
PDF
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
PDF
Introducing DataFrames in Spark for Large Scale Data Science
PPTX
Hive + Tez: A Performance Deep Dive
PPTX
S3 整合性モデルと Hadoop/Spark の話
PDF
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
PPTX
大量のデータ処理や分析に使えるOSS Apache Spark入門(Open Source Conference 2021 Online/Kyoto 発表資料)
PDF
Apache Spark Core—Deep Dive—Proper Optimization
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Hive Bucketing in Apache Spark with Tejas Patil
Hadoop/Spark で Amazon S3 を徹底的に使いこなすワザ (Hadoop / Spark Conference Japan 2019)
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Enabling Vectorized Engine in Apache Spark
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Deep Dive: Memory Management in Apache Spark
Introduction to Spark Internals
Parquet performance tuning: the missing guide
Intro to Apache Spark
SparkSQL: A Compiler from Queries to RDDs
Cosco: An Efficient Facebook-Scale Shuffle Service
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Introducing DataFrames in Spark for Large Scale Data Science
Hive + Tez: A Performance Deep Dive
S3 整合性モデルと Hadoop/Spark の話
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
大量のデータ処理や分析に使えるOSS Apache Spark入門(Open Source Conference 2021 Online/Kyoto 発表資料)
Apache Spark Core—Deep Dive—Proper Optimization
Ad

Similar to Optimizing Apache Spark SQL Joins (20)

PDF
Meetup talk
PDF
The internals of Spark SQL Joins, Dmytro Popovich
PDF
Deep Dive into Spark
PDF
Optimizations in Spark; RDD, DataFrame
PDF
What’s New in the Upcoming Apache Spark 3.0
PDF
Apache Spark 3.0: Overview of What’s New and Why Care
PDF
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
PDF
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
PDF
Spark SQL Deep Dive @ Melbourne Spark Meetup
DOCX
Quick Guide to Refresh Spark skills
PDF
PyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_production
PDF
Real-Time Spark: From Interactive Queries to Streaming
PPTX
Big Data Transformation Powered By Apache Spark.pptx
PPTX
Big Data Transformations Powered By Spark
PDF
No more struggles with Apache Spark workloads in production
PDF
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PDF
Spark + AI Summit recap jul16 2020
ODP
Introduction to Joins in Structured Streaming
PDF
Spark SQL Join Improvement at Facebook
PDF
Apche Spark SQL and Advanced Queries on big data
Meetup talk
The internals of Spark SQL Joins, Dmytro Popovich
Deep Dive into Spark
Optimizations in Spark; RDD, DataFrame
What’s New in the Upcoming Apache Spark 3.0
Apache Spark 3.0: Overview of What’s New and Why Care
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
Spark SQL Deep Dive @ Melbourne Spark Meetup
Quick Guide to Refresh Spark skills
PyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_production
Real-Time Spark: From Interactive Queries to Streaming
Big Data Transformation Powered By Apache Spark.pptx
Big Data Transformations Powered By Spark
No more struggles with Apache Spark workloads in production
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
Spark + AI Summit recap jul16 2020
Introduction to Joins in Structured Streaming
Spark SQL Join Improvement at Facebook
Apche Spark SQL and Advanced Queries on big data
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake

Recently uploaded (20)

PPTX
Introduction to Artificial Intelligence
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PDF
IEEE-CS Tech Predictions, SWEBOK and Quantum Software: Towards Q-SWEBOK
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PDF
The Role of Automation and AI in EHS Management for Data Centers.pdf
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PDF
AI in Product Development-omnex systems
PPTX
What to Capture When It Breaks: 16 Artifacts That Reveal Root Causes
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PPTX
AIRLINE PRICE API | FLIGHT API COST |
PDF
Best Practices for Rolling Out Competency Management Software.pdf
PPTX
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PDF
A REACT POMODORO TIMER WEB APPLICATION.pdf
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PPTX
Materi_Pemrograman_Komputer-Looping.pptx
PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
PDF
How Creative Agencies Leverage Project Management Software.pdf
PPTX
Presentation of Computer CLASS 2 .pptx
Introduction to Artificial Intelligence
Which alternative to Crystal Reports is best for small or large businesses.pdf
IEEE-CS Tech Predictions, SWEBOK and Quantum Software: Towards Q-SWEBOK
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
The Role of Automation and AI in EHS Management for Data Centers.pdf
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
AI in Product Development-omnex systems
What to Capture When It Breaks: 16 Artifacts That Reveal Root Causes
VVF-Customer-Presentation2025-Ver1.9.pptx
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
AIRLINE PRICE API | FLIGHT API COST |
Best Practices for Rolling Out Competency Management Software.pdf
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
How to Choose the Right IT Partner for Your Business in Malaysia
A REACT POMODORO TIMER WEB APPLICATION.pdf
Upgrade and Innovation Strategies for SAP ERP Customers
Materi_Pemrograman_Komputer-Looping.pptx
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
How Creative Agencies Leverage Project Management Software.pdf
Presentation of Computer CLASS 2 .pptx

Optimizing Apache Spark SQL Joins

  • 1. Optimizing Apache Spark SQL Joins Vida Ha Solutions Architect
  • 2. About Me 2005 Mobile Web & Voice Search 2
  • 3. About Me 2005 Mobile Web & Voice Search 3 2012 Reporting & Analytics
  • 4. About Me 2005 Mobile Web & Voice Search 4 2012 Reporting & Analytics 2014 Solutions Architect
  • 5. Evolution of Spark… 5 2014: • Spark 1.x • RDD based API’s. • Everyday I’m Shufflin’ 2017: • Spark 2.x • Dataframes & Datasets • Adv SQL Catalyst • Optimizing Joins
  • 6. 6 SELECT … FROM TABLE A JOIN TABLE B ON A.KEY1 = B.KEY2 Spark SQL Joins
  • 7. Topics Covered Today 7 Basic Joins: • Shuffle Hash Join • Troubleshooting • Broadcast Hash Join • Cartesian Join Special Cases: • Theta Join • One to Many Join
  • 8. Shuffle Hash Join 8 A Shuffle Hash Join is the most basic type of join, and goes back to Map Reduce Fundamentals. • Map through two different data frames/tables. • Use the fields in the join condition as the output key. • Shuffle both datasets by the output key. • In the reduce phase, join the two datasets now any rows of both tables with the same keys are on the same machine and are sorted.
  • 9. Shuffle Hash Join 9 Table 1 Table 2MAP SHUFFLE REDUCE Output Output Output Output Output
  • 10. join_rdd = sqlContext.sql(“select * FROM people_in_the_us JOIN states ON people_in_the_us.state = states.name”) Shuffle Hash Join Performance Works best when the DF’s: • Distribute evenly with the key you are joining on. • Have an adequate number of keys for parallelism.
  • 11. US DF Partition 1 Problems: ● Uneven Sharding ● Limited parallelism w/ 50 output partitions US RDD Partition 2 US RDD Partition 2**All** the Data for CA **All** the Data for RI CA RI All the data for the US will be shuffled into only 50 keys for each of the states. Uneven Sharding & Limited Parallelism, US DF Partition 2 US DF Partition N Small State DF A larger Spark Cluster will not solve these problems!
  • 12. US DF Partition 1 Problems: ● Uneven Sharding ● Limited parallelism w/ 50 output partitions US RDD Partition 2 US RDD Partition 2**All** the Data for CA **All** the Data for RI CA RI All the data for the US will be shuffled into only 50 keys for each of the states. Uneven Sharding & Limited Parallelism, US DF Partition 2 US DF Partition N Small State DF Broadcast Hash Join can address this problem if one DF is small enough to fit in memory.
  • 13. join_rdd = sqlContext.sql(“select * FROM people_in_california LEFT JOIN all_the_people_in_the_world ON people_in_california.id = all_the_people_in_the_world.id”) More Performance Considerations Final output keys = # of people in CA, so don’t need a huge Spark cluster, right?
  • 14. The Size of the Spark Cluster to run this job is limited by the Large table rather than the Medium Sized Table. Left Join - Shuffle Step Not a Problem: ● Even Sharding ● Good Parallelism Shuffles everything before dropping keys All CA DF All World DF All the Data from Both Tables Final Joined Output
  • 15. A Better Solution Filter the World DF for only entries that match the CA ID Filter Transform Benefits: ● Less Data shuffled over the network and less shuffle space needed. ● More transforms, but still faster. Shuffle All CA DF All World DF Final Joined Output Partial World DF
  • 16. ● Can’t tell you. ● There aren’t always strict rules for optimizing. ● If you were only considering two small columns from the World RDD in Parquet format, the filtering step may not be worth it. You should understand your data and it’s unique properties in order to best optimize your Spark Job. What’s the Tipping Point for Huge?
  • 17. Things to Look for: ● Tasks that take much longer to run than others. ● Speculative tasks that are launching. ● Shards that have a lot more input or shuffle output. Check the Spark UI pages for task level detail about your Spark job. In Practice: Detecting Shuffle Problems
  • 18. Broadcast Hash Join 18 Parallelism of the large DF is maintained (n output partitions), and shuffle is not even needed. Broadcast Large DF Partition N Large DF Partition 1 Large DF Partition 2 Optimization: When one of the DF’s is small enough to fit in memory on a single machine. Small DF Small DF Small DF Small DF
  • 19. Broadcast Hash Join 19 • Often optimal over Shuffle Hash Join. • Use “explain” to determine if the Spark SQL catalyst hash chosen Broadcast Hash Join. • Should be automatic for many Spark SQL tables, may need to provide hints for other types.
  • 20. Cartesian Join 20 • A cartesian join can easily explode the number of output rows. 100,000 X 100,000 = 10 Billion • Alternative to a full blown cartesian join: • Create an RDD of UID by UID. • Force a Broadcast of the rows of the table . • Call a UDF given the UID by UID to look up the table rows and perform your calculation. • Time your calculation on a sample set to size your cluster.
  • 21. One To Many Join 21 • A single row on one table can map to many rows on the 2nd table. • Can explode the number of output rows. • Not a problem if you use parquet - the size of the output files is not that much since the duplicate data encodes well.
  • 22. Theta Join 22 • Spark SQL consider each keyA against each keyB in the example above and loop to see if the theta condition is met. • Better Solution - create buckets for keyA and KeyB can be matched on. join_rdd = sqlContext.sql(“select * FROM tableA JOIN tableB ON (keyA < keyB + 10)”)

Editor's Notes

  • #2: PRESENTER: Underline text added for extra emphasis