SlideShare a Scribd company logo
WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
Daniel Tomes, Databricks
Spark Core – Proper
Optimization
#UnifiedAnalytics #SparkAISummit
TEMPORARY – NOTE FOR REVIEW
I gave this talk at Summit US 2019. I will be
adding supplemental slides for SparkUI deep dive
and delta optimization depending on how many
people in the crowd has seen this presentation on
YT. If none, I will go with same presentation, if
many, I will do the first half as review and second
half with new info. New info to come soon.
3#UnifiedAnalytics #SparkAISummit
Talking Points
• Spark Hierarchy
• The Spark UI
• Understanding Partitions
• Common Opportunities For Optimization
4#UnifiedAnalytics #SparkAISummit
Spark Hierarchy
5#UnifiedAnalytics #SparkAISummit
Spark Hierarchy
• Actions are eager
– Made of transformations (lazy)
• narrow
• wide (requires shuffle)
– Spawn jobs
• Spawn Stages
– Spawn Tasks
» Do work & utilize hardware
6#UnifiedAnalytics #SparkAISummit
Minimize Data Movement
Less Jobs
Less Stages
More Tasks
More Ops/Task
7#UnifiedAnalytics #SparkAISummit
Navigating The Spark UI
DEMO
8#UnifiedAnalytics #SparkAISummit
Get A Baseline
• Is your action efficient?
– Spills?
• CPU Utilization
– GANGLIA / YARN / Etc
– Tails
9#UnifiedAnalytics #SparkAISummit
Goal
Understand Your Hardware
• Core Count & Speed
• Memory Per Core (Working & Storage)
• Local Disk Type, Count, Size, & Speed
• Network Speed & Topology
• Data Lake Properties (rate limits)
• Cost / Core / Hour
– Financial For Cloud
– Opportunity for Shared & On Prem
10#UnifiedAnalytics #SparkAISummit
Minimize Data Scans (Lazy Load)
• Data Skipping
– HIVE Partitions
– Bucketing
• Only Experts – Nearly Impossible to
Maintain
– Databricks Delta Z-Ordering
• What is It
• How To Do It
11#UnifiedAnalytics #SparkAISummit
12#UnifiedAnalytics #SparkAISummit
Without Partition Filter
With Partition Filter
Shrink Partition Range
Using a Filter on
Partitioned Column
13#UnifiedAnalytics #SparkAISummit
Simple
Extra Shuffle Partitions
With Broadcast
Partitions – Definition
Each of a number of portions into which some
operating systems divide memory or storage
14#UnifiedAnalytics #SparkAISummit
HIVE PARTITION == SPARK PARTITION
Spark Partitions – Types
• Input
– Controls
• spark.default.parallelism (don’t use)
• spark.sql.files.maxPartitionBytes (mutable)
– assuming source has sufficient partitions
• Shuffle
– Control = partition count
• spark.sql.shuffle.partitions
• Output
– Control = stage partition count split by max records per file
• Coalesce(n) to shrink
• Repartition(n) to increase and/or balance (shuffle)
• df.write.option(“maxRecordsPerFile”, N)
15#UnifiedAnalytics #SparkAISummit
Partitions – Right Sizing – Shuffle – Master Equation
• Largest Shuffle Stage
– Target Size <= 200 MB/partition
• Partition Count = Stage Input Data / Target Size
– Solve for Partition Count
EXAMPLE
Shuffle Stage Input = 210GB
x = 210000MB / 200MB = 1050
spark.conf.set(“spark.sql.shuffle.partitions”, 1050)
BUT -> If cluster has 2000 cores
spark.conf.set(“spark.sql.shuffle.partitions”, 2000)
16#UnifiedAnalytics #SparkAISummit
17#UnifiedAnalytics #SparkAISummit
Stage 21 -> Shuffle Fed By Stage 19 & 20
THUS
Stage 21 Shuffle Input = 45.4g + 8.6g == 54g
Default Shuffle Partition == 200 == 54000mb/200parts =~
270mb/shuffle part
Cluster Spec
96 cores @ 7.625g/core
3.8125g Working Mem
3.8125g Storage Mem
18#UnifiedAnalytics #SparkAISummit
Cluster Spec
96 cores @ 7.625g/core
3.8125g Working Mem
3.8125g Storage Mem
480 shuffle partitions – WHY?
Target shuffle part size == 100m
p = 54g / 100m == 540
540p / 96 cores == 5.625
96 * 5 == 480
If p == 540 another 60p have to be loaded
and processed after first cycle is complete
NO SPILL
19#UnifiedAnalytics #SparkAISummit
Partitions – Right Sizing (input)
• Use Spark Defaults (128MB) unless…
– Source Structure is not optimal (upstream)
– Remove Spills
– Increase Parallelism
– Heavily Nested/Repetitive Data
– UDFs
20#UnifiedAnalytics #SparkAISummit
Get DF Partitions
df.rdd.partitions.size
21#UnifiedAnalytics #SparkAISummit
14.7g/452part == 32.3mb/part
spark.sql.files.maxPartitionBytes == 128MB
Parquet Consideration: Source must have sufficient row blocks
sc.hadoopConfiguration.setInt("parquet.block.size", 1024 * 1024 * 16)
df.write…
FIX THIS SLIDE
Partitions – Right Sizing (output)
• Write Once -> Read Many
– More Time to Write but Faster to Read
• Perfect writes limit parallelism
– Compactions (minor & major)
Write Data Size = 14.7GB
Desired File Size = 1500MB
Max stage parallelism = 10
96 – 10 == 86 cores idle during write
22#UnifiedAnalytics #SparkAISummit
23#UnifiedAnalytics #SparkAISummit
Only 10 Cores Used
All 96 Cores Used
Average File Size == 1.5g
Average File Size == 0.16g
Partitions – Why So Serious?
• Avoid The Spill
– If (Partition Size > Working Memory Size)
Spill
– If (Storage Memory Available) Spill to
Memory
– If (Storage Memory Exhausted) Spill to Disk
– If (Local Disk Exhausted) Fail Job
• Maximize Parallelism
– Utilize All Cores
– Provision only the cores you need
24#UnifiedAnalytics #SparkAISummit
Balance
• Maximizing Resources Requires Balance
– Task Duration
– Partition Size
• SKEW
– When some partitions are significantly larger than
most
25#UnifiedAnalytics #SparkAISummit
Input Partitions
Shuffle Partitions
Output Files
Spills
GC Times
Straggling Tasks
26#UnifiedAnalytics #SparkAISummit
75th percentile ~ 2m recs
max ~ 45m recs
stragglers take > 22X longer IF no spillage
With spillage, 100Xs longer
Skew Join Optimization
• OSS Fix
– Add Column to each side with
random int between 0 and
spark.sql.shuffle.partitions – 1 to
both sides
– Add join clause to include join on
generated column above
– Drop temp columns from result
• Databricks Fix (Skew Join)
val skewedKeys = List(”id1”, “id200”, ”id-99”)
df.join(
skewDF.hint(“tblA”, “skewKey”, skewedKeys),
Seq(keyCol), “inner”)
27#UnifiedAnalytics #SparkAISummit
Minimize Data Scans (Persistence)
• Persistence
– Not Free
• Repetition
– SQL Plan
28#UnifiedAnalytics #SparkAISummit
df.cache == df.persist(StorageLevel.MEMORY_AND_DISK)
• Types
– Default (MEMORY_AND_DISK)
• Deserialized
– Deserialized = Faster = Bigger
– Serialized = Slower = Smaller
– _2 = Safety = 2X bigger
– MEMORY_ONLY
– DISK_ONLY
Don’t Forget To Cleanup!
df.unpersist
29#UnifiedAnalytics #SparkAISummit
30#UnifiedAnalytics #SparkAISummit
TPCDS Query 4
Minimize Data Scans (Delta Cache)
CACHE SELECT column_name[, column_name, ...] FROM [db_name.]table_name [ WHERE boolean_expression ]
31#UnifiedAnalytics #SparkAISummit
Why So Fast?
HOW TO USE
AWS - i3s – On By Default
AZURE – Ls-series – On By Default
spark.databricks.io.cache.enabled true
spark.databricks.io.cache.maxDiskUsage 50g
spark.databricks.io.cache.maxMetaDataCache 1g
spark.databricks.io.cache.compression.enabled false
Hot Data Auto Cached
Super Fast
Relieves Memory Pressure
Join Optimization
• Default = SortMergeJoin
• Broadcast Joins
– Automatic If:
(one side < spark.sql.autoBroadcastJoinThreshold) (default 10m)
– Risks
• Not Enough Driver Memory
• DF > spark.driver.maxResultSize
• DF > Single Executor Available Working Memory
– Prod – Mitigate The Risks
• Checker Functions
32#UnifiedAnalytics #SparkAISummit
Persistence Vs. Broadcast
33#UnifiedAnalytics #SparkAISummit
Attempt to send compute to the data
Data availability guaranteed ->
each executor has entire dataset
34#UnifiedAnalytics #SparkAISummit
126MB
270MB
Range Join Optimization
• Range Joins Types
– Point In Interval Range Join
• Predicate specifies value in one relation that is between two values from the other relation
– Interval Overlap Range Join
• Predicate specifies an overlap of intervals between two values from each relation
REFERENCE
35#UnifiedAnalytics #SparkAISummit
Omit Expensive Ops
• Repartition
– Use Coalesce or Shuffle Partition Count
• Count – Do you really need it?
• DistinctCount
– use approxCountDistinct()
• If distincts are required, put them in the right place
– Use dropDuplicates
– dropDuplicates BEFORE the join
– dropDuplicates BEFORE the groupBy
36#UnifiedAnalytics #SparkAISummit
UDF Penalties
• Traditional UDFs cannot use Tungsten
– Use org.apache.spark.sql.functions
– Use PandasUDFs
• Utilizes Apache Arrow
– Use SparkR UDFs
37#UnifiedAnalytics #SparkAISummit
Source Data Format
• File Size
• Splittable
• Compressed
• HIVE Partitioned
38#UnifiedAnalytics #SparkAISummit
QUESTIONS
39#UnifiedAnalytics #SparkAISummit
DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

More Related Content

PDF
Apache Spark Core—Deep Dive—Proper Optimization
PDF
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
PDF
Physical Plans in Spark SQL
PDF
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
PDF
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
PDF
Understanding Query Plans and Spark UIs
PPTX
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Apache Spark Core—Deep Dive—Proper Optimization
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Physical Plans in Spark SQL
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Understanding Query Plans and Spark UIs
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark

What's hot (20)

PDF
Fine Tuning and Enhancing Performance of Apache Spark Jobs
PDF
Deep Dive: Memory Management in Apache Spark
PDF
Memory Management in Apache Spark
PDF
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
PDF
Hive Bucketing in Apache Spark with Tejas Patil
PDF
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
PDF
The Parquet Format and Performance Optimization Opportunities
PDF
Spark shuffle introduction
PDF
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
PDF
Top 5 mistakes when writing Spark applications
PDF
Spark Performance Tuning .pdf
PDF
Parquet performance tuning: the missing guide
PDF
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
PDF
How to Automate Performance Tuning for Apache Spark
PDF
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
PDF
Apache Iceberg - A Table Format for Hige Analytic Datasets
PPTX
Processing Large Data with Apache Spark -- HasGeek
PDF
Apache Spark in Depth: Core Concepts, Architecture & Internals
PDF
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Deep Dive: Memory Management in Apache Spark
Memory Management in Apache Spark
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Hive Bucketing in Apache Spark with Tejas Patil
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
The Parquet Format and Performance Optimization Opportunities
Spark shuffle introduction
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Top 5 mistakes when writing Spark applications
Spark Performance Tuning .pdf
Parquet performance tuning: the missing guide
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
How to Automate Performance Tuning for Apache Spark
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Apache Iceberg - A Table Format for Hige Analytic Datasets
Processing Large Data with Apache Spark -- HasGeek
Apache Spark in Depth: Core Concepts, Architecture & Internals
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Ad

Similar to Apache Spark Core – Practical Optimization (20)

PDF
Simplifying Change Data Capture using Databricks Delta
PDF
Parallelizing with Apache Spark in Unexpected Ways
PDF
Databricks: What We Have Learned by Eating Our Dog Food
PDF
Petabytes, Exabytes, and Beyond: Managing Delta Lakes for Interactive Queries...
PDF
Spark + AI Summit recap jul16 2020
PDF
Tactical Data Science Tips: Python and Spark Together
PDF
Performance Troubleshooting Using Apache Spark Metrics
PDF
Improving Apache Spark Downscaling
PDF
What’s New in the Upcoming Apache Spark 3.0
PDF
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
PDF
Meetup talk
PDF
Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...
PDF
Tangram: Distributed Scheduling Framework for Apache Spark at Facebook
PDF
The internals of Spark SQL Joins, Dmytro Popovich
PDF
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
PDF
Scaling ML-Based Threat Detection For Production Cyber Attacks
PDF
Connecting the Dots: Integrating Apache Spark into Production Pipelines
PDF
Real-Time Spark: From Interactive Queries to Streaming
PDF
Deep Dive into Spark
PDF
How Adobe Does 2 Million Records Per Second Using Apache Spark!
Simplifying Change Data Capture using Databricks Delta
Parallelizing with Apache Spark in Unexpected Ways
Databricks: What We Have Learned by Eating Our Dog Food
Petabytes, Exabytes, and Beyond: Managing Delta Lakes for Interactive Queries...
Spark + AI Summit recap jul16 2020
Tactical Data Science Tips: Python and Spark Together
Performance Troubleshooting Using Apache Spark Metrics
Improving Apache Spark Downscaling
What’s New in the Upcoming Apache Spark 3.0
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Meetup talk
Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...
Tangram: Distributed Scheduling Framework for Apache Spark at Facebook
The internals of Spark SQL Joins, Dmytro Popovich
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
Scaling ML-Based Threat Detection For Production Cyber Attacks
Connecting the Dots: Integrating Apache Spark into Production Pipelines
Real-Time Spark: From Interactive Queries to Streaming
Deep Dive into Spark
How Adobe Does 2 Million Records Per Second Using Apache Spark!
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake

Recently uploaded (20)

PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PDF
Clinical guidelines as a resource for EBP(1).pdf
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PDF
Business Analytics and business intelligence.pdf
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
SAP 2 completion done . PRESENTATION.pptx
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PDF
Introduction to the R Programming Language
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PDF
annual-report-2024-2025 original latest.
Qualitative Qantitative and Mixed Methods.pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
Clinical guidelines as a resource for EBP(1).pdf
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Introduction-to-Cloud-ComputingFinal.pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Business Analytics and business intelligence.pdf
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
SAP 2 completion done . PRESENTATION.pptx
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Introduction to the R Programming Language
STERILIZATION AND DISINFECTION-1.ppthhhbx
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Galatica Smart Energy Infrastructure Startup Pitch Deck
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Business Ppt On Nestle.pptx huunnnhhgfvu
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
annual-report-2024-2025 original latest.

Apache Spark Core – Practical Optimization

  • 1. WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
  • 2. Daniel Tomes, Databricks Spark Core – Proper Optimization #UnifiedAnalytics #SparkAISummit
  • 3. TEMPORARY – NOTE FOR REVIEW I gave this talk at Summit US 2019. I will be adding supplemental slides for SparkUI deep dive and delta optimization depending on how many people in the crowd has seen this presentation on YT. If none, I will go with same presentation, if many, I will do the first half as review and second half with new info. New info to come soon. 3#UnifiedAnalytics #SparkAISummit
  • 4. Talking Points • Spark Hierarchy • The Spark UI • Understanding Partitions • Common Opportunities For Optimization 4#UnifiedAnalytics #SparkAISummit
  • 6. Spark Hierarchy • Actions are eager – Made of transformations (lazy) • narrow • wide (requires shuffle) – Spawn jobs • Spawn Stages – Spawn Tasks » Do work & utilize hardware 6#UnifiedAnalytics #SparkAISummit
  • 7. Minimize Data Movement Less Jobs Less Stages More Tasks More Ops/Task 7#UnifiedAnalytics #SparkAISummit
  • 8. Navigating The Spark UI DEMO 8#UnifiedAnalytics #SparkAISummit
  • 9. Get A Baseline • Is your action efficient? – Spills? • CPU Utilization – GANGLIA / YARN / Etc – Tails 9#UnifiedAnalytics #SparkAISummit Goal
  • 10. Understand Your Hardware • Core Count & Speed • Memory Per Core (Working & Storage) • Local Disk Type, Count, Size, & Speed • Network Speed & Topology • Data Lake Properties (rate limits) • Cost / Core / Hour – Financial For Cloud – Opportunity for Shared & On Prem 10#UnifiedAnalytics #SparkAISummit
  • 11. Minimize Data Scans (Lazy Load) • Data Skipping – HIVE Partitions – Bucketing • Only Experts – Nearly Impossible to Maintain – Databricks Delta Z-Ordering • What is It • How To Do It 11#UnifiedAnalytics #SparkAISummit
  • 12. 12#UnifiedAnalytics #SparkAISummit Without Partition Filter With Partition Filter Shrink Partition Range Using a Filter on Partitioned Column
  • 14. Partitions – Definition Each of a number of portions into which some operating systems divide memory or storage 14#UnifiedAnalytics #SparkAISummit HIVE PARTITION == SPARK PARTITION
  • 15. Spark Partitions – Types • Input – Controls • spark.default.parallelism (don’t use) • spark.sql.files.maxPartitionBytes (mutable) – assuming source has sufficient partitions • Shuffle – Control = partition count • spark.sql.shuffle.partitions • Output – Control = stage partition count split by max records per file • Coalesce(n) to shrink • Repartition(n) to increase and/or balance (shuffle) • df.write.option(“maxRecordsPerFile”, N) 15#UnifiedAnalytics #SparkAISummit
  • 16. Partitions – Right Sizing – Shuffle – Master Equation • Largest Shuffle Stage – Target Size <= 200 MB/partition • Partition Count = Stage Input Data / Target Size – Solve for Partition Count EXAMPLE Shuffle Stage Input = 210GB x = 210000MB / 200MB = 1050 spark.conf.set(“spark.sql.shuffle.partitions”, 1050) BUT -> If cluster has 2000 cores spark.conf.set(“spark.sql.shuffle.partitions”, 2000) 16#UnifiedAnalytics #SparkAISummit
  • 17. 17#UnifiedAnalytics #SparkAISummit Stage 21 -> Shuffle Fed By Stage 19 & 20 THUS Stage 21 Shuffle Input = 45.4g + 8.6g == 54g Default Shuffle Partition == 200 == 54000mb/200parts =~ 270mb/shuffle part Cluster Spec 96 cores @ 7.625g/core 3.8125g Working Mem 3.8125g Storage Mem
  • 18. 18#UnifiedAnalytics #SparkAISummit Cluster Spec 96 cores @ 7.625g/core 3.8125g Working Mem 3.8125g Storage Mem 480 shuffle partitions – WHY? Target shuffle part size == 100m p = 54g / 100m == 540 540p / 96 cores == 5.625 96 * 5 == 480 If p == 540 another 60p have to be loaded and processed after first cycle is complete NO SPILL
  • 20. Partitions – Right Sizing (input) • Use Spark Defaults (128MB) unless… – Source Structure is not optimal (upstream) – Remove Spills – Increase Parallelism – Heavily Nested/Repetitive Data – UDFs 20#UnifiedAnalytics #SparkAISummit Get DF Partitions df.rdd.partitions.size
  • 21. 21#UnifiedAnalytics #SparkAISummit 14.7g/452part == 32.3mb/part spark.sql.files.maxPartitionBytes == 128MB Parquet Consideration: Source must have sufficient row blocks sc.hadoopConfiguration.setInt("parquet.block.size", 1024 * 1024 * 16) df.write… FIX THIS SLIDE
  • 22. Partitions – Right Sizing (output) • Write Once -> Read Many – More Time to Write but Faster to Read • Perfect writes limit parallelism – Compactions (minor & major) Write Data Size = 14.7GB Desired File Size = 1500MB Max stage parallelism = 10 96 – 10 == 86 cores idle during write 22#UnifiedAnalytics #SparkAISummit
  • 23. 23#UnifiedAnalytics #SparkAISummit Only 10 Cores Used All 96 Cores Used Average File Size == 1.5g Average File Size == 0.16g
  • 24. Partitions – Why So Serious? • Avoid The Spill – If (Partition Size > Working Memory Size) Spill – If (Storage Memory Available) Spill to Memory – If (Storage Memory Exhausted) Spill to Disk – If (Local Disk Exhausted) Fail Job • Maximize Parallelism – Utilize All Cores – Provision only the cores you need 24#UnifiedAnalytics #SparkAISummit
  • 25. Balance • Maximizing Resources Requires Balance – Task Duration – Partition Size • SKEW – When some partitions are significantly larger than most 25#UnifiedAnalytics #SparkAISummit Input Partitions Shuffle Partitions Output Files Spills GC Times Straggling Tasks
  • 26. 26#UnifiedAnalytics #SparkAISummit 75th percentile ~ 2m recs max ~ 45m recs stragglers take > 22X longer IF no spillage With spillage, 100Xs longer
  • 27. Skew Join Optimization • OSS Fix – Add Column to each side with random int between 0 and spark.sql.shuffle.partitions – 1 to both sides – Add join clause to include join on generated column above – Drop temp columns from result • Databricks Fix (Skew Join) val skewedKeys = List(”id1”, “id200”, ”id-99”) df.join( skewDF.hint(“tblA”, “skewKey”, skewedKeys), Seq(keyCol), “inner”) 27#UnifiedAnalytics #SparkAISummit
  • 28. Minimize Data Scans (Persistence) • Persistence – Not Free • Repetition – SQL Plan 28#UnifiedAnalytics #SparkAISummit df.cache == df.persist(StorageLevel.MEMORY_AND_DISK) • Types – Default (MEMORY_AND_DISK) • Deserialized – Deserialized = Faster = Bigger – Serialized = Slower = Smaller – _2 = Safety = 2X bigger – MEMORY_ONLY – DISK_ONLY Don’t Forget To Cleanup! df.unpersist
  • 31. Minimize Data Scans (Delta Cache) CACHE SELECT column_name[, column_name, ...] FROM [db_name.]table_name [ WHERE boolean_expression ] 31#UnifiedAnalytics #SparkAISummit Why So Fast? HOW TO USE AWS - i3s – On By Default AZURE – Ls-series – On By Default spark.databricks.io.cache.enabled true spark.databricks.io.cache.maxDiskUsage 50g spark.databricks.io.cache.maxMetaDataCache 1g spark.databricks.io.cache.compression.enabled false Hot Data Auto Cached Super Fast Relieves Memory Pressure
  • 32. Join Optimization • Default = SortMergeJoin • Broadcast Joins – Automatic If: (one side < spark.sql.autoBroadcastJoinThreshold) (default 10m) – Risks • Not Enough Driver Memory • DF > spark.driver.maxResultSize • DF > Single Executor Available Working Memory – Prod – Mitigate The Risks • Checker Functions 32#UnifiedAnalytics #SparkAISummit
  • 33. Persistence Vs. Broadcast 33#UnifiedAnalytics #SparkAISummit Attempt to send compute to the data Data availability guaranteed -> each executor has entire dataset
  • 35. Range Join Optimization • Range Joins Types – Point In Interval Range Join • Predicate specifies value in one relation that is between two values from the other relation – Interval Overlap Range Join • Predicate specifies an overlap of intervals between two values from each relation REFERENCE 35#UnifiedAnalytics #SparkAISummit
  • 36. Omit Expensive Ops • Repartition – Use Coalesce or Shuffle Partition Count • Count – Do you really need it? • DistinctCount – use approxCountDistinct() • If distincts are required, put them in the right place – Use dropDuplicates – dropDuplicates BEFORE the join – dropDuplicates BEFORE the groupBy 36#UnifiedAnalytics #SparkAISummit
  • 37. UDF Penalties • Traditional UDFs cannot use Tungsten – Use org.apache.spark.sql.functions – Use PandasUDFs • Utilizes Apache Arrow – Use SparkR UDFs 37#UnifiedAnalytics #SparkAISummit
  • 38. Source Data Format • File Size • Splittable • Compressed • HIVE Partitioned 38#UnifiedAnalytics #SparkAISummit
  • 40. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT