Apache Spark Core – Practical Optimization

WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics

Daniel Tomes, Databricks
Spark Core – Proper
Optimization
#UnifiedAnalytics #SparkAISummit

TEMPORARY – NOTE FOR REVIEW
I gave this talk at Summit US 2019. I will be
adding supplemental slides for SparkUI deep dive
and delta optimization depending on how many
people in the crowd has seen this presentation on
YT. If none, I will go with same presentation, if
many, I will do the first half as review and second
half with new info. New info to come soon.
3#UnifiedAnalytics #SparkAISummit

Talking Points
• Spark Hierarchy
• The Spark UI
• Understanding Partitions
• Common Opportunities For Optimization

Spark Hierarchy

Spark Hierarchy
• Actions are eager
– Made of transformations (lazy)
• narrow
• wide (requires shuffle)
– Spawn jobs
• Spawn Stages
– Spawn Tasks
» Do work & utilize hardware

Minimize Data Movement
Less Jobs
Less Stages
More Tasks
More Ops/Task

Navigating The Spark UI
DEMO

Get A Baseline
• Is your action efficient?
– Spills?
• CPU Utilization
– GANGLIA / YARN / Etc
– Tails
Goal

Understand Your Hardware
• Core Count & Speed
• Memory Per Core (Working & Storage)
• Local Disk Type, Count, Size, & Speed
• Network Speed & Topology
• Data Lake Properties (rate limits)
• Cost / Core / Hour
– Financial For Cloud
– Opportunity for Shared & On Prem

Minimize Data Scans (Lazy Load)
• Data Skipping
– HIVE Partitions
– Bucketing
• Only Experts – Nearly Impossible to
Maintain
– Databricks Delta Z-Ordering
• What is It
• How To Do It

Without Partition Filter
With Partition Filter
Shrink Partition Range
Using a Filter on
Partitioned Column

Simple
Extra Shuffle Partitions
With Broadcast

Partitions – Definition
Each of a number of portions into which some
operating systems divide memory or storage
HIVE PARTITION == SPARK PARTITION

Spark Partitions – Types
• Input
– Controls
• spark.default.parallelism (don’t use)
• spark.sql.files.maxPartitionBytes (mutable)
– assuming source has sufficient partitions
• Shuffle
– Control = partition count
• spark.sql.shuffle.partitions
• Output
– Control = stage partition count split by max records per file
• Coalesce(n) to shrink
• Repartition(n) to increase and/or balance (shuffle)
• df.write.option(“maxRecordsPerFile”, N)

Partitions – Right Sizing – Shuffle – Master Equation
• Largest Shuffle Stage
– Target Size <= 200 MB/partition
• Partition Count = Stage Input Data / Target Size
– Solve for Partition Count
EXAMPLE
Shuffle Stage Input = 210GB
x = 210000MB / 200MB = 1050
spark.conf.set(“spark.sql.shuffle.partitions”, 1050)
BUT -> If cluster has 2000 cores
spark.conf.set(“spark.sql.shuffle.partitions”, 2000)

Stage 21 -> Shuffle Fed By Stage 19 & 20
THUS
Stage 21 Shuffle Input = 45.4g + 8.6g == 54g
Default Shuffle Partition == 200 == 54000mb/200parts =~
270mb/shuffle part
Cluster Spec
96 cores @ 7.625g/core
3.8125g Working Mem
3.8125g Storage Mem

Cluster Spec
96 cores @ 7.625g/core
3.8125g Working Mem
3.8125g Storage Mem
480 shuffle partitions – WHY?
Target shuffle part size == 100m
p = 54g / 100m == 540
540p / 96 cores == 5.625
96 * 5 == 480
If p == 540 another 60p have to be loaded
and processed after first cycle is complete
NO SPILL

Partitions – Right Sizing (input)
• Use Spark Defaults (128MB) unless…
– Source Structure is not optimal (upstream)
– Remove Spills
– Increase Parallelism
– Heavily Nested/Repetitive Data
– UDFs
Get DF Partitions
df.rdd.partitions.size

14.7g/452part == 32.3mb/part
spark.sql.files.maxPartitionBytes == 128MB
Parquet Consideration: Source must have sufficient row blocks
sc.hadoopConfiguration.setInt("parquet.block.size", 1024 * 1024 * 16)
df.write…
FIX THIS SLIDE

Partitions – Right Sizing (output)
• Write Once -> Read Many
– More Time to Write but Faster to Read
• Perfect writes limit parallelism
– Compactions (minor & major)
Write Data Size = 14.7GB
Desired File Size = 1500MB
Max stage parallelism = 10
96 – 10 == 86 cores idle during write

Only 10 Cores Used
All 96 Cores Used
Average File Size == 1.5g
Average File Size == 0.16g

Partitions – Why So Serious?
• Avoid The Spill
– If (Partition Size > Working Memory Size)
Spill
– If (Storage Memory Available) Spill to
Memory
– If (Storage Memory Exhausted) Spill to Disk
– If (Local Disk Exhausted) Fail Job
• Maximize Parallelism
– Utilize All Cores
– Provision only the cores you need

Balance
• Maximizing Resources Requires Balance
– Task Duration
– Partition Size
• SKEW
– When some partitions are significantly larger than
most
Input Partitions
Shuffle Partitions
Output Files
Spills
GC Times
Straggling Tasks

75th percentile ~ 2m recs
max ~ 45m recs
stragglers take > 22X longer IF no spillage
With spillage, 100Xs longer

Skew Join Optimization
• OSS Fix
– Add Column to each side with
random int between 0 and
spark.sql.shuffle.partitions – 1 to
both sides
– Add join clause to include join on
generated column above
– Drop temp columns from result
• Databricks Fix (Skew Join)
val skewedKeys = List(”id1”, “id200”, ”id-99”)
df.join(
skewDF.hint(“tblA”, “skewKey”, skewedKeys),
Seq(keyCol), “inner”)

Minimize Data Scans (Persistence)
• Persistence
– Not Free
• Repetition
– SQL Plan
df.cache == df.persist(StorageLevel.MEMORY_AND_DISK)
• Types
– Default (MEMORY_AND_DISK)
• Deserialized
– Deserialized = Faster = Bigger
– Serialized = Slower = Smaller
– _2 = Safety = 2X bigger
– MEMORY_ONLY
– DISK_ONLY
Don’t Forget To Cleanup!
df.unpersist

TPCDS Query 4

Minimize Data Scans (Delta Cache)
CACHE SELECT column_name[, column_name, ...] FROM [db_name.]table_name [ WHERE boolean_expression ]
Why So Fast?
HOW TO USE
AWS - i3s – On By Default
AZURE – Ls-series – On By Default
spark.databricks.io.cache.enabled true
spark.databricks.io.cache.maxDiskUsage 50g
spark.databricks.io.cache.maxMetaDataCache 1g
spark.databricks.io.cache.compression.enabled false
Hot Data Auto Cached
Super Fast
Relieves Memory Pressure

Join Optimization
• Default = SortMergeJoin
• Broadcast Joins
– Automatic If:
(one side < spark.sql.autoBroadcastJoinThreshold) (default 10m)
– Risks
• Not Enough Driver Memory
• DF > spark.driver.maxResultSize
• DF > Single Executor Available Working Memory
– Prod – Mitigate The Risks
• Checker Functions

Persistence Vs. Broadcast
Attempt to send compute to the data
Data availability guaranteed ->
each executor has entire dataset

126MB
270MB

Range Join Optimization
• Range Joins Types
– Point In Interval Range Join
• Predicate specifies value in one relation that is between two values from the other relation
– Interval Overlap Range Join
• Predicate specifies an overlap of intervals between two values from each relation
REFERENCE

Omit Expensive Ops
• Repartition
– Use Coalesce or Shuffle Partition Count
• Count – Do you really need it?
• DistinctCount
– use approxCountDistinct()
• If distincts are required, put them in the right place
– Use dropDuplicates
– dropDuplicates BEFORE the join
– dropDuplicates BEFORE the groupBy

UDF Penalties
• Traditional UDFs cannot use Tungsten
– Use org.apache.spark.sql.functions
– Use PandasUDFs
• Utilizes Apache Arrow
– Use SparkR UDFs

Source Data Format
• File Size
• Splittable
• Compressed
• HIVE Partitioned

QUESTIONS

DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

Apache Spark Core – Practical Optimization

More Related Content

What's hot (20)

Similar to Apache Spark Core – Practical Optimization (20)

More from Databricks (20)

Recently uploaded (20)

Apache Spark Core – Practical Optimization