SQLDAY 2023 Chodkowski Adrian Databricks Performance Tuning

STRATEGIC PARTNER
14.05.2023 SQLDay 2023 1

14.05.2023 SQLDay 2023 2
Azure Databricks –
performance tuning
Adrian Chodkowski

O mnie
• Adrian Chodkowski
• Microsoft Data Platform MVP
• Data engineer & Architekt w Elitmind
• Specjalizacja: Platforma danych Microsoft (Azure & On-premise)
• Data Community
• seequality.net
• Adrian.Chodkowski@outlook.com
• @Twitter: Adrian_SQL
• LinkedIn: http://tinyurl.com/adrian-sql
14.05.2023 SQLDay 2023 3

AGENDA
• Databricks disk cache
• AutoLoader
• Static & Dynamic Partition pruning
• File pruning
• Z-Ordering
• Additional tips & summary
14.05.2023 SQLDay 2023 4
Level: 300

14.05.2023 SQLDay 2023 5
Databricks disk cache
aka Delta cache, DBIO

Disk caching
• Disk caching behavior is a proprietary Databricks feature
previously called Delta Cache & DBIO,
• It is different than Spark cache!
• Works with parquet and delta,
• Databricks uses disk caching to accelerate data reads by
creating copies of remote Parquet data files in nodes’ local
storage,
• Successive reads of the same data are then performed
locally,
• After turning on it can be managed automatically or it can
be forced by using CACHE SELECT syntax,
• You benefit most by using cache-accelerated worker
instance types
14.05.2023 SQLDay 2023 6

Disk cache vs Spark
cache
• Prefer use of disk cache!
14.05.2023 SQLDay 2023 7

Disk caching -
configuration
• spark.conf.set("spark.databricks.io.cache.enabled", "[true | false]")
• Be careful with autoscaling :
• Spark.databricks.io.cache.maxDiskUsage – disk space per node reserved for cached data in bytes,
• Spark.databricks.io.cache.maxMetaDataCache – disk space per node reserved for cached metadata in
bytes,
• Spark.databricks.io.cache.compression.enabled – should the cached data be stored in compressed
format.
14.05.2023 SQLDay 2023 8

14.05.2023 SQLDay 2023 12
Ingestion
With Autoloader

Autoloader
• Functionality based on structured
streaming that can identify and load
new files,
• Works in two modes:
• Directory Listing – internally lists
structure of files and partitions. Can be
done incrementally
• File Notification - leverage Event Grid
and storage queues to track new files
appearance
• Avoid overwrites the file,
• Have a backfill strategy defined!
• In many cases consider it as a default
loading mechanism!
• There is also COPY INTO!
14.05.2023 SQLDay 2023 15

14.05.2023 SQLDay 2023 16
Partitioning
Dynamic Pruning & Design

Partitioning
• Division of a table to the hierarchy of folders,
• Can be beneficial when reading data (partition discovery)
and optimizing (OPTIMIZE WHERE partition key)
• Databricks recommendation:
• don’t use it for tables smaller than 1 TB,
• Each partition >1GB,
• When reading and filter by partition key then only
specific partition can be read (rest will be skipped)
• Can be created using PARTITIONED BY or Partition
keywords,
• Don’t use CREATE TABLE PARTITION BY AS SELECT – it will
add hive overhead and can take ages to finish!
14.05.2023 SQLDay 2023 17
Year=2003
Month=01
Month=02
Month=03
Month=04
*.parquet
*.parquet
*.parquet
*.parquet
*.parquet
*.parquet
*.parquet
*.parquet

Dynamic Partition
Pruning
14.05.2023 18
• Especially useful in star schema
• Introduced in Spark 3.0
• Let’s assume query:
SELECT d.c1, f.c2
FROM Fact AS f
JOIN Dimension as d
ON f.join_key = d.join_key
WHERE d.c2 =10
• Since there is a filter on one table (d.c2 = 10),
internally DPP can create a subquery:
SELECT d.join_key FROM Dimension AS d
WHERE d.c2=10;
• and then broadcast this sub-query result, so that we can use this
result to prune partitions for "t1".

Dynamic Partition
Pruning
14.05.2023 19
• Especially useful in star schema
• Introduced in Spark 3.0
• Let’s assume query:
SELECT d.c1, f.c2
FROM Fact AS f
JOIN Dimension as d
ON f.join_key = d.join_key
WHERE d.c2 =10
• Since there is a filter on one table (d.c2 = 10),
internally DPP can create a subquery:
SELECT d.join_key FROM Dimension AS d
WHERE d.c2=10;
• and then broadcast this sub-query result, so that we can use this
result to prune partitions for "t1".

14.05.2023 SQLDay 2023 20
File pruning
By using stats

File pruning
• In addition to eliminating data at partition granularity, Delta Lake on
Databricks dynamically skips unnecessary files when possible.
• Delta Lake automatically collects metadata about data files (files can be
skipped without data file access)
• Statistics are taken for the first 32 columns (can be changed).
14.05.2023 SQLDay 2023 21
SELECT * FROM
Dimension AS d
WHERE
d.category_id IN
(1,2,5);
File 1
File 2
File 3
File 4
File 5
File 6
File Column Min Max
File 1 Category_id 1 2

File pruning
• In addition to eliminating data at partition granularity, Delta Lake on
Databricks dynamically skips unnecessary files when possible,
• Delta Lake automatically collects metadata about data files (files can be
skipped without data file access),
• Statistics are taken for the first 32 columns (can be changed).
14.05.2023 SQLDay 2023 22
SELECT * FROM
Dimension AS d
WHERE
d.category_id IN
(1,2,5);
File Column Min Max
File 1
File 2
File 3
File 4
File 5
File 6

14.05.2023 SQLDay 2023 23
Zordering
By using OPTIMIZE

OPTIMIZE Z-Order
• OPTIMIZE optimizes the layout of Delta Lake,
• By Default OPTIMIZE set max file size to 1GB (1073741824),
• You can control size by using spark.databricks.delta.optimize.maxFileSize
• Can be used based on column or bin-packing optimization:
• Bin-packing – idempotent technique that aims to produce evenly-balanced data files with
respect to their size on disk (not number of rows),
• Z-ordering not-idempotent technique that aims to produce evenly-balanced data files with
respect to the number of rows (not size on disk)
14.05.2023 SQLDay 2023 24
OPTIMIZE events OPTIMIZE events WHERE
>= '2017-01-01' OPTIMIZE events WHERE
>= current_timestamp() -
1 day ZORDER BY (eventType)

OPTIMIZE Z-Order
14.05.2023 SQLDay 2023 25
File 1(800MB)
File 2 (2200MB)
File 3 (900 MB)
File 4 (300 MB)
File 5 (100 MB)
File 1
(1GB)
File 2
(1GB)
File 3
(1GB)
File Column Mi
n
Max
OPTIMIZE table
ZORDER BY (category_id)
File Column Mi
n
Max

Dynamic File pruning
• Files can be skipped based on join not literal values
• To make it happen following requirements must be met:
• Inner table (probe) being joined is in Delta format
• Joint type is INNER or LEFT-SEMI
• Join Strategy is BROADCAST HASH JOIN
• Number of files in the inner table is greater than value set in
spark.databricks.optimizer.deltaTableFilesThreshold (default
1000)
• Spark.databricks.optimizer.dynamicFilePruning should be True
(default)
• Size of inner table should be more than
spark.databricks.optimizer.deltaTableSize
(default 10GB)
14.05.2023 SQLDay 2023 26

Other
• Use newest Databricks Runtime,
• For fast Update/Merge re-write the least amount of files:
spark.databricks.delta.optimize.maxfilesize 16-128MB or turn on optimized writes.
• Don’t use Python or Scala UDF if native function exists – transfer data between Python
and Spark = serialization is needed = drastically slows down queries
• Move numericals, keys, high cardinality query predicates to the left, long string that are
not distinct enough for stats collection move to the left (only 32 columns has statistics)
• OPTIMIZE benefits from Compute Optimized clusters (because of a lot of encoding and
decoding parquet files)
• Think about spot instances,
• For some operations consider Databricks Standard.
14.05.2023 SQLDay 2023 27

Adaptive Query
Processing
• Game changer in Spark 3.x
• For example: Initially SortMergeJoin chosen
but once the ingest stages completes plan
will be updated to use BroadcastHashJoin
14.05.2023 SQLDay 2023 28

Other
• Turn Adaptive Query Execution (default)
• Turn Coalesce Partitions on
(spark.sql.adaptive.coalescePartitions.enable
d)
• Turn Skew Join On
(spark.sql.adaptive.skewJoin.enabled)
• Turn Local Shuffle Reader on
(spark.sql.adaptive.localShuffleReader.enable
d)
• Broadcast Join threshold
(spark.sql.autoBroadcastJoinThreshold)
• Don’t prefer Sort MergeJoin
spark.sql.join.prefersortmergejoin -> false
14.05.2023 SQLDay 2023 29
• Turn off stats collection
• dataSkippingNumIndexedCols 0
• Optimize Zorder by merge keys between Bronze & Silver
• Turn Optimized Writes
• Restructure columns for skipping
• Optimize ZOrder by join keys or High Cardinality columns used
in WHERE
• Turn Optimized Writes
• Enable Databricks IO cache
• Consider using Photon
• Consider using Premium Storage
• Build preaggregate tables

14.05.2023 SQLDay 2023 30
Dziękuję!

14.05.2023 SQLDay 2023 34
STRATEGIC PARTNER

SQLDAY 2023 Chodkowski Adrian Databricks Performance Tuning

More Related Content

Similar to SQLDAY 2023 Chodkowski Adrian Databricks Performance Tuning (20)

Recently uploaded (20)

SQLDAY 2023 Chodkowski Adrian Databricks Performance Tuning