SlideShare a Scribd company logo
STRATEGIC PARTNER
14.05.2023 SQLDay 2023 1
14.05.2023 SQLDay 2023 2
Azure Databricks –
performance tuning
Adrian Chodkowski
O mnie
• Adrian Chodkowski
• Microsoft Data Platform MVP
• Data engineer & Architekt w Elitmind
• Specjalizacja: Platforma danych Microsoft (Azure & On-premise)
• Data Community
• seequality.net
• Adrian.Chodkowski@outlook.com
• @Twitter: Adrian_SQL
• LinkedIn: http://tinyurl.com/adrian-sql
14.05.2023 SQLDay 2023 3
AGENDA
• Databricks disk cache
• AutoLoader
• Static & Dynamic Partition pruning
• File pruning
• Z-Ordering
• Additional tips & summary
14.05.2023 SQLDay 2023 4
Level: 300
14.05.2023 SQLDay 2023 5
Databricks disk cache
aka Delta cache, DBIO
Disk caching
• Disk caching behavior is a proprietary Databricks feature
previously called Delta Cache & DBIO,
• It is different than Spark cache!
• Works with parquet and delta,
• Databricks uses disk caching to accelerate data reads by
creating copies of remote Parquet data files in nodes’ local
storage,
• Successive reads of the same data are then performed
locally,
• After turning on it can be managed automatically or it can
be forced by using CACHE SELECT syntax,
• You benefit most by using cache-accelerated worker
instance types
14.05.2023 SQLDay 2023 6
Disk cache vs Spark
cache
• Prefer use of disk cache!
14.05.2023 SQLDay 2023 7
Disk caching -
configuration
• spark.conf.set("spark.databricks.io.cache.enabled", "[true | false]")
• Be careful with autoscaling :
• Spark.databricks.io.cache.maxDiskUsage – disk space per node reserved for cached data in bytes,
• Spark.databricks.io.cache.maxMetaDataCache – disk space per node reserved for cached metadata in
bytes,
• Spark.databricks.io.cache.compression.enabled – should the cached data be stored in compressed
format.
14.05.2023 SQLDay 2023 8
14.05.2023 SQLDay 2023 12
Ingestion
With Autoloader
Autoloader
• Functionality based on structured
streaming that can identify and load
new files,
• Works in two modes:
• Directory Listing – internally lists
structure of files and partitions. Can be
done incrementally
• File Notification - leverage Event Grid
and storage queues to track new files
appearance
• Avoid overwrites the file,
• Have a backfill strategy defined!
• In many cases consider it as a default
loading mechanism!
• There is also COPY INTO!
14.05.2023 SQLDay 2023 15
14.05.2023 SQLDay 2023 16
Partitioning
Dynamic Pruning & Design
Partitioning
• Division of a table to the hierarchy of folders,
• Can be beneficial when reading data (partition discovery)
and optimizing (OPTIMIZE WHERE partition key)
• Databricks recommendation:
• don’t use it for tables smaller than 1 TB,
• Each partition >1GB,
• When reading and filter by partition key then only
specific partition can be read (rest will be skipped)
• Can be created using PARTITIONED BY or Partition
keywords,
• Don’t use CREATE TABLE PARTITION BY AS SELECT – it will
add hive overhead and can take ages to finish!
14.05.2023 SQLDay 2023 17
Year=2003
Month=01
Month=02
Month=03
Month=04
*.parquet
*.parquet
*.parquet
*.parquet
*.parquet
*.parquet
*.parquet
*.parquet
Dynamic Partition
Pruning
14.05.2023 18
• Especially useful in star schema
• Introduced in Spark 3.0
• Let’s assume query:
SELECT d.c1, f.c2
FROM Fact AS f
JOIN Dimension as d
ON f.join_key = d.join_key
WHERE d.c2 =10
• Since there is a filter on one table (d.c2 = 10),
internally DPP can create a subquery:
SELECT d.join_key FROM Dimension AS d
WHERE d.c2=10;
• and then broadcast this sub-query result, so that we can use this
result to prune partitions for "t1".
Dynamic Partition
Pruning
14.05.2023 19
• Especially useful in star schema
• Introduced in Spark 3.0
• Let’s assume query:
SELECT d.c1, f.c2
FROM Fact AS f
JOIN Dimension as d
ON f.join_key = d.join_key
WHERE d.c2 =10
• Since there is a filter on one table (d.c2 = 10),
internally DPP can create a subquery:
SELECT d.join_key FROM Dimension AS d
WHERE d.c2=10;
• and then broadcast this sub-query result, so that we can use this
result to prune partitions for "t1".
14.05.2023 SQLDay 2023 20
File pruning
By using stats
File pruning
• In addition to eliminating data at partition granularity, Delta Lake on
Databricks dynamically skips unnecessary files when possible.
• Delta Lake automatically collects metadata about data files (files can be
skipped without data file access)
• Statistics are taken for the first 32 columns (can be changed).
14.05.2023 SQLDay 2023 21
SELECT * FROM
Dimension AS d
WHERE
d.category_id IN
(1,2,5);
File 1
File 2
File 3
File 4
File 5
File 6
File Column Min Max
File 1 Category_id 1 2
File 2 Category_id 1 10
File 3 Category_id 6 10
File 4 Category_id 8 20
File 5 Category_id 4 100
File 6 Category_id 1 1
File pruning
• In addition to eliminating data at partition granularity, Delta Lake on
Databricks dynamically skips unnecessary files when possible,
• Delta Lake automatically collects metadata about data files (files can be
skipped without data file access),
• Statistics are taken for the first 32 columns (can be changed).
14.05.2023 SQLDay 2023 22
SELECT * FROM
Dimension AS d
WHERE
d.category_id IN
(1,2,5);
File Column Min Max
File 1 Category_id 1 2
File 2 Category_id 1 10
File 3 Category_id 6 10
File 4 Category_id 8 20
File 5 Category_id 4 100
File 6 Category_id 1 1
File 1
File 2
File 3
File 4
File 5
File 6
14.05.2023 SQLDay 2023 23
Zordering
By using OPTIMIZE
OPTIMIZE Z-Order
• OPTIMIZE optimizes the layout of Delta Lake,
• By Default OPTIMIZE set max file size to 1GB (1073741824),
• You can control size by using spark.databricks.delta.optimize.maxFileSize
• Can be used based on column or bin-packing optimization:
• Bin-packing – idempotent technique that aims to produce evenly-balanced data files with
respect to their size on disk (not number of rows),
• Z-ordering not-idempotent technique that aims to produce evenly-balanced data files with
respect to the number of rows (not size on disk)
14.05.2023 SQLDay 2023 24
OPTIMIZE events OPTIMIZE events WHERE
>= '2017-01-01' OPTIMIZE events WHERE
>= current_timestamp() -
1 day ZORDER BY (eventType)
OPTIMIZE Z-Order
14.05.2023 SQLDay 2023 25
File 1(800MB)
File 2 (2200MB)
File 3 (900 MB)
File 4 (300 MB)
File 5 (100 MB)
File 1
(1GB)
File 2
(1GB)
File 3
(1GB)
File Column Mi
n
Max
File 1 Category_id 1 2
File 2 Category_id 1 10
File 3 Category_id 6 10
File 4 Category_id 8 20
File 5 Category_id 4 100
OPTIMIZE table
ZORDER BY (category_id)
File Column Mi
n
Max
File 1 Category_id 1 10
File 2 Category_id 11 20
File 3 Category_id 20 100
Dynamic File pruning
• Files can be skipped based on join not literal values
• To make it happen following requirements must be met:
• Inner table (probe) being joined is in Delta format
• Joint type is INNER or LEFT-SEMI
• Join Strategy is BROADCAST HASH JOIN
• Number of files in the inner table is greater than value set in
spark.databricks.optimizer.deltaTableFilesThreshold (default
1000)
• Spark.databricks.optimizer.dynamicFilePruning should be True
(default)
• Size of inner table should be more than
spark.databricks.optimizer.deltaTableSize
(default 10GB)
14.05.2023 SQLDay 2023 26
Other
• Use newest Databricks Runtime,
• For fast Update/Merge re-write the least amount of files:
spark.databricks.delta.optimize.maxfilesize 16-128MB or turn on optimized writes.
• Don’t use Python or Scala UDF if native function exists – transfer data between Python
and Spark = serialization is needed = drastically slows down queries
• Move numericals, keys, high cardinality query predicates to the left, long string that are
not distinct enough for stats collection move to the left (only 32 columns has statistics)
• OPTIMIZE benefits from Compute Optimized clusters (because of a lot of encoding and
decoding parquet files)
• Think about spot instances,
• For some operations consider Databricks Standard.
14.05.2023 SQLDay 2023 27
Adaptive Query
Processing
• Game changer in Spark 3.x
• For example: Initially SortMergeJoin chosen
but once the ingest stages completes plan
will be updated to use BroadcastHashJoin
14.05.2023 SQLDay 2023 28
Other
• Turn Adaptive Query Execution (default)
• Turn Coalesce Partitions on
(spark.sql.adaptive.coalescePartitions.enable
d)
• Turn Skew Join On
(spark.sql.adaptive.skewJoin.enabled)
• Turn Local Shuffle Reader on
(spark.sql.adaptive.localShuffleReader.enable
d)
• Broadcast Join threshold
(spark.sql.autoBroadcastJoinThreshold)
• Don’t prefer Sort MergeJoin
spark.sql.join.prefersortmergejoin -> false
14.05.2023 SQLDay 2023 29
• Turn off stats collection
• dataSkippingNumIndexedCols 0
• Optimize Zorder by merge keys between Bronze & Silver
• Turn Optimized Writes
• Restructure columns for skipping
• Optimize ZOrder by join keys or High Cardinality columns used
in WHERE
• Turn Optimized Writes
• Enable Databricks IO cache
• Consider using Photon
• Consider using Premium Storage
• Build preaggregate tables
14.05.2023 SQLDay 2023 30
Dziękuję!
14.05.2023 SQLDay 2023 34
STRATEGIC PARTNER

More Related Content

PDF
Dynamic Partition Pruning in Apache Spark
PDF
Best Practices for Building Robust Data Platform with Apache Spark and Delta
PDF
Apache Spark 3.0: Overview of What’s New and Why Care
PDF
Delta Lake: Optimizing Merge
PDF
Spark + AI Summit recap jul16 2020
PDF
What’s New in the Upcoming Apache Spark 3.0
PDF
Deep Dive into the New Features of Apache Spark 3.0
PDF
DeltaLakeOperations.pdf
Dynamic Partition Pruning in Apache Spark
Best Practices for Building Robust Data Platform with Apache Spark and Delta
Apache Spark 3.0: Overview of What’s New and Why Care
Delta Lake: Optimizing Merge
Spark + AI Summit recap jul16 2020
What’s New in the Upcoming Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0
DeltaLakeOperations.pdf

Similar to SQLDAY 2023 Chodkowski Adrian Databricks Performance Tuning (20)

PDF
Optimising Geospatial Queries with Dynamic File Pruning
PDF
delta_lake_cheat_sheet.pdf
PDF
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
PDF
Common Strategies for Improving Performance on Your Delta Lakehouse
PDF
New developments in open source ecosystem spark3.0 koalas delta lake
PDF
Making Apache Spark Better with Delta Lake
PDF
Operating and Supporting Delta Lake in Production
PPTX
Apache Spark 3 Dynamic Partition Pruning
PDF
Achieving Lakehouse Models with Spark 3.0
PDF
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
PPTX
2018 data warehouse features in spark
PDF
Delta Lake Cheat Sheet.pdf
PDF
Simplifying Change Data Capture using Databricks Delta
PDF
Mastering Query Optimization Techniques for Modern Data Engineers
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
PDF
Open Source Reliability for Data Lake with Apache Spark by Michael Armbrust
PDF
Spark Streaming Tips for Devs and Ops
PDF
Spark Streaming Tips for Devs and Ops by Fran perez y federico fernández
PPTX
Be A Hero: Transforming GoPro Analytics Data Pipeline
PDF
Optimizations in Spark; RDD, DataFrame
Optimising Geospatial Queries with Dynamic File Pruning
delta_lake_cheat_sheet.pdf
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Common Strategies for Improving Performance on Your Delta Lakehouse
New developments in open source ecosystem spark3.0 koalas delta lake
Making Apache Spark Better with Delta Lake
Operating and Supporting Delta Lake in Production
Apache Spark 3 Dynamic Partition Pruning
Achieving Lakehouse Models with Spark 3.0
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
2018 data warehouse features in spark
Delta Lake Cheat Sheet.pdf
Simplifying Change Data Capture using Databricks Delta
Mastering Query Optimization Techniques for Modern Data Engineers
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Open Source Reliability for Data Lake with Apache Spark by Michael Armbrust
Spark Streaming Tips for Devs and Ops
Spark Streaming Tips for Devs and Ops by Fran perez y federico fernández
Be A Hero: Transforming GoPro Analytics Data Pipeline
Optimizations in Spark; RDD, DataFrame
Ad

Recently uploaded (20)

PPTX
modul_python (1).pptx for professional and student
PPT
Quality review (1)_presentation of this 21
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
Managing Community Partner Relationships
PPTX
Computer network topology notes for revision
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PDF
Transcultural that can help you someday.
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PDF
Optimise Shopper Experiences with a Strong Data Estate.pdf
PDF
Lecture1 pattern recognition............
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
IB Computer Science - Internal Assessment.pptx
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
PPT
Predictive modeling basics in data cleaning process
modul_python (1).pptx for professional and student
Quality review (1)_presentation of this 21
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Managing Community Partner Relationships
Computer network topology notes for revision
ISS -ESG Data flows What is ESG and HowHow
Qualitative Qantitative and Mixed Methods.pptx
Transcultural that can help you someday.
Acceptance and paychological effects of mandatory extra coach I classes.pptx
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Optimise Shopper Experiences with a Strong Data Estate.pdf
Lecture1 pattern recognition............
climate analysis of Dhaka ,Banglades.pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
IB Computer Science - Internal Assessment.pptx
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
Predictive modeling basics in data cleaning process
Ad

SQLDAY 2023 Chodkowski Adrian Databricks Performance Tuning

  • 2. 14.05.2023 SQLDay 2023 2 Azure Databricks – performance tuning Adrian Chodkowski
  • 3. O mnie • Adrian Chodkowski • Microsoft Data Platform MVP • Data engineer & Architekt w Elitmind • Specjalizacja: Platforma danych Microsoft (Azure & On-premise) • Data Community • seequality.net • [email protected] • @Twitter: Adrian_SQL • LinkedIn: http://tinyurl.com/adrian-sql 14.05.2023 SQLDay 2023 3
  • 4. AGENDA • Databricks disk cache • AutoLoader • Static & Dynamic Partition pruning • File pruning • Z-Ordering • Additional tips & summary 14.05.2023 SQLDay 2023 4 Level: 300
  • 5. 14.05.2023 SQLDay 2023 5 Databricks disk cache aka Delta cache, DBIO
  • 6. Disk caching • Disk caching behavior is a proprietary Databricks feature previously called Delta Cache & DBIO, • It is different than Spark cache! • Works with parquet and delta, • Databricks uses disk caching to accelerate data reads by creating copies of remote Parquet data files in nodes’ local storage, • Successive reads of the same data are then performed locally, • After turning on it can be managed automatically or it can be forced by using CACHE SELECT syntax, • You benefit most by using cache-accelerated worker instance types 14.05.2023 SQLDay 2023 6
  • 7. Disk cache vs Spark cache • Prefer use of disk cache! 14.05.2023 SQLDay 2023 7
  • 8. Disk caching - configuration • spark.conf.set("spark.databricks.io.cache.enabled", "[true | false]") • Be careful with autoscaling : • Spark.databricks.io.cache.maxDiskUsage – disk space per node reserved for cached data in bytes, • Spark.databricks.io.cache.maxMetaDataCache – disk space per node reserved for cached metadata in bytes, • Spark.databricks.io.cache.compression.enabled – should the cached data be stored in compressed format. 14.05.2023 SQLDay 2023 8
  • 9. 14.05.2023 SQLDay 2023 12 Ingestion With Autoloader
  • 10. Autoloader • Functionality based on structured streaming that can identify and load new files, • Works in two modes: • Directory Listing – internally lists structure of files and partitions. Can be done incrementally • File Notification - leverage Event Grid and storage queues to track new files appearance • Avoid overwrites the file, • Have a backfill strategy defined! • In many cases consider it as a default loading mechanism! • There is also COPY INTO! 14.05.2023 SQLDay 2023 15
  • 11. 14.05.2023 SQLDay 2023 16 Partitioning Dynamic Pruning & Design
  • 12. Partitioning • Division of a table to the hierarchy of folders, • Can be beneficial when reading data (partition discovery) and optimizing (OPTIMIZE WHERE partition key) • Databricks recommendation: • don’t use it for tables smaller than 1 TB, • Each partition >1GB, • When reading and filter by partition key then only specific partition can be read (rest will be skipped) • Can be created using PARTITIONED BY or Partition keywords, • Don’t use CREATE TABLE PARTITION BY AS SELECT – it will add hive overhead and can take ages to finish! 14.05.2023 SQLDay 2023 17 Year=2003 Month=01 Month=02 Month=03 Month=04 *.parquet *.parquet *.parquet *.parquet *.parquet *.parquet *.parquet *.parquet
  • 13. Dynamic Partition Pruning 14.05.2023 18 • Especially useful in star schema • Introduced in Spark 3.0 • Let’s assume query: SELECT d.c1, f.c2 FROM Fact AS f JOIN Dimension as d ON f.join_key = d.join_key WHERE d.c2 =10 • Since there is a filter on one table (d.c2 = 10), internally DPP can create a subquery: SELECT d.join_key FROM Dimension AS d WHERE d.c2=10; • and then broadcast this sub-query result, so that we can use this result to prune partitions for "t1".
  • 14. Dynamic Partition Pruning 14.05.2023 19 • Especially useful in star schema • Introduced in Spark 3.0 • Let’s assume query: SELECT d.c1, f.c2 FROM Fact AS f JOIN Dimension as d ON f.join_key = d.join_key WHERE d.c2 =10 • Since there is a filter on one table (d.c2 = 10), internally DPP can create a subquery: SELECT d.join_key FROM Dimension AS d WHERE d.c2=10; • and then broadcast this sub-query result, so that we can use this result to prune partitions for "t1".
  • 15. 14.05.2023 SQLDay 2023 20 File pruning By using stats
  • 16. File pruning • In addition to eliminating data at partition granularity, Delta Lake on Databricks dynamically skips unnecessary files when possible. • Delta Lake automatically collects metadata about data files (files can be skipped without data file access) • Statistics are taken for the first 32 columns (can be changed). 14.05.2023 SQLDay 2023 21 SELECT * FROM Dimension AS d WHERE d.category_id IN (1,2,5); File 1 File 2 File 3 File 4 File 5 File 6 File Column Min Max File 1 Category_id 1 2 File 2 Category_id 1 10 File 3 Category_id 6 10 File 4 Category_id 8 20 File 5 Category_id 4 100 File 6 Category_id 1 1
  • 17. File pruning • In addition to eliminating data at partition granularity, Delta Lake on Databricks dynamically skips unnecessary files when possible, • Delta Lake automatically collects metadata about data files (files can be skipped without data file access), • Statistics are taken for the first 32 columns (can be changed). 14.05.2023 SQLDay 2023 22 SELECT * FROM Dimension AS d WHERE d.category_id IN (1,2,5); File Column Min Max File 1 Category_id 1 2 File 2 Category_id 1 10 File 3 Category_id 6 10 File 4 Category_id 8 20 File 5 Category_id 4 100 File 6 Category_id 1 1 File 1 File 2 File 3 File 4 File 5 File 6
  • 18. 14.05.2023 SQLDay 2023 23 Zordering By using OPTIMIZE
  • 19. OPTIMIZE Z-Order • OPTIMIZE optimizes the layout of Delta Lake, • By Default OPTIMIZE set max file size to 1GB (1073741824), • You can control size by using spark.databricks.delta.optimize.maxFileSize • Can be used based on column or bin-packing optimization: • Bin-packing – idempotent technique that aims to produce evenly-balanced data files with respect to their size on disk (not number of rows), • Z-ordering not-idempotent technique that aims to produce evenly-balanced data files with respect to the number of rows (not size on disk) 14.05.2023 SQLDay 2023 24 OPTIMIZE events OPTIMIZE events WHERE >= '2017-01-01' OPTIMIZE events WHERE >= current_timestamp() - 1 day ZORDER BY (eventType)
  • 20. OPTIMIZE Z-Order 14.05.2023 SQLDay 2023 25 File 1(800MB) File 2 (2200MB) File 3 (900 MB) File 4 (300 MB) File 5 (100 MB) File 1 (1GB) File 2 (1GB) File 3 (1GB) File Column Mi n Max File 1 Category_id 1 2 File 2 Category_id 1 10 File 3 Category_id 6 10 File 4 Category_id 8 20 File 5 Category_id 4 100 OPTIMIZE table ZORDER BY (category_id) File Column Mi n Max File 1 Category_id 1 10 File 2 Category_id 11 20 File 3 Category_id 20 100
  • 21. Dynamic File pruning • Files can be skipped based on join not literal values • To make it happen following requirements must be met: • Inner table (probe) being joined is in Delta format • Joint type is INNER or LEFT-SEMI • Join Strategy is BROADCAST HASH JOIN • Number of files in the inner table is greater than value set in spark.databricks.optimizer.deltaTableFilesThreshold (default 1000) • Spark.databricks.optimizer.dynamicFilePruning should be True (default) • Size of inner table should be more than spark.databricks.optimizer.deltaTableSize (default 10GB) 14.05.2023 SQLDay 2023 26
  • 22. Other • Use newest Databricks Runtime, • For fast Update/Merge re-write the least amount of files: spark.databricks.delta.optimize.maxfilesize 16-128MB or turn on optimized writes. • Don’t use Python or Scala UDF if native function exists – transfer data between Python and Spark = serialization is needed = drastically slows down queries • Move numericals, keys, high cardinality query predicates to the left, long string that are not distinct enough for stats collection move to the left (only 32 columns has statistics) • OPTIMIZE benefits from Compute Optimized clusters (because of a lot of encoding and decoding parquet files) • Think about spot instances, • For some operations consider Databricks Standard. 14.05.2023 SQLDay 2023 27
  • 23. Adaptive Query Processing • Game changer in Spark 3.x • For example: Initially SortMergeJoin chosen but once the ingest stages completes plan will be updated to use BroadcastHashJoin 14.05.2023 SQLDay 2023 28
  • 24. Other • Turn Adaptive Query Execution (default) • Turn Coalesce Partitions on (spark.sql.adaptive.coalescePartitions.enable d) • Turn Skew Join On (spark.sql.adaptive.skewJoin.enabled) • Turn Local Shuffle Reader on (spark.sql.adaptive.localShuffleReader.enable d) • Broadcast Join threshold (spark.sql.autoBroadcastJoinThreshold) • Don’t prefer Sort MergeJoin spark.sql.join.prefersortmergejoin -> false 14.05.2023 SQLDay 2023 29 • Turn off stats collection • dataSkippingNumIndexedCols 0 • Optimize Zorder by merge keys between Bronze & Silver • Turn Optimized Writes • Restructure columns for skipping • Optimize ZOrder by join keys or High Cardinality columns used in WHERE • Turn Optimized Writes • Enable Databricks IO cache • Consider using Photon • Consider using Premium Storage • Build preaggregate tables
  • 25. 14.05.2023 SQLDay 2023 30 DziÄ™kujÄ™!
  • 26. 14.05.2023 SQLDay 2023 34 STRATEGIC PARTNER