SlideShare a Scribd company logo
Spark SQL Beyond Official
Documentation
David Vrba Ph.D.
Senior ML Engineer
About Myself
▪ Senior ML Engineer at Socialbakers
▪ developing and optimizing Spark jobs
▪ productionalizing Spark applications and deploying ML
models
▪ Spark Trainer
▪ 1-day, 2-days trainings
▪ reach out to me at https://www.linkedin.com/in/vrba-
david/
▪ Writer
▪ publishing articles at medium
▪ follow me at https://medium.com/@vrba.dave
Goal
▪ Knowledge sharing
▪ Free continuation of my previous talk
▪ Physical Plans in Spark SQL
▪ https://databricks.com/session_eu19/physical-plans-in-spark-sql
▪ Describe the non-obvious behavior of some Spark features
▪ Go beyond the documentation
▪ Focus on practical aspects of Spark SQL
Topics
▪ Statistics
▪ Saving data in sorted state to a file format
Statistics
▪ How to see them
▪ How they are computed
▪ Where they are used
▪ What to be careful about
Statistics - how to see them
▪ Table level:
▪ DESCRIBE EXTENDED
▪ DESCRIBE FORMATTED
spark.sql(“DESCRIBE EXTENDED table_name”).show(n=50)
spark.sql(“ANALYZE TABLE table_name COMPUTE
STATISTICS”).show(n=50)
Statistics - how to see them
▪ Column level:
spark.sql(“DESCRIBE EXTENDED table_name column_name”).show()
Statistics - how to see them
▪ From the plan - since Spark 3.0
spark.table(table_name).explain(mode=“cost”)
Statistics - how they are propagated
Relation
Filter
Project
Aggregate
Leaf Node - Responsible for
computing the statistics
Statistics are propagated
through the tree and adjusted
along the way
Statistics - how they are propagated
▪ Simple way
▪ propagates only sizeInBytes
▪ propagation through the plan is very basic (Filter is
not adjusted at all)
(
spark.table(table_name)
.filter(col(“user_id”) < 0)
.explain(mode=”cost”)
)
spark.conf.set(“spark.sql.cbo.enabled”, True)
Statistics - how they are propagated
▪ More advanced
▪ propagates sizeInBytes and rowCount + column level
▪ since Spark 2.2
▪ better propagation through plan (selectivity for Filter)
▪ CBO has to be enabled (by default OFF)
▪ works with metastore
No change in Filter statistics - it
requires column stats to be computed
Statistics - how they are propagated
▪ Selectivity requires having column level stats
spark.sql(“ANALYZE TABLE table_name COMPUTE STATISTICS
FOR COLUMNS user_id”)
Statistics - how they are computed
Relation
Filter
Project
Aggregate
Leaf Node - Responsible for
computing the statistics
1. Taken from metastore
2. Computed using Hadoop API (only sizeInBytes)
3. Default value sizeInBytes = 8EB
spark.sql.defaultSizeInBytes
Statistics - how they are computed
CBO ON
Analyze table ON
Table partitionedAll Stats from M
CatalogFileIndex InMemoryFI
T
T
T F
F
Analyze table ON
Stats from M
TF
F
CatalogTable
InMemoryFI
FT
Using Hadoop API - only
sizeInBytes
Using Hadoop API - only
sizeInBytes
Only sizeInBytes -
taken directly
All stats except for size which
is computed from rowCount
Maximum value (8 EB)
spark.table(...)
spark.sql.defaultSizeInBytes
spark.sql.cbo.enabled
Statistics - how they are computed
Partitioned table - ANALYZE TABLE haven’t run yet:
Not partitioned table - ANALYZE TABLE haven’t run yet:
Statistics - where they are used
▪ joinReorder - in case you join more than two tables
▪ finds most optimal configuration for multiple joins
▪ by default it is OFF
spark.conf.set(“spark.sql.cbo.joinReorder.enabled”, True)
▪ join selection - decide whether to use BroadcastHashJoin
▪ spark.sql.autoBroadcastJoinThreshold - 10MB default
Saving data in a sorted state to a file format
▪ Functions for sorting
▪ How to save in sorted state
Sorting in Spark SQL
▪ orderBy / sort
▪ DataFrame transformation
▪ samples data in separate job
▪ creates a shuffle to achieve global sort
▪ sortWithinPartitions
▪ DataFrame transformation
▪ sorts each partition
▪ sortBy
▪ called on DataFrameWriter after calling write
▪ used together with bucketing - sorts each bucket
▪ requires using saveAsTable
Example - save in sorted state
▪ Partition your data by the column: year
▪ Have each partition sorted by the column: user_id
▪ Have one file per partition (this file should be sorted by user_id)
Example - save in sorted state
(
df.repartition(‘year’)
.sortWithinPartitions(‘user_id’)
.write
.mode(‘overwrite’)
.partitionBy(‘year’)
.option(‘path’, output_path)
.saveAsTable(table_name)
)
This will not save the data sorted!
When saving the data to a file format Spark requires this
ordering:
(partitionColumns + bucketingIdExpression + sortColumns)
If this requirement is not satisfied Spark will forget the
sort and will sort it again with this ordering
Example - save in sorted state
(
df.repartition(‘year’)
.sortWithinPartitions(‘user_id’)
.write
.mode(‘overwrite’)
.partitionBy(‘year’)
.option(‘path’, output_path)
.saveAsTable(table_name)
)
requiredOrdering = (partitionColumns) = (year)
actualOrdering = (user_id)
The requirement is not satisfied.
Example - save in sorted state
(
df.repartition(‘year’)
.sortWithinPartitions(‘year’, ‘user_id’)
.write
.mode(‘overwrite’)
.partitionBy(‘year’)
.option(‘path’, output_path)
.saveAsTable(table_name)
)
requiredOrdering = (partitionColumns) = (year)
actualOrdering = (year, user_id)
The requirement is satisfied - Spark will keep the order
Instead call it as follows:
Conclusion
▪ Using statistics can improve performance of your joins
▪ Don’t forget to call ANALYZE TABLE especially if your table is partitioned
▪ Saving sorted data requires caution
▪ Don’t forget to sort by partition columns
Feedback
Your feedback is important to us.
Don’t forget to rate
and review the sessions.

More Related Content

PDF
The Parquet Format and Performance Optimization Opportunities
PDF
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
PPTX
Transformations and actions a visual guide training
PDF
MyRocks Deep Dive
PDF
PostgreSQL Deep Internal
PPTX
Hadoop Compatible File Systems (Azure編) (セミナー「Big Data Developerに贈る第二弾 ‐ Azur...
PDF
Dive into PySpark
PDF
PostgreSQLでスケールアウト
The Parquet Format and Performance Optimization Opportunities
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Transformations and actions a visual guide training
MyRocks Deep Dive
PostgreSQL Deep Internal
Hadoop Compatible File Systems (Azure編) (セミナー「Big Data Developerに贈る第二弾 ‐ Azur...
Dive into PySpark
PostgreSQLでスケールアウト

What's hot (20)

PDF
Materialized Column: An Efficient Way to Optimize Queries on Nested Columns
PDF
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
PDF
MySQL Query And Index Tuning
PPTX
XXE、SSRF、安全でないデシリアライゼーション入門
PDF
最近のストリーム処理事情振り返り
PPTX
Optimizing Apache Spark SQL Joins
PDF
Benchmark MinHash+LSH algorithm on Spark
ODP
The PostgreSQL Query Planner
PDF
Operating and Supporting Delta Lake in Production
PDF
Hadoop/Spark で Amazon S3 を徹底的に使いこなすワザ (Hadoop / Spark Conference Japan 2019)
PDF
Solving PostgreSQL wicked problems
PDF
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
PDF
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
PPTX
分析指向データレイク実現の次の一手 ~Delta Lake、なにそれおいしいの?~(NTTデータ テクノロジーカンファレンス 2020 発表資料)
PDF
Analytical Queries with Hive: SQL Windowing and Table Functions
PDF
Dbts2013 特濃jpoug log_file_sync
PDF
Building a SIMD Supported Vectorized Native Engine for Spark SQL
PDF
Db2 & Db2 Warehouse v11.5.4 最新情報アップデート2020年8月25日
PDF
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
PDF
A Day in the Life of a ClickHouse Query Webinar Slides
Materialized Column: An Efficient Way to Optimize Queries on Nested Columns
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
MySQL Query And Index Tuning
XXE、SSRF、安全でないデシリアライゼーション入門
最近のストリーム処理事情振り返り
Optimizing Apache Spark SQL Joins
Benchmark MinHash+LSH algorithm on Spark
The PostgreSQL Query Planner
Operating and Supporting Delta Lake in Production
Hadoop/Spark で Amazon S3 を徹底的に使いこなすワザ (Hadoop / Spark Conference Japan 2019)
Solving PostgreSQL wicked problems
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
分析指向データレイク実現の次の一手 ~Delta Lake、なにそれおいしいの?~(NTTデータ テクノロジーカンファレンス 2020 発表資料)
Analytical Queries with Hive: SQL Windowing and Table Functions
Dbts2013 特濃jpoug log_file_sync
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Db2 & Db2 Warehouse v11.5.4 最新情報アップデート2020年8月25日
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
A Day in the Life of a ClickHouse Query Webinar Slides
Ad

Similar to Spark SQL Beyond Official Documentation (20)

PDF
Why you should care about data layout in the file system with Cheng Lian and ...
PDF
Apache Spark's Built-in File Sources in Depth
PDF
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
PDF
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
PDF
Spark SQL Deep Dive @ Melbourne Spark Meetup
PDF
Introduction to Spark Datasets - Functional and relational together at last
PPTX
iceberg introduction.pptx
PDF
Vancouver AWS Meetup Slides 11-20-2018 Apache Spark with Amazon EMR
PDF
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
PDF
Apache Spark 3.0: Overview of What’s New and Why Care
PDF
Deep Dive into Spark
PDF
Cost-Based Optimizer in Apache Spark 2.2
PDF
Beyond SQL: Speeding up Spark with DataFrames
PDF
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
PDF
What’s New in the Upcoming Apache Spark 3.0
PPTX
Spark sql
PPTX
Beyond shuffling - Strata London 2016
PDF
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
PDF
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
PDF
Getting The Best Performance With PySpark
Why you should care about data layout in the file system with Cheng Lian and ...
Apache Spark's Built-in File Sources in Depth
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Spark SQL Deep Dive @ Melbourne Spark Meetup
Introduction to Spark Datasets - Functional and relational together at last
iceberg introduction.pptx
Vancouver AWS Meetup Slides 11-20-2018 Apache Spark with Amazon EMR
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
Apache Spark 3.0: Overview of What’s New and Why Care
Deep Dive into Spark
Cost-Based Optimizer in Apache Spark 2.2
Beyond SQL: Speeding up Spark with DataFrames
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
What’s New in the Upcoming Apache Spark 3.0
Spark sql
Beyond shuffling - Strata London 2016
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Getting The Best Performance With PySpark
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake

Recently uploaded (20)

PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PDF
[EN] Industrial Machine Downtime Prediction
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPT
ISS -ESG Data flows What is ESG and HowHow
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
Introduction to Knowledge Engineering Part 1
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PDF
annual-report-2024-2025 original latest.
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
Computer network topology notes for revision
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
climate analysis of Dhaka ,Banglades.pptx
STERILIZATION AND DISINFECTION-1.ppthhhbx
[EN] Industrial Machine Downtime Prediction
STUDY DESIGN details- Lt Col Maksud (21).pptx
ISS -ESG Data flows What is ESG and HowHow
Fluorescence-microscope_Botany_detailed content
Introduction to Knowledge Engineering Part 1
Reliability_Chapter_ presentation 1221.5784
Qualitative Qantitative and Mixed Methods.pptx
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
annual-report-2024-2025 original latest.
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Computer network topology notes for revision
.pdf is not working space design for the following data for the following dat...
IB Computer Science - Internal Assessment.pptx
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck

Spark SQL Beyond Official Documentation

  • 1. Spark SQL Beyond Official Documentation David Vrba Ph.D. Senior ML Engineer
  • 2. About Myself ▪ Senior ML Engineer at Socialbakers ▪ developing and optimizing Spark jobs ▪ productionalizing Spark applications and deploying ML models ▪ Spark Trainer ▪ 1-day, 2-days trainings ▪ reach out to me at https://www.linkedin.com/in/vrba- david/ ▪ Writer ▪ publishing articles at medium ▪ follow me at https://medium.com/@vrba.dave
  • 3. Goal ▪ Knowledge sharing ▪ Free continuation of my previous talk ▪ Physical Plans in Spark SQL ▪ https://databricks.com/session_eu19/physical-plans-in-spark-sql ▪ Describe the non-obvious behavior of some Spark features ▪ Go beyond the documentation ▪ Focus on practical aspects of Spark SQL
  • 4. Topics ▪ Statistics ▪ Saving data in sorted state to a file format
  • 5. Statistics ▪ How to see them ▪ How they are computed ▪ Where they are used ▪ What to be careful about
  • 6. Statistics - how to see them ▪ Table level: ▪ DESCRIBE EXTENDED ▪ DESCRIBE FORMATTED spark.sql(“DESCRIBE EXTENDED table_name”).show(n=50) spark.sql(“ANALYZE TABLE table_name COMPUTE STATISTICS”).show(n=50)
  • 7. Statistics - how to see them ▪ Column level: spark.sql(“DESCRIBE EXTENDED table_name column_name”).show()
  • 8. Statistics - how to see them ▪ From the plan - since Spark 3.0 spark.table(table_name).explain(mode=“cost”)
  • 9. Statistics - how they are propagated Relation Filter Project Aggregate Leaf Node - Responsible for computing the statistics Statistics are propagated through the tree and adjusted along the way
  • 10. Statistics - how they are propagated ▪ Simple way ▪ propagates only sizeInBytes ▪ propagation through the plan is very basic (Filter is not adjusted at all) ( spark.table(table_name) .filter(col(“user_id”) < 0) .explain(mode=”cost”) )
  • 11. spark.conf.set(“spark.sql.cbo.enabled”, True) Statistics - how they are propagated ▪ More advanced ▪ propagates sizeInBytes and rowCount + column level ▪ since Spark 2.2 ▪ better propagation through plan (selectivity for Filter) ▪ CBO has to be enabled (by default OFF) ▪ works with metastore No change in Filter statistics - it requires column stats to be computed
  • 12. Statistics - how they are propagated ▪ Selectivity requires having column level stats spark.sql(“ANALYZE TABLE table_name COMPUTE STATISTICS FOR COLUMNS user_id”)
  • 13. Statistics - how they are computed Relation Filter Project Aggregate Leaf Node - Responsible for computing the statistics 1. Taken from metastore 2. Computed using Hadoop API (only sizeInBytes) 3. Default value sizeInBytes = 8EB spark.sql.defaultSizeInBytes
  • 14. Statistics - how they are computed CBO ON Analyze table ON Table partitionedAll Stats from M CatalogFileIndex InMemoryFI T T T F F Analyze table ON Stats from M TF F CatalogTable InMemoryFI FT Using Hadoop API - only sizeInBytes Using Hadoop API - only sizeInBytes Only sizeInBytes - taken directly All stats except for size which is computed from rowCount Maximum value (8 EB) spark.table(...) spark.sql.defaultSizeInBytes spark.sql.cbo.enabled
  • 15. Statistics - how they are computed Partitioned table - ANALYZE TABLE haven’t run yet: Not partitioned table - ANALYZE TABLE haven’t run yet:
  • 16. Statistics - where they are used ▪ joinReorder - in case you join more than two tables ▪ finds most optimal configuration for multiple joins ▪ by default it is OFF spark.conf.set(“spark.sql.cbo.joinReorder.enabled”, True) ▪ join selection - decide whether to use BroadcastHashJoin ▪ spark.sql.autoBroadcastJoinThreshold - 10MB default
  • 17. Saving data in a sorted state to a file format ▪ Functions for sorting ▪ How to save in sorted state
  • 18. Sorting in Spark SQL ▪ orderBy / sort ▪ DataFrame transformation ▪ samples data in separate job ▪ creates a shuffle to achieve global sort ▪ sortWithinPartitions ▪ DataFrame transformation ▪ sorts each partition ▪ sortBy ▪ called on DataFrameWriter after calling write ▪ used together with bucketing - sorts each bucket ▪ requires using saveAsTable
  • 19. Example - save in sorted state ▪ Partition your data by the column: year ▪ Have each partition sorted by the column: user_id ▪ Have one file per partition (this file should be sorted by user_id)
  • 20. Example - save in sorted state ( df.repartition(‘year’) .sortWithinPartitions(‘user_id’) .write .mode(‘overwrite’) .partitionBy(‘year’) .option(‘path’, output_path) .saveAsTable(table_name) ) This will not save the data sorted! When saving the data to a file format Spark requires this ordering: (partitionColumns + bucketingIdExpression + sortColumns) If this requirement is not satisfied Spark will forget the sort and will sort it again with this ordering
  • 21. Example - save in sorted state ( df.repartition(‘year’) .sortWithinPartitions(‘user_id’) .write .mode(‘overwrite’) .partitionBy(‘year’) .option(‘path’, output_path) .saveAsTable(table_name) ) requiredOrdering = (partitionColumns) = (year) actualOrdering = (user_id) The requirement is not satisfied.
  • 22. Example - save in sorted state ( df.repartition(‘year’) .sortWithinPartitions(‘year’, ‘user_id’) .write .mode(‘overwrite’) .partitionBy(‘year’) .option(‘path’, output_path) .saveAsTable(table_name) ) requiredOrdering = (partitionColumns) = (year) actualOrdering = (year, user_id) The requirement is satisfied - Spark will keep the order Instead call it as follows:
  • 23. Conclusion ▪ Using statistics can improve performance of your joins ▪ Don’t forget to call ANALYZE TABLE especially if your table is partitioned ▪ Saving sorted data requires caution ▪ Don’t forget to sort by partition columns
  • 24. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.