SlideShare a Scribd company logo
Enhancing Spark SQL Optimizer with
Reliable Statistics
Ron Hu, Fang Cao, Min Qiu*, Yizhen Liu
Huawei Technologies, Inc.
* Former Huawei employee
Agenda
• Review of Catalyst Architecture
• Rule-based optimizations
• Reliable statistics collected
• Cost-based rules
• Future Work
• Q & A
Page 2
Catalyst Architecture
Spark optimizes
query plan here
Reference:Deep Dive into Spark SQL’s Catalyst Optimizer, a databricks engineering blog
Page 3
Rule-Based Optimizer in Spark SQL
• Most of Spark SQL optimizer’s rules are heuristics rules.
– Does NOT consider the cost of each operator
– Does NOT consider the cost of the equivalent logical plans
• Join order is decided by its position in the SQL queries
• Join type is based on some very simple system
assumptions
• Number of shuffle partitions is a fixed number.
• Our community work:
– Ex.: Fixed bugs in Spark.
– Spark Summit East 2016 talk, https://spark-summit.org/east-
2016/events/enhancements-on-spark-sql-optimizer/
Page 4
Statistics Collected
• Collect Table Statistics information
• Collect Column Statistics information
• Only consider static system statistics (configuration
file: CPU, Storage, Network) at this stage.
• Goal:
– Calculate the cost for each database operator
• in terms of number of output rows, size of output rows, etc.
– Based on the cost calculation, adjust the query execution
plan
Page 5
Table Statistics Collected
• Use a modified Hive Analyze Table statement to
collect statistics of a table.
– Ex: Analyze Table lineitem compute statistics
• It collects table level statistics and save into
metastore.
– Number of rows
– Number of files
– Table size in bytes
Page 6
Column Statistics Collected
• Use Analyze statement to collect column level statistics
of individual column.
– Ex: Analyze Table lineitem compute statistics for
columns l_orderkey, l_partkey, l_suppkey,
l_returnflag, l_linestatus, l_shipdate, ……..
• It collects column level statistics and save into
metastore.
– Minimal value, maximal value,
– Number of distinct values, number of null values
– Column maximal length, column average length
– Uniqueness of a column
Page 7
Column 1-D Histogram
Provided two kinds of Histograms: Equi-Width and Equi-
Depth
- Between buckets, data distribution is determined by histograms
- Within one bucket, still assume data is evenly distributed
Max number of buckets: 256,
- If Number of Distinct Values <= 256, use equi-width
- If Number of Distinct Values > 256, use equi-depth
Used Hive Analyze Command and Hive Metastore API
Page 8
Column interval
Frequency
Equi-Width
Equi-Depth
Column interval
Frequency
Column 2-D Histogram
• Developed 2-dimensional equi-depth histogram for the
column combination of (c1, c2)
– In a 2-dimensional histogram, there are 2 levels of buckets.
– B(c1) is the number of major buckets for column C1.
– Within each C1 bucket, B(c2) is the number of buckets for C2
• Lessons Learned:
– Users do not use 2-D histogram often as they do not know which 2
columns are correlated.
– What granularity to use? 256 buckets or 256x256 buckets?
– Difficult to extend to 3-D or more dimensions
– Can be replaced by hints
Page 9
Cost-Based Rules
• Optimizer is a RuleExecutor.
– Individual optimization is defined as Rule
• We added new rules to estimate number of output
rows and output size in bytes for each execution
operator:
– MetastoreRelation, Filter, Project, Join, Sort, Aggregate,
Exchange, Limit, Union, etc.
• The node’s cost = nominal scale of (output_rows,
output_size)
Page 10
Filter Operator Statistics
• Between Filter’s expressions: AND, OR, NOT
• In each Expression: =, <, <=, >, >=, like, in, etc
• Current support type in Expression
– For <, <=, >, >=, String, Integer, Double, etc
– For =, String, Integer, Double and Date Type, and User-Defined
Types, etc.
• Sample: A <= B
– Based on A, B’s min/max/NDV values, decide the relationships
between A and B. After completing this expression, what the new
min/max/NDV should be for A and B
– We use histograms to adjust min/max/NDV values
– Assume all the data is evenly distributed if no histogram information.
Page 11
Filter Operator Example
• Column A (op) Data B
– (op) can be “=“, “<”, “<=”, “>”, “>=”, “like”
– Like the styles as “l_orderkey = 3”, “l_shipdate <= “1995-03-21”
– Column’s max/min/distinct should be updated
– Sample: Column A < value B
Column AB B
A.min A.max
Filtering Factor = 0%
no need to change A’s statistics
A will not appear in the future work
Filtering Factor = 100%
no need to change A’s statistics
value
frequency
50
40
30
20
10
1–5 6–10 11–15 16–20 21–25
With Histograms
Filtering Factor = using Histograms to calculate
A.min = no change
A.max = B.value
A.ndv = A.ndv * Filtering Factor
Without Histograms, Suppose Data is evenly distributed
Filtering Factor = (B.value – A.min) / (A.max – A.min)
A.min = no change
A.max = B.value
A.ndv = A.ndv * Filtering Factor
Page 12
Filter Operator Example
• Column A (op) Column B
– Actually, based on observation, this expression will appear in Project, but not in Filter
– Note: for column comparing, currently we don’t support histogram. We cannot suppose the data is evenly
distributed, so the empirical filtering factor is set to 1/3
– (op) can be “<”, “<=”, “>”, “>=”
– Need to adjust the A and B’s min/max/NDV after filtering
– Sample: Column A < Column B
B
A
AA
A
B
B B
A filtering = 100%
B filtering = 100%
A filtering = 0%
B filtering = 0%
A filtering = 33.3%
B filtering = 33.3%
A filtering = 33.3%
B filtering = 33.3%
Page 13
Join Order
• Only for two table joins
• We calculate the cost of Hash Join using the stats of left
and right nodes.
– Nominal Cost = <nominal-rows> × 0.7 + <nominal-size> × 0.3
• Choose lower-cost child as build side of hash join (Prior
to Spark 1.5).
Page 14
Multi-way Join Reorder
• Currently Spark SQL’s Join order is not decided
by the cost of multi-way join operations.
• We decide the join order based on the output
rows and output size of the intermediate table.
– The join with smaller output is performed first.
– Can benefit star join queries (like TPC-DS).
• Using dynamic programming for join order
Page 15
Sample:Q3,3 Tables join+aggregate
• ParquetRelation node
• Filter node
• Project node
• Join node
• Aggregation
• Limit
Page 16
Build Right -> Build Left
Limitation without Key Information
• Spark SQL does not support index or primary key.
– This missing information fails to properly estimate the
join output of the primary/foreign key join.
• When estimating the number of GROUP BY
operator output records, we multiply the number of
distinct values for each GROUP BY column.
– This formula is valid only if every GROUP BY column is
independent.
Page 17
Column Uniqueness
• We know that a column is unique (or primary key)
if the number of distinct values divided by the
number of records of a table is close to 1.0.
– We can set the size of hash join table properly if one
join column is unique.
– When computing the number of GROUP BY output
records, if one GROUP BY column is unique, we do
NOT multiply those non-unique columns.
Page 18
Unique Column Example, tpc-h Q10
• /* tpc-h Q10: c_custkey is unique */
• SELECT c_custkey, c_name, sum(l_extendedprice * (1 - l_discount))
• AS revenue, c_acctbal, n_name, c_address, c_phone, c_comment
• FROM nation join customer join orders join lineitem
• WHERE c_custkey = o_custkey AND l_orderkey = o_orderkey
• AND o_orderdate >= '1993-10-01' AND o_orderdate < '1994-01-01'
• AND l_returnflag = 'R' AND c_nationkey = n_nationkey
• GROUP BY c_custkey, c_name, c_acctbal, c_phone, n_name, c_address, c_comment
• ORDER BY revenue DESC limit 20
Number of group-by outputs can be:
• 1708M if there is no unique column information,
• 82K if we know there is a unique group-by column
Page 19
SQL Hints
• Some information cannot be analyzed directly from the statistics of
tables/columns. Example, tpc-h Q13:
– Supported hints /*+ …. */: Like_FilterFactor,
NDV_Correlated_Columns, Join_Build, Join_Type, ……
Page 20
SELECT c_count, count(*) as custdist
FROM
(SELECT c_custkey, count(o_orderkey) c_count
FROM customer LEFT OUTER JOIN orders
ON c_custkey = o_custkey
and o_comment not like '%special%request%'
GROUP BY c_custkey
) c_orders
GROUP BY c_count
ORDER BY custdist desc, c_count desc
Actual vs Estimated Output Rows
Query Actual Estimated
Q1 4 6
Q2 460 1756
Q3 11621 496485
Q4 5 5
Q5 5 25
Q6 1 1
Q7 4 5
Q8 2 5
Q9 175 222
Q10 37967 81611
Query Actual Estimated
Q11 28574 32000
Q12 2 2
Q13 42 100
Q14 1 1
Q15 1 2
Q16 18314 14700
Q17 1 1
Q18 57 1621
Q19 1 1
Q20 186 558
Q21 411 558
Page 21
Wrong Output Rows Estimate for Q3
• We do not handle the correlated columns of
different tables.
TPC-H Q3:
select l_orderkey, sum(l_extendedprice *(1 - l_discount)) as revenue,
o_orderdate, o_shippriority
from customer, orders, lineitem
where c_mktsegment = 'BUILDING'
and c_custkey = o_custkey and l_orderkey = o_orderkey
and o_orderdate < date '1995-3-15'
and l_shipdate > date '1995-3-15'
group by l_orderkey, o_orderdate, o_shippriority
order by l_orderkey, revenue desc, o_orderdate
Page 22
Possible Future Work
• How to collect table histograms information quickly and correctly
– For full table scan – correct, but slow, especially for big data
– Possible method – Sampling Counting
• Linear, LogLog, Adaptive, Hyper LogLog, Hyper LogLog++, etc
• Expression Statistics
– Now only raw columns’ statistics are collected. Not for the derived columns
– Derived columns from calculation of expressions
• Ex: Alias Column, Aggregation Expression, Arithmetic Expression, UDF
• Collecting the real-world running statistics information, for the future
query plan optimization.
– Continuous feedback optimization
Page 23
THANK YOU.
ron.hu@huawei.com fang.cao@huawei.com
yizhen.liu@huawei.com

More Related Content

PPTX
Catalyst optimizer
PDF
Understanding and Improving Code Generation
PPTX
Performance Optimizations in Apache Impala
PDF
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
PDF
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
PDF
How We Optimize Spark SQL Jobs With parallel and sync IO
PPTX
Query Compilation in Impala
PPTX
Apache Spark
Catalyst optimizer
Understanding and Improving Code Generation
Performance Optimizations in Apache Impala
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
How We Optimize Spark SQL Jobs With parallel and sync IO
Query Compilation in Impala
Apache Spark

What's hot (20)

PDF
Storing time series data with Apache Cassandra
PDF
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
PDF
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
PPTX
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
PPTX
Dynamic filtering for presto join optimisation
PDF
Common Strategies for Improving Performance on Your Delta Lakehouse
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
PDF
Building a SIMD Supported Vectorized Native Engine for Spark SQL
PPTX
Hive + Tez: A Performance Deep Dive
PDF
Apache Spark At Scale in the Cloud
PDF
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
PDF
A Deep Dive into Query Execution Engine of Spark SQL
PDF
Iceberg: a fast table format for S3
PDF
Spark SQL
PDF
How to use Parquet as a basis for ETL and analytics
PPTX
Lessons Learned From Running 1800 Clusters (Brooke Jensen, Instaclustr) | Cas...
PDF
Apache Calcite: One Frontend to Rule Them All
PDF
Archmage, Pinterest’s Real-time Analytics Platform on Druid
PPTX
Tuning and Debugging in Apache Spark
Storing time series data with Apache Cassandra
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Dynamic filtering for presto join optimisation
Common Strategies for Improving Performance on Your Delta Lakehouse
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Hive + Tez: A Performance Deep Dive
Apache Spark At Scale in the Cloud
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
A Deep Dive into Query Execution Engine of Spark SQL
Iceberg: a fast table format for S3
Spark SQL
How to use Parquet as a basis for ETL and analytics
Lessons Learned From Running 1800 Clusters (Brooke Jensen, Instaclustr) | Cas...
Apache Calcite: One Frontend to Rule Them All
Archmage, Pinterest’s Real-time Analytics Platform on Druid
Tuning and Debugging in Apache Spark
Ad

Similar to Enhancing Spark SQL Optimizer with Reliable Statistics (20)

PDF
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
PDF
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
PDF
Cost-Based Optimizer in Apache Spark 2.2
PDF
advance-sqaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaal.pdf
DOCX
Sql interview prep
PPTX
PPTX
Database Management System Review
PDF
How to use histograms to get better performance
PDF
Using histograms to get better performance
PPTX
Calamities with cardinalities
PPTX
Analytics functions in mysql, oracle and hive
PPTX
1. dml select statement reterive data
PDF
Longhorn PHP - MySQL Indexes, Histograms, Locking Options, and Other Ways to ...
PDF
MySQL Indexes and Histograms - RMOUG Training Days 2022
PDF
The ultimate-guide-to-sql
PPTX
Concepts of Query Processing in ADBMS.pptx
PPTX
PDF
MODULE 1.pdf foundations of data science for final
PPTX
Part1 of SQL Tuning Workshop - Understanding the Optimizer
PDF
OOW2016: Exploring Advanced SQL Techniques Using Analytic Functions
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Cost-Based Optimizer in Apache Spark 2.2
advance-sqaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaal.pdf
Sql interview prep
Database Management System Review
How to use histograms to get better performance
Using histograms to get better performance
Calamities with cardinalities
Analytics functions in mysql, oracle and hive
1. dml select statement reterive data
Longhorn PHP - MySQL Indexes, Histograms, Locking Options, and Other Ways to ...
MySQL Indexes and Histograms - RMOUG Training Days 2022
The ultimate-guide-to-sql
Concepts of Query Processing in ADBMS.pptx
MODULE 1.pdf foundations of data science for final
Part1 of SQL Tuning Workshop - Understanding the Optimizer
OOW2016: Exploring Advanced SQL Techniques Using Analytic Functions
Ad

More from Jen Aman (20)

PPTX
Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
PDF
Snorkel: Dark Data and Machine Learning with Christopher Ré
PDF
Deep Learning on Apache® Spark™: Workflows and Best Practices
PDF
Deep Learning on Apache® Spark™ : Workflows and Best Practices
PDF
RISELab:Enabling Intelligent Real-Time Decisions
PDF
Spatial Analysis On Histological Images Using Spark
PDF
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
PDF
A Graph-Based Method For Cross-Entity Threat Detection
PDF
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
PDF
Time-Evolving Graph Processing On Commodity Clusters
PDF
Deploying Accelerators At Datacenter Scale Using Spark
PDF
Re-Architecting Spark For Performance Understandability
PDF
Re-Architecting Spark For Performance Understandability
PDF
Low Latency Execution For Apache Spark
PDF
Efficient State Management With Spark 2.0 And Scale-Out Databases
PDF
Livy: A REST Web Service For Apache Spark
PDF
GPU Computing With Apache Spark And Python
PDF
Spark And Cassandra: 2 Fast, 2 Furious
PDF
Building Custom Machine Learning Algorithms With Apache SystemML
PDF
Spark on Mesos
Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Snorkel: Dark Data and Machine Learning with Christopher Ré
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best Practices
RISELab:Enabling Intelligent Real-Time Decisions
Spatial Analysis On Histological Images Using Spark
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
A Graph-Based Method For Cross-Entity Threat Detection
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Time-Evolving Graph Processing On Commodity Clusters
Deploying Accelerators At Datacenter Scale Using Spark
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance Understandability
Low Latency Execution For Apache Spark
Efficient State Management With Spark 2.0 And Scale-Out Databases
Livy: A REST Web Service For Apache Spark
GPU Computing With Apache Spark And Python
Spark And Cassandra: 2 Fast, 2 Furious
Building Custom Machine Learning Algorithms With Apache SystemML
Spark on Mesos

Recently uploaded (20)

PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PDF
Business Analytics and business intelligence.pdf
PDF
Introduction to the R Programming Language
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
modul_python (1).pptx for professional and student
PDF
[EN] Industrial Machine Downtime Prediction
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PDF
Introduction to Data Science and Data Analysis
PPTX
SAP 2 completion done . PRESENTATION.pptx
PPTX
Managing Community Partner Relationships
PDF
Transcultural that can help you someday.
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Data_Analytics_and_PowerBI_Presentation.pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
.pdf is not working space design for the following data for the following dat...
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Business Analytics and business intelligence.pdf
Introduction to the R Programming Language
Clinical guidelines as a resource for EBP(1).pdf
IB Computer Science - Internal Assessment.pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
modul_python (1).pptx for professional and student
[EN] Industrial Machine Downtime Prediction
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Qualitative Qantitative and Mixed Methods.pptx
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
Introduction to Data Science and Data Analysis
SAP 2 completion done . PRESENTATION.pptx
Managing Community Partner Relationships
Transcultural that can help you someday.

Enhancing Spark SQL Optimizer with Reliable Statistics

  • 1. Enhancing Spark SQL Optimizer with Reliable Statistics Ron Hu, Fang Cao, Min Qiu*, Yizhen Liu Huawei Technologies, Inc. * Former Huawei employee
  • 2. Agenda • Review of Catalyst Architecture • Rule-based optimizations • Reliable statistics collected • Cost-based rules • Future Work • Q & A Page 2
  • 3. Catalyst Architecture Spark optimizes query plan here Reference:Deep Dive into Spark SQL’s Catalyst Optimizer, a databricks engineering blog Page 3
  • 4. Rule-Based Optimizer in Spark SQL • Most of Spark SQL optimizer’s rules are heuristics rules. – Does NOT consider the cost of each operator – Does NOT consider the cost of the equivalent logical plans • Join order is decided by its position in the SQL queries • Join type is based on some very simple system assumptions • Number of shuffle partitions is a fixed number. • Our community work: – Ex.: Fixed bugs in Spark. – Spark Summit East 2016 talk, https://spark-summit.org/east- 2016/events/enhancements-on-spark-sql-optimizer/ Page 4
  • 5. Statistics Collected • Collect Table Statistics information • Collect Column Statistics information • Only consider static system statistics (configuration file: CPU, Storage, Network) at this stage. • Goal: – Calculate the cost for each database operator • in terms of number of output rows, size of output rows, etc. – Based on the cost calculation, adjust the query execution plan Page 5
  • 6. Table Statistics Collected • Use a modified Hive Analyze Table statement to collect statistics of a table. – Ex: Analyze Table lineitem compute statistics • It collects table level statistics and save into metastore. – Number of rows – Number of files – Table size in bytes Page 6
  • 7. Column Statistics Collected • Use Analyze statement to collect column level statistics of individual column. – Ex: Analyze Table lineitem compute statistics for columns l_orderkey, l_partkey, l_suppkey, l_returnflag, l_linestatus, l_shipdate, …….. • It collects column level statistics and save into metastore. – Minimal value, maximal value, – Number of distinct values, number of null values – Column maximal length, column average length – Uniqueness of a column Page 7
  • 8. Column 1-D Histogram Provided two kinds of Histograms: Equi-Width and Equi- Depth - Between buckets, data distribution is determined by histograms - Within one bucket, still assume data is evenly distributed Max number of buckets: 256, - If Number of Distinct Values <= 256, use equi-width - If Number of Distinct Values > 256, use equi-depth Used Hive Analyze Command and Hive Metastore API Page 8 Column interval Frequency Equi-Width Equi-Depth Column interval Frequency
  • 9. Column 2-D Histogram • Developed 2-dimensional equi-depth histogram for the column combination of (c1, c2) – In a 2-dimensional histogram, there are 2 levels of buckets. – B(c1) is the number of major buckets for column C1. – Within each C1 bucket, B(c2) is the number of buckets for C2 • Lessons Learned: – Users do not use 2-D histogram often as they do not know which 2 columns are correlated. – What granularity to use? 256 buckets or 256x256 buckets? – Difficult to extend to 3-D or more dimensions – Can be replaced by hints Page 9
  • 10. Cost-Based Rules • Optimizer is a RuleExecutor. – Individual optimization is defined as Rule • We added new rules to estimate number of output rows and output size in bytes for each execution operator: – MetastoreRelation, Filter, Project, Join, Sort, Aggregate, Exchange, Limit, Union, etc. • The node’s cost = nominal scale of (output_rows, output_size) Page 10
  • 11. Filter Operator Statistics • Between Filter’s expressions: AND, OR, NOT • In each Expression: =, <, <=, >, >=, like, in, etc • Current support type in Expression – For <, <=, >, >=, String, Integer, Double, etc – For =, String, Integer, Double and Date Type, and User-Defined Types, etc. • Sample: A <= B – Based on A, B’s min/max/NDV values, decide the relationships between A and B. After completing this expression, what the new min/max/NDV should be for A and B – We use histograms to adjust min/max/NDV values – Assume all the data is evenly distributed if no histogram information. Page 11
  • 12. Filter Operator Example • Column A (op) Data B – (op) can be “=“, “<”, “<=”, “>”, “>=”, “like” – Like the styles as “l_orderkey = 3”, “l_shipdate <= “1995-03-21” – Column’s max/min/distinct should be updated – Sample: Column A < value B Column AB B A.min A.max Filtering Factor = 0% no need to change A’s statistics A will not appear in the future work Filtering Factor = 100% no need to change A’s statistics value frequency 50 40 30 20 10 1–5 6–10 11–15 16–20 21–25 With Histograms Filtering Factor = using Histograms to calculate A.min = no change A.max = B.value A.ndv = A.ndv * Filtering Factor Without Histograms, Suppose Data is evenly distributed Filtering Factor = (B.value – A.min) / (A.max – A.min) A.min = no change A.max = B.value A.ndv = A.ndv * Filtering Factor Page 12
  • 13. Filter Operator Example • Column A (op) Column B – Actually, based on observation, this expression will appear in Project, but not in Filter – Note: for column comparing, currently we don’t support histogram. We cannot suppose the data is evenly distributed, so the empirical filtering factor is set to 1/3 – (op) can be “<”, “<=”, “>”, “>=” – Need to adjust the A and B’s min/max/NDV after filtering – Sample: Column A < Column B B A AA A B B B A filtering = 100% B filtering = 100% A filtering = 0% B filtering = 0% A filtering = 33.3% B filtering = 33.3% A filtering = 33.3% B filtering = 33.3% Page 13
  • 14. Join Order • Only for two table joins • We calculate the cost of Hash Join using the stats of left and right nodes. – Nominal Cost = <nominal-rows> × 0.7 + <nominal-size> × 0.3 • Choose lower-cost child as build side of hash join (Prior to Spark 1.5). Page 14
  • 15. Multi-way Join Reorder • Currently Spark SQL’s Join order is not decided by the cost of multi-way join operations. • We decide the join order based on the output rows and output size of the intermediate table. – The join with smaller output is performed first. – Can benefit star join queries (like TPC-DS). • Using dynamic programming for join order Page 15
  • 16. Sample:Q3,3 Tables join+aggregate • ParquetRelation node • Filter node • Project node • Join node • Aggregation • Limit Page 16 Build Right -> Build Left
  • 17. Limitation without Key Information • Spark SQL does not support index or primary key. – This missing information fails to properly estimate the join output of the primary/foreign key join. • When estimating the number of GROUP BY operator output records, we multiply the number of distinct values for each GROUP BY column. – This formula is valid only if every GROUP BY column is independent. Page 17
  • 18. Column Uniqueness • We know that a column is unique (or primary key) if the number of distinct values divided by the number of records of a table is close to 1.0. – We can set the size of hash join table properly if one join column is unique. – When computing the number of GROUP BY output records, if one GROUP BY column is unique, we do NOT multiply those non-unique columns. Page 18
  • 19. Unique Column Example, tpc-h Q10 • /* tpc-h Q10: c_custkey is unique */ • SELECT c_custkey, c_name, sum(l_extendedprice * (1 - l_discount)) • AS revenue, c_acctbal, n_name, c_address, c_phone, c_comment • FROM nation join customer join orders join lineitem • WHERE c_custkey = o_custkey AND l_orderkey = o_orderkey • AND o_orderdate >= '1993-10-01' AND o_orderdate < '1994-01-01' • AND l_returnflag = 'R' AND c_nationkey = n_nationkey • GROUP BY c_custkey, c_name, c_acctbal, c_phone, n_name, c_address, c_comment • ORDER BY revenue DESC limit 20 Number of group-by outputs can be: • 1708M if there is no unique column information, • 82K if we know there is a unique group-by column Page 19
  • 20. SQL Hints • Some information cannot be analyzed directly from the statistics of tables/columns. Example, tpc-h Q13: – Supported hints /*+ …. */: Like_FilterFactor, NDV_Correlated_Columns, Join_Build, Join_Type, …… Page 20 SELECT c_count, count(*) as custdist FROM (SELECT c_custkey, count(o_orderkey) c_count FROM customer LEFT OUTER JOIN orders ON c_custkey = o_custkey and o_comment not like '%special%request%' GROUP BY c_custkey ) c_orders GROUP BY c_count ORDER BY custdist desc, c_count desc
  • 21. Actual vs Estimated Output Rows Query Actual Estimated Q1 4 6 Q2 460 1756 Q3 11621 496485 Q4 5 5 Q5 5 25 Q6 1 1 Q7 4 5 Q8 2 5 Q9 175 222 Q10 37967 81611 Query Actual Estimated Q11 28574 32000 Q12 2 2 Q13 42 100 Q14 1 1 Q15 1 2 Q16 18314 14700 Q17 1 1 Q18 57 1621 Q19 1 1 Q20 186 558 Q21 411 558 Page 21
  • 22. Wrong Output Rows Estimate for Q3 • We do not handle the correlated columns of different tables. TPC-H Q3: select l_orderkey, sum(l_extendedprice *(1 - l_discount)) as revenue, o_orderdate, o_shippriority from customer, orders, lineitem where c_mktsegment = 'BUILDING' and c_custkey = o_custkey and l_orderkey = o_orderkey and o_orderdate < date '1995-3-15' and l_shipdate > date '1995-3-15' group by l_orderkey, o_orderdate, o_shippriority order by l_orderkey, revenue desc, o_orderdate Page 22
  • 23. Possible Future Work • How to collect table histograms information quickly and correctly – For full table scan – correct, but slow, especially for big data – Possible method – Sampling Counting • Linear, LogLog, Adaptive, Hyper LogLog, Hyper LogLog++, etc • Expression Statistics – Now only raw columns’ statistics are collected. Not for the derived columns – Derived columns from calculation of expressions • Ex: Alias Column, Aggregation Expression, Arithmetic Expression, UDF • Collecting the real-world running statistics information, for the future query plan optimization. – Continuous feedback optimization Page 23