SlideShare a Scribd company logo
Generalized Sub-Query Fusion for
Eliminating Redundant I/O from
Big-Data Queries
Kaushik Rajan, Partho Sarthi, Akash Lal, Abhishek Modi, Ashit Gosalia,
Prakhar Jain, Mo Liu, Saurabh Kalikar
OSDI 20’
Presented by Mingcong Han (GitHub: francis0407)
Credits: some pictures from the conference presentation slides
Big data query compilation
Redundant stages of processing
• TPCDS, 40% of queries have redundant I/O
• 16% of all queries, high-impact spend at least 50% time with redundant I/O
• 9% medium impact, spend 10-50% time with redundant I/O
State-of-the-art Query Optimizer
SparkSQL Catalyst Optimizer
• SQL -> LogicalPlan -> PhysicalPlan -> MapReduce
RESIN: MapReduce reasoning during optimization
RESIN: MapReduce reasoning during optimization
Example 1 —— ResinMap
Example 1 —— ResinMap
Example 2 —— ResinReduce
F1 = SELECT id as 𝑐1,
max(𝑠𝑖𝑔𝑛𝑎𝑙1) as 𝑠1
FROM iotLogs
WHERE ℎ𝑟1 ≤ 12
GROUP BY id
F2 = SELECT id as 𝑐2,
max(𝑠𝑖𝑔𝑛𝑎𝑙2) as 𝑠2
FROM iotLogs
WHERE ℎ𝑟2 ≤ 18
GROUP BY id
R = SELECT 𝑐1, 𝑠1, 𝑠2
FROM F1 JOIN F2
ON 𝑐1 = 𝑐2
Example 2 —— ResinReduce
𝑚𝑎𝑥1 = 𝑚𝑎𝑥2 = ∞
𝑟𝑐1 = 𝑟𝑐2 = 0
foreach (<id, 𝑠𝑖𝑔𝑛𝑎𝑙1, 𝑠𝑖𝑔𝑛𝑎𝑙2, ℎ𝑟1, ℎ𝑟2> in partition)
{
if (ℎ𝑟1 ≤ 12) {
𝑚𝑎𝑥1=max(𝑚𝑎𝑥1, 𝑠𝑖𝑔𝑛𝑎𝑙1); 𝑟𝑐1++;
}
if (ℎ𝑟2 ≤ 18) {
𝑚𝑎𝑥2=max(𝑚𝑎𝑥2, 𝑠𝑖𝑔𝑛𝑎𝑙2); 𝑟𝑐2++;
}
}
output(id, 𝑚𝑎𝑥1, 𝑚𝑎𝑥2, 𝑟𝑐1, 𝑟𝑐2)
Resin Operators
Resin Optimization Rules
1. Sub-query fusion
– i.e. fuse the operators applied on the same table
2. Binary operator elimination
– i.e. eliminate the redundant Union/Join after fusion
Resin Optimization Rules
1. Sub-query fusion
– i.e. fuse the operators applied on the same table
2. Binary operator elimination
– i.e. eliminate the redundant Union/Join after fusion
Sub-query Fusion
• Basic query fusion
– φ :filter condition
– 𝐶 ← 𝐸:project mapping
– λ[φ, 𝐶 ← 𝐸]:filter(φ)+ project(𝐶 ← 𝐸) (λ is also called
ResinSimpleMap)
Unary Operator Fusion
• Fuse the operators where 𝑜𝑝1, 𝑜𝑝2 are one of
GroupBy(γ), ResinReduce(ρ), and ResinSimpleMap(λ)
Unary Operator Fusion
• Case 1: 𝑜𝑝1, 𝑜𝑝2 are ResinSimpleMap(λ)
λ[φ1 ∧ φ𝑟1 ∧ φ2 ∧ φ𝑟2,
𝐶1 ← 𝐸1 ∪ 𝐶2 ← 𝐸2
∪ 𝐼(𝑐𝑜𝑙𝑠 φ1 ) ∪ 𝐼(𝑐𝑜𝑙𝑠 φ𝑟1 )
∪ 𝐼(𝑐𝑜𝑙𝑠 φ2 ) ∪ 𝐼(𝑐𝑜𝑙𝑠 φ𝑟2 ))]
Q
λ[φ1 ∧ φ𝑟1, 𝐼(𝐶1)] λ[φ2 ∧ φ𝑟2, 𝐼(𝐶2)]
Basic
Rule
Unary Operator Fusion
• Case 2: 𝑜𝑝1, 𝑜𝑝2 are GroupBy(γ)
– ρ[k, List(φ, agg)]: ResionReduce which groups by k
Binary Operator Fusion
• Fuse the operators where 𝑜𝑝1, 𝑜𝑝2 are one of Join(ψ,
jt) and Union()
Binary Operator Fusion
• Case 1: 𝑜𝑝1, 𝑜𝑝2 are Join(ψ, jt)
– Ψ: join condition
– jt: join type (inner, outer, …)
Binary Operator Fusion
• Case 2: 𝑜𝑝1, 𝑜𝑝2 are Union
Resin Optimization Rules
1. Sub-query fusion
– i.e. fuse the operators applied on the same table
2. Binary operator elimination
– i.e. eliminate the redundant Union/Join after fusion
Binary operator elimination
• Case 1: Union elimination
– μ[List(φ,C ← E)]: ResinMap
Binary operator elimination
• Case 2: Join elimination
– ρ[k, List(φ, agg)]: ResionReduce which groups by k
Example 3
Join
id = did
Project
id, signal
Table: signals
(id, hr, signal)
Table: dInfo
(did,city,ht,area)
Filter
(hr>=5 and hr<=19)
Filter
(hr<=7 or hr>=17)
Filter
(ht <= 2)
Filter
(ht >= 11)
Project
id, signal
Project
did, city
Project
did, city
Project
city, signal
Join
id = did
Project
city, signal
GroupBy
c1 ← city
s1 ← max(signal)
GroupBy
c2 ← city
s2 ← max(signal)
Join
c1 = c2
Standard Plan
Step 1: Fuse Filter+Project
Join
id = did
Project
id, signal
Table: signals
(id, hr, signal)
Table: dInfo
(did,city,ht,area)
Filter
(hr>=5 and hr<=19)
Filter
(hr<=7 or hr>=17)
Filter
(ht <= 2)
Filter
(ht >= 11)
Project
id, signal
Project
did, city
Project
did, city
Project
city, signal
Join
id = did
Project
city, signal
GroupBy
c1 ← city
s1 ← max(signal)
GroupBy
c2 ← city
s2 ← max(signal)
Join
c1 = c2
Join
id = did
Table: signals
(id, hr, signal)
Table: dInfo
(did,city,ht,area)
Project
city, signal
Join
id = did
Project
city, signal
GroupBy
c1 ← city
s1 ← max(signal)
GroupBy
c2 ← city
s2 ← max(signal)
Join
c1 = c2
ResinSimpleMap
hr>=5 and hr<=19
id, signal
ResinSimpleMap
hr<=7 or hr>=17
id, signal
ResinSimpleMap
ht <= 2
did, city
ResinSimpleMap
ht >= 11
did, city
Step 1: Fuse Filter+Project
Step 2: Fuse Join
Join
id = did
Table: signals
(id, hr, signal)
Table: dInfo
(did,city,ht,area)
Project
city, signal
Join
id = did
Project
city, signal
GroupBy
c1 ← city
s1 ← max(signal)
GroupBy
c2 ← city
s2 ← max(signal)
Join
c1 = c2
ResinSimpleMap
hr>=5 and hr<=19
id, signal
ResinSimpleMap
hr<=7 or hr>=17
id, signal
ResinSimpleMap
ht <= 2
did, city
ResinSimpleMap
ht >= 11
did, city
Join
id = did
Table: signals
(id, hr, signal)
Table: dInfo
(did,city,ht,area)
GroupBy
c1 ← city
s1 ← max(signal)
GroupBy
c2 ← city
s2 ← max(signal)
Join
c1 = c2
ResinSimpleMap
hr>=5 and hr<=19 and ht<=2
city, signal
ResinSimpleMap
(hr<=7 or hr>=17) and ht>=11
city, signal
ResinSimpleMap
True
id, signal, hr
ResinSimpleMap
ht <= 2 or ht >= 11
did, city, ht, area
Step 2: Fuse Join
Step 3: Fuse GroupBy
Join
id = did
Table: signals
(id, hr, signal)
Table: dInfo
(did,city,ht,area)
GroupBy
c1 ← city
s1 ← max(signal)
GroupBy
c2 ← city
s2 ← max(signal)
Join
c1 = c2
ResinSimpleMap
hr>=5 and hr<=19 and ht<=2
city, signal
ResinSimpleMap
(hr<=7 or hr>=17) and ht>=11
city, signal
ResinSimpleMap
True
id, signal, hr
ResinSimpleMap
ht <= 2 or ht >= 11
did, city, ht, area
Join
id = did
Table: signals
(id, hr, signal)
Table: dInfo
(did,city,ht,area)
Join
c1 = c2
ResinSimpleMap
True
id, signal, hr
ResinSimpleMap
ht <= 2 or ht >= 11
did, city, ht, area
ResinReduce
GroupBy: c ← city
Filter1: hr>=5 and hr<=19 and ht<=2;
Agg1: s1← max(signal), rc1←count(*)
Filter2: (hr<=7 or hr>=17) and ht>=11;
Agg2: s2 ← max(signal), rc2 ←cont(*)
ResinSimpleMap
rc1 > 0
c1 ← c, s1
ResinSimpleMap
rc1 > 0
c1 ← c, s1
Step 3: Fuse GroupBy
Step 4: Eliminate Join
Join
id = did
Table: signals
(id, hr, signal)
Table: dInfo
(did,city,ht,area)
Join
c1 = c2
ResinSimpleMap
True
id, signal, hr
ResinSimpleMap
ht <= 2 or ht >= 11
did, city, ht, area
ResinReduce
GroupBy: c ← city
Filter1: hr>=5 and hr<=19 and ht<=2;
Agg1: s1← max(signal), rc1←count(*)
Filter2: (hr<=7 or hr>=17) and ht>=11;
Agg2: s2 ← max(signal), rc2 ←cont(*)
ResinSimpleMap
rc1 > 0
c1 ← c, s1
ResinSimpleMap
rc1 > 0
c1 ← c, s1
Join
id = did
Table: signals
(id, hr, signal)
Table: dInfo
(did,city,ht,area)
ResinSimpleMap
True
id, signal, hr
ResinSimpleMap
ht <= 2 or ht >= 11
did, city, ht, area
ResinReduce
GroupBy: c ← city
Filter1: hr>=5 and hr<=19 and ht<=2;
Agg1: s1← max(signal), rc1←count(*)
Filter2: (hr<=7 or hr>=17) and ht>=11;
Agg2: s2 ← max(signal), rc2 ←cont(*)
ResinSimpleMap
rc1 > 0 and rc2 > 0
c1 ← c, s1, s2
Step 4: Eliminate Join
Evaluation
• TPC-DS 10G, 40 of 104 queries are affected
Evaluation
Conclusion
• RESIN, a query optimizer that eliminates the
redundant I/O for big-data queries
– Two new operators: ResinMap, ResinReduce
– Two new rules: sub-query fusion, binary-operator elimination
• The optimizations are useful for 40% of queries in
TPCDS with 1.4x improvement on average
Q & A
Thanks

More Related Content

PDF
構文や語彙意味論の分析成果をプログラムとして具現化する言語 パターンマッチAPIの可能性
PPT
Maps&hash tables
PDF
Dataflow Analysis
PDF
Register Allocation
PPT
Algorithm analysis basics - Seven Functions/Big-Oh/Omega/Theta
DOCX
Surface3d in R and rgl package.
PPTX
Butterfly Counting in Bipartite Networks
PDF
Big datacourse
構文や語彙意味論の分析成果をプログラムとして具現化する言語 パターンマッチAPIの可能性
Maps&hash tables
Dataflow Analysis
Register Allocation
Algorithm analysis basics - Seven Functions/Big-Oh/Omega/Theta
Surface3d in R and rgl package.
Butterfly Counting in Bipartite Networks
Big datacourse

What's hot (20)

PPTX
Crystal Ball Event Prediction and Log Analysis with Hadoop MapReduce and Spark
PDF
LSGAN - SIMPle(Simple Idea Meaningful Performance Level up)
PDF
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
PDF
Deep Convolutional GANs - meaning of latent space
PDF
Graph Regularised Hashing (ECIR'15 Talk)
DOCX
Company_X_Data_Analyst_Challenge
PDF
딥러닝 중급 - AlexNet과 VggNet (Basic of DCNN : AlexNet and VggNet)
PPT
Week 10 part 1 pe 6282 Block Diagrams
PPTX
Seminar PSU 10.10.2014 mme
PDF
쉽게 설명하는 GAN (What is this? Gum? It's GAN.)
PPTX
La R Users Group Survey Of R Graphics
PPTX
IGraph a tool to analyze your network
PDF
Incremental and parallel computation of structural graph summaries for evolvi...
PDF
Context-Aware Recommender System Based on Boolean Matrix Factorisation
PDF
PDF
Optimization in ChaStrobe Software with Genetic Algorithm
PDF
Tao Fayan_Iso and Full_volume rendering
PDF
Realtime Analytics
PDF
Introduction to Information Channel
PPTX
MapReduce
Crystal Ball Event Prediction and Log Analysis with Hadoop MapReduce and Spark
LSGAN - SIMPle(Simple Idea Meaningful Performance Level up)
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Deep Convolutional GANs - meaning of latent space
Graph Regularised Hashing (ECIR'15 Talk)
Company_X_Data_Analyst_Challenge
딥러닝 중급 - AlexNet과 VggNet (Basic of DCNN : AlexNet and VggNet)
Week 10 part 1 pe 6282 Block Diagrams
Seminar PSU 10.10.2014 mme
쉽게 설명하는 GAN (What is this? Gum? It's GAN.)
La R Users Group Survey Of R Graphics
IGraph a tool to analyze your network
Incremental and parallel computation of structural graph summaries for evolvi...
Context-Aware Recommender System Based on Boolean Matrix Factorisation
Optimization in ChaStrobe Software with Genetic Algorithm
Tao Fayan_Iso and Full_volume rendering
Realtime Analytics
Introduction to Information Channel
MapReduce
Ad

Similar to [Paper Reading] Generalized Sub-Query Fusion for Eliminating Redundant I/O from Big-Data Queries (20)

PPTX
R Language Introduction
PDF
MUMS: Transition & SPUQ Workshop - Practical Bayesian Optimization for Urban ...
PDF
Fast, stable and scalable true radix sorting with Matt Dowle at useR! Aalborg
PDF
Refactoring to Macros with Clojure
PPTX
SociaLite: High-level Query Language for Big Data Analysis
PDF
Introduction to NumPy for Machine Learning Programmers
PPT
python_bASICSPPTvISHWASpython_bASICS.ppt
PDF
Optimization and Mathematical Programming in R and ROI - R Optimization Infra...
PDF
Haskell 101
PDF
TDC2016SP - Trilha Programação Funcional
PPTX
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
PDF
A Signature Algorithm Based On Chaotic Maps And Factoring Problems
PDF
Visual Api Training
PPTX
Apache Flink: API, runtime, and project roadmap
PDF
Swift for tensorflow
PPT
Admission for b.tech
PPTX
Transformations and actions a visual guide training
PPTX
Flink internals web
ODP
Stratosphere Intro (Java and Scala Interface)
PPTX
R for hadoopers
R Language Introduction
MUMS: Transition & SPUQ Workshop - Practical Bayesian Optimization for Urban ...
Fast, stable and scalable true radix sorting with Matt Dowle at useR! Aalborg
Refactoring to Macros with Clojure
SociaLite: High-level Query Language for Big Data Analysis
Introduction to NumPy for Machine Learning Programmers
python_bASICSPPTvISHWASpython_bASICS.ppt
Optimization and Mathematical Programming in R and ROI - R Optimization Infra...
Haskell 101
TDC2016SP - Trilha Programação Funcional
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Signature Algorithm Based On Chaotic Maps And Factoring Problems
Visual Api Training
Apache Flink: API, runtime, and project roadmap
Swift for tensorflow
Admission for b.tech
Transformations and actions a visual guide training
Flink internals web
Stratosphere Intro (Java and Scala Interface)
R for hadoopers
Ad

More from PingCAP (20)

PPTX
[Paper Reading] Efficient Query Processing with Optimistically Compressed Has...
PDF
[Paper Reading]Orca: A Modular Query Optimizer Architecture for Big Data
PPTX
[Paper Reading]KVSSD: Close integration of LSM trees and flash translation la...
PPTX
[Paper Reading]Chucky: A Succinct Cuckoo Filter for LSM-Tree
PPTX
[Paper Reading]The Bw-Tree: A B-tree for New Hardware Platforms
PPTX
[Paper Reading] QAGen: Generating query-aware test databases
PDF
[Paper Reading] Leases: An Efficient Fault-Tolerant Mechanism for Distribute...
PDF
[Paper reading] Interleaving with Coroutines: A Practical Approach for Robust...
PDF
[Paperreading] Paxos made easy (by sen han)
PDF
[Paper Reading] Steering Query Optimizers: A Practical Take on Big Data Workl...
PDF
The Dark Side Of Go -- Go runtime related problems in TiDB in production
PDF
TiDB DevCon 2020 Opening Keynote
PDF
Finding Logic Bugs in Database Management Systems
PDF
Chaos Practice in PingCAP
PDF
TiDB at PayPay
PPTX
Paper Reading: FPTree
PPTX
Paper Reading: Smooth Scan
PPTX
Paper Reading: Flexible Paxos
PPTX
Paper reading: Cost-based Query Transformation in Oracle
PPTX
Paper reading: HashKV and beyond
[Paper Reading] Efficient Query Processing with Optimistically Compressed Has...
[Paper Reading]Orca: A Modular Query Optimizer Architecture for Big Data
[Paper Reading]KVSSD: Close integration of LSM trees and flash translation la...
[Paper Reading]Chucky: A Succinct Cuckoo Filter for LSM-Tree
[Paper Reading]The Bw-Tree: A B-tree for New Hardware Platforms
[Paper Reading] QAGen: Generating query-aware test databases
[Paper Reading] Leases: An Efficient Fault-Tolerant Mechanism for Distribute...
[Paper reading] Interleaving with Coroutines: A Practical Approach for Robust...
[Paperreading] Paxos made easy (by sen han)
[Paper Reading] Steering Query Optimizers: A Practical Take on Big Data Workl...
The Dark Side Of Go -- Go runtime related problems in TiDB in production
TiDB DevCon 2020 Opening Keynote
Finding Logic Bugs in Database Management Systems
Chaos Practice in PingCAP
TiDB at PayPay
Paper Reading: FPTree
Paper Reading: Smooth Scan
Paper Reading: Flexible Paxos
Paper reading: Cost-based Query Transformation in Oracle
Paper reading: HashKV and beyond

Recently uploaded (20)

PPTX
bas. eng. economics group 4 presentation 1.pptx
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PDF
Well-logging-methods_new................
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PPTX
Construction Project Organization Group 2.pptx
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PPT
Introduction, IoT Design Methodology, Case Study on IoT System for Weather Mo...
PPTX
Lecture Notes Electrical Wiring System Components
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PDF
composite construction of structures.pdf
PPTX
Current and future trends in Computer Vision.pptx
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PPTX
UNIT 4 Total Quality Management .pptx
PPT
introduction to datamining and warehousing
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PPTX
Geodesy 1.pptx...............................................
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PPTX
Sustainable Sites - Green Building Construction
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
bas. eng. economics group 4 presentation 1.pptx
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
Well-logging-methods_new................
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
Construction Project Organization Group 2.pptx
Foundation to blockchain - A guide to Blockchain Tech
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
Introduction, IoT Design Methodology, Case Study on IoT System for Weather Mo...
Lecture Notes Electrical Wiring System Components
Automation-in-Manufacturing-Chapter-Introduction.pdf
composite construction of structures.pdf
Current and future trends in Computer Vision.pptx
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
UNIT 4 Total Quality Management .pptx
introduction to datamining and warehousing
Model Code of Practice - Construction Work - 21102022 .pdf
Geodesy 1.pptx...............................................
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
Sustainable Sites - Green Building Construction
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx

[Paper Reading] Generalized Sub-Query Fusion for Eliminating Redundant I/O from Big-Data Queries

  • 1. Generalized Sub-Query Fusion for Eliminating Redundant I/O from Big-Data Queries Kaushik Rajan, Partho Sarthi, Akash Lal, Abhishek Modi, Ashit Gosalia, Prakhar Jain, Mo Liu, Saurabh Kalikar OSDI 20’ Presented by Mingcong Han (GitHub: francis0407) Credits: some pictures from the conference presentation slides
  • 2. Big data query compilation
  • 3. Redundant stages of processing • TPCDS, 40% of queries have redundant I/O • 16% of all queries, high-impact spend at least 50% time with redundant I/O • 9% medium impact, spend 10-50% time with redundant I/O
  • 4. State-of-the-art Query Optimizer SparkSQL Catalyst Optimizer • SQL -> LogicalPlan -> PhysicalPlan -> MapReduce
  • 5. RESIN: MapReduce reasoning during optimization
  • 6. RESIN: MapReduce reasoning during optimization
  • 7. Example 1 —— ResinMap
  • 8. Example 1 —— ResinMap
  • 9. Example 2 —— ResinReduce F1 = SELECT id as 𝑐1, max(𝑠𝑖𝑔𝑛𝑎𝑙1) as 𝑠1 FROM iotLogs WHERE ℎ𝑟1 ≤ 12 GROUP BY id F2 = SELECT id as 𝑐2, max(𝑠𝑖𝑔𝑛𝑎𝑙2) as 𝑠2 FROM iotLogs WHERE ℎ𝑟2 ≤ 18 GROUP BY id R = SELECT 𝑐1, 𝑠1, 𝑠2 FROM F1 JOIN F2 ON 𝑐1 = 𝑐2
  • 10. Example 2 —— ResinReduce 𝑚𝑎𝑥1 = 𝑚𝑎𝑥2 = ∞ 𝑟𝑐1 = 𝑟𝑐2 = 0 foreach (<id, 𝑠𝑖𝑔𝑛𝑎𝑙1, 𝑠𝑖𝑔𝑛𝑎𝑙2, ℎ𝑟1, ℎ𝑟2> in partition) { if (ℎ𝑟1 ≤ 12) { 𝑚𝑎𝑥1=max(𝑚𝑎𝑥1, 𝑠𝑖𝑔𝑛𝑎𝑙1); 𝑟𝑐1++; } if (ℎ𝑟2 ≤ 18) { 𝑚𝑎𝑥2=max(𝑚𝑎𝑥2, 𝑠𝑖𝑔𝑛𝑎𝑙2); 𝑟𝑐2++; } } output(id, 𝑚𝑎𝑥1, 𝑚𝑎𝑥2, 𝑟𝑐1, 𝑟𝑐2)
  • 12. Resin Optimization Rules 1. Sub-query fusion – i.e. fuse the operators applied on the same table 2. Binary operator elimination – i.e. eliminate the redundant Union/Join after fusion
  • 13. Resin Optimization Rules 1. Sub-query fusion – i.e. fuse the operators applied on the same table 2. Binary operator elimination – i.e. eliminate the redundant Union/Join after fusion
  • 14. Sub-query Fusion • Basic query fusion – φ :filter condition – 𝐶 ← 𝐸:project mapping – λ[φ, 𝐶 ← 𝐸]:filter(φ)+ project(𝐶 ← 𝐸) (λ is also called ResinSimpleMap)
  • 15. Unary Operator Fusion • Fuse the operators where 𝑜𝑝1, 𝑜𝑝2 are one of GroupBy(γ), ResinReduce(ρ), and ResinSimpleMap(λ)
  • 16. Unary Operator Fusion • Case 1: 𝑜𝑝1, 𝑜𝑝2 are ResinSimpleMap(λ) λ[φ1 ∧ φ𝑟1 ∧ φ2 ∧ φ𝑟2, 𝐶1 ← 𝐸1 ∪ 𝐶2 ← 𝐸2 ∪ 𝐼(𝑐𝑜𝑙𝑠 φ1 ) ∪ 𝐼(𝑐𝑜𝑙𝑠 φ𝑟1 ) ∪ 𝐼(𝑐𝑜𝑙𝑠 φ2 ) ∪ 𝐼(𝑐𝑜𝑙𝑠 φ𝑟2 ))] Q λ[φ1 ∧ φ𝑟1, 𝐼(𝐶1)] λ[φ2 ∧ φ𝑟2, 𝐼(𝐶2)] Basic Rule
  • 17. Unary Operator Fusion • Case 2: 𝑜𝑝1, 𝑜𝑝2 are GroupBy(γ) – ρ[k, List(φ, agg)]: ResionReduce which groups by k
  • 18. Binary Operator Fusion • Fuse the operators where 𝑜𝑝1, 𝑜𝑝2 are one of Join(ψ, jt) and Union()
  • 19. Binary Operator Fusion • Case 1: 𝑜𝑝1, 𝑜𝑝2 are Join(ψ, jt) – Ψ: join condition – jt: join type (inner, outer, …)
  • 20. Binary Operator Fusion • Case 2: 𝑜𝑝1, 𝑜𝑝2 are Union
  • 21. Resin Optimization Rules 1. Sub-query fusion – i.e. fuse the operators applied on the same table 2. Binary operator elimination – i.e. eliminate the redundant Union/Join after fusion
  • 22. Binary operator elimination • Case 1: Union elimination – μ[List(φ,C ← E)]: ResinMap
  • 23. Binary operator elimination • Case 2: Join elimination – ρ[k, List(φ, agg)]: ResionReduce which groups by k
  • 25. Join id = did Project id, signal Table: signals (id, hr, signal) Table: dInfo (did,city,ht,area) Filter (hr>=5 and hr<=19) Filter (hr<=7 or hr>=17) Filter (ht <= 2) Filter (ht >= 11) Project id, signal Project did, city Project did, city Project city, signal Join id = did Project city, signal GroupBy c1 ← city s1 ← max(signal) GroupBy c2 ← city s2 ← max(signal) Join c1 = c2 Standard Plan
  • 26. Step 1: Fuse Filter+Project Join id = did Project id, signal Table: signals (id, hr, signal) Table: dInfo (did,city,ht,area) Filter (hr>=5 and hr<=19) Filter (hr<=7 or hr>=17) Filter (ht <= 2) Filter (ht >= 11) Project id, signal Project did, city Project did, city Project city, signal Join id = did Project city, signal GroupBy c1 ← city s1 ← max(signal) GroupBy c2 ← city s2 ← max(signal) Join c1 = c2
  • 27. Join id = did Table: signals (id, hr, signal) Table: dInfo (did,city,ht,area) Project city, signal Join id = did Project city, signal GroupBy c1 ← city s1 ← max(signal) GroupBy c2 ← city s2 ← max(signal) Join c1 = c2 ResinSimpleMap hr>=5 and hr<=19 id, signal ResinSimpleMap hr<=7 or hr>=17 id, signal ResinSimpleMap ht <= 2 did, city ResinSimpleMap ht >= 11 did, city Step 1: Fuse Filter+Project
  • 28. Step 2: Fuse Join Join id = did Table: signals (id, hr, signal) Table: dInfo (did,city,ht,area) Project city, signal Join id = did Project city, signal GroupBy c1 ← city s1 ← max(signal) GroupBy c2 ← city s2 ← max(signal) Join c1 = c2 ResinSimpleMap hr>=5 and hr<=19 id, signal ResinSimpleMap hr<=7 or hr>=17 id, signal ResinSimpleMap ht <= 2 did, city ResinSimpleMap ht >= 11 did, city
  • 29. Join id = did Table: signals (id, hr, signal) Table: dInfo (did,city,ht,area) GroupBy c1 ← city s1 ← max(signal) GroupBy c2 ← city s2 ← max(signal) Join c1 = c2 ResinSimpleMap hr>=5 and hr<=19 and ht<=2 city, signal ResinSimpleMap (hr<=7 or hr>=17) and ht>=11 city, signal ResinSimpleMap True id, signal, hr ResinSimpleMap ht <= 2 or ht >= 11 did, city, ht, area Step 2: Fuse Join
  • 30. Step 3: Fuse GroupBy Join id = did Table: signals (id, hr, signal) Table: dInfo (did,city,ht,area) GroupBy c1 ← city s1 ← max(signal) GroupBy c2 ← city s2 ← max(signal) Join c1 = c2 ResinSimpleMap hr>=5 and hr<=19 and ht<=2 city, signal ResinSimpleMap (hr<=7 or hr>=17) and ht>=11 city, signal ResinSimpleMap True id, signal, hr ResinSimpleMap ht <= 2 or ht >= 11 did, city, ht, area
  • 31. Join id = did Table: signals (id, hr, signal) Table: dInfo (did,city,ht,area) Join c1 = c2 ResinSimpleMap True id, signal, hr ResinSimpleMap ht <= 2 or ht >= 11 did, city, ht, area ResinReduce GroupBy: c ← city Filter1: hr>=5 and hr<=19 and ht<=2; Agg1: s1← max(signal), rc1←count(*) Filter2: (hr<=7 or hr>=17) and ht>=11; Agg2: s2 ← max(signal), rc2 ←cont(*) ResinSimpleMap rc1 > 0 c1 ← c, s1 ResinSimpleMap rc1 > 0 c1 ← c, s1 Step 3: Fuse GroupBy
  • 32. Step 4: Eliminate Join Join id = did Table: signals (id, hr, signal) Table: dInfo (did,city,ht,area) Join c1 = c2 ResinSimpleMap True id, signal, hr ResinSimpleMap ht <= 2 or ht >= 11 did, city, ht, area ResinReduce GroupBy: c ← city Filter1: hr>=5 and hr<=19 and ht<=2; Agg1: s1← max(signal), rc1←count(*) Filter2: (hr<=7 or hr>=17) and ht>=11; Agg2: s2 ← max(signal), rc2 ←cont(*) ResinSimpleMap rc1 > 0 c1 ← c, s1 ResinSimpleMap rc1 > 0 c1 ← c, s1
  • 33. Join id = did Table: signals (id, hr, signal) Table: dInfo (did,city,ht,area) ResinSimpleMap True id, signal, hr ResinSimpleMap ht <= 2 or ht >= 11 did, city, ht, area ResinReduce GroupBy: c ← city Filter1: hr>=5 and hr<=19 and ht<=2; Agg1: s1← max(signal), rc1←count(*) Filter2: (hr<=7 or hr>=17) and ht>=11; Agg2: s2 ← max(signal), rc2 ←cont(*) ResinSimpleMap rc1 > 0 and rc2 > 0 c1 ← c, s1, s2 Step 4: Eliminate Join
  • 34. Evaluation • TPC-DS 10G, 40 of 104 queries are affected
  • 36. Conclusion • RESIN, a query optimizer that eliminates the redundant I/O for big-data queries – Two new operators: ResinMap, ResinReduce – Two new rules: sub-query fusion, binary-operator elimination • The optimizations are useful for 40% of queries in TPCDS with 1.4x improvement on average
  • 37. Q & A