SlideShare a Scribd company logo
Don’t Optimize my Queries;
Optimize my Data!
Julian Hyde
DataEngConf NYC
2017/10/30
@julianhyde
SQL
Query planning
Query federation
OLAP
Streaming
Hadoop
ASF member
Original author of Apache Calcite
PMC Apache Arrow, Calcite, Drill, Eagle, Kylin
Architect at Hortonworks
Overview
How do you tune a data system? How can (or should) a data system tune itself?
What problems have we solved to bring these things to Apache Calcite?
Part 1: Strategies for organizing data. (We rely heavily on relational algebra,
especially materialized views.)
Part 2: How to make systems self-organizing? (Algorithms for design
materialized views, infer relationships between data sets, gathering statistics
about data sets.)
SELECT d.name, COUNT(*) AS c
FROM Emps AS e
JOIN Depts AS d USING (deptno)
WHERE e.age < 40
GROUP BY d.deptno
HAVING COUNT(*) > 5
ORDER BY c DESC
Relational algebra
Based on set theory, plus operators:
Project, Filter, Aggregate, Union, Join,
Sort
Requires: declarative language (SQL),
query planner
Original goal: data independence
Enables: query optimization, new
algorithms and data structures
Scan [Emps] Scan [Depts]
Join [e.deptno = d.deptno]
Filter [e.age < 30]
Aggregate [deptno, COUNT(*) AS c]
Filter [c > 5]
Project [name, c]
Sort [c DESC]
Apache Calcite
Apache top-level project since October, 2015
Query planning framework used in many
projects and products
Also works standalone: embedded federated
query engine with SQL / JDBC front end
Apache community development model
1. Organizing data
A “simple” query
Data
● 2010 U.S. census
● 100 million records
● 1KB per record
● 100 GB total
System
● 4x SATA 3 disks
● Total read throughput 1 GB/s
Query
Goal
● Compute the answer to the query in
under 5 seconds
SELECT SUM(householdSize)
FROM CensusHouseholds;
Solutions
Sequential scan Query takes 100 s (100 GB at 1 GB/s)
Parallelize Spread the data over 40 disks in 10 machines
Query takes 10 s
Cache Keep the data in memory
2nd query: 10 ms
3rd query: 10 s
Materialize Summarize the data on disk
All queries: 100 ms
Materialize +
cache + adapt
As above, building summaries on demand
Ways of organizing data
Format (CSV, JSON, binary)
Layout: row- vs. column-oriented (e.g. Parquet, ORC), cache friendly (e.g. Arrow)
Storage medium (disk, flash, RAM, NVRAM, ...)
Non-lossy copy: sorted / partitioned
Lossy copies of data: project, filter, aggregate, join
Combinations of the above
Logical optimizations >> physical optimizations
Index
A sorted, projected materialized
view
Accelerates queries that use
ranges, correlated lookups, sorting,
aggregate, distinct
CREATE TABLE Emp (empno INT,
name VARCHAR(20), deptno INT);
CREATE INDEX I_Emp_Deptno
ON Emp (deptno, name);
SELECT DISTINCT deptno FROM Emp
WHERE deptno BETWEEN 20 AND 40
ORDER BY deptno;
empno name deptno
100 Fred 20
110 Barney 10
120 Wilma 30
130 Dino 10
deptno name rowid
10 Barney af5634.0001
10 Dino af5634.0003
20 Fred af5634.0000
30 Wilma af5634.0002
Add the remaining columns
No longer need “rowid”
Lossless
During planning, treat indexes
as tables, and index lookups
as joins
Covering index
empno name deptno
100 Fred 20
110 Barney 10
120 Wilma 30
130 Dino 10
deptno name empno
10 Barney 100
10 Dino 130
20 Fred 20
30 Wilma 30
CREATE INDEX I_Emp_Deptno2 (
deptno INTEGER,
name VARCHAR(20))
COVER (empno);
Materialized view
CREATE MATERIALIZED
VIEW EmpsByDeptno AS
SELECT deptno, name, deptno
FROM Emp
ORDER BY deptno, name;
Scan [Emps]
Scan [EmpsByDeptno]
Sort [deptno, name]
empno name deptno
100 Fred 20
110 Barney 10
120 Wilma 30
130 Dino 10
deptno name empno
10 Barney 100
10 Dino 130
20 Fred 20
30 Wilma 30
As a materialized view, an
index is now just another
table
Several tables contain the
information necessary to
answer the query - just pick
the best
Spatial query
Find all restaurants within 1.5 distance units of
where I am:
restaurant x y
Zachary’s pizza 3 1
King Yen 7 7
Filippo’s 7 4
Station burger 5 6
SELECT *
FROM Restaurants AS r
WHERE ST_Distance(
ST_MakePoint(r.x, r.y),
ST_MakePoint(6, 7)) < 1.5
•
•
•
•
Zachary’s
pizza
Filippo’s
King
Yen
Station
burger
Hilbert space-filling curve
● A space-filling curve invented by mathematician David Hilbert
● Every (x, y) point has a unique position on the curve
● Points near to each other typically have Hilbert indexes close together
•
•
•
•
Add restriction based on h, a restaurant’s distance
along the Hilbert curve
Must keep original restriction due to false positives
Using Hilbert index
restaurant x y h
Zachary’s pizza 3 1 5
King Yen 7 7 41
Filippo’s 7 4 52
Station burger 5 6 36
Zachary’s
pizza
Filippo’s
SELECT *
FROM Restaurants AS r
WHERE (r.h BETWEEN 35 AND 42
OR r.h BETWEEN 46 AND 46)
AND ST_Distance(
ST_MakePoint(r.x, r.y),
ST_MakePoint(6, 7)) < 1.5
King
Yen
Station
burger
Telling the optimizer
1. Declare h as a generated column
2. Sort table by h
Planner can now convert spatial range
queries into a range scan
Does not require specialized spatial
index such as r-tree
Very efficient on a sorted table such as
HBase
CREATE TABLE Restaurants (
restaurant VARCHAR(20),
x DOUBLE,
y DOUBLE,
h DOUBLE GENERATED ALWAYS AS
ST_Hilbert(x, y) STORED)
SORT KEY (h);
restaurant x y h
Zachary’s pizza 3 1 5
Station burger 5 6 36
King Yen 7 7 41
Filippo’s 7 4 52
Much valuable data is “data in flight”
Use SQL to query streams (or streams + tables)
Streaming
Data center
SELECT AVG(unitPrice)
FROM Orders
WHERE units > 1000
AND orderDate
BETWEEN ‘2014-06-01’
AND ‘2015-12-31’
SELECT STREAM *
FROM Orders
WHERE units > 1000
Streaming query
Historic query
Hybrid query combines a stream with its
own history
● Orders is used as both as stream
and as “stream history” virtual table
● “Average order size over last year”
should be maintained by the system,
i.e. a materialized view
SELECT STREAM *
FROM Orders AS o
WHERE units > (
SELECT AVG(units)
FROM Orders AS h
WHERE h.productId = o.productId
AND h.rowtime
> o.rowtime - INTERVAL ‘1’ YEAR)
“Orders” used
as a stream
“Orders” used as
a “stream history”
virtual table
Summary - data optimization via
materialized views
Many forms of data optimization can be modeled as materialized views:
● Blocks in cache
● B-tree indexes
● Summary tables
● Spatial indexes
● History of streams
Allows the optimizer to “understand” the optimization and use it (if beneficial)
But who designs the optimizations?
2. Learning
How do data systems learn?
queries
DML
statistics
adaptations
recommender
Goals ● Improve response time, throughput, storage cost
● Predictable, adaptive (short and long term), allow human
intervention
How? ● Humans
● Adaptive systems
● Smart algorithms
Example
adaptations
● Cache disk blocks in memory
● Cached query results
● Data organization, e.g. partition on a different key
● Secondary structures, e.g. b-tree and r-tree indexes
Tiled, in-memory materialized views
A vision for an adaptive data system (we’re not there yet)
tables on
disk
in-memory
materializations
SELECT x, SUM(n) FROM t GROUP BY x
Building materialized views
Challenges:
● Design Which materializations to create?
● Populate Load them with data
● Maintain Incrementally populate when data changes
● Rewrite Transparently rewrite queries to use materializations
● Adapt Design and populate new materializations, drop unused ones
● Express Need a rich algebra, to model how data is derived
Initial focus: summary tables (materialized views over star schemas)
CREATE LATTICE Sales AS
SELECT t.*, c.*, COUNT(*), SUM(s.units)
FROM Sales AS s
JOIN Time AS t USING (timeId)
JOIN Customers AS c USING (customerId)
JOIN Products AS p USING (productId);
Designing summary tables via lattices
CREATE MATERIALIZED VIEW SalesYearZipcode AS
SELECT t.year, c.state, c.zipcode,
COUNT(*), SUM(units)
FROM Sales AS s
JOIN Time AS t USING (timeId)
JOIN Customers AS c USING (customerId)
GROUP BY 1, 2, 3;
product
product
class
sales
customers
time
Many possible
summary
tables
Key
z zipcode (43k)
s state (50)
g gender (2)
y year (5)
m month (12)
() 1
(z, s, g, y,
m) 912k
(s, g, y,
m) 6k
(z) 43k (s) 50 (g) 2 (y) 5 (m) 12
raw 1m
(y, m)
60
(g, y) 10
(z, s)
43.4k
(g, y, m)
120
Fewer than you would
expect, because 5m
combinations cannot
occur in 1m row table
Fewer than you
would expect,
because state
depends on zipcode
Algorithm: Design summary tables
Given a database with 30 columns, 10M rows. Find X summary tables with under
Y rows that improve query response time the most.
AdaptiveMonteCarlo algorithm [1]:
● Based on research [2]
● Greedy algorithm that takes a combination of summary tables and tries to
find the table that yields the greatest cost/benefit improvement
● Models “benefit” of the table as query time saved over simulated query load
● The “cost” of a table is its size
[1] org.pentaho.aggdes.algorithm.impl.AdaptiveMonteCarloAlgorithm
[2] Harinarayan, Rajaraman, Ullman (1996). “Implementing data cubes efficiently”
Lattice (optimized) () 1
(z, s, g, y,
m) 912k
(s, g, y,
m) 6k
(z) 43k (s) 50 (g) 2 (y) 5 (m) 12
(z, g, y,
m) 909k
(z, s, y,
m) 831k
raw 1m
(z, s, g,
m) 644k
(z, s, g,
y) 392k
(y, m)
60
(z, s)
43.4k
(z, s, g)
83.6k
(g, y) 10
(g, y, m)
120
(g, m)
24
Key
z zipcode (43k)
s state (50)
g gender (2)
y year (5)
m month (12)
Data profiling
Algorithm needs count(distinct a, b, ...) for each combination of attributes:
● Previous example had 25
= 32 possible tables
● Schema with 30 attributes has 230
(about 109
) possible tables
● Algorithm considers a significant fraction of these
● Approximations are OK
Attempts to solve the profiling problem:
1. Compute each combination: scan, sort, unique, count; repeat 230
times!
2. Sketches (HyperLogLog)
3. Sketches + parallelism + information theory [CALCITE-1616]
Sketches
HyperLogLog is an algorithm that computes
approximate distinct count. It can estimate
cardinalities of 109
with a typical error rate of
2%, using 1.5 kB of memory. [3][4]
With 16 MB memory per machine we can
compute 10,000 combinations of attributes
each pass.
So, we’re down from 109
to 105
passes.
[3] Flajolet, Fusy, Gandouet, Meunier (2007). "Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm"
[4] https://github.com/mrjgreen/HyperLogLog
Given Expected cardinality Actual cardinality Surprise
(gender): 2 (state): 50 (gender, state): 100.0 100 0.000
(month): 12 (zipcode): 43,000 (month, zipcode): 441,699.3 442,700 0.001
(state): 50 (zipcode): 43,000 (state, zipcode): 799,666.7 43,400 0.897
(state, zipcode): 43,400
(gender, state): 100
(gender, zipcode): 85,995
(gender, state, zipcode): 86,799
= min(86,799, 892,234, 892,228)
83,567 0.019
● Surprise = abs(actual - expected) / (actual + expected)
● E(card (x, y)) = n . (1 - ((n - 1) / n) ^ p) n = card (x) * card (y), p = row count
Combining probability & information theory
Algorithm
Three ways “surprise” can help:
● If a cardinality is not
surprising, we don’t need to
store it -- we can derive it
● If a combination’s cardinality
is not surprising, it is unlikely
to have surprising children
● If we’re not seeing surprising
results, it’s time to stop
surprise_threshold := 1
queue := {singleton combinations} // (a), (b), ...
while queue is not empty {
batch := remove first 10,000 entries in queue
compute cardinality of each combination in batch
for each actual (computed) cardinality a {
e := expected cardinality of combination
s := surprise(a, e)
if s > surprise_threshold {
store combination and its cardinality
add child combinations to queue // (x, a), (x, b), ...
}
increase surprise_threshold
}
}
Algorithm progress and “surprise” threshold
Progress of algorithm
Rejected as not
sufficiently
surprising
Surprise
threshold rises
as algorithm
progresses
Singleton
combinations
are have surprise
= 1
Surprise
threshold rises
after we have
completed the
first batch
Data profiling - summary
The algorithm defeats a combinatorial search space using sketches +
information theory + parallelism
Recommending data structures is an optimization problem; profiling provides
the cost & benefit function
As a by-product, the algorithm discovers unique keys, “almost” keys, and foreign
keys
But which tables are actually joined together in practice?
CREATE LATTICE Sales AS
SELECT t.*, c.*, COUNT(*), SUM(s.units)
FROM Sales AS s
JOIN Time AS t USING (timeId)
JOIN Customers AS c USING (customerId)
JOIN Products AS p USING (productId);
CREATE MATERIALIZED VIEW SalesYearZipcode AS
SELECT t.year, c.state, c.zipcode,
COUNT(*), SUM(units)
FROM Sales AS s
JOIN Time AS t USING (timeId)
JOIN Customers AS c USING (customerId)
GROUP BY 1, 2, 3;
product
product
class
sales
customers
time
The lattice generates the
summary tables. But who
writes the lattice?
Designing summary tables via lattices (2)
CREATE LATTICE Sales AS
SELECT t.*, c.*, COUNT(*), SUM(s.units)
FROM Sales AS s
JOIN Time AS t USING (timeId)
JOIN Customers AS c USING (customerId)
JOIN Products AS p USING (productId);
CREATE MATERIALIZED VIEW SalesYearZipcode AS
SELECT t.year, c.state, c.zipcode,
COUNT(*), SUM(units)
FROM Sales AS s
JOIN Time AS t USING (timeId)
JOIN Customers AS c USING (customerId)
GROUP BY 1, 2, 3;
ALTER SCHEMA Sales
INFER LATTICES;
product
product
class
sales
customers
time
Designing summary tables via lattices (3)
Lattice after Query 1 + 2
Query 2
Query 1
Growing and evolving
lattices based on queries
sales
customers
product
product
class
sales
product
product
class
sales
customers
See: [CALCITE-1870] “Lattice suggester”
Summary
Learning systems = manual tuning + adaptive + smart algorithms
Query history + data profiling→ lattices → summary tables
We have discussed summary tables (materialized views based on
join/aggregate in a star schema) but the approach can be applied to other kinds
of materialized views
Relational algebra, incorporating materialized views, is a powerful language that
allows us to combine many forms of data optimization
Thank you! Questions?
@julianhyde · @ApacheCalcite · http://apache.calcite.org
Resources
[CALCITE-1616] Data profiler
[CALCITE-1870] Lattice suggester
[CALCITE-1861] Spatial indexes
[CALCITE-1968] OpenGIS
[CALCITE-1991] Generated columns
Talk: “Data profiling with Apache Calcite” (Hadoop Summit, 2017)
Talk: “SQL on everything, in memory” (Strata, 2014)
Zhang, Qi, Stradling, Huang (2014). “Towards a Painless Index for Spatial Objects”
Harinarayan, Rajaraman, Ullman (1996). “Implementing data cubes efficiently”
Image credit
https://www.flickr.com/photos/defenceimages/6938469933/
Don’t optimize my queries, optimize my data!
Extra slides
Architecture
Conventional database Calcite
Planning queries
MySQL
Splunk
join
Key: productId
group
Key: productName
Agg: count
filter
Condition:
action = 'purchase'
sort
Key: c desc
scan
scan
Table: products
select p.productName, count(*) as c
from splunk.splunk as s
join mysql.products as p
on s.productId = p.productId
where s.action = 'purchase'
group by p.productName
order by c desc
Table: splunk
Optimized query
MySQL
Splunk
join
Key: productId
group
Key: productName
Agg: count
filter
Condition:
action = 'purchase'
sort
Key: c desc
scan
scan
Table: splunk
Table: products
select p.productName, count(*) as c
from splunk.splunk as s
join mysql.products as p
on s.productId = p.productId
where s.action = 'purchase'
group by p.productName
order by c desc
Calcite framework
Cost, statistics
RelOptCost
RelOptCostFactory
RelMetadataProvider
• RelMdColumnUniquensss
• RelMdDistinctRowCount
• RelMdSelectivity
SQL parser
SqlNode
SqlParser
SqlValidator
Transformation rules
RelOptRule
• FilterMergeRule
• AggregateUnionTransposeRule
• 100+ more
Global transformations
• Unification (materialized view)
• Column trimming
• De-correlation
Relational algebra
RelNode (operator)
• TableScan
• Filter
• Project
• Union
• Aggregate
• …
RelDataType (type)
RexNode (expression)
RelTrait (physical property)
• RelConvention (calling-convention)
• RelCollation (sortedness)
• RelDistribution (partitioning)
RelBuilder
JDBC driver
Metadata
Schema
Table
Function
• TableFunction
• TableMacro
Lattice
Materialized views, lattices, tiles
Materialized view - A table whose contents are
guaranteed to be the same as executing a given query.
Lattice - Recommends, builds, and recognizes summary
materialized views (tiles) based on a star schema.
A query defines the tables and many:1 relationships in
the star schema.
Tile - A summary materialized view that belongs to a
lattice. A tile may or may not be materialized. Might be:
● Declared in lattice, or
● Generated via recommender algorithm, or
● Created in response to query.
CREATE MATERIALIZED VIEW t AS
SELECT * FROM emps
WHERE deptno = 10;
CREATE LATTICE star AS
SELECT *
FROM sales_fact_1997 AS s
JOIN product AS p ON …
JOIN product_class AS pc ON …
JOIN customer AS c ON …
JOIN time_by_day AS t ON …;
CREATE MATERIALIZED VIEW zg IN star
SELECT gender, zipcode, COUNT(*),
SUM(unit_sales) FROM star
GROUP BY gender, zipcode;
Combining past and future
select stream *
from Orders as o
where units > (
select avg(units)
from Orders as h
where h.productId = o.productId
and h.rowtime > o.rowtime - interval ‘1’ year)
➢ Orders is used as both stream and table
➢ System determines where to find the records
➢ Query is invalid if records are not available
Controlling when data is emitted
Early emission is the defining
characteristic of a streaming query.
The emit clause is a SQL extension
inspired by Apache Beam’s “trigger”
notion. (Still experimental… and
evolving.)
A relational (non-streaming) query is
just a query with the most conservative
possible emission strategy.
select stream productId,
count(*) as c
from Orders
group by productId,
floor(rowtime to hour)
emit at watermark,
early interval ‘2’ minute,
late limit 1;
select *
from Orders
emit when complete;
Other applications of data profiling
Query optimization:
● Planners are poor at estimating selectivity of conditions after N-way join
(especially on real data)
● New join-order benchmark: “Movies made by French directors tend to have
French actors”
● Predict number of reducers in MapReduce & Spark
“Grokking” a data set
Identifying problems in normalization, partitioning, quality
Applications in machine learning?
Further improvements to data profiling
● Build sketches in parallel
● Run algorithm in a distributed framework (Spark or MapReduce)
● Compute histograms
○ For example, Median age for male/female customers
● Seek out functional dependencies
○ Once you know FDs, a lot of cardinalities are no longer “surprising”
○ FDs occur in denormalized tables, e.g. star schemas
● Smarter criteria for stopping algorithm
● Skew/heavy hitters. Are some values much more frequent than others?
● Conditional cardinalities and functional dependencies
○ Does one partition of the data behave differently from others? (e.g. year=2005, state=LA)

More Related Content

PDF
Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
PDF
Parquet performance tuning: the missing guide
PDF
Dynamic Partition Pruning in Apache Spark
PDF
Hive Bucketing in Apache Spark with Tejas Patil
PDF
Understanding Query Plans and Spark UIs
PDF
Spark SQL
PDF
Deep Dive: Memory Management in Apache Spark
PDF
A Deep Dive into Query Execution Engine of Spark SQL
Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Parquet performance tuning: the missing guide
Dynamic Partition Pruning in Apache Spark
Hive Bucketing in Apache Spark with Tejas Patil
Understanding Query Plans and Spark UIs
Spark SQL
Deep Dive: Memory Management in Apache Spark
A Deep Dive into Query Execution Engine of Spark SQL

What's hot (20)

PDF
Incremental View Maintenance with Coral, DBT, and Iceberg
PDF
Apache Iceberg - A Table Format for Hige Analytic Datasets
PPTX
Optimizing Apache Spark SQL Joins
PDF
Efficient Data Storage for Analytics with Apache Parquet 2.0
PDF
CDC Stream Processing with Apache Flink
PDF
The Apache Spark File Format Ecosystem
PDF
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
PDF
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
PDF
Apache Spark in Depth: Core Concepts, Architecture & Internals
PDF
Cost-Based Optimizer in Apache Spark 2.2
PDF
Building Robust ETL Pipelines with Apache Spark
PDF
Apache Calcite Tutorial - BOSS 21
PDF
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
PPTX
Dynamic filtering for presto join optimisation
PDF
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
PPTX
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
PDF
Cosco: An Efficient Facebook-Scale Shuffle Service
PDF
Hudi architecture, fundamentals and capabilities
PDF
Building robust CDC pipeline with Apache Hudi and Debezium
PDF
Introduction to Apache Calcite
Incremental View Maintenance with Coral, DBT, and Iceberg
Apache Iceberg - A Table Format for Hige Analytic Datasets
Optimizing Apache Spark SQL Joins
Efficient Data Storage for Analytics with Apache Parquet 2.0
CDC Stream Processing with Apache Flink
The Apache Spark File Format Ecosystem
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Spark in Depth: Core Concepts, Architecture & Internals
Cost-Based Optimizer in Apache Spark 2.2
Building Robust ETL Pipelines with Apache Spark
Apache Calcite Tutorial - BOSS 21
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Dynamic filtering for presto join optimisation
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
Cosco: An Efficient Facebook-Scale Shuffle Service
Hudi architecture, fundamentals and capabilities
Building robust CDC pipeline with Apache Hudi and Debezium
Introduction to Apache Calcite
Ad

Viewers also liked (12)

PDF
Apache Calcite: One planner fits all
PDF
Bi on Big Data - Strata 2016 in London
PPTX
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
PPTX
Apache Arrow - An Overview
PDF
The twins that everyone loved too much
PDF
Data Science Languages and Industry Analytics
PPTX
Building a Virtual Data Lake with Apache Arrow
PPTX
Options for Data Prep - A Survey of the Current Market
PPTX
Apache Arrow: In Theory, In Practice
PDF
SQL on everything, in memory
PPTX
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
PPTX
Apache Calcite overview
Apache Calcite: One planner fits all
Bi on Big Data - Strata 2016 in London
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
Apache Arrow - An Overview
The twins that everyone loved too much
Data Science Languages and Industry Analytics
Building a Virtual Data Lake with Apache Arrow
Options for Data Prep - A Survey of the Current Market
Apache Arrow: In Theory, In Practice
SQL on everything, in memory
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Apache Calcite overview
Ad

Similar to Don’t optimize my queries, optimize my data! (20)

PPTX
Lazy beats Smart and Fast
PDF
Don't optimize my queries, organize my data!
PDF
Tactical data engineering
PDF
Data Profiling in Apache Calcite
PDF
Data profiling with Apache Calcite
PDF
Data profiling in Apache Calcite
PPT
The thinking persons guide to data warehouse design
PDF
Cost-Based query optimization
PDF
Cost-based Query Optimization
PDF
phoenix-on-calcite-hadoop-summit-2016
PDF
Why you care about
 relational algebra (even though you didn’t know it)
PDF
Spatial query on vanilla databases
PDF
Infobright Column-Oriented Analytical Database Engine
PPT
What to do when one size does not fit all?!
PDF
Optimized cluster index generation
PPTX
Dbms schemas for decision support
PDF
unit 3 DBMS.docx.pdf geometric transformer in query processing
PDF
unit 3 DBMS.docx.pdf geometry in query p
PDF
Issues in Query Processing and Optimization
PPTX
19CS3052R-CO1-7-S7 ECE
Lazy beats Smart and Fast
Don't optimize my queries, organize my data!
Tactical data engineering
Data Profiling in Apache Calcite
Data profiling with Apache Calcite
Data profiling in Apache Calcite
The thinking persons guide to data warehouse design
Cost-Based query optimization
Cost-based Query Optimization
phoenix-on-calcite-hadoop-summit-2016
Why you care about
 relational algebra (even though you didn’t know it)
Spatial query on vanilla databases
Infobright Column-Oriented Analytical Database Engine
What to do when one size does not fit all?!
Optimized cluster index generation
Dbms schemas for decision support
unit 3 DBMS.docx.pdf geometric transformer in query processing
unit 3 DBMS.docx.pdf geometry in query p
Issues in Query Processing and Optimization
19CS3052R-CO1-7-S7 ECE

More from Julian Hyde (20)

PPTX
Measures in SQL (SIGMOD 2024, Santiago, Chile)
PDF
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
PDF
Building a semantic/metrics layer using Calcite
PDF
Cubing and Metrics in SQL, oh my!
PDF
Adding measures to Calcite SQL
PDF
Morel, a data-parallel programming language
PDF
Is there a perfect data-parallel programming language? (Experiments with More...
PDF
Morel, a Functional Query Language
PDF
Apache Calcite (a tutorial given at BOSS '21)
PDF
The evolution of Apache Calcite and its Community
PDF
What to expect when you're Incubating
PDF
Open Source SQL - beyond parsers: ZetaSQL and Apache Calcite
PDF
Efficient spatial queries on vanilla databases
PDF
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
PDF
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
PDF
A smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
PDF
Streaming SQL
PDF
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
PDF
Streaming SQL
PDF
Streaming SQL
Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Building a semantic/metrics layer using Calcite
Cubing and Metrics in SQL, oh my!
Adding measures to Calcite SQL
Morel, a data-parallel programming language
Is there a perfect data-parallel programming language? (Experiments with More...
Morel, a Functional Query Language
Apache Calcite (a tutorial given at BOSS '21)
The evolution of Apache Calcite and its Community
What to expect when you're Incubating
Open Source SQL - beyond parsers: ZetaSQL and Apache Calcite
Efficient spatial queries on vanilla databases
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
A smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
Streaming SQL
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
Streaming SQL
Streaming SQL

Recently uploaded (20)

PPTX
assetexplorer- product-overview - presentation
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PPTX
Transform Your Business with a Software ERP System
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PPTX
Reimagine Home Health with the Power of Agentic AI​
PPTX
Why Generative AI is the Future of Content, Code & Creativity?
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PDF
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
PDF
Nekopoi APK 2025 free lastest update
PDF
wealthsignaloriginal-com-DS-text-... (1).pdf
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PDF
Digital Strategies for Manufacturing Companies
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PDF
PTS Company Brochure 2025 (1).pdf.......
PPTX
CHAPTER 2 - PM Management and IT Context
PDF
Cost to Outsource Software Development in 2025
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PPTX
Introduction to Artificial Intelligence
PDF
System and Network Administration Chapter 2
assetexplorer- product-overview - presentation
Design an Analysis of Algorithms I-SECS-1021-03
Transform Your Business with a Software ERP System
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
Reimagine Home Health with the Power of Agentic AI​
Why Generative AI is the Future of Content, Code & Creativity?
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
Nekopoi APK 2025 free lastest update
wealthsignaloriginal-com-DS-text-... (1).pdf
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
Digital Strategies for Manufacturing Companies
Upgrade and Innovation Strategies for SAP ERP Customers
PTS Company Brochure 2025 (1).pdf.......
CHAPTER 2 - PM Management and IT Context
Cost to Outsource Software Development in 2025
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
Introduction to Artificial Intelligence
System and Network Administration Chapter 2

Don’t optimize my queries, optimize my data!

  • 1. Don’t Optimize my Queries; Optimize my Data! Julian Hyde DataEngConf NYC 2017/10/30
  • 2. @julianhyde SQL Query planning Query federation OLAP Streaming Hadoop ASF member Original author of Apache Calcite PMC Apache Arrow, Calcite, Drill, Eagle, Kylin Architect at Hortonworks
  • 3. Overview How do you tune a data system? How can (or should) a data system tune itself? What problems have we solved to bring these things to Apache Calcite? Part 1: Strategies for organizing data. (We rely heavily on relational algebra, especially materialized views.) Part 2: How to make systems self-organizing? (Algorithms for design materialized views, infer relationships between data sets, gathering statistics about data sets.)
  • 4. SELECT d.name, COUNT(*) AS c FROM Emps AS e JOIN Depts AS d USING (deptno) WHERE e.age < 40 GROUP BY d.deptno HAVING COUNT(*) > 5 ORDER BY c DESC Relational algebra Based on set theory, plus operators: Project, Filter, Aggregate, Union, Join, Sort Requires: declarative language (SQL), query planner Original goal: data independence Enables: query optimization, new algorithms and data structures Scan [Emps] Scan [Depts] Join [e.deptno = d.deptno] Filter [e.age < 30] Aggregate [deptno, COUNT(*) AS c] Filter [c > 5] Project [name, c] Sort [c DESC]
  • 5. Apache Calcite Apache top-level project since October, 2015 Query planning framework used in many projects and products Also works standalone: embedded federated query engine with SQL / JDBC front end Apache community development model
  • 7. A “simple” query Data ● 2010 U.S. census ● 100 million records ● 1KB per record ● 100 GB total System ● 4x SATA 3 disks ● Total read throughput 1 GB/s Query Goal ● Compute the answer to the query in under 5 seconds SELECT SUM(householdSize) FROM CensusHouseholds;
  • 8. Solutions Sequential scan Query takes 100 s (100 GB at 1 GB/s) Parallelize Spread the data over 40 disks in 10 machines Query takes 10 s Cache Keep the data in memory 2nd query: 10 ms 3rd query: 10 s Materialize Summarize the data on disk All queries: 100 ms Materialize + cache + adapt As above, building summaries on demand
  • 9. Ways of organizing data Format (CSV, JSON, binary) Layout: row- vs. column-oriented (e.g. Parquet, ORC), cache friendly (e.g. Arrow) Storage medium (disk, flash, RAM, NVRAM, ...) Non-lossy copy: sorted / partitioned Lossy copies of data: project, filter, aggregate, join Combinations of the above Logical optimizations >> physical optimizations
  • 10. Index A sorted, projected materialized view Accelerates queries that use ranges, correlated lookups, sorting, aggregate, distinct CREATE TABLE Emp (empno INT, name VARCHAR(20), deptno INT); CREATE INDEX I_Emp_Deptno ON Emp (deptno, name); SELECT DISTINCT deptno FROM Emp WHERE deptno BETWEEN 20 AND 40 ORDER BY deptno; empno name deptno 100 Fred 20 110 Barney 10 120 Wilma 30 130 Dino 10 deptno name rowid 10 Barney af5634.0001 10 Dino af5634.0003 20 Fred af5634.0000 30 Wilma af5634.0002
  • 11. Add the remaining columns No longer need “rowid” Lossless During planning, treat indexes as tables, and index lookups as joins Covering index empno name deptno 100 Fred 20 110 Barney 10 120 Wilma 30 130 Dino 10 deptno name empno 10 Barney 100 10 Dino 130 20 Fred 20 30 Wilma 30 CREATE INDEX I_Emp_Deptno2 ( deptno INTEGER, name VARCHAR(20)) COVER (empno);
  • 12. Materialized view CREATE MATERIALIZED VIEW EmpsByDeptno AS SELECT deptno, name, deptno FROM Emp ORDER BY deptno, name; Scan [Emps] Scan [EmpsByDeptno] Sort [deptno, name] empno name deptno 100 Fred 20 110 Barney 10 120 Wilma 30 130 Dino 10 deptno name empno 10 Barney 100 10 Dino 130 20 Fred 20 30 Wilma 30 As a materialized view, an index is now just another table Several tables contain the information necessary to answer the query - just pick the best
  • 13. Spatial query Find all restaurants within 1.5 distance units of where I am: restaurant x y Zachary’s pizza 3 1 King Yen 7 7 Filippo’s 7 4 Station burger 5 6 SELECT * FROM Restaurants AS r WHERE ST_Distance( ST_MakePoint(r.x, r.y), ST_MakePoint(6, 7)) < 1.5 • • • • Zachary’s pizza Filippo’s King Yen Station burger
  • 14. Hilbert space-filling curve ● A space-filling curve invented by mathematician David Hilbert ● Every (x, y) point has a unique position on the curve ● Points near to each other typically have Hilbert indexes close together
  • 15. • • • • Add restriction based on h, a restaurant’s distance along the Hilbert curve Must keep original restriction due to false positives Using Hilbert index restaurant x y h Zachary’s pizza 3 1 5 King Yen 7 7 41 Filippo’s 7 4 52 Station burger 5 6 36 Zachary’s pizza Filippo’s SELECT * FROM Restaurants AS r WHERE (r.h BETWEEN 35 AND 42 OR r.h BETWEEN 46 AND 46) AND ST_Distance( ST_MakePoint(r.x, r.y), ST_MakePoint(6, 7)) < 1.5 King Yen Station burger
  • 16. Telling the optimizer 1. Declare h as a generated column 2. Sort table by h Planner can now convert spatial range queries into a range scan Does not require specialized spatial index such as r-tree Very efficient on a sorted table such as HBase CREATE TABLE Restaurants ( restaurant VARCHAR(20), x DOUBLE, y DOUBLE, h DOUBLE GENERATED ALWAYS AS ST_Hilbert(x, y) STORED) SORT KEY (h); restaurant x y h Zachary’s pizza 3 1 5 Station burger 5 6 36 King Yen 7 7 41 Filippo’s 7 4 52
  • 17. Much valuable data is “data in flight” Use SQL to query streams (or streams + tables) Streaming Data center SELECT AVG(unitPrice) FROM Orders WHERE units > 1000 AND orderDate BETWEEN ‘2014-06-01’ AND ‘2015-12-31’ SELECT STREAM * FROM Orders WHERE units > 1000 Streaming query Historic query
  • 18. Hybrid query combines a stream with its own history ● Orders is used as both as stream and as “stream history” virtual table ● “Average order size over last year” should be maintained by the system, i.e. a materialized view SELECT STREAM * FROM Orders AS o WHERE units > ( SELECT AVG(units) FROM Orders AS h WHERE h.productId = o.productId AND h.rowtime > o.rowtime - INTERVAL ‘1’ YEAR) “Orders” used as a stream “Orders” used as a “stream history” virtual table
  • 19. Summary - data optimization via materialized views Many forms of data optimization can be modeled as materialized views: ● Blocks in cache ● B-tree indexes ● Summary tables ● Spatial indexes ● History of streams Allows the optimizer to “understand” the optimization and use it (if beneficial) But who designs the optimizations?
  • 21. How do data systems learn? queries DML statistics adaptations recommender Goals ● Improve response time, throughput, storage cost ● Predictable, adaptive (short and long term), allow human intervention How? ● Humans ● Adaptive systems ● Smart algorithms Example adaptations ● Cache disk blocks in memory ● Cached query results ● Data organization, e.g. partition on a different key ● Secondary structures, e.g. b-tree and r-tree indexes
  • 22. Tiled, in-memory materialized views A vision for an adaptive data system (we’re not there yet) tables on disk in-memory materializations SELECT x, SUM(n) FROM t GROUP BY x
  • 23. Building materialized views Challenges: ● Design Which materializations to create? ● Populate Load them with data ● Maintain Incrementally populate when data changes ● Rewrite Transparently rewrite queries to use materializations ● Adapt Design and populate new materializations, drop unused ones ● Express Need a rich algebra, to model how data is derived Initial focus: summary tables (materialized views over star schemas)
  • 24. CREATE LATTICE Sales AS SELECT t.*, c.*, COUNT(*), SUM(s.units) FROM Sales AS s JOIN Time AS t USING (timeId) JOIN Customers AS c USING (customerId) JOIN Products AS p USING (productId); Designing summary tables via lattices CREATE MATERIALIZED VIEW SalesYearZipcode AS SELECT t.year, c.state, c.zipcode, COUNT(*), SUM(units) FROM Sales AS s JOIN Time AS t USING (timeId) JOIN Customers AS c USING (customerId) GROUP BY 1, 2, 3; product product class sales customers time
  • 25. Many possible summary tables Key z zipcode (43k) s state (50) g gender (2) y year (5) m month (12) () 1 (z, s, g, y, m) 912k (s, g, y, m) 6k (z) 43k (s) 50 (g) 2 (y) 5 (m) 12 raw 1m (y, m) 60 (g, y) 10 (z, s) 43.4k (g, y, m) 120 Fewer than you would expect, because 5m combinations cannot occur in 1m row table Fewer than you would expect, because state depends on zipcode
  • 26. Algorithm: Design summary tables Given a database with 30 columns, 10M rows. Find X summary tables with under Y rows that improve query response time the most. AdaptiveMonteCarlo algorithm [1]: ● Based on research [2] ● Greedy algorithm that takes a combination of summary tables and tries to find the table that yields the greatest cost/benefit improvement ● Models “benefit” of the table as query time saved over simulated query load ● The “cost” of a table is its size [1] org.pentaho.aggdes.algorithm.impl.AdaptiveMonteCarloAlgorithm [2] Harinarayan, Rajaraman, Ullman (1996). “Implementing data cubes efficiently”
  • 27. Lattice (optimized) () 1 (z, s, g, y, m) 912k (s, g, y, m) 6k (z) 43k (s) 50 (g) 2 (y) 5 (m) 12 (z, g, y, m) 909k (z, s, y, m) 831k raw 1m (z, s, g, m) 644k (z, s, g, y) 392k (y, m) 60 (z, s) 43.4k (z, s, g) 83.6k (g, y) 10 (g, y, m) 120 (g, m) 24 Key z zipcode (43k) s state (50) g gender (2) y year (5) m month (12)
  • 28. Data profiling Algorithm needs count(distinct a, b, ...) for each combination of attributes: ● Previous example had 25 = 32 possible tables ● Schema with 30 attributes has 230 (about 109 ) possible tables ● Algorithm considers a significant fraction of these ● Approximations are OK Attempts to solve the profiling problem: 1. Compute each combination: scan, sort, unique, count; repeat 230 times! 2. Sketches (HyperLogLog) 3. Sketches + parallelism + information theory [CALCITE-1616]
  • 29. Sketches HyperLogLog is an algorithm that computes approximate distinct count. It can estimate cardinalities of 109 with a typical error rate of 2%, using 1.5 kB of memory. [3][4] With 16 MB memory per machine we can compute 10,000 combinations of attributes each pass. So, we’re down from 109 to 105 passes. [3] Flajolet, Fusy, Gandouet, Meunier (2007). "Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm" [4] https://github.com/mrjgreen/HyperLogLog
  • 30. Given Expected cardinality Actual cardinality Surprise (gender): 2 (state): 50 (gender, state): 100.0 100 0.000 (month): 12 (zipcode): 43,000 (month, zipcode): 441,699.3 442,700 0.001 (state): 50 (zipcode): 43,000 (state, zipcode): 799,666.7 43,400 0.897 (state, zipcode): 43,400 (gender, state): 100 (gender, zipcode): 85,995 (gender, state, zipcode): 86,799 = min(86,799, 892,234, 892,228) 83,567 0.019 ● Surprise = abs(actual - expected) / (actual + expected) ● E(card (x, y)) = n . (1 - ((n - 1) / n) ^ p) n = card (x) * card (y), p = row count Combining probability & information theory
  • 31. Algorithm Three ways “surprise” can help: ● If a cardinality is not surprising, we don’t need to store it -- we can derive it ● If a combination’s cardinality is not surprising, it is unlikely to have surprising children ● If we’re not seeing surprising results, it’s time to stop surprise_threshold := 1 queue := {singleton combinations} // (a), (b), ... while queue is not empty { batch := remove first 10,000 entries in queue compute cardinality of each combination in batch for each actual (computed) cardinality a { e := expected cardinality of combination s := surprise(a, e) if s > surprise_threshold { store combination and its cardinality add child combinations to queue // (x, a), (x, b), ... } increase surprise_threshold } }
  • 32. Algorithm progress and “surprise” threshold Progress of algorithm Rejected as not sufficiently surprising Surprise threshold rises as algorithm progresses Singleton combinations are have surprise = 1 Surprise threshold rises after we have completed the first batch
  • 33. Data profiling - summary The algorithm defeats a combinatorial search space using sketches + information theory + parallelism Recommending data structures is an optimization problem; profiling provides the cost & benefit function As a by-product, the algorithm discovers unique keys, “almost” keys, and foreign keys But which tables are actually joined together in practice?
  • 34. CREATE LATTICE Sales AS SELECT t.*, c.*, COUNT(*), SUM(s.units) FROM Sales AS s JOIN Time AS t USING (timeId) JOIN Customers AS c USING (customerId) JOIN Products AS p USING (productId); CREATE MATERIALIZED VIEW SalesYearZipcode AS SELECT t.year, c.state, c.zipcode, COUNT(*), SUM(units) FROM Sales AS s JOIN Time AS t USING (timeId) JOIN Customers AS c USING (customerId) GROUP BY 1, 2, 3; product product class sales customers time The lattice generates the summary tables. But who writes the lattice? Designing summary tables via lattices (2)
  • 35. CREATE LATTICE Sales AS SELECT t.*, c.*, COUNT(*), SUM(s.units) FROM Sales AS s JOIN Time AS t USING (timeId) JOIN Customers AS c USING (customerId) JOIN Products AS p USING (productId); CREATE MATERIALIZED VIEW SalesYearZipcode AS SELECT t.year, c.state, c.zipcode, COUNT(*), SUM(units) FROM Sales AS s JOIN Time AS t USING (timeId) JOIN Customers AS c USING (customerId) GROUP BY 1, 2, 3; ALTER SCHEMA Sales INFER LATTICES; product product class sales customers time Designing summary tables via lattices (3)
  • 36. Lattice after Query 1 + 2 Query 2 Query 1 Growing and evolving lattices based on queries sales customers product product class sales product product class sales customers See: [CALCITE-1870] “Lattice suggester”
  • 37. Summary Learning systems = manual tuning + adaptive + smart algorithms Query history + data profiling→ lattices → summary tables We have discussed summary tables (materialized views based on join/aggregate in a star schema) but the approach can be applied to other kinds of materialized views Relational algebra, incorporating materialized views, is a powerful language that allows us to combine many forms of data optimization
  • 38. Thank you! Questions? @julianhyde · @ApacheCalcite · http://apache.calcite.org Resources [CALCITE-1616] Data profiler [CALCITE-1870] Lattice suggester [CALCITE-1861] Spatial indexes [CALCITE-1968] OpenGIS [CALCITE-1991] Generated columns Talk: “Data profiling with Apache Calcite” (Hadoop Summit, 2017) Talk: “SQL on everything, in memory” (Strata, 2014) Zhang, Qi, Stradling, Huang (2014). “Towards a Painless Index for Spatial Objects” Harinarayan, Rajaraman, Ullman (1996). “Implementing data cubes efficiently” Image credit https://www.flickr.com/photos/defenceimages/6938469933/
  • 42. Planning queries MySQL Splunk join Key: productId group Key: productName Agg: count filter Condition: action = 'purchase' sort Key: c desc scan scan Table: products select p.productName, count(*) as c from splunk.splunk as s join mysql.products as p on s.productId = p.productId where s.action = 'purchase' group by p.productName order by c desc Table: splunk
  • 43. Optimized query MySQL Splunk join Key: productId group Key: productName Agg: count filter Condition: action = 'purchase' sort Key: c desc scan scan Table: splunk Table: products select p.productName, count(*) as c from splunk.splunk as s join mysql.products as p on s.productId = p.productId where s.action = 'purchase' group by p.productName order by c desc
  • 44. Calcite framework Cost, statistics RelOptCost RelOptCostFactory RelMetadataProvider • RelMdColumnUniquensss • RelMdDistinctRowCount • RelMdSelectivity SQL parser SqlNode SqlParser SqlValidator Transformation rules RelOptRule • FilterMergeRule • AggregateUnionTransposeRule • 100+ more Global transformations • Unification (materialized view) • Column trimming • De-correlation Relational algebra RelNode (operator) • TableScan • Filter • Project • Union • Aggregate • … RelDataType (type) RexNode (expression) RelTrait (physical property) • RelConvention (calling-convention) • RelCollation (sortedness) • RelDistribution (partitioning) RelBuilder JDBC driver Metadata Schema Table Function • TableFunction • TableMacro Lattice
  • 45. Materialized views, lattices, tiles Materialized view - A table whose contents are guaranteed to be the same as executing a given query. Lattice - Recommends, builds, and recognizes summary materialized views (tiles) based on a star schema. A query defines the tables and many:1 relationships in the star schema. Tile - A summary materialized view that belongs to a lattice. A tile may or may not be materialized. Might be: ● Declared in lattice, or ● Generated via recommender algorithm, or ● Created in response to query. CREATE MATERIALIZED VIEW t AS SELECT * FROM emps WHERE deptno = 10; CREATE LATTICE star AS SELECT * FROM sales_fact_1997 AS s JOIN product AS p ON … JOIN product_class AS pc ON … JOIN customer AS c ON … JOIN time_by_day AS t ON …; CREATE MATERIALIZED VIEW zg IN star SELECT gender, zipcode, COUNT(*), SUM(unit_sales) FROM star GROUP BY gender, zipcode;
  • 46. Combining past and future select stream * from Orders as o where units > ( select avg(units) from Orders as h where h.productId = o.productId and h.rowtime > o.rowtime - interval ‘1’ year) ➢ Orders is used as both stream and table ➢ System determines where to find the records ➢ Query is invalid if records are not available
  • 47. Controlling when data is emitted Early emission is the defining characteristic of a streaming query. The emit clause is a SQL extension inspired by Apache Beam’s “trigger” notion. (Still experimental… and evolving.) A relational (non-streaming) query is just a query with the most conservative possible emission strategy. select stream productId, count(*) as c from Orders group by productId, floor(rowtime to hour) emit at watermark, early interval ‘2’ minute, late limit 1; select * from Orders emit when complete;
  • 48. Other applications of data profiling Query optimization: ● Planners are poor at estimating selectivity of conditions after N-way join (especially on real data) ● New join-order benchmark: “Movies made by French directors tend to have French actors” ● Predict number of reducers in MapReduce & Spark “Grokking” a data set Identifying problems in normalization, partitioning, quality Applications in machine learning?
  • 49. Further improvements to data profiling ● Build sketches in parallel ● Run algorithm in a distributed framework (Spark or MapReduce) ● Compute histograms ○ For example, Median age for male/female customers ● Seek out functional dependencies ○ Once you know FDs, a lot of cardinalities are no longer “surprising” ○ FDs occur in denormalized tables, e.g. star schemas ● Smarter criteria for stopping algorithm ● Skew/heavy hitters. Are some values much more frequent than others? ● Conditional cardinalities and functional dependencies ○ Does one partition of the data behave differently from others? (e.g. year=2005, state=LA)