SlideShare a Scribd company logo
Data profiling with
Apache Calcite
Julian Hyde
Apache: Big Data, Miami
2017/05/16
@julianhyde
SQL
Query planning
Query federation
OLAP
Streaming
Hadoop
ASF member
Original author of Apache Calcite
PMC Apache Arrow, Drill, Eagle, Kylin
Overview
Apache Calcite
Motivating problem: Automatically designing summary tables
What is data profiling?
Naive profiling algorithm
Improving the algorithm using sketches, parallelism, information theory
Applying data profiling to other problems
Apache Calcite
Apache top-level project since October, 2015
Query planning framework
➢ Relational algebra, rewrite rules
➢ Cost model & statistics
➢ Federation via adapters
➢ Extensible
Packaging
➢ Library
➢ Optional SQL parser, JDBC server
➢ Community-authored rules, adapters
Planning queries
MySQL
Splunk
join
Key: productId
group
Key: productName
Agg: count
filter
Condition:
action = 'purchase'
sort
Key: c desc
scan
scan
Table: products
select p.productName, count(*) as c
from splunk.splunk as s
join mysql.products as p
on s.productId = p.productId
where s.action = 'purchase'
group by p.productName
order by c desc
Table: splunk
Optimized query
MySQL
Splunk
join
Key: productId
group
Key: productName
Agg: count
filter
Condition:
action = 'purchase'
sort
Key: c desc
scan
scan
Table: splunk
Table: products
select p.productName, count(*) as c
from splunk.splunk as s
join mysql.products as p
on s.productId = p.productId
where s.action = 'purchase'
group by p.productName
order by c desc
Want to learn more about Calcite?
Come to my other talk:
● “Building a Smarter Pig”
● A Calcite adapter for Apache Pig
● Eli Levine & Julian Hyde
● Thursday 3.40pm
Optimizing queries
Problem
10 TB database, disk with 1 GB/s throughput, and a query that reads 1 TB data.
Solutions
1. Sequential scan Query takes 1,000s.
2. Parallelize Spread the data over 100 disks in 25 machines. Query takes 10s.
3. Cache Keep the data in memory. 2nd query: 10ms. 3rd query: 10s.
4. Materialize Summarize the data on disk. All queries: 100ms.
5. Materialize + cache + adapt As above, building summaries on demand.
Optimizing data
A materialized view (“materialization”) is a table that contains the result of a
query. The DBMS maintains it, and uses it to answer queries on other tables.
Challenges:
● Design Which materializations to create?
● Populate Load them with data
● Maintain Incrementally populate when data changes
● Rewrite Transparently rewrite queries to use materializations
● Adapt Design and populate new materializations, drop unused ones
● Express Need a rich algebra, to model how data is derived
create materialized view EmpSummary as
select deptno, COUNT(*) as c, SUM(sal) as s
from Emp
group by deptno
Lattice
Key
z zipcode (43k)
s state (50)
g gender (2)
y year (5)
m month (12)
() 1
(z, s, g, y,
m) 912k
(s, g, y,
m) 6k
(z) 43k (s) 50 (g) 2 (y) 5 (m) 12
raw 1m
(y, m)
60
(g, y) 10
(z, s)
43.4k
(g, y, m)
120
Fewer than you would
expect, because 5m
combinations cannot
occur in 1m row table
Fewer than you
would expect,
because state
depends on zipcode
Algorithm: Design summary tables
Given a database with 30 columns, 10M rows. Find X summary tables with under
Y rows that improve query response time the most.
AdaptiveMonteCarlo algorithm [1]:
● Based on research [2]
● Greedy algorithm that takes a combination of summary tables and tries to
find the table that yields the greatest cost/benefit improvement
● Models “benefit” of the table as query time saved over simulated query load
● The “cost” of a table is its size
[1] org.pentaho.aggdes.algorithm.impl.AdaptiveMonteCarloAlgorithm
[2] Harinarayan, Rajaraman, Ullman (1996). “Implementing data cubes efficiently”
Lattice (optimized) () 1
(z, s, g, y,
m) 912k
(s, g, y,
m) 6k
(z) 43k (s) 50 (g) 2 (y) 5 (m) 12
(z, g, y,
m) 909k
(z, s, y,
m) 831k
raw 1m
(z, s, g,
m) 644k
(z, s, g,
y) 392k
(y, m)
60
(z, s)
43.4k
(z, s, g)
83.6k
(g, y) 10
(g, y, m)
120
(g, m)
24
Key
z zipcode (43k)
s state (50)
g gender (2)
y year (5)
m month (12)
Lattice (optimized) () 1
(z, s, g, y,
m) 912k
(s, g, y,
m) 6k
(z) 43k (s) 50 (g) 2 (y) 5 (m) 12
(z, g, y,
m) 909k
(z, s, y,
m) 831k
raw 1m
(z, s, g,
m) 644k
(z, s, g,
y) 392k
(y, m)
60
(z, s)
43.4k
(z, s, g)
83.6k
(g, y) 10
(g, y, m)
120
(g, m)
24
Key
z zipcode (43k)
s state (50)
g gender (2)
y year (5)
m month (12)
Aggregate Cost
(rows)
Benefit (query
rows saved)
% queries
s, g, y, m 6k 497k 50%
z, s, g 87k 304k 33%
g, y 10 1.5k 25%
g, m 24 1.5k 25%
s, g 100 1.5k 25%
y, m 60 1.5k 25%
Data profiling
Algorithm needs count(distinct a, b, ...) for each combination of attributes:
● Previous example had 25
= 32 possible tables
● Schema with 30 attributes has 230
(about 109
) possible tables
● Algorithm considers a significant fraction of these
● Approximations are OK
Attempts to solve the profiling problem:
1. Compute each combination: scan, sort, unique, count; repeat 230
times!
2. Sketches (HyperLogLog)
3. Sketches + parallelism + information theory (CALCITE-1616)
Sketches
HyperLogLog is an algorithm that computes
approximate distinct count. It can estimate
cardinalities of 109
with a typical error rate of
2%, using 1.5 kB of memory. [3][4]
With 16 MB memory per machine we can
compute 10,000 combinations of attributes
each pass.
So, we’re down from 109
to 105
passes.
[3] Flajolet, Fusy, Gandouet, Meunier (2007). "Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm"
[4] https://github.com/mrjgreen/HyperLogLog
Given Expected cardinality Actual cardinality Surprise
(gender): 2 (state): 50 (gender, state): 100.0 100 0.000
(month): 12 (zipcode): 43,000 (month, zipcode): 441,699.3 442,700 0.001
(state): 50 (zipcode): 43,000 (state, zipcode): 799,666.7 43,400 0.897
(state, zipcode): 43,400
(gender, state): 100
(gender, zipcode): 85,995
(gender, state, zipcode): 86,799
= min(86,799, 892,234, 892,228)
83,567 0.019
● Surprise = abs(actual - expected) / (actual + expected)
● E(card (x, y)) = n . (1 - ((n - 1) / n) ^ p) n = card (x) * card (y), p = row count
Combining probability & information theory
Algorithm
Three ways “surprise” can help:
● If a cardinality is not
surprising, we don’t need to
store it -- we can derive it
● If a combination’s cardinality
is not surprising, it is unlikely
to have surprising children
● If we’re not seeing surprising
results, it’s time to stop
surprise_threshold := 1
queue := {singleton combinations} // (a), (b), ...
while queue is not empty {
batch := remove first 10,000 entries in queue
compute cardinality of each combination in batch
for each actual (computed) cardinality a {
e := expected cardinality of combination
s := surprise(a, e)
if s > surprise_threshold {
store combination and its cardinality
add child combinations to queue // (x, a), (x, b), ...
}
increase surprise_threshold
}
}
Algorithm progress and surprise threshold
Progress of algorithm
Rejected as not
sufficiently
surprising
Surprise
threshold rises
as algorithm
progresses
Singleton
combinations
are have surprise
= 1
Surprise thresold
rises after we
hve completed
the first batch
Hierarchies considered
harmful
Hierarchies are a feature of most OLAP systems
Does it makes sense to store (year, quarter,
month, date) and roll up to (year, quarter)?
No -- algorithm can deduce hierarchies; less
configuration means fewer mistakes
Summary optimizer naturally includes attributes
that don’t increase summary cardinality by much
Feel free to specify a “drill path” in slice & dice UI
True hierarchy
(year)
↑
(year, quarter)
↑
(year, quarter, month)
↑
(year, quarter, month, date)
Almost a hierarchy
(nation)
↑
(nation, state)
↑
(nation, state, zipcode)
Other applications of data profiling
Query optimization:
● Planners are poor at estimating selectivity of conditions after N-way join
(especially on real data)
● New join-order benchmark: “Movies made by French directors tend to have
French actors”
● Predict number of reducers in MapReduce & Spark
“Grokking” a data set
Identifying problems in normalization, paritioning, quality
Applications in machine learning?
Further improvements
● Build sketches in parallel
● Run algorithm in a distributed framework (Spark or MapReduce)
● Compute histograms
○ For example, Median age for male/female customers
● Seek out functional dependencies
○ Once you know FDs, a lot of cardinalities are no longer “surprising”
○ FDs occur in denormalized tables, e.g. star schemas
● Smarter criteria for stopping algorithm
● Skew/heavy hitters. Are some values much more frequent than others?
● Conditional cardinalities and functional dependencies
○ Does one partition of the data behave differently from others? (e.g. year=2005, state=LA)
Thank you!
https://issues.apache.org/jira/browse/CALCITE-1788
https://calcite.apache.org
@ApacheCalcite
@julianhyde

More Related Content

PDF
Data profiling with Apache Calcite
PDF
Spatial query on vanilla databases
PDF
Don’t optimize my queries, optimize my data!
PPTX
Lazy beats Smart and Fast
PDF
Efficient spatial queries on vanilla databases
PDF
Morel, a Functional Query Language
PDF
Tactical data engineering
PDF
Is there a perfect data-parallel programming language? (Experiments with More...
Data profiling with Apache Calcite
Spatial query on vanilla databases
Don’t optimize my queries, optimize my data!
Lazy beats Smart and Fast
Efficient spatial queries on vanilla databases
Morel, a Functional Query Language
Tactical data engineering
Is there a perfect data-parallel programming language? (Experiments with More...

What's hot (20)

PDF
Streaming SQL
PDF
Don't optimize my queries, organize my data!
PPTX
How to understand and analyze Apache Hive query execution plan for performanc...
PDF
Hive Functions Cheat Sheet
PDF
Apache Calcite Tutorial - BOSS 21
PDF
Map reduce: beyond word count
PPT
Drill / SQL / Optiq
PPTX
ComputeFest 2012: Intro To R for Physical Sciences
PDF
Grouping & Summarizing Data in R
PDF
R + 15 minutes = Hadoop cluster
PDF
Pivoting Data with SparkSQL by Andrew Ray
PDF
Streaming SQL
PDF
GeoMesa on Apache Spark SQL with Anthony Fox
PDF
Introduction to Pandas and Time Series Analysis [PyCon DE]
PDF
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
PDF
pandas - Python Data Analysis
PDF
Spark Dataframe - Mr. Jyotiska
PDF
Data Manipulation Using R (& dplyr)
PDF
User Defined Aggregation in Apache Spark: A Love Story
PDF
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Streaming SQL
Don't optimize my queries, organize my data!
How to understand and analyze Apache Hive query execution plan for performanc...
Hive Functions Cheat Sheet
Apache Calcite Tutorial - BOSS 21
Map reduce: beyond word count
Drill / SQL / Optiq
ComputeFest 2012: Intro To R for Physical Sciences
Grouping & Summarizing Data in R
R + 15 minutes = Hadoop cluster
Pivoting Data with SparkSQL by Andrew Ray
Streaming SQL
GeoMesa on Apache Spark SQL with Anthony Fox
Introduction to Pandas and Time Series Analysis [PyCon DE]
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
pandas - Python Data Analysis
Spark Dataframe - Mr. Jyotiska
Data Manipulation Using R (& dplyr)
User Defined Aggregation in Apache Spark: A Love Story
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Ad

Similar to Data Profiling in Apache Calcite (20)

PDF
Data profiling in Apache Calcite
PDF
M4_DAR_part1. module part 4 analystics with r
PDF
R programming & Machine Learning
PPT
Real Time Geodemographics
PDF
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
PPTX
Scaling up data science applications
ODP
Scaling PostgreSQL With GridSQL
PPTX
Presentation_BigData_NenaMarin
PDF
Data Modeling, Normalization, and De-Normalization | PostgresOpen 2019 | Dimi...
PDF
Scaling up data science applications
PDF
Scaling Up: How Switching to Apache Spark Improved Performance, Realizability...
PPT
Hands on Mahout!
PDF
Distributed Real-Time Stream Processing: Why and How 2.0
PDF
Dplyr v2 . Exploratory data analysis.pdf
PDF
Structuring Spark: DataFrames, Datasets, and Streaming
PDF
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
PDF
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
PDF
Amazon SageMaker을 통한 손쉬운 Jupyter Notebook 활용하기 - 윤석찬 (AWS 테크에반젤리스트)
PDF
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
PDF
クラウドDWHとしても進化を続けるPivotal Greenplumご紹介
Data profiling in Apache Calcite
M4_DAR_part1. module part 4 analystics with r
R programming & Machine Learning
Real Time Geodemographics
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Scaling up data science applications
Scaling PostgreSQL With GridSQL
Presentation_BigData_NenaMarin
Data Modeling, Normalization, and De-Normalization | PostgresOpen 2019 | Dimi...
Scaling up data science applications
Scaling Up: How Switching to Apache Spark Improved Performance, Realizability...
Hands on Mahout!
Distributed Real-Time Stream Processing: Why and How 2.0
Dplyr v2 . Exploratory data analysis.pdf
Structuring Spark: DataFrames, Datasets, and Streaming
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Amazon SageMaker을 통한 손쉬운 Jupyter Notebook 활용하기 - 윤석찬 (AWS 테크에반젤리스트)
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
クラウドDWHとしても進化を続けるPivotal Greenplumご紹介
Ad

More from Julian Hyde (20)

PPTX
Measures in SQL (SIGMOD 2024, Santiago, Chile)
PDF
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
PDF
Building a semantic/metrics layer using Calcite
PDF
Cubing and Metrics in SQL, oh my!
PDF
Adding measures to Calcite SQL
PDF
Morel, a data-parallel programming language
PDF
Apache Calcite (a tutorial given at BOSS '21)
PDF
The evolution of Apache Calcite and its Community
PDF
What to expect when you're Incubating
PDF
Open Source SQL - beyond parsers: ZetaSQL and Apache Calcite
PDF
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
PDF
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
PDF
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
PDF
A smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
PDF
Streaming SQL
PDF
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
PDF
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
PDF
Streaming SQL
PDF
Streaming SQL with Apache Calcite
PDF
Streaming SQL
Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Building a semantic/metrics layer using Calcite
Cubing and Metrics in SQL, oh my!
Adding measures to Calcite SQL
Morel, a data-parallel programming language
Apache Calcite (a tutorial given at BOSS '21)
The evolution of Apache Calcite and its Community
What to expect when you're Incubating
Open Source SQL - beyond parsers: ZetaSQL and Apache Calcite
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
A smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
Streaming SQL
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
Streaming SQL
Streaming SQL with Apache Calcite
Streaming SQL

Recently uploaded (20)

PDF
iTop VPN Free 5.6.0.5262 Crack latest version 2025
PDF
System and Network Administration Chapter 2
PPTX
Odoo POS Development Services by CandidRoot Solutions
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PDF
Nekopoi APK 2025 free lastest update
PPTX
Log360_SIEM_Solutions Overview PPT_Feb 2020.pptx
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PDF
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
PPTX
Why Generative AI is the Future of Content, Code & Creativity?
PPTX
assetexplorer- product-overview - presentation
PDF
System and Network Administraation Chapter 3
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PPTX
CHAPTER 2 - PM Management and IT Context
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PDF
top salesforce developer skills in 2025.pdf
PDF
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
PDF
Digital Systems & Binary Numbers (comprehensive )
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
iTop VPN Free 5.6.0.5262 Crack latest version 2025
System and Network Administration Chapter 2
Odoo POS Development Services by CandidRoot Solutions
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Nekopoi APK 2025 free lastest update
Log360_SIEM_Solutions Overview PPT_Feb 2020.pptx
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
Why Generative AI is the Future of Content, Code & Creativity?
assetexplorer- product-overview - presentation
System and Network Administraation Chapter 3
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
CHAPTER 2 - PM Management and IT Context
Navsoft: AI-Powered Business Solutions & Custom Software Development
How to Choose the Right IT Partner for Your Business in Malaysia
top salesforce developer skills in 2025.pdf
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
Digital Systems & Binary Numbers (comprehensive )
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)

Data Profiling in Apache Calcite

  • 1. Data profiling with Apache Calcite Julian Hyde Apache: Big Data, Miami 2017/05/16
  • 2. @julianhyde SQL Query planning Query federation OLAP Streaming Hadoop ASF member Original author of Apache Calcite PMC Apache Arrow, Drill, Eagle, Kylin
  • 3. Overview Apache Calcite Motivating problem: Automatically designing summary tables What is data profiling? Naive profiling algorithm Improving the algorithm using sketches, parallelism, information theory Applying data profiling to other problems
  • 4. Apache Calcite Apache top-level project since October, 2015 Query planning framework ➢ Relational algebra, rewrite rules ➢ Cost model & statistics ➢ Federation via adapters ➢ Extensible Packaging ➢ Library ➢ Optional SQL parser, JDBC server ➢ Community-authored rules, adapters
  • 5. Planning queries MySQL Splunk join Key: productId group Key: productName Agg: count filter Condition: action = 'purchase' sort Key: c desc scan scan Table: products select p.productName, count(*) as c from splunk.splunk as s join mysql.products as p on s.productId = p.productId where s.action = 'purchase' group by p.productName order by c desc Table: splunk
  • 6. Optimized query MySQL Splunk join Key: productId group Key: productName Agg: count filter Condition: action = 'purchase' sort Key: c desc scan scan Table: splunk Table: products select p.productName, count(*) as c from splunk.splunk as s join mysql.products as p on s.productId = p.productId where s.action = 'purchase' group by p.productName order by c desc
  • 7. Want to learn more about Calcite? Come to my other talk: ● “Building a Smarter Pig” ● A Calcite adapter for Apache Pig ● Eli Levine & Julian Hyde ● Thursday 3.40pm
  • 8. Optimizing queries Problem 10 TB database, disk with 1 GB/s throughput, and a query that reads 1 TB data. Solutions 1. Sequential scan Query takes 1,000s. 2. Parallelize Spread the data over 100 disks in 25 machines. Query takes 10s. 3. Cache Keep the data in memory. 2nd query: 10ms. 3rd query: 10s. 4. Materialize Summarize the data on disk. All queries: 100ms. 5. Materialize + cache + adapt As above, building summaries on demand.
  • 9. Optimizing data A materialized view (“materialization”) is a table that contains the result of a query. The DBMS maintains it, and uses it to answer queries on other tables. Challenges: ● Design Which materializations to create? ● Populate Load them with data ● Maintain Incrementally populate when data changes ● Rewrite Transparently rewrite queries to use materializations ● Adapt Design and populate new materializations, drop unused ones ● Express Need a rich algebra, to model how data is derived create materialized view EmpSummary as select deptno, COUNT(*) as c, SUM(sal) as s from Emp group by deptno
  • 10. Lattice Key z zipcode (43k) s state (50) g gender (2) y year (5) m month (12) () 1 (z, s, g, y, m) 912k (s, g, y, m) 6k (z) 43k (s) 50 (g) 2 (y) 5 (m) 12 raw 1m (y, m) 60 (g, y) 10 (z, s) 43.4k (g, y, m) 120 Fewer than you would expect, because 5m combinations cannot occur in 1m row table Fewer than you would expect, because state depends on zipcode
  • 11. Algorithm: Design summary tables Given a database with 30 columns, 10M rows. Find X summary tables with under Y rows that improve query response time the most. AdaptiveMonteCarlo algorithm [1]: ● Based on research [2] ● Greedy algorithm that takes a combination of summary tables and tries to find the table that yields the greatest cost/benefit improvement ● Models “benefit” of the table as query time saved over simulated query load ● The “cost” of a table is its size [1] org.pentaho.aggdes.algorithm.impl.AdaptiveMonteCarloAlgorithm [2] Harinarayan, Rajaraman, Ullman (1996). “Implementing data cubes efficiently”
  • 12. Lattice (optimized) () 1 (z, s, g, y, m) 912k (s, g, y, m) 6k (z) 43k (s) 50 (g) 2 (y) 5 (m) 12 (z, g, y, m) 909k (z, s, y, m) 831k raw 1m (z, s, g, m) 644k (z, s, g, y) 392k (y, m) 60 (z, s) 43.4k (z, s, g) 83.6k (g, y) 10 (g, y, m) 120 (g, m) 24 Key z zipcode (43k) s state (50) g gender (2) y year (5) m month (12)
  • 13. Lattice (optimized) () 1 (z, s, g, y, m) 912k (s, g, y, m) 6k (z) 43k (s) 50 (g) 2 (y) 5 (m) 12 (z, g, y, m) 909k (z, s, y, m) 831k raw 1m (z, s, g, m) 644k (z, s, g, y) 392k (y, m) 60 (z, s) 43.4k (z, s, g) 83.6k (g, y) 10 (g, y, m) 120 (g, m) 24 Key z zipcode (43k) s state (50) g gender (2) y year (5) m month (12) Aggregate Cost (rows) Benefit (query rows saved) % queries s, g, y, m 6k 497k 50% z, s, g 87k 304k 33% g, y 10 1.5k 25% g, m 24 1.5k 25% s, g 100 1.5k 25% y, m 60 1.5k 25%
  • 14. Data profiling Algorithm needs count(distinct a, b, ...) for each combination of attributes: ● Previous example had 25 = 32 possible tables ● Schema with 30 attributes has 230 (about 109 ) possible tables ● Algorithm considers a significant fraction of these ● Approximations are OK Attempts to solve the profiling problem: 1. Compute each combination: scan, sort, unique, count; repeat 230 times! 2. Sketches (HyperLogLog) 3. Sketches + parallelism + information theory (CALCITE-1616)
  • 15. Sketches HyperLogLog is an algorithm that computes approximate distinct count. It can estimate cardinalities of 109 with a typical error rate of 2%, using 1.5 kB of memory. [3][4] With 16 MB memory per machine we can compute 10,000 combinations of attributes each pass. So, we’re down from 109 to 105 passes. [3] Flajolet, Fusy, Gandouet, Meunier (2007). "Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm" [4] https://github.com/mrjgreen/HyperLogLog
  • 16. Given Expected cardinality Actual cardinality Surprise (gender): 2 (state): 50 (gender, state): 100.0 100 0.000 (month): 12 (zipcode): 43,000 (month, zipcode): 441,699.3 442,700 0.001 (state): 50 (zipcode): 43,000 (state, zipcode): 799,666.7 43,400 0.897 (state, zipcode): 43,400 (gender, state): 100 (gender, zipcode): 85,995 (gender, state, zipcode): 86,799 = min(86,799, 892,234, 892,228) 83,567 0.019 ● Surprise = abs(actual - expected) / (actual + expected) ● E(card (x, y)) = n . (1 - ((n - 1) / n) ^ p) n = card (x) * card (y), p = row count Combining probability & information theory
  • 17. Algorithm Three ways “surprise” can help: ● If a cardinality is not surprising, we don’t need to store it -- we can derive it ● If a combination’s cardinality is not surprising, it is unlikely to have surprising children ● If we’re not seeing surprising results, it’s time to stop surprise_threshold := 1 queue := {singleton combinations} // (a), (b), ... while queue is not empty { batch := remove first 10,000 entries in queue compute cardinality of each combination in batch for each actual (computed) cardinality a { e := expected cardinality of combination s := surprise(a, e) if s > surprise_threshold { store combination and its cardinality add child combinations to queue // (x, a), (x, b), ... } increase surprise_threshold } }
  • 18. Algorithm progress and surprise threshold Progress of algorithm Rejected as not sufficiently surprising Surprise threshold rises as algorithm progresses Singleton combinations are have surprise = 1 Surprise thresold rises after we hve completed the first batch
  • 19. Hierarchies considered harmful Hierarchies are a feature of most OLAP systems Does it makes sense to store (year, quarter, month, date) and roll up to (year, quarter)? No -- algorithm can deduce hierarchies; less configuration means fewer mistakes Summary optimizer naturally includes attributes that don’t increase summary cardinality by much Feel free to specify a “drill path” in slice & dice UI True hierarchy (year) ↑ (year, quarter) ↑ (year, quarter, month) ↑ (year, quarter, month, date) Almost a hierarchy (nation) ↑ (nation, state) ↑ (nation, state, zipcode)
  • 20. Other applications of data profiling Query optimization: ● Planners are poor at estimating selectivity of conditions after N-way join (especially on real data) ● New join-order benchmark: “Movies made by French directors tend to have French actors” ● Predict number of reducers in MapReduce & Spark “Grokking” a data set Identifying problems in normalization, paritioning, quality Applications in machine learning?
  • 21. Further improvements ● Build sketches in parallel ● Run algorithm in a distributed framework (Spark or MapReduce) ● Compute histograms ○ For example, Median age for male/female customers ● Seek out functional dependencies ○ Once you know FDs, a lot of cardinalities are no longer “surprising” ○ FDs occur in denormalized tables, e.g. star schemas ● Smarter criteria for stopping algorithm ● Skew/heavy hitters. Are some values much more frequent than others? ● Conditional cardinalities and functional dependencies ○ Does one partition of the data behave differently from others? (e.g. year=2005, state=LA)