Apache Calcite: One Frontend to Rule Them All

Apache Calcite
One Front-end to Rule Them All
Michael Mior, PMC Chair

Overview
● What is Apache Calcite?
● Calcite components
● Streaming SQL
● Next steps and contributing to Calcite

What is Apache Calcite?
● An ANSI-compliant SQL parser
● A logical query optimizer
● A heterogenous data processing framework

Origins
2004 LucidEra and SQLstream were each building SQL systems
2012 Code pared down and entered the ASF incubator
2015 Graduated from incubator
2016 I joined the Calcite project as a committer
2017 Joined the PMC and was voted as chair
2018 Paper presented at SIGMOD

Powered by Calcite
● Many open source projects
(Apache Hive, Apache Drill, Apache Phoenix, Lingual, …)
● Commercial products
(MapD, Dremio, Qubole, …)
● Contributors from Huawei, Uber, Intel, Salesforce, …

Conventional Architecture
JDBC Client JDBC Server
SQL Parser
Optimizer
Datastore
Metadata
Operators

Calcite Architecture
JDBC Client JDBC Server
SQL Parser
Optimizer
3rd party
data
Pluggable
Metadata
Adapters
Pluggable
Rules
3rd party
data

Optimizer
● Operates on relational algebra by matching rules
● Calcite contains 100+ rewrite rules
● Currently working on validating these using Cosette
● Optimization is cost-based
● “Calling convention” allows optimization across backends

Example rules
● Join order transposition
● Transpose different operators (e.g. project before join)
● Merge adjacent operators
● Materialized view query rewriting

Optimizer
● Based on the Volcano optimizer generator
○ Logical operators are functions (e.g. join)
○ Physical operators implement logical operators
○ Physical properties are attributes of the data
(e.g. sorting, partitioning)
● Start with logical expressions and physical properties
● Optimization produces a plan with only physical operators

Relational Algebra and Streaming
● Scan
● Filter
● Project
● Join
● Sort
● Aggregate
● Union
● Values
● Delta (relation to stream)
● Chi (stream to relation)

Adapters
● Connect to different backends (not just relational)
● Only required operation is a table scan
● Allow push down of filter, sort, etc.
● Calcite implements remaining operators

● Calling convention allows Calcite to separate
backend-specific operators and generic implementations
● Any relational algebra operator can be pushed down
● Operator push down simply requires a new optimizer rule
Adapters

Conventions
1. Plans start as
logical nodes
3. Fire rules to
propagate conventions
to other nodes
2. Assign each
Scan its table’s
native convention
4. The best plan may
use an engine not tied
to any native format
Join
Filter Scan
ScanScan
Join
Join
Filter Scan
ScanScan
Join
Scan
ScanScan
Join
Filter
Join
Join
Filter Scan
ScanScan
Join

Conventions
● Conventions are a uniform representation
of hybrid queries
● Physical property of nodes
(like ordering, distribution)
● Adapter =
Schema factory +
Convention +
Rules to convert to a convention
Join
Filter Scan
ScanScan
Join

● Column store database
● Uses tables partitioned across servers and clustered
● Supports limited filtering and sorting
Apache Cassandra Adapter

Query example
CREATE TABLE playlists (id uuid, song_order int,
song_id uuid, title text, artist text,
PRIMARY KEY (id, song_order));
SELECT title FROM playlists WHERE
id=62c36092-82a1-3a00-93d1-46196ee77204 AND
artist='Relient K' ORDER BY song_order;

SELECT * FROM playlists;
Query example
Sort
Scan
Project
Filter
● Start with a table scan
● Remaining operations performed by Calcite

SELECT * FROM playlists WHERE
id=62c36092-82a1-3a00-93d1-46196ee77204;
Query example
Sort
Scan
Project
Filter
Filter
● Push the filter on the partition key to Cassandra
● The remaining filter is performed by Calcite

SELECT * FROM playlists WHERE
id=62c36092-82a1-3a00-93d1-46196ee77204
ORDER BY song_order;
Query example
Filter
Scan
Project
Filter
Sort
● Push the ordering to Cassandra
● This uses the table’s clustering key

SELECT title, album FROM playlists WHERE
id=62c36092-82a1-3a00-93d1-46196ee77204
ORDER BY song_order;
Query example
Scan
Filter
Filter
Sort
Project
Project
● Push down the project of necessary fields
● This is the query sent to Cassandra
● Only the filter and project are done by Calcite

● Materialized view maintenance
● View-based query rewriting
● Full SQL support
● Join with other data sources
What we get for free

● All data must be modeled as relations
● Easy for relational databases
● Also relatively easy for many wide column stores
● What about document stores?
Data representation

Semistructured Data
● Columns can have complex types (e.g. arrays and maps)
● Add UNNEST operator to relational algebra
● New rules can be added to optimize these queries
name age pets
Sally 29 [{name: Fido,
type: Dog},
{name: Jack,
type: Cat}]
name age pets
Sally 29 {name: Fido,
type: Dog}
Sally 29 {name: Jack,
type: Cat}

MongoDB Adapter
_MAP
{ _id : 02401, city : BROCKTON, loc : [
-71.03434799999999, 42.081571 ], pop
: 59498, state : MA }
{ _id : 06902, city : STAMFORD, loc : [
-73.53742800000001, 41.052552 ], pop
: 54605, state : CT }
● Use one column with the whole document
● Unnest attributes as needed
● This is very messy, but we have
no schema to work with

MongoDB Adapter
id city latitude longitude population state
02401 BROCKTON -71.034348 42.081571 59498 MA
06902 STAMFORD -73.537428 41.052552 54605 CT
● Views to the rescue!
● Users of adapters can define structured views over
semistructured data (or do this lazily! See Apache Drill)

MySQL
Splunk
join
Key: productId
group
Key: productName
Agg: count
filter
Condition:
action = 'purchase'
sort
Key: c desc
scan
scan
Table: products
Table: splunk
SELECT p.productName, COUNT(*) AS c
FROM splunk.splunk AS s
JOIN mysql.products AS p
ON s.productId = p.productId
WHERE s.action = 'purchase'
GROUP BY p.productName
ORDER BY c DESC
FilterIntoJoin

MySQL
Splunk
group
Key: productName
Agg: count
sort
Key: c desc
FilterIntoJoin
join
Key: productId
filter
Condition:
action = 'purchase'
scan
scan
Table: splunk
Table: products
SELECT p.productName, COUNT(*) AS c
FROM splunk.splunk AS s
JOIN mysql.products AS p
ON s.productId = p.productId
WHERE s.action = 'purchase'
GROUP BY p.productName
ORDER BY c DESC

Streaming Data
● Calcite supports multiple windowing algorithms
(e.g. tumbling, sliding, hopping)
● Streaming queries can be combined with tables
● Streaming queries can be optimized using the same rules
along with new rules specifically for streaming queries

Streaming Data
● Relations can be used both as streams and tables
● Calcite is a reference implementation for streaming SQL
(still being standardized)
SELECT STREAM * FROM Orders AS o WHERE units >
(SELECT AVG(units) FROM Orders AS h WHERE
h.productId = o.productId AND h.rowtime >
o.rowtime - INTERVAL ‘1’ YEAR)

Windowing
Tumbling window
Hopping window
Session window
SELECT STREAM … FROM Orders
GROUP BY FLOOR(rowtime TO HOUR)
GROUP BY TUMBLE(rowtime, INTERVAL ‘1’ HOUR)
GROUP BY HOP(rowtime, INTERVAL ‘1’ HOUR,
INTERVAL ‘2’ HOUR)
GROUP BY SESSION(rowtime, INTERVAL ‘1’ HOUR)

My Use Case
● Perform view-based query rewriting to provide a logical
model over a denormalized data store
● Denormalized tables are views over (non-materialized)
logical tables
● Queries can be rewritten from logical tables to the most
cost-efficient choice of materialized views

Use Cases
● Parsing and validating SQL (not so easy)
● Adding a relational front end to an existing system
● Prototyping new query processing algorithms
● Integrating data from multiple backends
● Allowing RDBMS tools to work with non-relational DBs

Calcite Project Future Work
● Geospatial queries
● Processing scientific data formats
● Sharing data in-memory between backends
● Additional query execution engines

My Future Work
● Better cost modeling
● Query-based data source selection
● Cost-based database system selection

Contributing to Apache Calcite
● Pick an existing issue or file a new one and start coding!
● Mailing list is generally very active
● New committers and PMC members regularly added
● Many opportunities for projects at various scales

Additional areas for contribution
● Testing (SQL is hard!)
● Incorporating state-of-the-art in DB research
● Access control across multiple systems
● Adapters for new classes of database (eg. array DBs)
● Implement missing SQL features (e.g. set operations)
…

Thanks to
● Edmon Begoli, Oak Ridge National Laboratory
● Jesús Camacho-Rodríguez, Hortonworks
● Julian Hyde, Hortonworks
● Daniel Lemire, Université du Québec (TÉLUQ)
● All other Calcite contributors!

Questions?
https://calcite.apache.org/

Apache Calcite: One Frontend to Rule Them All

More Related Content

What's hot (20)

Similar to Apache Calcite: One Frontend to Rule Them All (20)

More from Michael Mior (6)

Recently uploaded (20)

Apache Calcite: One Frontend to Rule Them All