SlideShare a Scribd company logo
1 © Hortonworks Inc. 2011–2018. All rights reserved
Accelerating query processing with
materialized views in Apache Hive
Jesús Camacho Rodríguez
DataWorks Summit Berlin
April 18, 2018
2 © Hortonworks Inc. 2011–2018. All rights reserved
Apache Hive
• Initial use case: batch processing
• Read-only data
• HiveQL (SQL-like query language)
• MapReduce
• Effort to take Hive beyond its batch processing roots
• Started in Apache Hive 0.10.0 (January 2013)
• Upcoming release: Apache Hive 3.0 (May 2018)
• Extensive renovation to improve three different axes
• Latency: allow interactive and sub-second queries
• Scalability: from TB to PB of data
• SQL support: move from HiveQL to SQL standard
3 © Hortonworks Inc. 2011–2018. All rights reserved
Apache Hive
• Multiple execution engines: Apache Tez and Apache Spark
• More efficient join execution algorithms
• Vectorized query execution
• Integration with columnar storage formats:
Apache ORC, Apache Parquet
• LLAP (Live Long and Process)
• Persistent deamons for low-latency queries
• Rule-based and cost-based optimizer
• Better statistics
• Tighter integration with other data processing systems: Druid
Important internals improvements
4 © Hortonworks Inc. 2011–2018. All rights reserved
Accelerating query processing
• Change data physical properties (distribute, sort)
• Filter rows
• Denormalize
• Preaggregate
Optimization based on access patterns
5 © Hortonworks Inc. 2011–2018. All rights reserved
Accelerating query processing
• Establish relationship between original and new tables
• Has a similar table already been created?
• Rewrite your queries to use new tables
• What happens when access patterns change?
• Maintain your new tables when original tables change
• Do I have to fully rebuild new tables?
Optimization based on access patterns
Currently, Hive users
have to do it manually
6 © Hortonworks Inc. 2011–2018. All rights reserved
Materialized views
• A materialized view is an entity that contains the result of a evaluating a query
• Important property  Awareness of the materialized view definition semantics
• Optimizer can exploit them for automatic query rewriting
• System can handle maintenance of the materialized views
• Generally, materializations can be created in different forms depending on the scope
• DBA writes “CREATE MATERIALIZED VIEW” statement
• Daemon creates materialized view based on recent query activity
• Cached result of previous similar query
• Query factorization identifies common pieces within a single query
7 © Hortonworks Inc. 2011–2018. All rights reserved
Possible workflow
1. Create materialized view using Hive tables
• Stored by Hive or Druid
2. User or dashboard sends queries to Hive
• Hive rewrites queries using available materialized views
• Execute rewitten query
Dashboards, BI tools
CREATE MATERIALIZED VIEW `ssb_mv`
STORED AS 'org.apache.hadoop.hive.druid.DruidStorageHandler'
ENABLE REWRITE
AS
<query>;
DBA, recommendation system
①
②
Data
Queries
8 © Hortonworks Inc. 2011–2018. All rights reserved
Materialized views in Apache Hive
• First implementation will be part of Apache Hive 3.0
• Multiple storage options: Hive, Druid
• Automatic rewriting of incoming queries to use materialized views
• Efficient view maintenance
• Incremental refresh
• Multiple options to control materialized views lifecycle
9 © Hortonworks Inc. 2011–2018. All rights reserved
Management of
materialized views in Hive
10 © Hortonworks Inc. 2011–2018. All rights reserved
Materialized view creation
• CREATE MATERIALIZED VIEW statement
CREATE MATERIALIZED VIEW [IF NOT EXISTS] [db_name.]materialized_view_name
[ENABLE REWRITE | DISABLE REWRITE]
[COMMENT materialized_view_comment]
[
[ROW FORMAT row_format]
[STORED AS file_format]
| STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)]
]
[LOCATION hdfs_path]
[TBLPROPERTIES (property_name=property_value, ...)]
AS
<query>;
⇢ Supports custom table properties, storage format, etc.
11 © Hortonworks Inc. 2011–2018. All rights reserved
Materialized view creation (stored in Druid)
• CREATE MATERIALIZED VIEW statement
CREATE MATERIALIZED VIEW druid_wiki_mv
STORED AS 'org.apache.hadoop.hive.druid.DruidStorageHandler'
AS
SELECT __time, page, user, c_added, c_removed
FROM src;
Hive materialized view name
Hive storage handler classname
12 © Hortonworks Inc. 2011–2018. All rights reserved
Other operations for materialized view management
DROP MATERIALIZED VIEW [db_name.]materialized_view_name;
SHOW MATERIALIZED VIEWS [IN database_name] ['identifier_with_wildcards’];
DESCRIBE [EXTENDED | FORMATTED] [db_name.]materialized_view_name;
⇢ More operations to be added and extended
13 © Hortonworks Inc. 2011–2018. All rights reserved
Materialized view-based
query rewriting
14 © Hortonworks Inc. 2011–2018. All rights reserved
Materialized view-based rewriting algorithm
• Automatically rewrite incoming queries using materialized views
• Optimizer exploits materialized view definition semantics
• Built on the ideas presented in [GL01] using Apache Calcite
• Supports queries containing TableScan, Project, Filter, Join, Aggregate operators
• Includes some extensions
• Generation of additional rewritings without needing to do join permutation
• Partial rewritings using union operators
• More information about the rewriting coverage
• http://calcite.apache.org/docs/materialized_views#rewriting-using-plan-structural-information
[GL01] Jonathan Goldstein and Per-åke Larson. Optimizing queries using materialized views: A practical,
scalable solution. In Proc. ACM SIGMOD Conf., 2001.
15 © Hortonworks Inc. 2011–2018. All rights reserved
Enable materialized view-based rewriting
• Global property to enable materialized view rewriting for queries
SET hive.materializedview.rewriting=true;
• User can selectively use enable/disable materialized views for rewriting
• Materialized views are enabled by default for rewriting
• Behavior can be altered after materialized view has been created
ALTER MATERIALIZED VIEW [db_name.]materialized_view_name ENABLE|DISABLE REWRITE;
16 © Hortonworks Inc. 2011–2018. All rights reserved
depts
Materialized view-based rewriting (example)
• Materialized view definition
Employees that were hired after 2016
CREATE MATERIALIZED VIEW mv
AS
SELECT empid, deptname, hire_date
FROM emps JOIN depts
ON (emps.deptno = depts.deptno)
WHERE hire_date >= '2016-01-01';
• Query
Employees that were hired last quarter
SELECT empid, deptname
FROM emps JOIN depts
ON (emps.deptno = depts.deptno)
WHERE hire_date >= '2018-01-01'
AND hire_date <= '2018-03-31';
• Materialized view-based rewriting
SELECT empid, deptname
FROM mv
WHERE hire_date >= '2018-01-01'
AND hire_date <= '2018-03-31';
deptsemps
empid depname hire_date
10001 IT 2016-03-01
10002 IT 2017-01-02
10003 HR 2017-07-01
10004 Finance 2018-01-15
10005 HR 2018-02-02
mv contents
empid depname
10004 Finance
10005 HR
Query results
17 © Hortonworks Inc. 2011–2018. All rights reserved
Materialized view-based rewriting (example 2)
• Materialized view definition
CREATE MATERIALIZED VIEW mv AS
SELECT <dims>,
lo_revenue,
lo_extprice * lo_disc AS d_price,
lo_revenue - lo_supplycost,
FROM
customer, dates, lineorder, part, supplier
WHERE
lo_orderdate = d_datekey
and lo_partkey = p_partkey
and lo_suppkey = s_suppkey
and lo_custkey = c_custkey;
• Query
SELECT sum(lo_extendedprice*lo_discount)
FROM
lineorder, dates
WHERE
lo_orderdate = d_datekey
and d_year = 2013
and lo_discount between 1 and 3;
• Materialized view-based rewriting
SELECT SUM(d_price)
FROM mv
WHERE
d_year = 2013
and lo_discount between 1 and 3;
supplier
part
dates
customerlineorder
Exploit SQL PK-FK and
NOT NULL constraints
d_year lo_discount <dims> d_price
2013 2 ... 7.55
2014 4 ... 432.60
2013 2 ... 34.45
2012 2 ... 2.05
… … ... …
mv contents
sum
42.0
…
Query results
18 © Hortonworks Inc. 2011–2018. All rights reserved
Materialized view-based rewriting (example 3)
• Materialized view definition
CREATE MATERIALIZED VIEW mv AS
SELECT floor(time to minute), page,
SUM(added) AS c_added,
SUM(removed) AS c_rmv
FROM wiki
GROUP BY floor(time to minute), page;
• Query
SELECT floor(time to month),
SUM(added) AS c_added
FROM wiki
GROUP BY floor(time to month);
• Materialized view-based rewriting
SELECT floor(time to month),
SUM(c_added) as c_added
FROM mv
GROUP BY floor(time to month);
wiki
__time page c_added c_rmv
2011-01-01 01:05:00 Justin 1800 25
2011-01-20 19:00:00 Justin 2912 42
2011-01-01 11:06:00 Ke$ha 1953 17
2011-02-02 13:15:00 Ke$ha 3194 170
2011-01-02 18:00:00 Miley 2232 34
mv contents
__time c_added
2011-01-01 00:00:00 8897
2011-02-01 00:00:00 3194
Query results
19 © Hortonworks Inc. 2011–2018. All rights reserved
Materialized view
maintenance
20 © Hortonworks Inc. 2011–2018. All rights reserved
Rebuilding materialized views
• Rebuild needs to be triggered manually by user
ALTER MATERIALIZED VIEW [db_name.]materialized_view_name REBUILD;
• Incremental materialized view maintenance
• Only refresh data that has changed in source tables
• Multiple benefits
• Decrease rebuild step execution time
• Preserves LLAP cache for existing data
• Materialized view should only use transactional tables (micromanaged or ACID)
• Current implementation only supports incremental rebuild for insert operations
• Update/delete operations force full rebuild
• Optimizer will attempt incremental rebuild
• Otherwise, fallback to full rebuild (INSERT OVERWRITE with MV definition)
21 © Hortonworks Inc. 2011–2018. All rights reserved
Incremental view maintenance algorithm
• Relies on materialized view rewriting algorithm
• Materialized view stores write ID for its tables when it is created/refreshed
• Write ID associates rows with transactions
• When rebuild is triggered, introduce filter condition on write ID column in MV definition
• Read only new rows from source tables
• Execute materialized view rewriting
• Rewrite INSERT OVERWRITE (full rebuild) into more efficient plan
• INSERT (table scan, filter, project, join)
• MERGE (table scan, filter, project, join, aggregate)
22 © Hortonworks Inc. 2011–2018. All rights reserved
CREATE MATERIALIZED VIEW mv1 AS
SELECT page, user,
SUM(added) AS c_added,
SUM(removed) AS c_rmv
FROM wiki
GROUP BY page, user;
Incremental view maintenance algorithm (example)
mv1 contents
page user c_added c_rmv
Justin Boxer 1800 25
Justin Reach 2912 42
Ke$ha Xeno 1953 17
Ke$ha Helz 3194 170
Miley Ashu 2232 34
page user … added removed … writeID
… … … … … … …
Miley Ashu … 68 16 … 10000
Justin Zaka … 392 239 … 10000
wiki contents
New records
⇢ ALTER MATERIALIZED VIEW mv1 REBUILD;
23 © Hortonworks Inc. 2011–2018. All rights reserved
CREATE MATERIALIZED VIEW mv1 AS
SELECT page, user,
SUM(added) AS c_added,
SUM(removed) AS c_rmv
FROM wiki
GROUP BY page, user;
① Rebuild statement rewriting
INSERT OVERWRITE mv1
SELECT page, user, SUM(added) AS c_added, SUM(removed) AS c_rmv
FROM (
SELECT page, user, c_added, c_removed
FROM mv1
UNION ALL
SELECT page, user, SUM(added) AS c_added, SUM(removed) AS c_rmv
FROM wiki
WHERE writeID > 9999
GROUP BY page, user) subq
GROUP BY page, user;
Incremental view maintenance algorithm (example)
Rollup data
mv1 contents
page user c_added c_rmv
Justin Boxer 1800 25
Justin Reach 2912 42
Ke$ha Xeno 1953 17
Ke$ha Helz 3194 170
Miley Ashu 2232 34
page user … added removed … writeID
… … … … … … …
Miley Ashu … 68 16 … 10000
Justin Zaka … 392 239 … 10000
wiki contents
New records
24 © Hortonworks Inc. 2011–2018. All rights reserved
CREATE MATERIALIZED VIEW mv1 AS
SELECT page, user,
SUM(added) AS c_added,
SUM(removed) AS c_rmv
FROM wiki
GROUP BY page, user;
② Rewrite INSERT OVERWRITE into MERGE statement
MERGE INTO mv1
USING (
SELECT page, user, SUM(added) AS c_added, SUM(removed) AS c_rmv
FROM wiki
WHERE writeID > 9999
GROUP BY page, user) src
ON mv1.page = src.page AND mv1.user = src.user
WHEN MATCHED
THEN UPDATE SET c_added = mv1.c_added + src.c_added,
c_removed = mv1.c_removed + src.c_rmv
WHEN NOT MATCHED
THEN INSERT VALUES (page, user, c_added, c_rmv);
Incremental view maintenance algorithm (example)
mv1 contents
page user c_added c_rmv
Justin Boxer 1800 25
Justin Reach 2912 42
Ke$ha Xeno 1953 17
Ke$ha Helz 3194 170
Miley Ashu 2232 34
page user … added removed … writeID
… … … … … … …
Miley Ashu … 68 16 … 10000
Justin Zaka … 392 239 … 10000
wiki contents
New records
25 © Hortonworks Inc. 2011–2018. All rights reserved
Materialized view lifecycle
26 © Hortonworks Inc. 2011–2018. All rights reserved
Management of materialized view lifecycle
• Do not accept stale data (default)
• If content of the materialized view is not fresh, we do not use it for automatic query rewriting
• Still possible to trigger partial rewritings that read both the stale materialized view and new data
from source tables
• Accept stale data
• Freshness defined as a time parameter
• If MV was not rebuilt for a certain time period and there were changes in base tables, ignore
• SET hive.materializedview.rewriting.time.window=10min;
• Can also be overriden by a certain materialized view using table properties
• Periodically rebuild materialized view, e.g., every 5 minutes
t=0min t=10min t=20min
Create MV Rebuild Rebuild Rebuild Rebuild
t=5min t=15min
27 © Hortonworks Inc. 2011–2018. All rights reserved
Road ahead
28 © Hortonworks Inc. 2011–2018. All rights reserved
Road ahead
• Improvements to current materialized views implementation
• Rewriting performance and scalability
• Single/many MVs
• Control physical distribution of data
• DISTRIBUTE BY, SORT BY, CLUSTER BY
• Increase incremental view maintenance coverage
• Support update/delete in source tables
• Materialized view recommender
• Ease the identification of access patterns for a given workload
29 © Hortonworks Inc. 2011–2018. All rights reserved
Demo
30 © Hortonworks Inc. 2011–2018. All rights reserved
Thank you
https://cwiki.apache.org/confluence/display/Hive/Materialized+views

More Related Content

PDF
Hive Data Modeling and Query Optimization
PPTX
Apache Arrow: In Theory, In Practice
PDF
Webinar: 99 Ways to Enrich Streaming Data with Apache Flink - Konstantin Knauf
PPTX
How to understand and analyze Apache Hive query execution plan for performanc...
PDF
The Top 5 Reasons to Deploy Your Applications on Oracle RAC
PDF
Hive spark-s3acommitter-hbase-nfs
PDF
Side by Side with Elasticsearch & Solr, Part 2
PPTX
ORC File - Optimizing Your Big Data
Hive Data Modeling and Query Optimization
Apache Arrow: In Theory, In Practice
Webinar: 99 Ways to Enrich Streaming Data with Apache Flink - Konstantin Knauf
How to understand and analyze Apache Hive query execution plan for performanc...
The Top 5 Reasons to Deploy Your Applications on Oracle RAC
Hive spark-s3acommitter-hbase-nfs
Side by Side with Elasticsearch & Solr, Part 2
ORC File - Optimizing Your Big Data

What's hot (20)

PPTX
From cache to in-memory data grid. Introduction to Hazelcast.
PDF
Iceberg: A modern table format for big data (Strata NY 2018)
PPTX
BI, Reporting and Analytics on Apache Cassandra
PPTX
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
PDF
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
PDF
Introduction à ElasticSearch
PPTX
Building Reliable Lakehouses with Apache Flink and Delta Lake
PDF
Parquet - Data I/O - Philadelphia 2013
PDF
[❤PDF❤] Oracle 19c Database Administration Oracle Simplified
PDF
Introduction to Apache Beam
PDF
EBS on ACFS white paper
PDF
From Zero to Hero with Kafka Connect
PDF
RMAN - New Features in Oracle 12c - IOUG Collaborate 2017
PDF
Kafka 101 and Developer Best Practices
PDF
Flash for Apache Spark Shuffle with Cosco
PDF
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
PPTX
Tuning Apache Kafka Connectors for Flink.pptx
PDF
ORC Files
PPTX
YARN Ready: Integrating to YARN with Tez
From cache to in-memory data grid. Introduction to Hazelcast.
Iceberg: A modern table format for big data (Strata NY 2018)
BI, Reporting and Analytics on Apache Cassandra
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
Introduction à ElasticSearch
Building Reliable Lakehouses with Apache Flink and Delta Lake
Parquet - Data I/O - Philadelphia 2013
[❤PDF❤] Oracle 19c Database Administration Oracle Simplified
Introduction to Apache Beam
EBS on ACFS white paper
From Zero to Hero with Kafka Connect
RMAN - New Features in Oracle 12c - IOUG Collaborate 2017
Kafka 101 and Developer Best Practices
Flash for Apache Spark Shuffle with Cosco
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Tuning Apache Kafka Connectors for Flink.pptx
ORC Files
YARN Ready: Integrating to YARN with Tez
Ad

Similar to Accelerating query processing with materialized views in Apache Hive (20)

PPTX
Accelerating query processing
PDF
Accelerating query processing with materialized views in Apache Hive
PPTX
Discardable In-Memory Materialized Queries With Hadoop
PPTX
Discardable In-Memory Materialized Query for Hadoop
PPTX
Hive Performance Dataworks Summit Melbourne February 2019
PDF
Fast SQL on Hadoop, Really?
PDF
What's New in Apache Hive 3.0 - Tokyo
PDF
What's New in Apache Hive 3.0?
PPTX
What's new in apache hive
PPT
materialized view description presentation
PDF
Autonomous ETL with Materialized Views
PDF
Selection & Maintenance of Materialized View and It’s Application for Fast Qu...
PDF
Fast SQL on Hadoop, really?
PPTX
Improve data warehouse performance by preprocessing
PDF
PGConf.ASIA 2019 Bali - Toward Implementing Incremental View Maintenance on P...
PDF
Cassandra Materialized Views
PDF
Fg33950952
PDF
Fg33950952
PDF
Flexviews materialized views for my sql
PDF
Data Warehousing 101(and a video)
Accelerating query processing
Accelerating query processing with materialized views in Apache Hive
Discardable In-Memory Materialized Queries With Hadoop
Discardable In-Memory Materialized Query for Hadoop
Hive Performance Dataworks Summit Melbourne February 2019
Fast SQL on Hadoop, Really?
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0?
What's new in apache hive
materialized view description presentation
Autonomous ETL with Materialized Views
Selection & Maintenance of Materialized View and It’s Application for Fast Qu...
Fast SQL on Hadoop, really?
Improve data warehouse performance by preprocessing
PGConf.ASIA 2019 Bali - Toward Implementing Incremental View Maintenance on P...
Cassandra Materialized Views
Fg33950952
Fg33950952
Flexviews materialized views for my sql
Data Warehousing 101(and a video)
Ad

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
PPTX
Managing the Dewey Decimal System
PPTX
Practical NoSQL: Accumulo's dirlist Example
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
PPTX
Security Framework for Multitenant Architecture
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
PPTX
Extending Twitter's Data Platform to Google Cloud
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
PDF
Computer Vision: Coming to a Store Near You
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Data Science Crash Course
Floating on a RAFT: HBase Durability with Apache Ratis
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
HBase Tales From the Trenches - Short stories about most common HBase operati...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Managing the Dewey Decimal System
Practical NoSQL: Accumulo's dirlist Example
HBase Global Indexing to support large-scale data ingestion at Uber
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Security Framework for Multitenant Architecture
Presto: Optimizing Performance of SQL-on-Anything Engine
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Extending Twitter's Data Platform to Google Cloud
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Computer Vision: Coming to a Store Near You
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Recently uploaded (20)

PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
Big Data Technologies - Introduction.pptx
PPTX
Spectroscopy.pptx food analysis technology
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPT
Teaching material agriculture food technology
PDF
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Advanced IT Governance
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Modernizing your data center with Dell and AMD
PDF
GamePlan Trading System Review: Professional Trader's Honest Take
NewMind AI Weekly Chronicles - August'25 Week I
Big Data Technologies - Introduction.pptx
Spectroscopy.pptx food analysis technology
Diabetes mellitus diagnosis method based random forest with bat algorithm
Teaching material agriculture food technology
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Review of recent advances in non-invasive hemoglobin estimation
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Dropbox Q2 2025 Financial Results & Investor Presentation
Advanced IT Governance
Reach Out and Touch Someone: Haptics and Empathic Computing
Understanding_Digital_Forensics_Presentation.pptx
“AI and Expert System Decision Support & Business Intelligence Systems”
The Rise and Fall of 3GPP – Time for a Sabbatical?
Spectral efficient network and resource selection model in 5G networks
NewMind AI Monthly Chronicles - July 2025
Per capita expenditure prediction using model stacking based on satellite ima...
Modernizing your data center with Dell and AMD
GamePlan Trading System Review: Professional Trader's Honest Take

Accelerating query processing with materialized views in Apache Hive

  • 1. 1 © Hortonworks Inc. 2011–2018. All rights reserved Accelerating query processing with materialized views in Apache Hive Jesús Camacho Rodríguez DataWorks Summit Berlin April 18, 2018
  • 2. 2 © Hortonworks Inc. 2011–2018. All rights reserved Apache Hive • Initial use case: batch processing • Read-only data • HiveQL (SQL-like query language) • MapReduce • Effort to take Hive beyond its batch processing roots • Started in Apache Hive 0.10.0 (January 2013) • Upcoming release: Apache Hive 3.0 (May 2018) • Extensive renovation to improve three different axes • Latency: allow interactive and sub-second queries • Scalability: from TB to PB of data • SQL support: move from HiveQL to SQL standard
  • 3. 3 © Hortonworks Inc. 2011–2018. All rights reserved Apache Hive • Multiple execution engines: Apache Tez and Apache Spark • More efficient join execution algorithms • Vectorized query execution • Integration with columnar storage formats: Apache ORC, Apache Parquet • LLAP (Live Long and Process) • Persistent deamons for low-latency queries • Rule-based and cost-based optimizer • Better statistics • Tighter integration with other data processing systems: Druid Important internals improvements
  • 4. 4 © Hortonworks Inc. 2011–2018. All rights reserved Accelerating query processing • Change data physical properties (distribute, sort) • Filter rows • Denormalize • Preaggregate Optimization based on access patterns
  • 5. 5 © Hortonworks Inc. 2011–2018. All rights reserved Accelerating query processing • Establish relationship between original and new tables • Has a similar table already been created? • Rewrite your queries to use new tables • What happens when access patterns change? • Maintain your new tables when original tables change • Do I have to fully rebuild new tables? Optimization based on access patterns Currently, Hive users have to do it manually
  • 6. 6 © Hortonworks Inc. 2011–2018. All rights reserved Materialized views • A materialized view is an entity that contains the result of a evaluating a query • Important property  Awareness of the materialized view definition semantics • Optimizer can exploit them for automatic query rewriting • System can handle maintenance of the materialized views • Generally, materializations can be created in different forms depending on the scope • DBA writes “CREATE MATERIALIZED VIEW” statement • Daemon creates materialized view based on recent query activity • Cached result of previous similar query • Query factorization identifies common pieces within a single query
  • 7. 7 © Hortonworks Inc. 2011–2018. All rights reserved Possible workflow 1. Create materialized view using Hive tables • Stored by Hive or Druid 2. User or dashboard sends queries to Hive • Hive rewrites queries using available materialized views • Execute rewitten query Dashboards, BI tools CREATE MATERIALIZED VIEW `ssb_mv` STORED AS 'org.apache.hadoop.hive.druid.DruidStorageHandler' ENABLE REWRITE AS <query>; DBA, recommendation system ① ② Data Queries
  • 8. 8 © Hortonworks Inc. 2011–2018. All rights reserved Materialized views in Apache Hive • First implementation will be part of Apache Hive 3.0 • Multiple storage options: Hive, Druid • Automatic rewriting of incoming queries to use materialized views • Efficient view maintenance • Incremental refresh • Multiple options to control materialized views lifecycle
  • 9. 9 © Hortonworks Inc. 2011–2018. All rights reserved Management of materialized views in Hive
  • 10. 10 © Hortonworks Inc. 2011–2018. All rights reserved Materialized view creation • CREATE MATERIALIZED VIEW statement CREATE MATERIALIZED VIEW [IF NOT EXISTS] [db_name.]materialized_view_name [ENABLE REWRITE | DISABLE REWRITE] [COMMENT materialized_view_comment] [ [ROW FORMAT row_format] [STORED AS file_format] | STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)] ] [LOCATION hdfs_path] [TBLPROPERTIES (property_name=property_value, ...)] AS <query>; ⇢ Supports custom table properties, storage format, etc.
  • 11. 11 © Hortonworks Inc. 2011–2018. All rights reserved Materialized view creation (stored in Druid) • CREATE MATERIALIZED VIEW statement CREATE MATERIALIZED VIEW druid_wiki_mv STORED AS 'org.apache.hadoop.hive.druid.DruidStorageHandler' AS SELECT __time, page, user, c_added, c_removed FROM src; Hive materialized view name Hive storage handler classname
  • 12. 12 © Hortonworks Inc. 2011–2018. All rights reserved Other operations for materialized view management DROP MATERIALIZED VIEW [db_name.]materialized_view_name; SHOW MATERIALIZED VIEWS [IN database_name] ['identifier_with_wildcards’]; DESCRIBE [EXTENDED | FORMATTED] [db_name.]materialized_view_name; ⇢ More operations to be added and extended
  • 13. 13 © Hortonworks Inc. 2011–2018. All rights reserved Materialized view-based query rewriting
  • 14. 14 © Hortonworks Inc. 2011–2018. All rights reserved Materialized view-based rewriting algorithm • Automatically rewrite incoming queries using materialized views • Optimizer exploits materialized view definition semantics • Built on the ideas presented in [GL01] using Apache Calcite • Supports queries containing TableScan, Project, Filter, Join, Aggregate operators • Includes some extensions • Generation of additional rewritings without needing to do join permutation • Partial rewritings using union operators • More information about the rewriting coverage • http://calcite.apache.org/docs/materialized_views#rewriting-using-plan-structural-information [GL01] Jonathan Goldstein and Per-åke Larson. Optimizing queries using materialized views: A practical, scalable solution. In Proc. ACM SIGMOD Conf., 2001.
  • 15. 15 © Hortonworks Inc. 2011–2018. All rights reserved Enable materialized view-based rewriting • Global property to enable materialized view rewriting for queries SET hive.materializedview.rewriting=true; • User can selectively use enable/disable materialized views for rewriting • Materialized views are enabled by default for rewriting • Behavior can be altered after materialized view has been created ALTER MATERIALIZED VIEW [db_name.]materialized_view_name ENABLE|DISABLE REWRITE;
  • 16. 16 © Hortonworks Inc. 2011–2018. All rights reserved depts Materialized view-based rewriting (example) • Materialized view definition Employees that were hired after 2016 CREATE MATERIALIZED VIEW mv AS SELECT empid, deptname, hire_date FROM emps JOIN depts ON (emps.deptno = depts.deptno) WHERE hire_date >= '2016-01-01'; • Query Employees that were hired last quarter SELECT empid, deptname FROM emps JOIN depts ON (emps.deptno = depts.deptno) WHERE hire_date >= '2018-01-01' AND hire_date <= '2018-03-31'; • Materialized view-based rewriting SELECT empid, deptname FROM mv WHERE hire_date >= '2018-01-01' AND hire_date <= '2018-03-31'; deptsemps empid depname hire_date 10001 IT 2016-03-01 10002 IT 2017-01-02 10003 HR 2017-07-01 10004 Finance 2018-01-15 10005 HR 2018-02-02 mv contents empid depname 10004 Finance 10005 HR Query results
  • 17. 17 © Hortonworks Inc. 2011–2018. All rights reserved Materialized view-based rewriting (example 2) • Materialized view definition CREATE MATERIALIZED VIEW mv AS SELECT <dims>, lo_revenue, lo_extprice * lo_disc AS d_price, lo_revenue - lo_supplycost, FROM customer, dates, lineorder, part, supplier WHERE lo_orderdate = d_datekey and lo_partkey = p_partkey and lo_suppkey = s_suppkey and lo_custkey = c_custkey; • Query SELECT sum(lo_extendedprice*lo_discount) FROM lineorder, dates WHERE lo_orderdate = d_datekey and d_year = 2013 and lo_discount between 1 and 3; • Materialized view-based rewriting SELECT SUM(d_price) FROM mv WHERE d_year = 2013 and lo_discount between 1 and 3; supplier part dates customerlineorder Exploit SQL PK-FK and NOT NULL constraints d_year lo_discount <dims> d_price 2013 2 ... 7.55 2014 4 ... 432.60 2013 2 ... 34.45 2012 2 ... 2.05 … … ... … mv contents sum 42.0 … Query results
  • 18. 18 © Hortonworks Inc. 2011–2018. All rights reserved Materialized view-based rewriting (example 3) • Materialized view definition CREATE MATERIALIZED VIEW mv AS SELECT floor(time to minute), page, SUM(added) AS c_added, SUM(removed) AS c_rmv FROM wiki GROUP BY floor(time to minute), page; • Query SELECT floor(time to month), SUM(added) AS c_added FROM wiki GROUP BY floor(time to month); • Materialized view-based rewriting SELECT floor(time to month), SUM(c_added) as c_added FROM mv GROUP BY floor(time to month); wiki __time page c_added c_rmv 2011-01-01 01:05:00 Justin 1800 25 2011-01-20 19:00:00 Justin 2912 42 2011-01-01 11:06:00 Ke$ha 1953 17 2011-02-02 13:15:00 Ke$ha 3194 170 2011-01-02 18:00:00 Miley 2232 34 mv contents __time c_added 2011-01-01 00:00:00 8897 2011-02-01 00:00:00 3194 Query results
  • 19. 19 © Hortonworks Inc. 2011–2018. All rights reserved Materialized view maintenance
  • 20. 20 © Hortonworks Inc. 2011–2018. All rights reserved Rebuilding materialized views • Rebuild needs to be triggered manually by user ALTER MATERIALIZED VIEW [db_name.]materialized_view_name REBUILD; • Incremental materialized view maintenance • Only refresh data that has changed in source tables • Multiple benefits • Decrease rebuild step execution time • Preserves LLAP cache for existing data • Materialized view should only use transactional tables (micromanaged or ACID) • Current implementation only supports incremental rebuild for insert operations • Update/delete operations force full rebuild • Optimizer will attempt incremental rebuild • Otherwise, fallback to full rebuild (INSERT OVERWRITE with MV definition)
  • 21. 21 © Hortonworks Inc. 2011–2018. All rights reserved Incremental view maintenance algorithm • Relies on materialized view rewriting algorithm • Materialized view stores write ID for its tables when it is created/refreshed • Write ID associates rows with transactions • When rebuild is triggered, introduce filter condition on write ID column in MV definition • Read only new rows from source tables • Execute materialized view rewriting • Rewrite INSERT OVERWRITE (full rebuild) into more efficient plan • INSERT (table scan, filter, project, join) • MERGE (table scan, filter, project, join, aggregate)
  • 22. 22 © Hortonworks Inc. 2011–2018. All rights reserved CREATE MATERIALIZED VIEW mv1 AS SELECT page, user, SUM(added) AS c_added, SUM(removed) AS c_rmv FROM wiki GROUP BY page, user; Incremental view maintenance algorithm (example) mv1 contents page user c_added c_rmv Justin Boxer 1800 25 Justin Reach 2912 42 Ke$ha Xeno 1953 17 Ke$ha Helz 3194 170 Miley Ashu 2232 34 page user … added removed … writeID … … … … … … … Miley Ashu … 68 16 … 10000 Justin Zaka … 392 239 … 10000 wiki contents New records ⇢ ALTER MATERIALIZED VIEW mv1 REBUILD;
  • 23. 23 © Hortonworks Inc. 2011–2018. All rights reserved CREATE MATERIALIZED VIEW mv1 AS SELECT page, user, SUM(added) AS c_added, SUM(removed) AS c_rmv FROM wiki GROUP BY page, user; ① Rebuild statement rewriting INSERT OVERWRITE mv1 SELECT page, user, SUM(added) AS c_added, SUM(removed) AS c_rmv FROM ( SELECT page, user, c_added, c_removed FROM mv1 UNION ALL SELECT page, user, SUM(added) AS c_added, SUM(removed) AS c_rmv FROM wiki WHERE writeID > 9999 GROUP BY page, user) subq GROUP BY page, user; Incremental view maintenance algorithm (example) Rollup data mv1 contents page user c_added c_rmv Justin Boxer 1800 25 Justin Reach 2912 42 Ke$ha Xeno 1953 17 Ke$ha Helz 3194 170 Miley Ashu 2232 34 page user … added removed … writeID … … … … … … … Miley Ashu … 68 16 … 10000 Justin Zaka … 392 239 … 10000 wiki contents New records
  • 24. 24 © Hortonworks Inc. 2011–2018. All rights reserved CREATE MATERIALIZED VIEW mv1 AS SELECT page, user, SUM(added) AS c_added, SUM(removed) AS c_rmv FROM wiki GROUP BY page, user; ② Rewrite INSERT OVERWRITE into MERGE statement MERGE INTO mv1 USING ( SELECT page, user, SUM(added) AS c_added, SUM(removed) AS c_rmv FROM wiki WHERE writeID > 9999 GROUP BY page, user) src ON mv1.page = src.page AND mv1.user = src.user WHEN MATCHED THEN UPDATE SET c_added = mv1.c_added + src.c_added, c_removed = mv1.c_removed + src.c_rmv WHEN NOT MATCHED THEN INSERT VALUES (page, user, c_added, c_rmv); Incremental view maintenance algorithm (example) mv1 contents page user c_added c_rmv Justin Boxer 1800 25 Justin Reach 2912 42 Ke$ha Xeno 1953 17 Ke$ha Helz 3194 170 Miley Ashu 2232 34 page user … added removed … writeID … … … … … … … Miley Ashu … 68 16 … 10000 Justin Zaka … 392 239 … 10000 wiki contents New records
  • 25. 25 © Hortonworks Inc. 2011–2018. All rights reserved Materialized view lifecycle
  • 26. 26 © Hortonworks Inc. 2011–2018. All rights reserved Management of materialized view lifecycle • Do not accept stale data (default) • If content of the materialized view is not fresh, we do not use it for automatic query rewriting • Still possible to trigger partial rewritings that read both the stale materialized view and new data from source tables • Accept stale data • Freshness defined as a time parameter • If MV was not rebuilt for a certain time period and there were changes in base tables, ignore • SET hive.materializedview.rewriting.time.window=10min; • Can also be overriden by a certain materialized view using table properties • Periodically rebuild materialized view, e.g., every 5 minutes t=0min t=10min t=20min Create MV Rebuild Rebuild Rebuild Rebuild t=5min t=15min
  • 27. 27 © Hortonworks Inc. 2011–2018. All rights reserved Road ahead
  • 28. 28 © Hortonworks Inc. 2011–2018. All rights reserved Road ahead • Improvements to current materialized views implementation • Rewriting performance and scalability • Single/many MVs • Control physical distribution of data • DISTRIBUTE BY, SORT BY, CLUSTER BY • Increase incremental view maintenance coverage • Support update/delete in source tables • Materialized view recommender • Ease the identification of access patterns for a given workload
  • 29. 29 © Hortonworks Inc. 2011–2018. All rights reserved Demo
  • 30. 30 © Hortonworks Inc. 2011–2018. All rights reserved Thank you https://cwiki.apache.org/confluence/display/Hive/Materialized+views

Editor's Notes

  • #3: Intro-Hive evolution from batch to interactive (modify a bit original slide)
  • #4: Important improvements in Hive in general Integration with Druid Mention improvements to optimization too, to link it with next slide about accelerating query processing using materializations
  • #5: Access patterns
  • #6: Access patterns
  • #7: A traditional technique to accelerating query execution is precalculation of materialized views Awareness of semantics enables materialized view rewriting and automatic maintenance of materialized views
  • #8: Possible workflow Three important points. You can query the materialized view as with any other table. Druid integration goes beyond materialized views: you can just query Druid from Hive. Materialized views do not work exclusively with Druid; in fact, we expect them to play well with LLAP.
  • #9: Work that we have done in Hive, main goals
  • #15: Implemented in Calcite, based on paper
  • #16: How to enable? Enabled by default in Hive 3.0, can alter materialized view to enable-disable
  • #18: Example 2 (materialized views exploit constraints)
  • #19: Example 3 (rollup based on time, richer semantics)
  • #21: Manual rebuild: full vs incremental
  • #22: Manual rebuild: full vs incremental
  • #23: Manual rebuild: full vs incremental
  • #24: Manual rebuild: full vs incremental
  • #25: Manual rebuild: full vs incremental
  • #27: Data freshness different options: fresh data vs accept data staleness
  • #29: Control physical distribution of data (distributed by, sorted by, cluster by) MV recommender From other slides, e.g, scaling as number of materialized views grow