SlideShare a Scribd company logo
Materialized Column——An Efficient Way
to Optimize Queries on Nested Columns
Guo, Jun (jason.guo.vip@gmail.com)
Lead of Data Engine Team, @ByteDance
Who we are
o Data Engine team of ByteDance
o Build a platform of one-stop
experience for OLAP , on which users
can analyze PB level data by writing
SQL without caring about the
underlying execution engine
What we do
o Manage Spark SQL / Presto / Hive
workload
o Offer Open API and self-serve platform
o Optimize Spark SQL / Presto / Hive
engine
o Design data architecture for most
business lines in ByteDance
Agenda
▪ Spark SQL at ByteDance
▪ Why nested type are widely used
▪ What are the main issues of nested type
▪ Optional solutions
▪ How does Materialized Column solve these problems
Spark SQL at ByteDance
Spark SQL at ByteDance
2016 2017 2018 2019 2020
Small Scale Experiments
Ad-hoc workload
Few ETL pipelines in production
Full-production deployment
Main engine in DW area
Why nested type are widely used
Why nested type are widely used
▪ Event log
▪ A lot of new tracking events are created everyday
▪ It is not a good idea to create a new column for a new type of event
▪ Dimension
▪ Dimension tables are dumped from MySQL of service backend
▪ Service backend may add some new fields on demand. These fields may not be
helpful for now but they may be useful in the future
Main issues for nested type
Main issues for nested type
▪ Unnecessary data are read which is a
waste of IO
▪ Vectorized read can not be exploit when
nested type column is read
▪ Filter pushdown can not be utilized
when nested column is read
▪ Duplicated computation. e.g. JSON
parsing is CPU-intensive
Optional solutions
Optional solutions – A separate table
▪ DW users design a solution to solve
these problems
▪ Maintain a new table which add new
columns which are extracted from the
nested columns
▪ Downstream users should query on this
new table and new columns for better
performance
Optional solutions – A separate table
▪ Pros
▪ Queries are on simple type so that all the
problems are solved
▪ Cons
▪ Need to push all the downstream users to
migrate their queries / pipelines to the new
table and new columns
▪ Duplicated storage and computation cost
▪ Can not handle frequent subfields changing
Optional solutions – Vectorized Read on Nested Column
▪ Refactor Parquet vectorized reader to
support vectorized read for nested types
▪ Support predicate pushdown for struct
Optional solutions – Vectorized Read on Nested Column
▪ Pros
▪ Enable vectorized read without any storage
overhead
▪ Cons
▪ Need to refactor vectorized reader for
Parquet and ORC respectively
▪ Filter pushdown for Array/Map is still not
available
▪ The performance of vectorized read on
nested type is not as good as that for simple
type
▪ Improve performance with struct by
about 100%
▪ Improve performance with map by
about 163%
How does Materialized Column solve these problems
How does Materialized Column solve these problems
CREATE TABLE base_table (
item STRING,
count INT,
people<STRING, STRING>
date STRING
)
USING parquet
PARTITIONED BY (date);
ALTER TABLE base_table ADD COLUMNS
(
age INT MATERIALIZED CAST(peopl
e[‘age’] AS INTEGER)
);
Add materialized columnOriginal table
How does Materialized Column solve these problems
How does Materialized Column solve these problems
Write with materialized column
explain extended insert into base_table partition(date='20201010') select 'appole', 1,
map('age','18','name','jack','gender','male')
How does Materialized Column solve these problems
Query with materialized column rewriteQuery without materialized column rewrite
How does Materialized Column solve these problems
Test case
Without Materialized
Column rewrite
With Materialized
Column rewrite
Performance Read data size
SQL_adhoc_1 6.3 min / 797.6 GB 3.4 min / 111.8 GB 85.3%↑ 86% ↓
SQL_adhoc_2 16.5 min / 3.2 TB 5.0 min / 111.1 GB 230%↑ 96.6%↓
SQL_etl_1 24 min / 3.7 TB 9.1 min / 686.1 GB 130.8%↑ 82%↓
Query without materialized column rewrite
Feedback
Your feedback is important to us.
Don’t forget to rate
and review the sessions.

More Related Content

PDF
How We Optimize Spark SQL Jobs With parallel and sync IO
PDF
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
PDF
Understanding Query Plans and Spark UIs
PDF
Making Apache Spark Better with Delta Lake
PDF
Common Strategies for Improving Performance on Your Delta Lakehouse
PDF
Physical Plans in Spark SQL
PDF
Cosco: An Efficient Facebook-Scale Shuffle Service
PDF
Best Practices for Enabling Speculative Execution on Large Scale Platforms
How We Optimize Spark SQL Jobs With parallel and sync IO
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Understanding Query Plans and Spark UIs
Making Apache Spark Better with Delta Lake
Common Strategies for Improving Performance on Your Delta Lakehouse
Physical Plans in Spark SQL
Cosco: An Efficient Facebook-Scale Shuffle Service
Best Practices for Enabling Speculative Execution on Large Scale Platforms

What's hot (20)

PDF
Spark shuffle introduction
PDF
Optimizing Delta/Parquet Data Lakes for Apache Spark
PPTX
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
PPTX
Understanding SQL Trace, TKPROF and Execution Plan for beginners
PDF
Apache Spark Core – Practical Optimization
PPTX
Delta lake and the delta architecture
PPTX
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
PDF
Iceberg: A modern table format for big data (Strata NY 2018)
PDF
A Deep Dive into Query Execution Engine of Spark SQL
PPTX
Optimizing Apache Spark SQL Joins
PDF
The Parquet Format and Performance Optimization Opportunities
PDF
Productizing Structured Streaming Jobs
PDF
Introduction to PySpark
PDF
Parquet performance tuning: the missing guide
PDF
Deep Dive into the New Features of Apache Spark 3.1
PPTX
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
PDF
Deep Dive: Memory Management in Apache Spark
PDF
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
PDF
Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Spark shuffle introduction
Optimizing Delta/Parquet Data Lakes for Apache Spark
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
Understanding SQL Trace, TKPROF and Execution Plan for beginners
Apache Spark Core – Practical Optimization
Delta lake and the delta architecture
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Iceberg: A modern table format for big data (Strata NY 2018)
A Deep Dive into Query Execution Engine of Spark SQL
Optimizing Apache Spark SQL Joins
The Parquet Format and Performance Optimization Opportunities
Productizing Structured Streaming Jobs
Introduction to PySpark
Parquet performance tuning: the missing guide
Deep Dive into the New Features of Apache Spark 3.1
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Deep Dive: Memory Management in Apache Spark
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Ad

Similar to Materialized Column: An Efficient Way to Optimize Queries on Nested Columns (20)

PPTX
The design and implementation of modern column oriented databases
PPTX
Oracle Database 12c - Features for Big Data
PPT
Column-vs-Row-how-different-are-they.ppt
PDF
Icde2019 improving rdf query performance using in-memory virtual columns in o...
PPTX
Materialized Views and Secondary Indexes in Scylla: They Are finally here!
PDF
Performance Improvement Technique in Column-Store
PDF
Data Modeling, Normalization, and Denormalisation | PostgreSQL Conference Eur...
PDF
Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...
PDF
Working with complex data types in BigQuery
PDF
Sql no sql
PDF
Mapping objects to_relational_databases
PDF
Tutorial On Database Management System
PDF
Implementation of nosql for robotics
PPT
Mapping inheritance structures_mapping_class
PDF
Inside Parquet Format
PDF
Data Modeling, Normalization, and De-Normalization | PostgresOpen 2019 | Dimi...
PDF
Modularized ETL Writing with Apache Spark
PPTX
PDF
PostgreSQL Performance Tables Partitioning vs. Aggregated Data Tables
PDF
SURVEY ON IMPLEMANTATION OF COLUMN ORIENTED NOSQL DATA STORES ( BIGTABLE & CA...
The design and implementation of modern column oriented databases
Oracle Database 12c - Features for Big Data
Column-vs-Row-how-different-are-they.ppt
Icde2019 improving rdf query performance using in-memory virtual columns in o...
Materialized Views and Secondary Indexes in Scylla: They Are finally here!
Performance Improvement Technique in Column-Store
Data Modeling, Normalization, and Denormalisation | PostgreSQL Conference Eur...
Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...
Working with complex data types in BigQuery
Sql no sql
Mapping objects to_relational_databases
Tutorial On Database Management System
Implementation of nosql for robotics
Mapping inheritance structures_mapping_class
Inside Parquet Format
Data Modeling, Normalization, and De-Normalization | PostgresOpen 2019 | Dimi...
Modularized ETL Writing with Apache Spark
PostgreSQL Performance Tables Partitioning vs. Aggregated Data Tables
SURVEY ON IMPLEMANTATION OF COLUMN ORIENTED NOSQL DATA STORES ( BIGTABLE & CA...
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake

Recently uploaded (20)

PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PDF
Foundation of Data Science unit number two notes
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
batch data Retailer Data management Project.pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
1intro to AI.pptx AI components & composition
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PDF
Linux OS guide to know, operate. Linux Filesystem, command, users and system
PPTX
Data-Driven-Credit-Card-Launch-A-Wells-Fargo-Case-Study.pptx
PDF
Oracle OFSAA_ The Complete Guide to Transforming Financial Risk Management an...
PPTX
Bharatiya Antariksh Hackathon 2025 Idea Submission PPT.pptx
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
1_Introduction to advance data techniques.pptx
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
Measurement of Afordability for Water Supply and Sanitation in Bangladesh .pptx
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Clinical guidelines as a resource for EBP(1).pdf
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Foundation of Data Science unit number two notes
Business Acumen Training GuidePresentation.pptx
Business Ppt On Nestle.pptx huunnnhhgfvu
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
batch data Retailer Data management Project.pptx
Introduction-to-Cloud-ComputingFinal.pptx
1intro to AI.pptx AI components & composition
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Linux OS guide to know, operate. Linux Filesystem, command, users and system
Data-Driven-Credit-Card-Launch-A-Wells-Fargo-Case-Study.pptx
Oracle OFSAA_ The Complete Guide to Transforming Financial Risk Management an...
Bharatiya Antariksh Hackathon 2025 Idea Submission PPT.pptx
IB Computer Science - Internal Assessment.pptx
1_Introduction to advance data techniques.pptx
climate analysis of Dhaka ,Banglades.pptx
Measurement of Afordability for Water Supply and Sanitation in Bangladesh .pptx

Materialized Column: An Efficient Way to Optimize Queries on Nested Columns

  • 1. Materialized Column——An Efficient Way to Optimize Queries on Nested Columns Guo, Jun ([email protected]) Lead of Data Engine Team, @ByteDance
  • 2. Who we are o Data Engine team of ByteDance o Build a platform of one-stop experience for OLAP , on which users can analyze PB level data by writing SQL without caring about the underlying execution engine
  • 3. What we do o Manage Spark SQL / Presto / Hive workload o Offer Open API and self-serve platform o Optimize Spark SQL / Presto / Hive engine o Design data architecture for most business lines in ByteDance
  • 4. Agenda ▪ Spark SQL at ByteDance ▪ Why nested type are widely used ▪ What are the main issues of nested type ▪ Optional solutions ▪ How does Materialized Column solve these problems
  • 5. Spark SQL at ByteDance
  • 6. Spark SQL at ByteDance 2016 2017 2018 2019 2020 Small Scale Experiments Ad-hoc workload Few ETL pipelines in production Full-production deployment Main engine in DW area
  • 7. Why nested type are widely used
  • 8. Why nested type are widely used ▪ Event log ▪ A lot of new tracking events are created everyday ▪ It is not a good idea to create a new column for a new type of event ▪ Dimension ▪ Dimension tables are dumped from MySQL of service backend ▪ Service backend may add some new fields on demand. These fields may not be helpful for now but they may be useful in the future
  • 9. Main issues for nested type
  • 10. Main issues for nested type ▪ Unnecessary data are read which is a waste of IO ▪ Vectorized read can not be exploit when nested type column is read ▪ Filter pushdown can not be utilized when nested column is read ▪ Duplicated computation. e.g. JSON parsing is CPU-intensive
  • 12. Optional solutions – A separate table ▪ DW users design a solution to solve these problems ▪ Maintain a new table which add new columns which are extracted from the nested columns ▪ Downstream users should query on this new table and new columns for better performance
  • 13. Optional solutions – A separate table ▪ Pros ▪ Queries are on simple type so that all the problems are solved ▪ Cons ▪ Need to push all the downstream users to migrate their queries / pipelines to the new table and new columns ▪ Duplicated storage and computation cost ▪ Can not handle frequent subfields changing
  • 14. Optional solutions – Vectorized Read on Nested Column ▪ Refactor Parquet vectorized reader to support vectorized read for nested types ▪ Support predicate pushdown for struct
  • 15. Optional solutions – Vectorized Read on Nested Column ▪ Pros ▪ Enable vectorized read without any storage overhead ▪ Cons ▪ Need to refactor vectorized reader for Parquet and ORC respectively ▪ Filter pushdown for Array/Map is still not available ▪ The performance of vectorized read on nested type is not as good as that for simple type ▪ Improve performance with struct by about 100% ▪ Improve performance with map by about 163%
  • 16. How does Materialized Column solve these problems
  • 17. How does Materialized Column solve these problems CREATE TABLE base_table ( item STRING, count INT, people<STRING, STRING> date STRING ) USING parquet PARTITIONED BY (date); ALTER TABLE base_table ADD COLUMNS ( age INT MATERIALIZED CAST(peopl e[‘age’] AS INTEGER) ); Add materialized columnOriginal table
  • 18. How does Materialized Column solve these problems
  • 19. How does Materialized Column solve these problems Write with materialized column explain extended insert into base_table partition(date='20201010') select 'appole', 1, map('age','18','name','jack','gender','male')
  • 20. How does Materialized Column solve these problems Query with materialized column rewriteQuery without materialized column rewrite
  • 21. How does Materialized Column solve these problems Test case Without Materialized Column rewrite With Materialized Column rewrite Performance Read data size SQL_adhoc_1 6.3 min / 797.6 GB 3.4 min / 111.8 GB 85.3%↑ 86% ↓ SQL_adhoc_2 16.5 min / 3.2 TB 5.0 min / 111.1 GB 230%↑ 96.6%↓ SQL_etl_1 24 min / 3.7 TB 9.1 min / 686.1 GB 130.8%↑ 82%↓ Query without materialized column rewrite
  • 22. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.