Materialized Column: An Efficient Way to Optimize Queries on Nested Columns

Materialized Column——An Efficient Way
to Optimize Queries on Nested Columns
Guo, Jun (jason.guo.vip@gmail.com)
Lead of Data Engine Team, @ByteDance

Who we are
o Data Engine team of ByteDance
o Build a platform of one-stop
experience for OLAP , on which users
can analyze PB level data by writing
SQL without caring about the
underlying execution engine

What we do
o Manage Spark SQL / Presto / Hive
workload
o Offer Open API and self-serve platform
o Optimize Spark SQL / Presto / Hive
engine
o Design data architecture for most
business lines in ByteDance

Agenda
▪ Spark SQL at ByteDance
▪ Why nested type are widely used
▪ What are the main issues of nested type
▪ Optional solutions
▪ How does Materialized Column solve these problems

Spark SQL at ByteDance
2016 2017 2018 2019 2020
Small Scale Experiments
Ad-hoc workload
Few ETL pipelines in production
Full-production deployment
Main engine in DW area

Why nested type are widely used

Why nested type are widely used
▪ Event log
▪ A lot of new tracking events are created everyday
▪ It is not a good idea to create a new column for a new type of event
▪ Dimension
▪ Dimension tables are dumped from MySQL of service backend
▪ Service backend may add some new fields on demand. These fields may not be
helpful for now but they may be useful in the future

Main issues for nested type
▪ Unnecessary data are read which is a
waste of IO
▪ Vectorized read can not be exploit when
nested type column is read
▪ Filter pushdown can not be utilized
when nested column is read
▪ Duplicated computation. e.g. JSON
parsing is CPU-intensive

Optional solutions – A separate table
▪ DW users design a solution to solve
these problems
▪ Maintain a new table which add new
columns which are extracted from the
nested columns
▪ Downstream users should query on this
new table and new columns for better
performance

Optional solutions – A separate table
▪ Pros
▪ Queries are on simple type so that all the
problems are solved
▪ Cons
▪ Need to push all the downstream users to
migrate their queries / pipelines to the new
table and new columns
▪ Duplicated storage and computation cost
▪ Can not handle frequent subfields changing

Optional solutions – Vectorized Read on Nested Column
▪ Refactor Parquet vectorized reader to
support vectorized read for nested types
▪ Support predicate pushdown for struct

Optional solutions – Vectorized Read on Nested Column
▪ Pros
▪ Enable vectorized read without any storage
overhead
▪ Cons
▪ Need to refactor vectorized reader for
Parquet and ORC respectively
▪ Filter pushdown for Array/Map is still not
available
▪ The performance of vectorized read on
nested type is not as good as that for simple
type
▪ Improve performance with struct by
about 100%
▪ Improve performance with map by
about 163%

How does Materialized Column solve these problems

CREATE TABLE base_table (
item STRING,
count INT,
people<STRING, STRING>
date STRING
)
USING parquet
PARTITIONED BY (date);
ALTER TABLE base_table ADD COLUMNS
(
age INT MATERIALIZED CAST(peopl
e[‘age’] AS INTEGER)
);
Add materialized columnOriginal table

Write with materialized column
explain extended insert into base_table partition(date='20201010') select 'appole', 1,
map('age','18','name','jack','gender','male')

Query with materialized column rewriteQuery without materialized column rewrite

Test case
Without Materialized
Column rewrite
With Materialized
Column rewrite
Performance Read data size
SQL_adhoc_1 6.3 min / 797.6 GB 3.4 min / 111.8 GB 85.3%↑ 86% ↓
SQL_adhoc_2 16.5 min / 3.2 TB 5.0 min / 111.1 GB 230%↑ 96.6%↓
SQL_etl_1 24 min / 3.7 TB 9.1 min / 686.1 GB 130.8%↑ 82%↓
Query without materialized column rewrite

Feedback
Your feedback is important to us.
Don’t forget to rate
and review the sessions.

Materialized Column: An Efficient Way to Optimize Queries on Nested Columns

More Related Content

What's hot (20)

Similar to Materialized Column: An Efficient Way to Optimize Queries on Nested Columns (20)

More from Databricks (20)

Recently uploaded (20)

Materialized Column: An Efficient Way to Optimize Queries on Nested Columns