SlideShare a Scribd company logo
How to understand and analyze
Apache Hive query execution plan
for performance debugging
© Hortonworks Inc. 2011 – 2015. All Rights Reserved
Pengcheng Xiong and Ashutosh Chauhan
Hortonworks Inc., Apache Hive
Community
{pxiong,ashutosh}@hortonworks.com
Page2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Goals:
• WHY old style Hive explain plan is hard to read
• Compare the old style explain with Postgres over a body of 500+ realistic SQL queries.
• WHAT is new style Hive explain plan
• Show orchestration of the tasks and operator trees, join sequences and algorithms, operator
execution costs
• HOW to performance debug a real query by analyzing the new Hive
explain plan
• Identify the potential improvement by changing join sequence, join algorithm and etc
• Show the real improvement by running the query in real cluster
• Integration/interaction with other system/tools
• Future work
Page3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
WHY old style Hive explain plan is hard to read
• M**** company’s schema and queries.
• Comparison of explain plans between Hive and Postgres for the 528
queries they can both execute.
• Hive: “explain”, Postgres: “explain verbose”
Hive “old
style”, 233.5
postgres,
53.8
Hive “old
style”, 1289
postgres,
328.6
Average Lines Per Explain Plan Average Words Per Explain Plan
N = 528 queries
Page4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
We can see that Hive old style explain is quite
verbose, is it necessary?
select a11.PBTNAME PBTNAME
from PMT_INVENTORY a11
join LU_MONTH a12
on (a11.QUARTER_ID = a12.QUARTER_ID)
where a12.MONTH_ID in (200607, 200606)
group by a11.PBTNAME;
Page5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
High level plan comparison: Postgres
QUERY PLAN
---------------------------------------------------------------------------------
Group (cost=3.83..3.84 rows=2 width=18)
Output: a11.pbtname
Group Key: a11.pbtname
-> Sort (cost=3.83..3.84 rows=2 width=18)
Output: a11.pbtname
Sort Key: a11.pbtname
-> Hash Join (cost=2.62..3.83 rows=2 width=18)
Output: a11.pbtname
Hash Cond: (a11.quarter_id = a12.quarter_id)
-> Seq Scan on public.pmt_inventory a11 (cost=0.00..1.12 rows=12 width=22)
Output: a11.quarter_id, a11.pbtname
-> Hash (cost=2.60..2.60 rows=2 width=4)
Output: a12.quarter_id
-> Seq Scan on public.lu_month a12 (cost=0.00..2.60 rows=2 width=4)
Output: a12.quarter_id
Filter: (a12.month_id = ANY ('{200607,200606}'::integer[]))
16 lines
Page6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
High level plan comparison: Hive
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Tez
Edges:
Map 2 <- Map 1 (BROADCAST_EDGE)
Reducer 3 <- Map 2 (SIMPLE_EDGE)
DagName: carter_20151114133018_2f2f0101-d14d-4688-bbb1-db67055016c3:946
Vertices:
Map 1
Map Operator Tree:
TableScan
alias: a11
filterExpr: quarter_id is not null (type: boolean)
Statistics: Num rows: 12 Data size: 1260 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: quarter_id is not null (type: boolean)
Statistics: Num rows: 6 Data size: 630 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: quarter_id (type: int)
sort order: +
Map-reduce partition columns: quarter_id (type: int)
Statistics: Num rows: 6 Data size: 630 Basic stats: COMPLETE Column stats: NONE
value expressions: pbtname (type: string)
Execution mode: vectorized
Map 2
Map Operator Tree:
TableScan
alias: a12
filterExpr: (quarter_id is not null and (month_id) IN (200607, 200606)) (type: boolean)
Statistics: Num rows: 48 Data size: 46752 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: (quarter_id is not null and (month_id) IN (200607, 200606)) (type: boolean)
Statistics: Num rows: 12 Data size: 11688 Basic stats: COMPLETE Column stats: NONE
Map Join Operator
condition map:
Inner Join 0 to 1
keys:
0 quarter_id (type: int)
1 quarter_id (type: int)
outputColumnNames: _col1
input vertices:
0 Map 1
Statistics: Num rows: 13 Data size: 12856 Basic stats: COMPLETE Column stats: NONE
HybridGraceHashJoin: true
Group By Operator
keys: _col1 (type: string)
mode: hash
outputColumnNames: _col0
Statistics: Num rows: 13 Data size: 12856 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: string)
sort order: +
Map-reduce partition columns: _col0 (type: string)
Statistics: Num rows: 13 Data size: 12856 Basic stats: COMPLETE Column stats: NONE
Execution mode: vectorized
Reducer 3
Reduce Operator Tree:
Group By Operator
keys: KEY._col0 (type: string)
mode: mergepartial
outputColumnNames: _col0
Statistics: Num rows: 6 Data size: 5933 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 6 Data size: 5933 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
Too verbose, need a magnifier!
80+ lines
Page7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Map 2 Operator
Map 2
Map Operator Tree:
TableScan
alias: a12
filterExpr: (quarter_id is not null and (month_id) IN (200607, 200606)) (type: boolean)
Statistics: Num rows: 48 Data size: 46752 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: (quarter_id is not null and (month_id) IN (200607, 200606)) (type: boolean)
Statistics: Num rows: 12 Data size: 11688 Basic stats: COMPLETE Column stats: NONE
Map Join Operator
condition map:
Inner Join 0 to 1
keys:
0 quarter_id (type: int)
1 quarter_id (type: int)
outputColumnNames: _col1
input vertices:
0 Map 1
Statistics: Num rows: 13 Data size: 12856 Basic stats: COMPLETE Column stats: NONE
HybridGraceHashJoin: true
Group By Operator
keys: _col1 (type: string)
mode: hash
outputColumnNames: _col0
Statistics: Num rows: 13 Data size: 12856 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: string)
sort order: +
Map-reduce partition columns: _col0 (type: string)
Statistics: Num rows: 13 Data size: 12856 Basic stats: COMPLETE Column stats: NONE
Execution mode: vectorized
Data flows from top to bottom
Each operator has 0 or 1
child
Page8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Map 2 Operator
Map 2
Map Operator Tree:
TableScan
alias: a12
filterExpr: (quarter_id is not null and (month_id) IN (200607, 200606)) (type: boolean)
Statistics: Num rows: 48 Data size: 46752 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: (quarter_id is not null and (month_id) IN (200607, 200606)) (type: boolean)
Statistics: Num rows: 12 Data size: 11688 Basic stats: COMPLETE Column stats: NONE
Map Join Operator
condition map:
Inner Join 0 to 1
keys:
0 quarter_id (type: int)
1 quarter_id (type: int)
outputColumnNames: _col1
input vertices:
0 Map 1
Statistics: Num rows: 13 Data size: 12856 Basic stats: COMPLETE Column stats: NONE
HybridGraceHashJoin: true
Group By Operator
keys: _col1 (type: string)
mode: hash
outputColumnNames: _col0
Statistics: Num rows: 13 Data size: 12856 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: string)
sort order: +
Map-reduce partition columns: _col0 (type: string)
Statistics: Num rows: 13 Data size: 12856 Basic stats: COMPLETE Column stats: NONE
Execution mode: vectorized
Must scroll to another part
of the plan to see what this is
Page9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Map 1 Operator
Map 1
Map Operator Tree:
TableScan
alias: a11
filterExpr: quarter_id is not null (type: boolean)
Statistics: Num rows: 12 Data size: 1260 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: quarter_id is not null (type: boolean)
Statistics: Num rows: 6 Data size: 630 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: quarter_id (type: int)
sort order: +
Map-reduce partition columns: quarter_id (type: int)
Statistics: Num rows: 6 Data size: 630 Basic stats: COMPLETE Column stats: NONE
value expressions: pbtname (type: string)
Execution mode: vectorized
Actual table name (PMT_INVENTORY)
not mentioned anywhere, only
the alias
Page10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Map 1 Operator
Map 1
Map Operator Tree:
TableScan
alias: a11
filterExpr: quarter_id is not null (type: boolean)
Statistics: Num rows: 12 Data size: 1260 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: quarter_id is not null (type: boolean)
Statistics: Num rows: 6 Data size: 630 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: quarter_id (type: int)
sort order: +
Map-reduce partition columns: quarter_id (type: int)
Statistics: Num rows: 6 Data size: 630 Basic stats: COMPLETE Column stats: NONE
value expressions: pbtname (type: string)
Execution mode: vectorized
How much of this information is really
necessary to SQL users?
Page11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Back to Postgres
QUERY PLAN
---------------------------------------------------------------------------------
Group (cost=3.83..3.84 rows=2 width=18)
Output: a11.pbtname
Group Key: a11.pbtname
-> Sort (cost=3.83..3.84 rows=2 width=18)
Output: a11.pbtname
Sort Key: a11.pbtname
-> Hash Join (cost=2.62..3.83 rows=2 width=18)
Output: a11.pbtname
Hash Cond: (a11.quarter_id = a12.quarter_id)
-> Seq Scan on public.pmt_inventory a11 (cost=0.00..1.12 rows=12 width=22)
Output: a11.quarter_id, a11.pbtname
-> Hash (cost=2.60..2.60 rows=2 width=4)
Output: a12.quarter_id
-> Seq Scan on public.lu_month a12 (cost=0.00..2.60 rows=2 width=4)
Output: a12.quarter_id
Filter: (a12.month_id = ANY ('{200607,200606}'::integer[]))
Data flows bottom to top
Page12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Back to Postgres
QUERY PLAN
---------------------------------------------------------------------------------
Group (cost=3.83..3.84 rows=2 width=18)
Output: a11.pbtname
Group Key: a11.pbtname
-> Sort (cost=3.83..3.84 rows=2 width=18)
Output: a11.pbtname
Sort Key: a11.pbtname
-> Hash Join (cost=2.62..3.83 rows=2 width=18)
Output: a11.pbtname
Hash Cond: (a11.quarter_id = a12.quarter_id)
-> Seq Scan on public.pmt_inventory a11 (cost=0.00..1.12 rows=12 width=22)
Output: a11.quarter_id, a11.pbtname
-> Hash (cost=2.60..2.60 rows=2 width=4)
Output: a12.quarter_id
-> Seq Scan on public.lu_month a12 (cost=0.00..2.60 rows=2 width=4)
Output: a12.quarter_id
Filter: (a12.month_id = ANY ('{200607,200606}'::integer[]))
Operators have multiple
children when it makes
sense
Page13 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Back to Postgres
QUERY PLAN
---------------------------------------------------------------------------------
Group (cost=3.83..3.84 rows=2 width=18)
Output: a11.pbtname
Group Key: a11.pbtname
-> Sort (cost=3.83..3.84 rows=2 width=18)
Output: a11.pbtname
Sort Key: a11.pbtname
-> Hash Join (cost=2.62..3.83 rows=2 width=18)
Output: a11.pbtname
Hash Cond: (a11.quarter_id = a12.quarter_id)
-> Seq Scan on public.pmt_inventory a11 (cost=0.00..1.12 rows=12 width=22)
Output: a11.quarter_id, a11.pbtname
-> Hash (cost=2.60..2.60 rows=2 width=4)
Output: a12.quarter_id
-> Seq Scan on public.lu_month a12 (cost=0.00..2.60 rows=2 width=4)
Output: a12.quarter_id
Filter: (a12.month_id = ANY ('{200607,200606}'::integer[]))
Join is done using a scan of pmt_inventory and a hash
following a scan of lu_month.
All this info is available without referring to a stage plan.
IOW you don’t have to jump around in the plan.
Page14 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Back to Postgres
QUERY PLAN
---------------------------------------------------------------------------------
Group (cost=3.83..3.84 rows=2 width=18)
Output: a11.pbtname
Group Key: a11.pbtname
-> Sort (cost=3.83..3.84 rows=2 width=18)
Output: a11.pbtname
Sort Key: a11.pbtname
-> Hash Join (cost=2.62..3.83 rows=2 width=18)
Output: a11.pbtname
Hash Cond: (a11.quarter_id = a12.quarter_id)
-> Seq Scan on public.pmt_inventory a11 (cost=0.00..1.12 rows=12 width=22)
Output: a11.quarter_id, a11.pbtname
-> Hash (cost=2.60..2.60 rows=2 width=4)
Output: a12.quarter_id
-> Seq Scan on public.lu_month a12 (cost=0.00..2.60 rows=2 width=4)
Output: a12.quarter_id
Filter: (a12.month_id = ANY ('{200607,200606}'::integer[]))
Actual schema / table names
visible
Page15 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Back to Postgres
QUERY PLAN
---------------------------------------------------------------------------------
Group (cost=3.83..3.84 rows=2 width=18)
Output: a11.pbtname
Group Key: a11.pbtname
-> Sort (cost=3.83..3.84 rows=2 width=18)
Output: a11.pbtname
Sort Key: a11.pbtname
-> Hash Join (cost=2.62..3.83 rows=2 width=18)
Output: a11.pbtname
Hash Cond: (a11.quarter_id = a12.quarter_id)
-> Seq Scan on public.pmt_inventory a11 (cost=0.00..1.12 rows=12 width=22)
Output: a11.quarter_id, a11.pbtname
-> Hash (cost=2.60..2.60 rows=2 width=4)
Output: a12.quarter_id
-> Seq Scan on public.lu_month a12 (cost=0.00..2.60 rows=2 width=4)
Output: a12.quarter_id
Filter: (a12.month_id = ANY ('{200607,200606}'::integer[]))
Cost mentioned once per operator
Cost monotonically increases
as you go up
Page16 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
WHAT is new style Hive explain plan (HIVE-9780)
• Set hive.explain.user=true; (by default). Use Tez, LLAP, etc
Stage-1
Reducer 3
File Output Operator [FS_14]
Group By Operator [GBY_12] (rows=8 width=101)
Output:["_col0"],keys:KEY._col0
<-Map 2 [SIMPLE_EDGE]
SHUFFLE [RS_11]
PartitionCols:_col0
Group By Operator [GBY_10] (rows=8 width=101)
Output:["_col0"],keys:_col1
Map Join Operator [MAPJOIN_19] (rows=33 width=101)
Conds:RS_6._col0=SEL_5._col1(Inner),HybridGraceHashJoin:true,Output:["_col1"]
<-Map 1 [BROADCAST_EDGE]
BROADCAST [RS_6]
PartitionCols:_col0
Select Operator [SEL_2] (rows=12 width=105)
Output:["_col0","_col1"]
Filter Operator [FIL_17] (rows=12 width=105)
predicate:quarter_id is not null
TableScan [TS_0] (rows=12 width=105)
m****@pmt_inventory,a11,Tbl:COMPLETE,Col:COMPLETE,Output:["quarter_id","pbtname"]
<-Select Operator [SEL_5] (rows=48 width=8)
Output:["_col1"]
Filter Operator [FIL_18] (rows=48 width=8)
predicate:((month_id) IN (200607, 200606) and quarter_id is not null)
TableScan [TS_3] (rows=48 width=8)
m****@lu_month,a12,Tbl:COMPLETE,Col:COMPLETE,Output:["month_id","quarter_id"]
Immediate Notes:
1. Much smaller
2. Can be read in order
Page17 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
WHAT is new style Hive explain plan (HIVE-9780)
Stage-1
Reducer 3
File Output Operator [FS_14]
Group By Operator [GBY_12] (rows=8 width=101)
Output:["_col0"],keys:KEY._col0
<-Map 2 [SIMPLE_EDGE]
SHUFFLE [RS_11]
PartitionCols:_col0
Group By Operator [GBY_10] (rows=8 width=101)
Output:["_col0"],keys:_col1
Map Join Operator [MAPJOIN_19] (rows=33 width=101)
Conds:RS_6._col0=SEL_5._col1(Inner),HybridGraceHashJoin:true,Output:["_col1"]
<-Map 1 [BROADCAST_EDGE]
BROADCAST [RS_6]
PartitionCols:_col0
Select Operator [SEL_2] (rows=12 width=105)
Output:["_col0","_col1"]
Filter Operator [FIL_17] (rows=12 width=105)
predicate:quarter_id is not null
TableScan [TS_0] (rows=12 width=105)
m****@pmt_inventory,a11,Tbl:COMPLETE,Col:COMPLETE,Output:["quarter_id","pbtname"]
<-Select Operator [SEL_5] (rows=48 width=8)
Output:["_col1"]
Filter Operator [FIL_18] (rows=48 width=8)
predicate:((month_id) IN (200607, 200606) and quarter_id is not null)
TableScan [TS_3] (rows=48 width=8)
m****@lu_month,a12,Tbl:COMPLETE,Col:COMPLETE,Output:["month_id","quarter_id"]
Data flows bottom to top
Page18 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
WHAT is new style Hive explain plan (HIVE-9780)
Stage-1
Reducer 3
File Output Operator [FS_14]
Group By Operator [GBY_12] (rows=8 width=101)
Output:["_col0"],keys:KEY._col0
<-Map 2 [SIMPLE_EDGE]
SHUFFLE [RS_11]
PartitionCols:_col0
Group By Operator [GBY_10] (rows=8 width=101)
Output:["_col0"],keys:_col1
Map Join Operator [MAPJOIN_19] (rows=33 width=101)
Conds:RS_6._col0=SEL_5._col1(Inner),HybridGraceHashJoin:true,Output:["_col1"]
<-Map 1 [BROADCAST_EDGE]
BROADCAST [RS_6]
PartitionCols:_col0
Select Operator [SEL_2] (rows=12 width=105)
Output:["_col0","_col1"]
Filter Operator [FIL_17] (rows=12 width=105)
predicate:quarter_id is not null
TableScan [TS_0] (rows=12 width=105)
m****@pmt_inventory,a11,Tbl:COMPLETE,Col:COMPLETE,Output:["quarter_id","pbtname"]
<-Select Operator [SEL_5] (rows=48 width=8)
Output:["_col1"]
Filter Operator [FIL_18] (rows=48 width=8)
predicate:((month_id) IN (200607, 200606) and quarter_id is not null)
TableScan [TS_3] (rows=48 width=8)
m****@lu_month,a12,Tbl:COMPLETE,Col:COMPLETE,Output:["month_id","quarter_id"]
Operators have multiple
children when it makes
sense
Page19 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
WHAT is new style Hive explain plan (HIVE-9780)
Stage-1
Reducer 3
File Output Operator [FS_14]
Group By Operator [GBY_12] (rows=8 width=101)
Output:["_col0"],keys:KEY._col0
<-Map 2 [SIMPLE_EDGE]
SHUFFLE [RS_11]
PartitionCols:_col0
Group By Operator [GBY_10] (rows=8 width=101)
Output:["_col0"],keys:_col1
Map Join Operator [MAPJOIN_19] (rows=33 width=101)
Conds:RS_6._col0=SEL_5._col1(Inner),HybridGraceHashJoin:true,Output:["_col1"]
<-Map 1 [BROADCAST_EDGE]
BROADCAST [RS_6]
PartitionCols:_col0
Select Operator [SEL_2] (rows=12 width=105)
Output:["_col0","_col1"]
Filter Operator [FIL_17] (rows=12 width=105)
predicate:quarter_id is not null
TableScan [TS_0] (rows=12 width=105)
m****@pmt_inventory,a11,Tbl:COMPLETE,Col:COMPLETE,Output:["quarter_id","pbtname"]
<-Select Operator [SEL_5] (rows=48 width=8)
Output:["_col1"]
Filter Operator [FIL_18] (rows=48 width=8)
predicate:((month_id) IN (200607, 200606) and quarter_id is not null)
TableScan [TS_3] (rows=48 width=8)
m****@lu_month,a12,Tbl:COMPLETE,Col:COMPLETE,Output:["month_id","quarter_id"]
Join’s information is clear
pmt_inventory is broadcasted to lu_month
and a MapJoin is done
Page20 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
WHAT is new style Hive explain plan (HIVE-9780)
Stage-1
Reducer 3
File Output Operator [FS_14]
Group By Operator [GBY_12] (rows=8 width=101)
Output:["_col0"],keys:KEY._col0
<-Map 2 [SIMPLE_EDGE]
SHUFFLE [RS_11]
PartitionCols:_col0
Group By Operator [GBY_10] (rows=8 width=101)
Output:["_col0"],keys:_col1
Map Join Operator [MAPJOIN_19] (rows=33 width=101)
Conds:RS_6._col0=SEL_5._col1(Inner),HybridGraceHashJoin:true,Output:["_col1"]
<-Map 1 [BROADCAST_EDGE]
BROADCAST [RS_6]
PartitionCols:_col0
Select Operator [SEL_2] (rows=12 width=105)
Output:["_col0","_col1"]
Filter Operator [FIL_17] (rows=12 width=105)
predicate:quarter_id is not null
TableScan [TS_0] (rows=12 width=105)
m****@pmt_inventory,a11,Tbl:COMPLETE,Col:COMPLETE,Output:["quarter_id","pbtname"]
<-Select Operator [SEL_5] (rows=48 width=8)
Output:["_col1"]
Filter Operator [FIL_18] (rows=48 width=8)
predicate:((month_id) IN (200607, 200606) and quarter_id is not null)
TableScan [TS_3] (rows=48 width=8)
m****@lu_month,a12,Tbl:COMPLETE,Col:COMPLETE,Output:["month_id","quarter_id"]
Cost mentioned once per operator
Page21 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
HOW to performance debug a real query (TPC-DS Q3, 1TB)
select
dt.d_year, item.i_brand_id brand_id, item.i_brand brand, sum(ss_ext_sales_price) sum_agg
from
date_dim dt, store_sales, item
where
dt.d_date_sk = store_sales.ss_sold_date_sk
and store_sales.ss_item_sk = item.i_item_sk
and item.i_manufact_id = 436
and dt.d_moy = 12
group by dt.d_year , item.i_brand , item.i_brand_id
order by dt.d_year , sum_agg desc , brand_id
limit 10
partitioned by ss_sold_date_sk
Page22 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
<-Reducer 3 [SIMPLE_EDGE]
SHUFFLE [RS_15]
PartitionCols:_col0, _col1, _col2
Group By Operator [GBY_14] (rows=1 width=116)
Output:["_col0","_col1","_col2","_col3"],aggregations:["sum(_col45)"],keys:_col6, _col65, _col64
Select Operator [SEL_13] (rows=76515 width=128)
Output:["_col6","_col65","_col64","_col45"]
Filter Operator [FIL_23] (rows=76515 width=128)
predicate:((_col0 = _col53) and (_col32 = _col57))
Merge Join Operator [MERGEJOIN_28] (rows=306061 width=128)
Conds:RS_8._col32=RS_34.i_item_sk(Inner),Output:["_col0","_col6","_col32","_col45","_col53","_col57","_col64","_col65"]
<-Map 7 [SIMPLE_EDGE] vectorized
SHUFFLE [RS_34]
PartitionCols:i_item_sk
Filter Operator [FIL_33] (rows=434 width=111)
predicate:(i_item_sk is not null and (i_manufact_id = 436))
TableScan [TS_2] (rows=300000 width=111)
tpcds_bin_partitioned_orc_1000@item,item,Tbl:COMPLETE,Col:COMPLETE,Output:["i_item_sk","i_brand_id","i_brand","i_manufact_id"]
<-Reducer 2 [SIMPLE_EDGE]
SHUFFLE [RS_8]
PartitionCols:_col32
Merge Join Operator [MERGEJOIN_27] (rows=211562452 width=20)
Conds:RS_30.d_date_sk=RS_32.ss_sold_date_sk(Inner),Output:["_col0","_col6","_col32","_col45","_col53"]
<-Map 1 [SIMPLE_EDGE] vectorized
SHUFFLE [RS_30]
PartitionCols:d_date_sk
Filter Operator [FIL_29] (rows=5619 width=12)
predicate:(d_date_sk is not null and (d_moy = 12))
TableScan [TS_0] (rows=73049 width=12)
tpcds_bin_partitioned_orc_1000@date_dim,dt,Tbl:COMPLETE,Col:COMPLETE,Output:["d_date_sk","d_year","d_moy"]
<-Map 6 [SIMPLE_EDGE] vectorized
SHUFFLE [RS_32]
PartitionCols:ss_sold_date_sk
Filter Operator [FIL_31] (rows=2750387156 width=11)
predicate:ss_item_sk is not null
TableScan [TS_1] (rows=2750387156 width=11)
tpcds_bin_partitioned_orc_1000@store_sales,store_sales,Tbl:COMPLETE,Col:COMPLETE,Output:["ss_item_sk","ss_ext_sales_price"]
Original plan runs 163.33s. Sounds
like column pruning and predicate
push down are working fine.
However, the join sequence
store_sales✖date_dim✖item is not
good enough. A better one is
store_sales✖item✖date_dim
Table Cardinality Cardinality after filter Selectivity
date_dim 73K 5619 7.6%
item 300K 434 0.14%
Page23 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
<-Reducer 3 [SIMPLE_EDGE]
SHUFFLE [RS_17]
PartitionCols:_col0, _col1, _col2
Group By Operator [GBY_16] (rows=9 width=116)
Output:["_col0","_col1","_col2","_col3"],aggregations:["sum(_col1)"],keys:_col8, _col4, _col5
Select Operator [SEL_15] (rows=306061 width=112)
Output:["_col8","_col4","_col5","_col1"]
Merge Join Operator [MERGEJOIN_29] (rows=306061 width=112)
Conds:RS_12._col2=RS_38._col0(Inner),Output:["_col1","_col4","_col5","_col8"]
<-Map 7 [SIMPLE_EDGE] vectorized
SHUFFLE [RS_38]
PartitionCols:_col0
Select Operator [SEL_37] (rows=5619 width=12)
Output:["_col0","_col1"]
Filter Operator [FIL_36] (rows=5619 width=12)
predicate:((d_moy = 12) and d_date_sk is not null)
TableScan [TS_6] (rows=73049 width=12)
tpcds_bin_partitioned_orc_1000@date_dim,dt,Tbl:COMPLETE,Col:COMPLETE,Output:["d_date_sk","d_year","d_moy"]
<-Reducer 2 [SIMPLE_EDGE]
SHUFFLE [RS_12]
PartitionCols:_col2
Merge Join Operator [MERGEJOIN_28] (rows=3978894 width=112)
Conds:RS_32._col0=RS_35._col0(Inner),Output:["_col1","_col2","_col4","_col5"]
<-Map 1 [SIMPLE_EDGE] vectorized
SHUFFLE [RS_32]
PartitionCols:_col0
Select Operator [SEL_31] (rows=2750387156 width=11)
Output:["_col0","_col1","_col2"]
Filter Operator [FIL_30] (rows=2750387156 width=11)
predicate:ss_item_sk is not null
TableScan [TS_0] (rows=2750387156 width=11)
tpcds_bin_partitioned_orc_1000@store_sales,store_sales,Tbl:COMPLETE,Col:COMPLETE,Output:["ss_item_sk","ss_ext_sales_price"]
<-Map 6 [SIMPLE_EDGE] vectorized
SHUFFLE [RS_35]
PartitionCols:_col0
Select Operator [SEL_34] (rows=434 width=111)
Output:["_col0","_col1","_col2"]
Filter Operator [FIL_33] (rows=434 width=111)
predicate:((i_manufact_id = 436) and i_item_sk is not null)
TableScan [TS_3] (rows=300000 width=111)
tpcds_bin_partitioned_orc_1000@item,item,Tbl:COMPLETE,Col:COMPLETE,Output:["i_item_sk","i_brand_id","i_brand","i_manufact_id"]
CBO on, new plan runs 143.97s
with new join sequence
store_sales✖item✖date_dim.
The input data size of one branch of
join is pretty small, should use map
join, rather than merge join.
Page24 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
SHUFFLE [RS_44]
PartitionCols:_col0, _col1, _col2
Group By Operator [GBY_43] (rows=9 width=116)
Output:["_col0","_col1","_col2","_col3"],aggregations:["sum(_col1)"],keys:_col8, _col4, _col5
Select Operator [SEL_42] (rows=306061 width=112)
Output:["_col8","_col4","_col5","_col1"]
Map Join Operator [MAPJOIN_41] (rows=306061 width=112)
Conds:MAPJOIN_40._col2=RS_37._col0(Inner),HybridGraceHashJoin:true,Output:["_col1","_col4","_col5","_col8"]
<-Map 5 [BROADCAST_EDGE] vectorized
BROADCAST [RS_37]
PartitionCols:_col0
Select Operator [SEL_36] (rows=5619 width=12)
Output:["_col0","_col1"]
Filter Operator [FIL_35] (rows=5619 width=12)
predicate:((d_moy = 12) and d_date_sk is not null)
TableScan [TS_6] (rows=73049 width=12)
tpcds_bin_partitioned_orc_1000@date_dim,dt,Tbl:COMPLETE,Col:COMPLETE,Output:["d_date_sk","d_year","d_moy"]
<-Map Join Operator [MAPJOIN_40] (rows=3978894 width=112)
Conds:SEL_39._col0=RS_34._col0(Inner),HybridGraceHashJoin:true,Output:["_col1","_col2","_col4","_col5"]
<-Map 4 [BROADCAST_EDGE] vectorized
BROADCAST [RS_34]
PartitionCols:_col0
Select Operator [SEL_33] (rows=434 width=111)
Output:["_col0","_col1","_col2"]
Filter Operator [FIL_32] (rows=434 width=111)
predicate:((i_manufact_id = 436) and i_item_sk is not null)
TableScan [TS_3] (rows=300000 width=111)
tpcds_bin_partitioned_orc_1000@item,item,Tbl:COMPLETE,Col:COMPLETE,Output:["i_item_sk","i_brand_id","i_brand","i_manufact_id"]
<-Select Operator [SEL_39] (rows=2750387156 width=11)
Output:["_col0","_col1","_col2"]
Filter Operator [FIL_38] (rows=2750387156 width=11)
predicate:ss_item_sk is not null
TableScan [TS_0] (rows=2750387156 width=11)
tpcds_bin_partitioned_orc_1000@store_sales,store_sales,Tbl:COMPLETE,Col:COMPLETE,Output:["ss_item_sk","ss_ext_sales_price"]
Increase
hive.auto.convert.join.noconditionaltask.size=
1,359,688,499, we can see it is now using map
join operators. New plan runs 45.84s.
store_sales is a partitioned table on
the join key ss_sold_date_sk with
date_dim table.
Page25 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
SHUFFLE [RS_55]
PartitionCols:_col0, _col1, _col2
Group By Operator [GBY_54] (rows=9 width=116)
Output:["_col0","_col1","_col2","_col3"],aggregations:["sum(_col1)"],keys:_col8, _col4, _col5
Select Operator [SEL_53] (rows=306061 width=112)
Output:["_col8","_col4","_col5","_col1"]
Map Join Operator [MAPJOIN_52] (rows=306061 width=112)
Conds:MAPJOIN_51._col2=RS_45._col0(Inner),HybridGraceHashJoin:true,Output:["_col1","_col4","_col5","_col8"]
<-Map 5 [BROADCAST_EDGE] vectorized
BROADCAST [RS_45]
PartitionCols:_col0
Select Operator [SEL_44] (rows=5619 width=12)
Output:["_col0","_col1"]
Filter Operator [FIL_43] (rows=5619 width=12)
predicate:((d_moy = 12) and d_date_sk is not null)
TableScan [TS_6] (rows=73049 width=12)
tpcds_bin_partitioned_orc_1000@date_dim,dt,Tbl:COMPLETE,Col:COMPLETE,Output:["d_date_sk","d_year","d_moy"]
Dynamic Partitioning Event Operator [EVENT_48] (rows=2809 width=12)
Group By Operator [GBY_47] (rows=2809 width=12)
Output:["_col0"],keys:_col0
Select Operator [SEL_46] (rows=5619 width=12)
Output:["_col0"]
Please refer to the previous Select Operator [SEL_44]
<-Map Join Operator [MAPJOIN_51] (rows=3978894 width=112)
Conds:SEL_50._col0=RS_42._col0(Inner),HybridGraceHashJoin:true,Output:["_col1","_col2","_col4","_col5"]
<-Map 4 [BROADCAST_EDGE] vectorized
BROADCAST [RS_42]
PartitionCols:_col0
Select Operator [SEL_41] (rows=434 width=111)
Output:["_col0","_col1","_col2"]
Filter Operator [FIL_40] (rows=434 width=111)
predicate:((i_manufact_id = 436) and i_item_sk is not null)
TableScan [TS_3] (rows=300000 width=111)
tpcds_bin_partitioned_orc_1000@item,item,Tbl:COMPLETE,Col:COMPLETE,Output:["i_item_sk","i_brand_id","i_brand","i_manufact_id"]
<-Select Operator [SEL_50] (rows=2750387156 width=11)
Output:["_col0","_col1","_col2"]
Filter Operator [FIL_49] (rows=2750387156 width=11)
predicate:ss_item_sk is not null
TableScan [TS_0] (rows=2750387156 width=11)
tpcds_bin_partitioned_orc_1000@store_sales,store_sales,Tbl:COMPLETE,Col:COMPLETE,Output:["ss_item_sk","ss_ext_sales_price"]
By setting
hive.tez.dynamic.partition.pruning=true,
we can see dynamic partitioning
event operators. See more about this
in HIVE-7826. New plan runs 31.35s.
In the run time, dynamic partition event
operator will send values needed to
prune to the application master - where
splits are generated and tasks are
submitted. Using these values we can
strip out any unneeded partitions
dynamically, while the query is running.
Page26 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Performance debugging summary (TPC-DS Q3, 1TB)
0
20
40
60
80
100
120
140
160
180
Original Join re-order Join selection Dynamic partition pruning
Queryexecutiontime(s)
Query execution time
Continuous improvement
Page27 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Integration with Apache Ambari
1. Type the query here
2. Click “explain”
3. explain plan will be shown
Page28 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
 “set hive.tez.exec.print.summary=true;”
 Get more insights on query performance
Can be used along with Tez vertex runtime stats
Runtime stats
Page29 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Summary
• Show old style Hive explain plan is hard to read.
• Verbose with too much redundant information, hard to follow how data
flows, cost of operator is unclear
• Compare with Postgres over a body of 500+ realistic SQL queries and
identify the candidate improving points
• Introduce new style Hive explain plan
• Use a concrete example to help understand the explain: execution cost, join
sequence and orchestration of the operator tree
• Use the new Hive explain plan to performance debug TPC-DS Q3
• Show the improvement after join re-ordering, join selection, and dynamic
partition pruning
• Integration/interaction with other system/tools
Page30 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Future work -- Some gaps remain after HIVE-9780
• Put the real schema, table and column names in the
explain plan, e.g., no more _col0 etc.
• This will help users to understand the plan better
• HIVE-8681: CBO: Column names are missing from join expression
in Map join with CBO enabled
• Get an equivalent of “EXPLAIN ANALYZE” – such as
operator level runtime stats and warnings.
• This will help users to find out the gap between estimated cost and
real cost
• HIVE-14362: Support explain analyze in Hive
Page31 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
SHUFFLE [RS_55]
PartitionCols:d_year, i_brand, i_brand_id
Group By Operator [GBY_54] (rows=9/10 width=116)
Output:["d_year","sum_agg","i_brand","i_brand_id"],aggregations:["sum(ss_ext_sales_price)"],keys:d_year, i_brand, i_brand_id
Select Operator [SEL_53] (rows=306061/324651 width=112)
Output:["d_year","i_brand","i_brand_id","ss_ext_sales_price"]
Map Join Operator [MAPJOIN_52] (rows=306061/324651 width=112)
Conds:MAPJOIN_51. ss_sold_date_sk=RS_45. d_date_sk(Inner),HybridGraceHashJoin:true,Output:["d_year","i_brand","i_brand_id","ss_ext_sales_price"]
<-Map 5 [BROADCAST_EDGE] vectorized
BROADCAST [RS_45]
PartitionCols:d_date_sk
Select Operator [SEL_44] (rows=5619/6034 width=12)
Output:["d_date_sk","d_year"]
Filter Operator [FIL_43] (rows=5619/6034 width=12)
predicate:((d_moy = 12) and d_date_sk is not null)
TableScan [TS_6] (rows=73049/73049 width=12)
tpcds_bin_partitioned_orc_1000@date_dim,dt,Tbl:COMPLETE,Col:COMPLETE,Output:["d_date_sk","d_year","d_moy"]
Dynamic Partitioning Event Operator [EVENT_48] (rows=2809 width=12)
Group By Operator [GBY_47] (rows=2809/2324 width=12)
Output:["d_date_sk"],keys:d_date_sk
Select Operator [SEL_46] (rows=5619/6034 width=12)
Output:["d_date_sk"]
Please refer to the previous Select Operator [SEL_44]
<-Map Join Operator [MAPJOIN_51] (rows=3978894/4202377 width=112)
Conds:SEL_50.ss_item_sk=RS_42. i_item_sk(Inner),HybridGraceHashJoin:true,Output:["ss_ext_sales_price",” ss_sold_date_sk","i_brand","i_brand_id"]
<-Map 4 [BROADCAST_EDGE] vectorized
BROADCAST [RS_42]
PartitionCols:i_item_sk
Select Operator [SEL_41] (rows=434/453 width=111)
Output:[” i_item_sk","i_brand_id","i_brand"]
Filter Operator [FIL_40] (rows=434/453 width=111)
predicate:((i_manufact_id = 436) and i_item_sk is not null)
TableScan [TS_3] (rows=300000/300000 width=111)
tpcds_bin_partitioned_orc_1000@item,item,Tbl:COMPLETE,Col:COMPLETE,Output:["i_item_sk","i_brand_id","i_brand","i_manufact_id"]
<-Select Operator [SEL_50] (rows=2750387156/2750387156 width=11)
Output:["ss_item_sk","ss_ext_sales_price","ss_sold_date_sk"]
Filter Operator [FIL_49] (rows=2750387156/2750387156 width=11)
predicate:ss_item_sk is not null
TableScan [TS_0] (rows=2750387156/2750387156 width=11)
tpcds_bin_partitioned_orc_1000@store_sales,store_sales,Tbl:COMPLETE,Col:COMPLETE,Output:["ss_item_sk","ss_ext_sales_price"]
Page32 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Acknowledgement
• We thank all the anonymous reviewers’ votes to give us this
opportunity to share our work.
• Part of the slides are borrowed from or modified based on Carter
Shanklin and Rajesh Balamohan’s slides.
• We thank Gunther Hagleitner for all the support and inputs.
• We thank Sapin Amin for setting up the testing cluster.
Page33 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Thank you! Questions?
Ad

Recommended

Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
DataWorks Summit
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
t3rmin4t0r
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
DataWorks Summit
 
Using Apache Hive with High Performance
Using Apache Hive with High Performance
Inderaj (Raj) Bains
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
Ryan Blue
 
The Apache Spark File Format Ecosystem
The Apache Spark File Format Ecosystem
Databricks
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
DataWorks Summit
 
Hive tuning
Hive tuning
Michael Zhang
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
Spark Summit
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
Cloudera, Inc.
 
Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas Patil
Databricks
 
Hive: Loading Data
Hive: Loading Data
Benjamin Leonhardi
 
Apache Spark Architecture
Apache Spark Architecture
Alexey Grishchenko
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
Databricks
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Spark shuffle introduction
Spark shuffle introduction
colorant
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
Databricks
 
Introduction to elasticsearch
Introduction to elasticsearch
hypto
 
Introduction to Impala
Introduction to Impala
markgrover
 
Introduction to Spark Internals
Introduction to Spark Internals
Pietro Michiardi
 
Introduction to HiveQL
Introduction to HiveQL
kristinferrier
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
 
Building Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache Spark
Databricks
 
Hive Anatomy
Hive Anatomy
nzhang
 
LLAP: long-lived execution in Hive
LLAP: long-lived execution in Hive
DataWorks Summit
 
Optimizing Hive Queries
Optimizing Hive Queries
Owen O'Malley
 
Analytical Queries with Hive: SQL Windowing and Table Functions
Analytical Queries with Hive: SQL Windowing and Table Functions
DataWorks Summit
 

More Related Content

What's hot (20)

Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
Spark Summit
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
Cloudera, Inc.
 
Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas Patil
Databricks
 
Hive: Loading Data
Hive: Loading Data
Benjamin Leonhardi
 
Apache Spark Architecture
Apache Spark Architecture
Alexey Grishchenko
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
Databricks
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Spark shuffle introduction
Spark shuffle introduction
colorant
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
Databricks
 
Introduction to elasticsearch
Introduction to elasticsearch
hypto
 
Introduction to Impala
Introduction to Impala
markgrover
 
Introduction to Spark Internals
Introduction to Spark Internals
Pietro Michiardi
 
Introduction to HiveQL
Introduction to HiveQL
kristinferrier
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
 
Building Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache Spark
Databricks
 
Hive Anatomy
Hive Anatomy
nzhang
 
LLAP: long-lived execution in Hive
LLAP: long-lived execution in Hive
DataWorks Summit
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
Spark Summit
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
Cloudera, Inc.
 
Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas Patil
Databricks
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
Databricks
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Spark shuffle introduction
Spark shuffle introduction
colorant
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
Databricks
 
Introduction to elasticsearch
Introduction to elasticsearch
hypto
 
Introduction to Impala
Introduction to Impala
markgrover
 
Introduction to Spark Internals
Introduction to Spark Internals
Pietro Michiardi
 
Introduction to HiveQL
Introduction to HiveQL
kristinferrier
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
 
Building Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache Spark
Databricks
 
Hive Anatomy
Hive Anatomy
nzhang
 
LLAP: long-lived execution in Hive
LLAP: long-lived execution in Hive
DataWorks Summit
 

Viewers also liked (9)

Optimizing Hive Queries
Optimizing Hive Queries
Owen O'Malley
 
Analytical Queries with Hive: SQL Windowing and Table Functions
Analytical Queries with Hive: SQL Windowing and Table Functions
DataWorks Summit
 
Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB
Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB
Athiq Ahamed
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
Databricks
 
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Spark Summit
 
Introduction to Apache Spark
Introduction to Apache Spark
Rahul Jain
 
SQL to Hive Cheat Sheet
SQL to Hive Cheat Sheet
Hortonworks
 
Apache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and Smarter
Databricks
 
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
Edureka!
 
Optimizing Hive Queries
Optimizing Hive Queries
Owen O'Malley
 
Analytical Queries with Hive: SQL Windowing and Table Functions
Analytical Queries with Hive: SQL Windowing and Table Functions
DataWorks Summit
 
Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB
Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB
Athiq Ahamed
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
Databricks
 
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Spark Summit
 
Introduction to Apache Spark
Introduction to Apache Spark
Rahul Jain
 
SQL to Hive Cheat Sheet
SQL to Hive Cheat Sheet
Hortonworks
 
Apache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and Smarter
Databricks
 
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
Edureka!
 
Ad

Similar to How to understand and analyze Apache Hive query execution plan for performance debugging (20)

Enhancing Spark SQL Optimizer with Reliable Statistics
Enhancing Spark SQL Optimizer with Reliable Statistics
Jen Aman
 
テスト用のプレゼンテーション
テスト用のプレゼンテーション
gooseboi
 
Major advancements in Apache Hive towards full support of SQL compliance
Major advancements in Apache Hive towards full support of SQL compliance
DataWorks Summit/Hadoop Summit
 
Hive User Meeting 2009 8 Facebook
Hive User Meeting 2009 8 Facebook
Zheng Shao
 
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebook
ragho
 
Hive 3 - a new horizon
Hive 3 - a new horizon
Thejas Nair
 
Hive 3 a new horizon
Hive 3 a new horizon
Abdelkrim Hadjidj
 
Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14
Julian Hyde
 
Hive
Hive
Srinath Reddy
 
Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14
Julian Hyde
 
Tech Talk - JPA and Query Optimization - publish
Tech Talk - JPA and Query Optimization - publish
Gleydson Lima
 
Hive 3 a new horizon
Hive 3 a new horizon
Artem Ervits
 
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
Yahoo Developer Network
 
Hive Evolution: ApacheCon NA 2010
Hive Evolution: ApacheCon NA 2010
John Sichi
 
query_tuning.pdf
query_tuning.pdf
ssuserf99076
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
Zheng Shao
 
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Spark Summit
 
OGSA-DAI DQP: A Developer's View
OGSA-DAI DQP: A Developer's View
Bartosz Dobrzelecki
 
unit01-Activity-SQL-Query-Review-1.pptx
unit01-Activity-SQL-Query-Review-1.pptx
AnesElmouaddeb
 
SQL Performance Solutions: Refactor Mercilessly, Index Wisely
SQL Performance Solutions: Refactor Mercilessly, Index Wisely
Enkitec
 
Enhancing Spark SQL Optimizer with Reliable Statistics
Enhancing Spark SQL Optimizer with Reliable Statistics
Jen Aman
 
テスト用のプレゼンテーション
テスト用のプレゼンテーション
gooseboi
 
Major advancements in Apache Hive towards full support of SQL compliance
Major advancements in Apache Hive towards full support of SQL compliance
DataWorks Summit/Hadoop Summit
 
Hive User Meeting 2009 8 Facebook
Hive User Meeting 2009 8 Facebook
Zheng Shao
 
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebook
ragho
 
Hive 3 - a new horizon
Hive 3 - a new horizon
Thejas Nair
 
Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14
Julian Hyde
 
Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14
Julian Hyde
 
Tech Talk - JPA and Query Optimization - publish
Tech Talk - JPA and Query Optimization - publish
Gleydson Lima
 
Hive 3 a new horizon
Hive 3 a new horizon
Artem Ervits
 
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
Yahoo Developer Network
 
Hive Evolution: ApacheCon NA 2010
Hive Evolution: ApacheCon NA 2010
John Sichi
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
Zheng Shao
 
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Spark Summit
 
OGSA-DAI DQP: A Developer's View
OGSA-DAI DQP: A Developer's View
Bartosz Dobrzelecki
 
unit01-Activity-SQL-Query-Review-1.pptx
unit01-Activity-SQL-Query-Review-1.pptx
AnesElmouaddeb
 
SQL Performance Solutions: Refactor Mercilessly, Index Wisely
SQL Performance Solutions: Refactor Mercilessly, Index Wisely
Enkitec
 
Ad

More from DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
DataWorks Summit/Hadoop Summit
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
DataWorks Summit/Hadoop Summit
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
DataWorks Summit/Hadoop Summit
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
DataWorks Summit/Hadoop Summit
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
DataWorks Summit/Hadoop Summit
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
DataWorks Summit/Hadoop Summit
 
Hadoop Crash Course
Hadoop Crash Course
DataWorks Summit/Hadoop Summit
 
Data Science Crash Course
Data Science Crash Course
DataWorks Summit/Hadoop Summit
 
Apache Spark Crash Course
Apache Spark Crash Course
DataWorks Summit/Hadoop Summit
 
Dataflow with Apache NiFi
Dataflow with Apache NiFi
DataWorks Summit/Hadoop Summit
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data Free
DataWorks Summit/Hadoop Summit
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
DataWorks Summit/Hadoop Summit
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
DataWorks Summit/Hadoop Summit
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
DataWorks Summit/Hadoop Summit
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
DataWorks Summit/Hadoop Summit
 
HBase in Practice
HBase in Practice
DataWorks Summit/Hadoop Summit
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
DataWorks Summit/Hadoop Summit
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
DataWorks Summit/Hadoop Summit
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
DataWorks Summit/Hadoop Summit
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
DataWorks Summit/Hadoop Summit
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
DataWorks Summit/Hadoop Summit
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
DataWorks Summit/Hadoop Summit
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
DataWorks Summit/Hadoop Summit
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
DataWorks Summit/Hadoop Summit
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
DataWorks Summit/Hadoop Summit
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
DataWorks Summit/Hadoop Summit
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
DataWorks Summit/Hadoop Summit
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
DataWorks Summit/Hadoop Summit
 

Recently uploaded (20)

Edge-banding-machines-edgeteq-s-200-en-.pdf
Edge-banding-machines-edgeteq-s-200-en-.pdf
AmirStern2
 
Down the Rabbit Hole – Solving 5 Training Roadblocks
Down the Rabbit Hole – Solving 5 Training Roadblocks
Rustici Software
 
Securing Account Lifecycles in the Age of Deepfakes.pptx
Securing Account Lifecycles in the Age of Deepfakes.pptx
FIDO Alliance
 
Supporting the NextGen 911 Digital Transformation with FME
Supporting the NextGen 911 Digital Transformation with FME
Safe Software
 
High Availability On-Premises FME Flow.pdf
High Availability On-Premises FME Flow.pdf
Safe Software
 
Raman Bhaumik - Passionate Tech Enthusiast
Raman Bhaumik - Passionate Tech Enthusiast
Raman Bhaumik
 
Reducing Conflicts and Increasing Safety Along the Cycling Networks of East-F...
Reducing Conflicts and Increasing Safety Along the Cycling Networks of East-F...
Safe Software
 
“Key Requirements to Successfully Implement Generative AI in Edge Devices—Opt...
“Key Requirements to Successfully Implement Generative AI in Edge Devices—Opt...
Edge AI and Vision Alliance
 
TrustArc Webinar - 2025 Global Privacy Survey
TrustArc Webinar - 2025 Global Privacy Survey
TrustArc
 
AI VIDEO MAGAZINE - June 2025 - r/aivideo
AI VIDEO MAGAZINE - June 2025 - r/aivideo
1pcity Studios, Inc
 
FIDO Alliance Seminar State of Passkeys.pptx
FIDO Alliance Seminar State of Passkeys.pptx
FIDO Alliance
 
Bridging the divide: A conversation on tariffs today in the book industry - T...
Bridging the divide: A conversation on tariffs today in the book industry - T...
BookNet Canada
 
No-Code Workflows for CAD & 3D Data: Scaling AI-Driven Infrastructure
No-Code Workflows for CAD & 3D Data: Scaling AI-Driven Infrastructure
Safe Software
 
SAP Modernization Strategies for a Successful S/4HANA Journey.pdf
SAP Modernization Strategies for a Successful S/4HANA Journey.pdf
Precisely
 
“From Enterprise to Makers: Driving Vision AI Innovation at the Extreme Edge,...
“From Enterprise to Makers: Driving Vision AI Innovation at the Extreme Edge,...
Edge AI and Vision Alliance
 
Creating Inclusive Digital Learning with AI: A Smarter, Fairer Future
Creating Inclusive Digital Learning with AI: A Smarter, Fairer Future
Impelsys Inc.
 
National Fuels Treatments Initiative: Building a Seamless Map of Hazardous Fu...
National Fuels Treatments Initiative: Building a Seamless Map of Hazardous Fu...
Safe Software
 
FIDO Seminar: New Data: Passkey Adoption in the Workforce.pptx
FIDO Seminar: New Data: Passkey Adoption in the Workforce.pptx
FIDO Alliance
 
FME for Distribution & Transmission Integrity Management Program (DIMP & TIMP)
FME for Distribution & Transmission Integrity Management Program (DIMP & TIMP)
Safe Software
 
vertical-cnc-processing-centers-drillteq-v-200-en.pdf
vertical-cnc-processing-centers-drillteq-v-200-en.pdf
AmirStern2
 
Edge-banding-machines-edgeteq-s-200-en-.pdf
Edge-banding-machines-edgeteq-s-200-en-.pdf
AmirStern2
 
Down the Rabbit Hole – Solving 5 Training Roadblocks
Down the Rabbit Hole – Solving 5 Training Roadblocks
Rustici Software
 
Securing Account Lifecycles in the Age of Deepfakes.pptx
Securing Account Lifecycles in the Age of Deepfakes.pptx
FIDO Alliance
 
Supporting the NextGen 911 Digital Transformation with FME
Supporting the NextGen 911 Digital Transformation with FME
Safe Software
 
High Availability On-Premises FME Flow.pdf
High Availability On-Premises FME Flow.pdf
Safe Software
 
Raman Bhaumik - Passionate Tech Enthusiast
Raman Bhaumik - Passionate Tech Enthusiast
Raman Bhaumik
 
Reducing Conflicts and Increasing Safety Along the Cycling Networks of East-F...
Reducing Conflicts and Increasing Safety Along the Cycling Networks of East-F...
Safe Software
 
“Key Requirements to Successfully Implement Generative AI in Edge Devices—Opt...
“Key Requirements to Successfully Implement Generative AI in Edge Devices—Opt...
Edge AI and Vision Alliance
 
TrustArc Webinar - 2025 Global Privacy Survey
TrustArc Webinar - 2025 Global Privacy Survey
TrustArc
 
AI VIDEO MAGAZINE - June 2025 - r/aivideo
AI VIDEO MAGAZINE - June 2025 - r/aivideo
1pcity Studios, Inc
 
FIDO Alliance Seminar State of Passkeys.pptx
FIDO Alliance Seminar State of Passkeys.pptx
FIDO Alliance
 
Bridging the divide: A conversation on tariffs today in the book industry - T...
Bridging the divide: A conversation on tariffs today in the book industry - T...
BookNet Canada
 
No-Code Workflows for CAD & 3D Data: Scaling AI-Driven Infrastructure
No-Code Workflows for CAD & 3D Data: Scaling AI-Driven Infrastructure
Safe Software
 
SAP Modernization Strategies for a Successful S/4HANA Journey.pdf
SAP Modernization Strategies for a Successful S/4HANA Journey.pdf
Precisely
 
“From Enterprise to Makers: Driving Vision AI Innovation at the Extreme Edge,...
“From Enterprise to Makers: Driving Vision AI Innovation at the Extreme Edge,...
Edge AI and Vision Alliance
 
Creating Inclusive Digital Learning with AI: A Smarter, Fairer Future
Creating Inclusive Digital Learning with AI: A Smarter, Fairer Future
Impelsys Inc.
 
National Fuels Treatments Initiative: Building a Seamless Map of Hazardous Fu...
National Fuels Treatments Initiative: Building a Seamless Map of Hazardous Fu...
Safe Software
 
FIDO Seminar: New Data: Passkey Adoption in the Workforce.pptx
FIDO Seminar: New Data: Passkey Adoption in the Workforce.pptx
FIDO Alliance
 
FME for Distribution & Transmission Integrity Management Program (DIMP & TIMP)
FME for Distribution & Transmission Integrity Management Program (DIMP & TIMP)
Safe Software
 
vertical-cnc-processing-centers-drillteq-v-200-en.pdf
vertical-cnc-processing-centers-drillteq-v-200-en.pdf
AmirStern2
 

How to understand and analyze Apache Hive query execution plan for performance debugging

  • 1. How to understand and analyze Apache Hive query execution plan for performance debugging © Hortonworks Inc. 2011 – 2015. All Rights Reserved Pengcheng Xiong and Ashutosh Chauhan Hortonworks Inc., Apache Hive Community {pxiong,ashutosh}@hortonworks.com
  • 2. Page2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Goals: • WHY old style Hive explain plan is hard to read • Compare the old style explain with Postgres over a body of 500+ realistic SQL queries. • WHAT is new style Hive explain plan • Show orchestration of the tasks and operator trees, join sequences and algorithms, operator execution costs • HOW to performance debug a real query by analyzing the new Hive explain plan • Identify the potential improvement by changing join sequence, join algorithm and etc • Show the real improvement by running the query in real cluster • Integration/interaction with other system/tools • Future work
  • 3. Page3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved WHY old style Hive explain plan is hard to read • M**** company’s schema and queries. • Comparison of explain plans between Hive and Postgres for the 528 queries they can both execute. • Hive: “explain”, Postgres: “explain verbose” Hive “old style”, 233.5 postgres, 53.8 Hive “old style”, 1289 postgres, 328.6 Average Lines Per Explain Plan Average Words Per Explain Plan N = 528 queries
  • 4. Page4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved We can see that Hive old style explain is quite verbose, is it necessary? select a11.PBTNAME PBTNAME from PMT_INVENTORY a11 join LU_MONTH a12 on (a11.QUARTER_ID = a12.QUARTER_ID) where a12.MONTH_ID in (200607, 200606) group by a11.PBTNAME;
  • 5. Page5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved High level plan comparison: Postgres QUERY PLAN --------------------------------------------------------------------------------- Group (cost=3.83..3.84 rows=2 width=18) Output: a11.pbtname Group Key: a11.pbtname -> Sort (cost=3.83..3.84 rows=2 width=18) Output: a11.pbtname Sort Key: a11.pbtname -> Hash Join (cost=2.62..3.83 rows=2 width=18) Output: a11.pbtname Hash Cond: (a11.quarter_id = a12.quarter_id) -> Seq Scan on public.pmt_inventory a11 (cost=0.00..1.12 rows=12 width=22) Output: a11.quarter_id, a11.pbtname -> Hash (cost=2.60..2.60 rows=2 width=4) Output: a12.quarter_id -> Seq Scan on public.lu_month a12 (cost=0.00..2.60 rows=2 width=4) Output: a12.quarter_id Filter: (a12.month_id = ANY ('{200607,200606}'::integer[])) 16 lines
  • 6. Page6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved High level plan comparison: Hive STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 depends on stages: Stage-1 STAGE PLANS: Stage: Stage-1 Tez Edges: Map 2 <- Map 1 (BROADCAST_EDGE) Reducer 3 <- Map 2 (SIMPLE_EDGE) DagName: carter_20151114133018_2f2f0101-d14d-4688-bbb1-db67055016c3:946 Vertices: Map 1 Map Operator Tree: TableScan alias: a11 filterExpr: quarter_id is not null (type: boolean) Statistics: Num rows: 12 Data size: 1260 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: quarter_id is not null (type: boolean) Statistics: Num rows: 6 Data size: 630 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: quarter_id (type: int) sort order: + Map-reduce partition columns: quarter_id (type: int) Statistics: Num rows: 6 Data size: 630 Basic stats: COMPLETE Column stats: NONE value expressions: pbtname (type: string) Execution mode: vectorized Map 2 Map Operator Tree: TableScan alias: a12 filterExpr: (quarter_id is not null and (month_id) IN (200607, 200606)) (type: boolean) Statistics: Num rows: 48 Data size: 46752 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: (quarter_id is not null and (month_id) IN (200607, 200606)) (type: boolean) Statistics: Num rows: 12 Data size: 11688 Basic stats: COMPLETE Column stats: NONE Map Join Operator condition map: Inner Join 0 to 1 keys: 0 quarter_id (type: int) 1 quarter_id (type: int) outputColumnNames: _col1 input vertices: 0 Map 1 Statistics: Num rows: 13 Data size: 12856 Basic stats: COMPLETE Column stats: NONE HybridGraceHashJoin: true Group By Operator keys: _col1 (type: string) mode: hash outputColumnNames: _col0 Statistics: Num rows: 13 Data size: 12856 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: _col0 (type: string) sort order: + Map-reduce partition columns: _col0 (type: string) Statistics: Num rows: 13 Data size: 12856 Basic stats: COMPLETE Column stats: NONE Execution mode: vectorized Reducer 3 Reduce Operator Tree: Group By Operator keys: KEY._col0 (type: string) mode: mergepartial outputColumnNames: _col0 Statistics: Num rows: 6 Data size: 5933 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 6 Data size: 5933 Basic stats: COMPLETE Column stats: NONE table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink Too verbose, need a magnifier! 80+ lines
  • 7. Page7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Map 2 Operator Map 2 Map Operator Tree: TableScan alias: a12 filterExpr: (quarter_id is not null and (month_id) IN (200607, 200606)) (type: boolean) Statistics: Num rows: 48 Data size: 46752 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: (quarter_id is not null and (month_id) IN (200607, 200606)) (type: boolean) Statistics: Num rows: 12 Data size: 11688 Basic stats: COMPLETE Column stats: NONE Map Join Operator condition map: Inner Join 0 to 1 keys: 0 quarter_id (type: int) 1 quarter_id (type: int) outputColumnNames: _col1 input vertices: 0 Map 1 Statistics: Num rows: 13 Data size: 12856 Basic stats: COMPLETE Column stats: NONE HybridGraceHashJoin: true Group By Operator keys: _col1 (type: string) mode: hash outputColumnNames: _col0 Statistics: Num rows: 13 Data size: 12856 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: _col0 (type: string) sort order: + Map-reduce partition columns: _col0 (type: string) Statistics: Num rows: 13 Data size: 12856 Basic stats: COMPLETE Column stats: NONE Execution mode: vectorized Data flows from top to bottom Each operator has 0 or 1 child
  • 8. Page8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Map 2 Operator Map 2 Map Operator Tree: TableScan alias: a12 filterExpr: (quarter_id is not null and (month_id) IN (200607, 200606)) (type: boolean) Statistics: Num rows: 48 Data size: 46752 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: (quarter_id is not null and (month_id) IN (200607, 200606)) (type: boolean) Statistics: Num rows: 12 Data size: 11688 Basic stats: COMPLETE Column stats: NONE Map Join Operator condition map: Inner Join 0 to 1 keys: 0 quarter_id (type: int) 1 quarter_id (type: int) outputColumnNames: _col1 input vertices: 0 Map 1 Statistics: Num rows: 13 Data size: 12856 Basic stats: COMPLETE Column stats: NONE HybridGraceHashJoin: true Group By Operator keys: _col1 (type: string) mode: hash outputColumnNames: _col0 Statistics: Num rows: 13 Data size: 12856 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: _col0 (type: string) sort order: + Map-reduce partition columns: _col0 (type: string) Statistics: Num rows: 13 Data size: 12856 Basic stats: COMPLETE Column stats: NONE Execution mode: vectorized Must scroll to another part of the plan to see what this is
  • 9. Page9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Map 1 Operator Map 1 Map Operator Tree: TableScan alias: a11 filterExpr: quarter_id is not null (type: boolean) Statistics: Num rows: 12 Data size: 1260 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: quarter_id is not null (type: boolean) Statistics: Num rows: 6 Data size: 630 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: quarter_id (type: int) sort order: + Map-reduce partition columns: quarter_id (type: int) Statistics: Num rows: 6 Data size: 630 Basic stats: COMPLETE Column stats: NONE value expressions: pbtname (type: string) Execution mode: vectorized Actual table name (PMT_INVENTORY) not mentioned anywhere, only the alias
  • 10. Page10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Map 1 Operator Map 1 Map Operator Tree: TableScan alias: a11 filterExpr: quarter_id is not null (type: boolean) Statistics: Num rows: 12 Data size: 1260 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: quarter_id is not null (type: boolean) Statistics: Num rows: 6 Data size: 630 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: quarter_id (type: int) sort order: + Map-reduce partition columns: quarter_id (type: int) Statistics: Num rows: 6 Data size: 630 Basic stats: COMPLETE Column stats: NONE value expressions: pbtname (type: string) Execution mode: vectorized How much of this information is really necessary to SQL users?
  • 11. Page11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Back to Postgres QUERY PLAN --------------------------------------------------------------------------------- Group (cost=3.83..3.84 rows=2 width=18) Output: a11.pbtname Group Key: a11.pbtname -> Sort (cost=3.83..3.84 rows=2 width=18) Output: a11.pbtname Sort Key: a11.pbtname -> Hash Join (cost=2.62..3.83 rows=2 width=18) Output: a11.pbtname Hash Cond: (a11.quarter_id = a12.quarter_id) -> Seq Scan on public.pmt_inventory a11 (cost=0.00..1.12 rows=12 width=22) Output: a11.quarter_id, a11.pbtname -> Hash (cost=2.60..2.60 rows=2 width=4) Output: a12.quarter_id -> Seq Scan on public.lu_month a12 (cost=0.00..2.60 rows=2 width=4) Output: a12.quarter_id Filter: (a12.month_id = ANY ('{200607,200606}'::integer[])) Data flows bottom to top
  • 12. Page12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Back to Postgres QUERY PLAN --------------------------------------------------------------------------------- Group (cost=3.83..3.84 rows=2 width=18) Output: a11.pbtname Group Key: a11.pbtname -> Sort (cost=3.83..3.84 rows=2 width=18) Output: a11.pbtname Sort Key: a11.pbtname -> Hash Join (cost=2.62..3.83 rows=2 width=18) Output: a11.pbtname Hash Cond: (a11.quarter_id = a12.quarter_id) -> Seq Scan on public.pmt_inventory a11 (cost=0.00..1.12 rows=12 width=22) Output: a11.quarter_id, a11.pbtname -> Hash (cost=2.60..2.60 rows=2 width=4) Output: a12.quarter_id -> Seq Scan on public.lu_month a12 (cost=0.00..2.60 rows=2 width=4) Output: a12.quarter_id Filter: (a12.month_id = ANY ('{200607,200606}'::integer[])) Operators have multiple children when it makes sense
  • 13. Page13 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Back to Postgres QUERY PLAN --------------------------------------------------------------------------------- Group (cost=3.83..3.84 rows=2 width=18) Output: a11.pbtname Group Key: a11.pbtname -> Sort (cost=3.83..3.84 rows=2 width=18) Output: a11.pbtname Sort Key: a11.pbtname -> Hash Join (cost=2.62..3.83 rows=2 width=18) Output: a11.pbtname Hash Cond: (a11.quarter_id = a12.quarter_id) -> Seq Scan on public.pmt_inventory a11 (cost=0.00..1.12 rows=12 width=22) Output: a11.quarter_id, a11.pbtname -> Hash (cost=2.60..2.60 rows=2 width=4) Output: a12.quarter_id -> Seq Scan on public.lu_month a12 (cost=0.00..2.60 rows=2 width=4) Output: a12.quarter_id Filter: (a12.month_id = ANY ('{200607,200606}'::integer[])) Join is done using a scan of pmt_inventory and a hash following a scan of lu_month. All this info is available without referring to a stage plan. IOW you don’t have to jump around in the plan.
  • 14. Page14 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Back to Postgres QUERY PLAN --------------------------------------------------------------------------------- Group (cost=3.83..3.84 rows=2 width=18) Output: a11.pbtname Group Key: a11.pbtname -> Sort (cost=3.83..3.84 rows=2 width=18) Output: a11.pbtname Sort Key: a11.pbtname -> Hash Join (cost=2.62..3.83 rows=2 width=18) Output: a11.pbtname Hash Cond: (a11.quarter_id = a12.quarter_id) -> Seq Scan on public.pmt_inventory a11 (cost=0.00..1.12 rows=12 width=22) Output: a11.quarter_id, a11.pbtname -> Hash (cost=2.60..2.60 rows=2 width=4) Output: a12.quarter_id -> Seq Scan on public.lu_month a12 (cost=0.00..2.60 rows=2 width=4) Output: a12.quarter_id Filter: (a12.month_id = ANY ('{200607,200606}'::integer[])) Actual schema / table names visible
  • 15. Page15 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Back to Postgres QUERY PLAN --------------------------------------------------------------------------------- Group (cost=3.83..3.84 rows=2 width=18) Output: a11.pbtname Group Key: a11.pbtname -> Sort (cost=3.83..3.84 rows=2 width=18) Output: a11.pbtname Sort Key: a11.pbtname -> Hash Join (cost=2.62..3.83 rows=2 width=18) Output: a11.pbtname Hash Cond: (a11.quarter_id = a12.quarter_id) -> Seq Scan on public.pmt_inventory a11 (cost=0.00..1.12 rows=12 width=22) Output: a11.quarter_id, a11.pbtname -> Hash (cost=2.60..2.60 rows=2 width=4) Output: a12.quarter_id -> Seq Scan on public.lu_month a12 (cost=0.00..2.60 rows=2 width=4) Output: a12.quarter_id Filter: (a12.month_id = ANY ('{200607,200606}'::integer[])) Cost mentioned once per operator Cost monotonically increases as you go up
  • 16. Page16 © Hortonworks Inc. 2011 – 2014. All Rights Reserved WHAT is new style Hive explain plan (HIVE-9780) • Set hive.explain.user=true; (by default). Use Tez, LLAP, etc Stage-1 Reducer 3 File Output Operator [FS_14] Group By Operator [GBY_12] (rows=8 width=101) Output:["_col0"],keys:KEY._col0 <-Map 2 [SIMPLE_EDGE] SHUFFLE [RS_11] PartitionCols:_col0 Group By Operator [GBY_10] (rows=8 width=101) Output:["_col0"],keys:_col1 Map Join Operator [MAPJOIN_19] (rows=33 width=101) Conds:RS_6._col0=SEL_5._col1(Inner),HybridGraceHashJoin:true,Output:["_col1"] <-Map 1 [BROADCAST_EDGE] BROADCAST [RS_6] PartitionCols:_col0 Select Operator [SEL_2] (rows=12 width=105) Output:["_col0","_col1"] Filter Operator [FIL_17] (rows=12 width=105) predicate:quarter_id is not null TableScan [TS_0] (rows=12 width=105) m****@pmt_inventory,a11,Tbl:COMPLETE,Col:COMPLETE,Output:["quarter_id","pbtname"] <-Select Operator [SEL_5] (rows=48 width=8) Output:["_col1"] Filter Operator [FIL_18] (rows=48 width=8) predicate:((month_id) IN (200607, 200606) and quarter_id is not null) TableScan [TS_3] (rows=48 width=8) m****@lu_month,a12,Tbl:COMPLETE,Col:COMPLETE,Output:["month_id","quarter_id"] Immediate Notes: 1. Much smaller 2. Can be read in order
  • 17. Page17 © Hortonworks Inc. 2011 – 2014. All Rights Reserved WHAT is new style Hive explain plan (HIVE-9780) Stage-1 Reducer 3 File Output Operator [FS_14] Group By Operator [GBY_12] (rows=8 width=101) Output:["_col0"],keys:KEY._col0 <-Map 2 [SIMPLE_EDGE] SHUFFLE [RS_11] PartitionCols:_col0 Group By Operator [GBY_10] (rows=8 width=101) Output:["_col0"],keys:_col1 Map Join Operator [MAPJOIN_19] (rows=33 width=101) Conds:RS_6._col0=SEL_5._col1(Inner),HybridGraceHashJoin:true,Output:["_col1"] <-Map 1 [BROADCAST_EDGE] BROADCAST [RS_6] PartitionCols:_col0 Select Operator [SEL_2] (rows=12 width=105) Output:["_col0","_col1"] Filter Operator [FIL_17] (rows=12 width=105) predicate:quarter_id is not null TableScan [TS_0] (rows=12 width=105) m****@pmt_inventory,a11,Tbl:COMPLETE,Col:COMPLETE,Output:["quarter_id","pbtname"] <-Select Operator [SEL_5] (rows=48 width=8) Output:["_col1"] Filter Operator [FIL_18] (rows=48 width=8) predicate:((month_id) IN (200607, 200606) and quarter_id is not null) TableScan [TS_3] (rows=48 width=8) m****@lu_month,a12,Tbl:COMPLETE,Col:COMPLETE,Output:["month_id","quarter_id"] Data flows bottom to top
  • 18. Page18 © Hortonworks Inc. 2011 – 2014. All Rights Reserved WHAT is new style Hive explain plan (HIVE-9780) Stage-1 Reducer 3 File Output Operator [FS_14] Group By Operator [GBY_12] (rows=8 width=101) Output:["_col0"],keys:KEY._col0 <-Map 2 [SIMPLE_EDGE] SHUFFLE [RS_11] PartitionCols:_col0 Group By Operator [GBY_10] (rows=8 width=101) Output:["_col0"],keys:_col1 Map Join Operator [MAPJOIN_19] (rows=33 width=101) Conds:RS_6._col0=SEL_5._col1(Inner),HybridGraceHashJoin:true,Output:["_col1"] <-Map 1 [BROADCAST_EDGE] BROADCAST [RS_6] PartitionCols:_col0 Select Operator [SEL_2] (rows=12 width=105) Output:["_col0","_col1"] Filter Operator [FIL_17] (rows=12 width=105) predicate:quarter_id is not null TableScan [TS_0] (rows=12 width=105) m****@pmt_inventory,a11,Tbl:COMPLETE,Col:COMPLETE,Output:["quarter_id","pbtname"] <-Select Operator [SEL_5] (rows=48 width=8) Output:["_col1"] Filter Operator [FIL_18] (rows=48 width=8) predicate:((month_id) IN (200607, 200606) and quarter_id is not null) TableScan [TS_3] (rows=48 width=8) m****@lu_month,a12,Tbl:COMPLETE,Col:COMPLETE,Output:["month_id","quarter_id"] Operators have multiple children when it makes sense
  • 19. Page19 © Hortonworks Inc. 2011 – 2014. All Rights Reserved WHAT is new style Hive explain plan (HIVE-9780) Stage-1 Reducer 3 File Output Operator [FS_14] Group By Operator [GBY_12] (rows=8 width=101) Output:["_col0"],keys:KEY._col0 <-Map 2 [SIMPLE_EDGE] SHUFFLE [RS_11] PartitionCols:_col0 Group By Operator [GBY_10] (rows=8 width=101) Output:["_col0"],keys:_col1 Map Join Operator [MAPJOIN_19] (rows=33 width=101) Conds:RS_6._col0=SEL_5._col1(Inner),HybridGraceHashJoin:true,Output:["_col1"] <-Map 1 [BROADCAST_EDGE] BROADCAST [RS_6] PartitionCols:_col0 Select Operator [SEL_2] (rows=12 width=105) Output:["_col0","_col1"] Filter Operator [FIL_17] (rows=12 width=105) predicate:quarter_id is not null TableScan [TS_0] (rows=12 width=105) m****@pmt_inventory,a11,Tbl:COMPLETE,Col:COMPLETE,Output:["quarter_id","pbtname"] <-Select Operator [SEL_5] (rows=48 width=8) Output:["_col1"] Filter Operator [FIL_18] (rows=48 width=8) predicate:((month_id) IN (200607, 200606) and quarter_id is not null) TableScan [TS_3] (rows=48 width=8) m****@lu_month,a12,Tbl:COMPLETE,Col:COMPLETE,Output:["month_id","quarter_id"] Join’s information is clear pmt_inventory is broadcasted to lu_month and a MapJoin is done
  • 20. Page20 © Hortonworks Inc. 2011 – 2014. All Rights Reserved WHAT is new style Hive explain plan (HIVE-9780) Stage-1 Reducer 3 File Output Operator [FS_14] Group By Operator [GBY_12] (rows=8 width=101) Output:["_col0"],keys:KEY._col0 <-Map 2 [SIMPLE_EDGE] SHUFFLE [RS_11] PartitionCols:_col0 Group By Operator [GBY_10] (rows=8 width=101) Output:["_col0"],keys:_col1 Map Join Operator [MAPJOIN_19] (rows=33 width=101) Conds:RS_6._col0=SEL_5._col1(Inner),HybridGraceHashJoin:true,Output:["_col1"] <-Map 1 [BROADCAST_EDGE] BROADCAST [RS_6] PartitionCols:_col0 Select Operator [SEL_2] (rows=12 width=105) Output:["_col0","_col1"] Filter Operator [FIL_17] (rows=12 width=105) predicate:quarter_id is not null TableScan [TS_0] (rows=12 width=105) m****@pmt_inventory,a11,Tbl:COMPLETE,Col:COMPLETE,Output:["quarter_id","pbtname"] <-Select Operator [SEL_5] (rows=48 width=8) Output:["_col1"] Filter Operator [FIL_18] (rows=48 width=8) predicate:((month_id) IN (200607, 200606) and quarter_id is not null) TableScan [TS_3] (rows=48 width=8) m****@lu_month,a12,Tbl:COMPLETE,Col:COMPLETE,Output:["month_id","quarter_id"] Cost mentioned once per operator
  • 21. Page21 © Hortonworks Inc. 2011 – 2014. All Rights Reserved HOW to performance debug a real query (TPC-DS Q3, 1TB) select dt.d_year, item.i_brand_id brand_id, item.i_brand brand, sum(ss_ext_sales_price) sum_agg from date_dim dt, store_sales, item where dt.d_date_sk = store_sales.ss_sold_date_sk and store_sales.ss_item_sk = item.i_item_sk and item.i_manufact_id = 436 and dt.d_moy = 12 group by dt.d_year , item.i_brand , item.i_brand_id order by dt.d_year , sum_agg desc , brand_id limit 10 partitioned by ss_sold_date_sk
  • 22. Page22 © Hortonworks Inc. 2011 – 2014. All Rights Reserved <-Reducer 3 [SIMPLE_EDGE] SHUFFLE [RS_15] PartitionCols:_col0, _col1, _col2 Group By Operator [GBY_14] (rows=1 width=116) Output:["_col0","_col1","_col2","_col3"],aggregations:["sum(_col45)"],keys:_col6, _col65, _col64 Select Operator [SEL_13] (rows=76515 width=128) Output:["_col6","_col65","_col64","_col45"] Filter Operator [FIL_23] (rows=76515 width=128) predicate:((_col0 = _col53) and (_col32 = _col57)) Merge Join Operator [MERGEJOIN_28] (rows=306061 width=128) Conds:RS_8._col32=RS_34.i_item_sk(Inner),Output:["_col0","_col6","_col32","_col45","_col53","_col57","_col64","_col65"] <-Map 7 [SIMPLE_EDGE] vectorized SHUFFLE [RS_34] PartitionCols:i_item_sk Filter Operator [FIL_33] (rows=434 width=111) predicate:(i_item_sk is not null and (i_manufact_id = 436)) TableScan [TS_2] (rows=300000 width=111) tpcds_bin_partitioned_orc_1000@item,item,Tbl:COMPLETE,Col:COMPLETE,Output:["i_item_sk","i_brand_id","i_brand","i_manufact_id"] <-Reducer 2 [SIMPLE_EDGE] SHUFFLE [RS_8] PartitionCols:_col32 Merge Join Operator [MERGEJOIN_27] (rows=211562452 width=20) Conds:RS_30.d_date_sk=RS_32.ss_sold_date_sk(Inner),Output:["_col0","_col6","_col32","_col45","_col53"] <-Map 1 [SIMPLE_EDGE] vectorized SHUFFLE [RS_30] PartitionCols:d_date_sk Filter Operator [FIL_29] (rows=5619 width=12) predicate:(d_date_sk is not null and (d_moy = 12)) TableScan [TS_0] (rows=73049 width=12) tpcds_bin_partitioned_orc_1000@date_dim,dt,Tbl:COMPLETE,Col:COMPLETE,Output:["d_date_sk","d_year","d_moy"] <-Map 6 [SIMPLE_EDGE] vectorized SHUFFLE [RS_32] PartitionCols:ss_sold_date_sk Filter Operator [FIL_31] (rows=2750387156 width=11) predicate:ss_item_sk is not null TableScan [TS_1] (rows=2750387156 width=11) tpcds_bin_partitioned_orc_1000@store_sales,store_sales,Tbl:COMPLETE,Col:COMPLETE,Output:["ss_item_sk","ss_ext_sales_price"] Original plan runs 163.33s. Sounds like column pruning and predicate push down are working fine. However, the join sequence store_sales✖date_dim✖item is not good enough. A better one is store_sales✖item✖date_dim Table Cardinality Cardinality after filter Selectivity date_dim 73K 5619 7.6% item 300K 434 0.14%
  • 23. Page23 © Hortonworks Inc. 2011 – 2014. All Rights Reserved <-Reducer 3 [SIMPLE_EDGE] SHUFFLE [RS_17] PartitionCols:_col0, _col1, _col2 Group By Operator [GBY_16] (rows=9 width=116) Output:["_col0","_col1","_col2","_col3"],aggregations:["sum(_col1)"],keys:_col8, _col4, _col5 Select Operator [SEL_15] (rows=306061 width=112) Output:["_col8","_col4","_col5","_col1"] Merge Join Operator [MERGEJOIN_29] (rows=306061 width=112) Conds:RS_12._col2=RS_38._col0(Inner),Output:["_col1","_col4","_col5","_col8"] <-Map 7 [SIMPLE_EDGE] vectorized SHUFFLE [RS_38] PartitionCols:_col0 Select Operator [SEL_37] (rows=5619 width=12) Output:["_col0","_col1"] Filter Operator [FIL_36] (rows=5619 width=12) predicate:((d_moy = 12) and d_date_sk is not null) TableScan [TS_6] (rows=73049 width=12) tpcds_bin_partitioned_orc_1000@date_dim,dt,Tbl:COMPLETE,Col:COMPLETE,Output:["d_date_sk","d_year","d_moy"] <-Reducer 2 [SIMPLE_EDGE] SHUFFLE [RS_12] PartitionCols:_col2 Merge Join Operator [MERGEJOIN_28] (rows=3978894 width=112) Conds:RS_32._col0=RS_35._col0(Inner),Output:["_col1","_col2","_col4","_col5"] <-Map 1 [SIMPLE_EDGE] vectorized SHUFFLE [RS_32] PartitionCols:_col0 Select Operator [SEL_31] (rows=2750387156 width=11) Output:["_col0","_col1","_col2"] Filter Operator [FIL_30] (rows=2750387156 width=11) predicate:ss_item_sk is not null TableScan [TS_0] (rows=2750387156 width=11) tpcds_bin_partitioned_orc_1000@store_sales,store_sales,Tbl:COMPLETE,Col:COMPLETE,Output:["ss_item_sk","ss_ext_sales_price"] <-Map 6 [SIMPLE_EDGE] vectorized SHUFFLE [RS_35] PartitionCols:_col0 Select Operator [SEL_34] (rows=434 width=111) Output:["_col0","_col1","_col2"] Filter Operator [FIL_33] (rows=434 width=111) predicate:((i_manufact_id = 436) and i_item_sk is not null) TableScan [TS_3] (rows=300000 width=111) tpcds_bin_partitioned_orc_1000@item,item,Tbl:COMPLETE,Col:COMPLETE,Output:["i_item_sk","i_brand_id","i_brand","i_manufact_id"] CBO on, new plan runs 143.97s with new join sequence store_sales✖item✖date_dim. The input data size of one branch of join is pretty small, should use map join, rather than merge join.
  • 24. Page24 © Hortonworks Inc. 2011 – 2014. All Rights Reserved SHUFFLE [RS_44] PartitionCols:_col0, _col1, _col2 Group By Operator [GBY_43] (rows=9 width=116) Output:["_col0","_col1","_col2","_col3"],aggregations:["sum(_col1)"],keys:_col8, _col4, _col5 Select Operator [SEL_42] (rows=306061 width=112) Output:["_col8","_col4","_col5","_col1"] Map Join Operator [MAPJOIN_41] (rows=306061 width=112) Conds:MAPJOIN_40._col2=RS_37._col0(Inner),HybridGraceHashJoin:true,Output:["_col1","_col4","_col5","_col8"] <-Map 5 [BROADCAST_EDGE] vectorized BROADCAST [RS_37] PartitionCols:_col0 Select Operator [SEL_36] (rows=5619 width=12) Output:["_col0","_col1"] Filter Operator [FIL_35] (rows=5619 width=12) predicate:((d_moy = 12) and d_date_sk is not null) TableScan [TS_6] (rows=73049 width=12) tpcds_bin_partitioned_orc_1000@date_dim,dt,Tbl:COMPLETE,Col:COMPLETE,Output:["d_date_sk","d_year","d_moy"] <-Map Join Operator [MAPJOIN_40] (rows=3978894 width=112) Conds:SEL_39._col0=RS_34._col0(Inner),HybridGraceHashJoin:true,Output:["_col1","_col2","_col4","_col5"] <-Map 4 [BROADCAST_EDGE] vectorized BROADCAST [RS_34] PartitionCols:_col0 Select Operator [SEL_33] (rows=434 width=111) Output:["_col0","_col1","_col2"] Filter Operator [FIL_32] (rows=434 width=111) predicate:((i_manufact_id = 436) and i_item_sk is not null) TableScan [TS_3] (rows=300000 width=111) tpcds_bin_partitioned_orc_1000@item,item,Tbl:COMPLETE,Col:COMPLETE,Output:["i_item_sk","i_brand_id","i_brand","i_manufact_id"] <-Select Operator [SEL_39] (rows=2750387156 width=11) Output:["_col0","_col1","_col2"] Filter Operator [FIL_38] (rows=2750387156 width=11) predicate:ss_item_sk is not null TableScan [TS_0] (rows=2750387156 width=11) tpcds_bin_partitioned_orc_1000@store_sales,store_sales,Tbl:COMPLETE,Col:COMPLETE,Output:["ss_item_sk","ss_ext_sales_price"] Increase hive.auto.convert.join.noconditionaltask.size= 1,359,688,499, we can see it is now using map join operators. New plan runs 45.84s. store_sales is a partitioned table on the join key ss_sold_date_sk with date_dim table.
  • 25. Page25 © Hortonworks Inc. 2011 – 2014. All Rights Reserved SHUFFLE [RS_55] PartitionCols:_col0, _col1, _col2 Group By Operator [GBY_54] (rows=9 width=116) Output:["_col0","_col1","_col2","_col3"],aggregations:["sum(_col1)"],keys:_col8, _col4, _col5 Select Operator [SEL_53] (rows=306061 width=112) Output:["_col8","_col4","_col5","_col1"] Map Join Operator [MAPJOIN_52] (rows=306061 width=112) Conds:MAPJOIN_51._col2=RS_45._col0(Inner),HybridGraceHashJoin:true,Output:["_col1","_col4","_col5","_col8"] <-Map 5 [BROADCAST_EDGE] vectorized BROADCAST [RS_45] PartitionCols:_col0 Select Operator [SEL_44] (rows=5619 width=12) Output:["_col0","_col1"] Filter Operator [FIL_43] (rows=5619 width=12) predicate:((d_moy = 12) and d_date_sk is not null) TableScan [TS_6] (rows=73049 width=12) tpcds_bin_partitioned_orc_1000@date_dim,dt,Tbl:COMPLETE,Col:COMPLETE,Output:["d_date_sk","d_year","d_moy"] Dynamic Partitioning Event Operator [EVENT_48] (rows=2809 width=12) Group By Operator [GBY_47] (rows=2809 width=12) Output:["_col0"],keys:_col0 Select Operator [SEL_46] (rows=5619 width=12) Output:["_col0"] Please refer to the previous Select Operator [SEL_44] <-Map Join Operator [MAPJOIN_51] (rows=3978894 width=112) Conds:SEL_50._col0=RS_42._col0(Inner),HybridGraceHashJoin:true,Output:["_col1","_col2","_col4","_col5"] <-Map 4 [BROADCAST_EDGE] vectorized BROADCAST [RS_42] PartitionCols:_col0 Select Operator [SEL_41] (rows=434 width=111) Output:["_col0","_col1","_col2"] Filter Operator [FIL_40] (rows=434 width=111) predicate:((i_manufact_id = 436) and i_item_sk is not null) TableScan [TS_3] (rows=300000 width=111) tpcds_bin_partitioned_orc_1000@item,item,Tbl:COMPLETE,Col:COMPLETE,Output:["i_item_sk","i_brand_id","i_brand","i_manufact_id"] <-Select Operator [SEL_50] (rows=2750387156 width=11) Output:["_col0","_col1","_col2"] Filter Operator [FIL_49] (rows=2750387156 width=11) predicate:ss_item_sk is not null TableScan [TS_0] (rows=2750387156 width=11) tpcds_bin_partitioned_orc_1000@store_sales,store_sales,Tbl:COMPLETE,Col:COMPLETE,Output:["ss_item_sk","ss_ext_sales_price"] By setting hive.tez.dynamic.partition.pruning=true, we can see dynamic partitioning event operators. See more about this in HIVE-7826. New plan runs 31.35s. In the run time, dynamic partition event operator will send values needed to prune to the application master - where splits are generated and tasks are submitted. Using these values we can strip out any unneeded partitions dynamically, while the query is running.
  • 26. Page26 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Performance debugging summary (TPC-DS Q3, 1TB) 0 20 40 60 80 100 120 140 160 180 Original Join re-order Join selection Dynamic partition pruning Queryexecutiontime(s) Query execution time Continuous improvement
  • 27. Page27 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Integration with Apache Ambari 1. Type the query here 2. Click “explain” 3. explain plan will be shown
  • 28. Page28 © Hortonworks Inc. 2011 – 2014. All Rights Reserved  “set hive.tez.exec.print.summary=true;”  Get more insights on query performance Can be used along with Tez vertex runtime stats Runtime stats
  • 29. Page29 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Summary • Show old style Hive explain plan is hard to read. • Verbose with too much redundant information, hard to follow how data flows, cost of operator is unclear • Compare with Postgres over a body of 500+ realistic SQL queries and identify the candidate improving points • Introduce new style Hive explain plan • Use a concrete example to help understand the explain: execution cost, join sequence and orchestration of the operator tree • Use the new Hive explain plan to performance debug TPC-DS Q3 • Show the improvement after join re-ordering, join selection, and dynamic partition pruning • Integration/interaction with other system/tools
  • 30. Page30 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Future work -- Some gaps remain after HIVE-9780 • Put the real schema, table and column names in the explain plan, e.g., no more _col0 etc. • This will help users to understand the plan better • HIVE-8681: CBO: Column names are missing from join expression in Map join with CBO enabled • Get an equivalent of “EXPLAIN ANALYZE” – such as operator level runtime stats and warnings. • This will help users to find out the gap between estimated cost and real cost • HIVE-14362: Support explain analyze in Hive
  • 31. Page31 © Hortonworks Inc. 2011 – 2014. All Rights Reserved SHUFFLE [RS_55] PartitionCols:d_year, i_brand, i_brand_id Group By Operator [GBY_54] (rows=9/10 width=116) Output:["d_year","sum_agg","i_brand","i_brand_id"],aggregations:["sum(ss_ext_sales_price)"],keys:d_year, i_brand, i_brand_id Select Operator [SEL_53] (rows=306061/324651 width=112) Output:["d_year","i_brand","i_brand_id","ss_ext_sales_price"] Map Join Operator [MAPJOIN_52] (rows=306061/324651 width=112) Conds:MAPJOIN_51. ss_sold_date_sk=RS_45. d_date_sk(Inner),HybridGraceHashJoin:true,Output:["d_year","i_brand","i_brand_id","ss_ext_sales_price"] <-Map 5 [BROADCAST_EDGE] vectorized BROADCAST [RS_45] PartitionCols:d_date_sk Select Operator [SEL_44] (rows=5619/6034 width=12) Output:["d_date_sk","d_year"] Filter Operator [FIL_43] (rows=5619/6034 width=12) predicate:((d_moy = 12) and d_date_sk is not null) TableScan [TS_6] (rows=73049/73049 width=12) tpcds_bin_partitioned_orc_1000@date_dim,dt,Tbl:COMPLETE,Col:COMPLETE,Output:["d_date_sk","d_year","d_moy"] Dynamic Partitioning Event Operator [EVENT_48] (rows=2809 width=12) Group By Operator [GBY_47] (rows=2809/2324 width=12) Output:["d_date_sk"],keys:d_date_sk Select Operator [SEL_46] (rows=5619/6034 width=12) Output:["d_date_sk"] Please refer to the previous Select Operator [SEL_44] <-Map Join Operator [MAPJOIN_51] (rows=3978894/4202377 width=112) Conds:SEL_50.ss_item_sk=RS_42. i_item_sk(Inner),HybridGraceHashJoin:true,Output:["ss_ext_sales_price",” ss_sold_date_sk","i_brand","i_brand_id"] <-Map 4 [BROADCAST_EDGE] vectorized BROADCAST [RS_42] PartitionCols:i_item_sk Select Operator [SEL_41] (rows=434/453 width=111) Output:[” i_item_sk","i_brand_id","i_brand"] Filter Operator [FIL_40] (rows=434/453 width=111) predicate:((i_manufact_id = 436) and i_item_sk is not null) TableScan [TS_3] (rows=300000/300000 width=111) tpcds_bin_partitioned_orc_1000@item,item,Tbl:COMPLETE,Col:COMPLETE,Output:["i_item_sk","i_brand_id","i_brand","i_manufact_id"] <-Select Operator [SEL_50] (rows=2750387156/2750387156 width=11) Output:["ss_item_sk","ss_ext_sales_price","ss_sold_date_sk"] Filter Operator [FIL_49] (rows=2750387156/2750387156 width=11) predicate:ss_item_sk is not null TableScan [TS_0] (rows=2750387156/2750387156 width=11) tpcds_bin_partitioned_orc_1000@store_sales,store_sales,Tbl:COMPLETE,Col:COMPLETE,Output:["ss_item_sk","ss_ext_sales_price"]
  • 32. Page32 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Acknowledgement • We thank all the anonymous reviewers’ votes to give us this opportunity to share our work. • Part of the slides are borrowed from or modified based on Carter Shanklin and Rajesh Balamohan’s slides. • We thank Gunther Hagleitner for all the support and inputs. • We thank Sapin Amin for setting up the testing cluster.
  • 33. Page33 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Thank you! Questions?

Editor's Notes

  • #2: Hive contributors have striven to improve the capability of Hive in terms of both performance and functionality. We assert that understanding and analyzing Apache Hive query execution plan is crucial for performance debugging. In this talk, we study why Apache Hive’s query execution plan today are so difficult to analyze. We identify a set of pain points from our Apache Hive performance engineers, development engineers as well as real users/customers. We propose and show a new presentation data model that can well address the pain points. The three most critical parts of the presentation are (1) the estimated query execution cost, which is the planner's guess at how long it will take to run the query (measured in #rows); (2) the orchestration of the operator tree across consecutive M/R (Tez) jobs; and (3) integration and extension support with other presentation tools, e.g., Apache Ambari.
  • #4: We can see that Hive old style explain is quite verbose, is it necessary?
  • #5: We pick a real query out of the 500 queries.
  • #6: 16 lines
  • #7: 80 lines
  • #11: Although Reduce Output Operator is crucial as it defines the boundary between a map task and a reduce task, we are thinking how much information of the reduce output operator is.... SQL users care about more about relational operators rather than MR operators.
  • #13: We can also see more details about this join.
  • #16: It seems that Postgres sets up a good example for hive to improve on.
  • #21: So this is the basic introduction of the new style hive explain plan
  • #22: This query involves 3 tables.
  • #25: splits
  • #29: For Query 87
  • #32: The next version of Hive explain plan will look like this. We highlight the changes that we want to make in red color.
  • #33: We thank the audience for your attention.
  • #34: I am ready to take any questions.