SlideShare a Scribd company logo
1 
Headline Goes Here 
Speaker Name or Subhead Goes Here 
DO NOT USE PUBLICLY 
Hive on Spark PRIOR TO 10/23/12 
Szehon Ho 
Software Engineer at Cloudera, Apache Hive Committer 
October 2014
Background (Hive) 
• Apache Hive: a data query and management tool for a 
distributed dataset, exposed via a SQL-like query language 
called HiveQL 
2
Background (Hive) 
• 2007-2013, MapReduce = only distributed processing engine 
• Map(), Reduce() primitives, not designed for long data pipelines 
• Complex SQL-like queries inefficiently expressed as many MR 
stages. 
• Disk IO between MR’s 
• Shuffle-sort between M+R 
3 
Map() Red() 
Hive Query 
Map() Red() Map() Red() 
HDFS
Background (Hive) 
• 2013 Hive Community started work on Hive on Tez 
• Tez DAG execution graph 
4 
Hive Query 
Map() Red() 
Map() Red() 
Red() 
HDFS
Background (Spark) 
• Generalized distributed processing framework created in ~2011 
by UC Berkeley AMPLab 
• Many advantages (community, ease-of-use), heading to succeed 
MapReduce 
5
Background (Spark) 
• Community Momentum: 
• Already the most active project in Hadoop ecosystem 
• June 2014: 255 contributors from 50 companies 
• First half of 2014: ~1200 commits, 250000 LOC changed 
• Integration from with many Hadoop components, ie Pig, Flume, 
Mahout, Crunch, Solr, now Hive. 
6
Background (Spark) 
7 
• Clean programming abstraction: Resilient Distributed Dataset 
(RDD): 
• A fault-tolerant dataset, can be a stage in a data pipeline. 
• Created from existing data set like HDFS file, or 
transformation from other RDD (chain-up RDD’s) 
• Expressive API’s, much more than MapReduce 
• Transformations: map, filter, groupBy 
• Actions: cache, save 
• => More efficient representation of Hive queries
Hive on Spark 
8 
• Shark Project: 
• AMPLab github project, fork of Hive 
• Not maintained by Hive community, sunsetted 2014 
• Hive on Spark: 
• Done in Hive community 
• Architecturally compatible, by keeping same physical abstraction for Hive on 
Spark as Hive on Tez/MR. 
• Code maintenance 
• Maximize re-use of common functionality across execution engine
Hive on Spark 
9 
• Hive on Spark, User Benefits 
• Another seamless execution option (MR, Tez, Spark) 
• Leverage Spark clusters coming in use for ML, Graph Processing, 
Streaming, etc. 
• Continued efficiency, performance improvements via strong Spark 
community.
High-Level Design 
Common across engines: 
• HQL syntax 
• Tool Integrations (auditing plugins, authorization, 
Drivers, Thrift clients, UDF, StorageHandler) 
• Logical optimizations 
MapRedCompiler TezCompiler SparkCompiler 
10 
Hive Query 
Logical Op Tree 
Task 
TaskCompiler 
Work 
MapRedTask 
MapRedWork 
TezTask SparkTask 
MapRedWork 
TezWork 
TezWork SparkWk 
TezWork 
SparkWk 
SparkWk
Simple Example 
11 
SELECT COUNT(*) from status_updates where 
ds = ‘2014-10-01’ group by region; 
TableScan 
(status_updates) 
Filter (ds=‘2014 10-01’) 
Select (region) 
Group-By (count) 
Select 
Hive Query: 
Operator Tree: 
GBY trigger reduce-boundary:
Simple Example 
12 
Reducer 
GroupBy 
Select 
FileOutput 
Mapper 
TableScan 
Filter 
Select 
Group-By 
ReduceSink 
MapRed Work Tree 
• Map->Reduce 
ShuffleSort
Simple Example 
13 
mapPartition() 
GroupBy 
Select 
FileOutput 
mapPartition() 
TableScan 
Filter 
Select 
Group-By 
ReduceSink 
Spark Work Tree: 
• RDD Chain 
No sorting 
groupBy()
Join Example 
14 
TableScan 
Filter 
Select 
Join 
Select 
Sort 
Select 
TableScan 
Filter 
Select 
SELECT * FROM 
(SELECT * FROM src WHERE src.key < 10) src1 
JOIN 
(SELECT * FROM src WHERE src.key < 10) src2 
ORDER BY src1.key; 
• Operator Tree: 
• Join/Sort trigger Reduce 
boundary 
Hive Query:
Join Example 
15 
MapRed Work Tree 
• 2 MapReduce Works 
ShuffleSort ShuffleSort 
Map 
TableScan 
ReduceSink (Sort) 
Map 
TableScan 
Filter 
Select 
Reduce Sink Reduce 
Join 
Select 
FileOutput 
Reduce 
Select 
FileOutput 
Map 
TableScan 
Filter 
Select 
Reduce Sink 
Disk IO 
HDFS
Join Example 
16 
No spill to disk 
mapPartition() 
Join 
Select 
Reduce Sink 
mapPartition() 
Select 
FileOutput 
union() Partition/ 
Sort() 
sortBy() 
mapPartition() 
TableScan 
Filter 
Select 
Reduce Sink 
mapPartition() 
TableScan 
Filter 
Select 
Reduce Sink 
Spark Work Tree: 
RDD Transform Chain
Improvements to Spark 
17 
• Reduce-side join: SPARK-2978 
• Spark had group(), sort(), but not partition+sort like MR-style shuffle-sort. 
• Can help other apps migrate from Map-Reduce to Spark 
• Remote Spark-context (push down to AM) 
• SparkContext is not allowed concurrently in client application process. 
• SparkContext is heavy-weight 
• Spark Monitoring API’s 
• Elastic scaling of Spark application: SPARK-3174
Community 
18 
• Thanks to contributors from many organizations: 
• Follow our progress on HIVE-7292 
• Thank you!

More Related Content

PDF
TriHUG Feb: Hive on spark
PPTX
Hive+Tez: A performance deep dive
PPTX
Have your Cake and Eat it Too - Architecture for Batch and Real-time processing
PPTX
Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig
PDF
Apache Hadoop YARN - Enabling Next Generation Data Applications
PPTX
Flexible and Real-Time Stream Processing with Apache Flink
PPTX
What's new in Hadoop Common and HDFS
PPTX
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
TriHUG Feb: Hive on spark
Hive+Tez: A performance deep dive
Have your Cake and Eat it Too - Architecture for Batch and Real-time processing
Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig
Apache Hadoop YARN - Enabling Next Generation Data Applications
Flexible and Real-Time Stream Processing with Apache Flink
What's new in Hadoop Common and HDFS
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters

What's hot (20)

PPTX
Hive on spark is blazing fast or is it final
PPTX
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
PPTX
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
PPTX
Spark vstez
PDF
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
PDF
Quick Introduction to Apache Tez
PDF
Cloudera Impala
PPTX
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
PPTX
NextGen Apache Hadoop MapReduce
PPTX
Hadoop and rdbms with sqoop
PPTX
Achieving 100k Queries per Hour on Hive on Tez
PPTX
Cloudera Impala + PostgreSQL
PPTX
Hadoop and Spark for the SAS Developer
PDF
Tez: Accelerating Data Pipelines - fifthel
PDF
Hadoop ecosystem
PPTX
Producing Spark on YARN for ETL
PDF
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
PPTX
Apache Tez – Present and Future
PPTX
Node Labels in YARN
PDF
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Hive on spark is blazing fast or is it final
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Spark vstez
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Quick Introduction to Apache Tez
Cloudera Impala
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
NextGen Apache Hadoop MapReduce
Hadoop and rdbms with sqoop
Achieving 100k Queries per Hour on Hive on Tez
Cloudera Impala + PostgreSQL
Hadoop and Spark for the SAS Developer
Tez: Accelerating Data Pipelines - fifthel
Hadoop ecosystem
Producing Spark on YARN for ETL
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Apache Tez – Present and Future
Node Labels in YARN
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Ad

Viewers also liked (20)

PDF
Hive Now Sparks
PDF
Overview of the Hive Stinger Initiative
PDF
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...
PPTX
Apache Kite
PDF
Ingesting hdfs intosolrusingsparktrimmed
PDF
Sparkstreaming
PDF
Devops Spark Streaming
PPTX
Scala training workshop 02
PDF
IndexedRDD: Efficeint Fine-Grained Updates for RDD's-(Ankur Dave, UC Berkeley)
PDF
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
PDF
Big Data/Hadoop Infrastructure Considerations
PPTX
Spark Technology Center IBM
PPTX
What’s New in the Berkeley Data Analytics Stack
PDF
Overview of stinger interactive query for hive
PPTX
What's New in Spark 2?
PPTX
Join Algorithms in MapReduce
PDF
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...
PDF
Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC ...
PDF
Spark Summit EU talk by Josef Habdank
PDF
SQL on everything, in memory
Hive Now Sparks
Overview of the Hive Stinger Initiative
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...
Apache Kite
Ingesting hdfs intosolrusingsparktrimmed
Sparkstreaming
Devops Spark Streaming
Scala training workshop 02
IndexedRDD: Efficeint Fine-Grained Updates for RDD's-(Ankur Dave, UC Berkeley)
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
Big Data/Hadoop Infrastructure Considerations
Spark Technology Center IBM
What’s New in the Berkeley Data Analytics Stack
Overview of stinger interactive query for hive
What's New in Spark 2?
Join Algorithms in MapReduce
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...
Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC ...
Spark Summit EU talk by Josef Habdank
SQL on everything, in memory
Ad

Similar to October 2014 HUG : Hive On Spark (20)

PPTX
Scrap Your MapReduce - Apache Spark
PDF
Data Science
PPT
Scala and spark
PDF
Apache Spark Overview
PPTX
Hadoop intro
PPT
11. From Hadoop to Spark 1:2
PDF
Introduction to Spark
PPTX
SQL on Hadoop for the Oracle Professional
PPTX
Spark Study Notes
PDF
Introduction to Spark on Hadoop
PPTX
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
PPTX
Hadoop
PDF
Advanced Analytics and Big Data (August 2014)
PPT
Map reducecloudtech
PPTX
Paris Data Geek - Spark Streaming
PPTX
Introduction to Spark - Phoenix Meetup 08-19-2014
PPTX
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
PPTX
Hackathon bonn
PDF
Hadoop Spark - Reuniao SouJava 12/04/2014
PPTX
Hadoop_arunam_ppt
Scrap Your MapReduce - Apache Spark
Data Science
Scala and spark
Apache Spark Overview
Hadoop intro
11. From Hadoop to Spark 1:2
Introduction to Spark
SQL on Hadoop for the Oracle Professional
Spark Study Notes
Introduction to Spark on Hadoop
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Hadoop
Advanced Analytics and Big Data (August 2014)
Map reducecloudtech
Paris Data Geek - Spark Streaming
Introduction to Spark - Phoenix Meetup 08-19-2014
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Hackathon bonn
Hadoop Spark - Reuniao SouJava 12/04/2014
Hadoop_arunam_ppt

More from Yahoo Developer Network (20)

PDF
Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
PDF
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
PDF
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
PDF
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
PDF
CICD at Oath using Screwdriver
PDF
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
PPTX
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
PDF
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
PPTX
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
PPTX
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
PDF
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
PPTX
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
PDF
Moving the Oath Grid to Docker, Eric Badger, Oath
PDF
Architecting Petabyte Scale AI Applications
PDF
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
PPTX
Jun 2017 HUG: YARN Scheduling – A Step Beyond
PDF
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
PPTX
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
PPTX
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
PPTX
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
CICD at Oath using Screwdriver
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Moving the Oath Grid to Docker, Eric Badger, Oath
Architecting Petabyte Scale AI Applications
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Jun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics

Recently uploaded (20)

PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Machine learning based COVID-19 study performance prediction
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
August Patch Tuesday
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
1. Introduction to Computer Programming.pptx
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
OMC Textile Division Presentation 2021.pptx
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Getting Started with Data Integration: FME Form 101
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PPTX
TLE Review Electricity (Electricity).pptx
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPTX
Spectroscopy.pptx food analysis technology
NewMind AI Weekly Chronicles - August'25-Week II
Machine learning based COVID-19 study performance prediction
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
August Patch Tuesday
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
1. Introduction to Computer Programming.pptx
Encapsulation_ Review paper, used for researhc scholars
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Reach Out and Touch Someone: Haptics and Empathic Computing
OMC Textile Division Presentation 2021.pptx
Spectral efficient network and resource selection model in 5G networks
Getting Started with Data Integration: FME Form 101
Programs and apps: productivity, graphics, security and other tools
Network Security Unit 5.pdf for BCA BBA.
Accuracy of neural networks in brain wave diagnosis of schizophrenia
TLE Review Electricity (Electricity).pptx
MIND Revenue Release Quarter 2 2025 Press Release
Spectroscopy.pptx food analysis technology

October 2014 HUG : Hive On Spark

  • 1. 1 Headline Goes Here Speaker Name or Subhead Goes Here DO NOT USE PUBLICLY Hive on Spark PRIOR TO 10/23/12 Szehon Ho Software Engineer at Cloudera, Apache Hive Committer October 2014
  • 2. Background (Hive) • Apache Hive: a data query and management tool for a distributed dataset, exposed via a SQL-like query language called HiveQL 2
  • 3. Background (Hive) • 2007-2013, MapReduce = only distributed processing engine • Map(), Reduce() primitives, not designed for long data pipelines • Complex SQL-like queries inefficiently expressed as many MR stages. • Disk IO between MR’s • Shuffle-sort between M+R 3 Map() Red() Hive Query Map() Red() Map() Red() HDFS
  • 4. Background (Hive) • 2013 Hive Community started work on Hive on Tez • Tez DAG execution graph 4 Hive Query Map() Red() Map() Red() Red() HDFS
  • 5. Background (Spark) • Generalized distributed processing framework created in ~2011 by UC Berkeley AMPLab • Many advantages (community, ease-of-use), heading to succeed MapReduce 5
  • 6. Background (Spark) • Community Momentum: • Already the most active project in Hadoop ecosystem • June 2014: 255 contributors from 50 companies • First half of 2014: ~1200 commits, 250000 LOC changed • Integration from with many Hadoop components, ie Pig, Flume, Mahout, Crunch, Solr, now Hive. 6
  • 7. Background (Spark) 7 • Clean programming abstraction: Resilient Distributed Dataset (RDD): • A fault-tolerant dataset, can be a stage in a data pipeline. • Created from existing data set like HDFS file, or transformation from other RDD (chain-up RDD’s) • Expressive API’s, much more than MapReduce • Transformations: map, filter, groupBy • Actions: cache, save • => More efficient representation of Hive queries
  • 8. Hive on Spark 8 • Shark Project: • AMPLab github project, fork of Hive • Not maintained by Hive community, sunsetted 2014 • Hive on Spark: • Done in Hive community • Architecturally compatible, by keeping same physical abstraction for Hive on Spark as Hive on Tez/MR. • Code maintenance • Maximize re-use of common functionality across execution engine
  • 9. Hive on Spark 9 • Hive on Spark, User Benefits • Another seamless execution option (MR, Tez, Spark) • Leverage Spark clusters coming in use for ML, Graph Processing, Streaming, etc. • Continued efficiency, performance improvements via strong Spark community.
  • 10. High-Level Design Common across engines: • HQL syntax • Tool Integrations (auditing plugins, authorization, Drivers, Thrift clients, UDF, StorageHandler) • Logical optimizations MapRedCompiler TezCompiler SparkCompiler 10 Hive Query Logical Op Tree Task TaskCompiler Work MapRedTask MapRedWork TezTask SparkTask MapRedWork TezWork TezWork SparkWk TezWork SparkWk SparkWk
  • 11. Simple Example 11 SELECT COUNT(*) from status_updates where ds = ‘2014-10-01’ group by region; TableScan (status_updates) Filter (ds=‘2014 10-01’) Select (region) Group-By (count) Select Hive Query: Operator Tree: GBY trigger reduce-boundary:
  • 12. Simple Example 12 Reducer GroupBy Select FileOutput Mapper TableScan Filter Select Group-By ReduceSink MapRed Work Tree • Map->Reduce ShuffleSort
  • 13. Simple Example 13 mapPartition() GroupBy Select FileOutput mapPartition() TableScan Filter Select Group-By ReduceSink Spark Work Tree: • RDD Chain No sorting groupBy()
  • 14. Join Example 14 TableScan Filter Select Join Select Sort Select TableScan Filter Select SELECT * FROM (SELECT * FROM src WHERE src.key < 10) src1 JOIN (SELECT * FROM src WHERE src.key < 10) src2 ORDER BY src1.key; • Operator Tree: • Join/Sort trigger Reduce boundary Hive Query:
  • 15. Join Example 15 MapRed Work Tree • 2 MapReduce Works ShuffleSort ShuffleSort Map TableScan ReduceSink (Sort) Map TableScan Filter Select Reduce Sink Reduce Join Select FileOutput Reduce Select FileOutput Map TableScan Filter Select Reduce Sink Disk IO HDFS
  • 16. Join Example 16 No spill to disk mapPartition() Join Select Reduce Sink mapPartition() Select FileOutput union() Partition/ Sort() sortBy() mapPartition() TableScan Filter Select Reduce Sink mapPartition() TableScan Filter Select Reduce Sink Spark Work Tree: RDD Transform Chain
  • 17. Improvements to Spark 17 • Reduce-side join: SPARK-2978 • Spark had group(), sort(), but not partition+sort like MR-style shuffle-sort. • Can help other apps migrate from Map-Reduce to Spark • Remote Spark-context (push down to AM) • SparkContext is not allowed concurrently in client application process. • SparkContext is heavy-weight • Spark Monitoring API’s • Elastic scaling of Spark application: SPARK-3174
  • 18. Community 18 • Thanks to contributors from many organizations: • Follow our progress on HIVE-7292 • Thank you!