SlideShare a Scribd company logo
A Container-based Sizing Framework
for Apache Hadoop/Spark Clusters
October 27, 2016
Hokkaido University
Akiyoshi SUGIKI, Phyo Thandar Thant
Agenda
Hokkaido University Academic Cloud
A Docker-based Sizing Framework for Hadoop
Multi-objective Optimization of Hadoop
1
Information Initiative Center, Hokkaido
University
Founded in 1962 as a national supercomputing center
A member of
– HPCI (High Performance Computing Infrastructure) - 12 institutes
– JHPCN (Joint Usage/Research Center for Interdisciplinary Large-
scale Information Infrastructure) - 8 institutes
University R&D center for supercomputing, cloud computing,
networking, and cyber security
Operating HPC twins
– Supercomputer (172 TFLOPS) and Academic Cloud System (43 TFLOPS)
2
Hokkaido University Academic Cloud (2011-)
Japan’s largest academic cloud system
– > 43 TFLOPS (> 114 nodes)
– ~2,000 VMs
3
Supercomputer Cloud System
Data-science
Cloud System
(Added, 2013-)
SR16000 M1
172 TF/176 nodes
22 TB (128 GB/node)
AMS2500 (File System)
600 TB (SAS, RAID5)
300 TB (SATA, RAID6)
BS2000
44 TF/114 nodes
14 TB (128 GB/node)
Cloud Storage
1.96 PB
AMS2300 (Boot File System)
260 TB (SAS, RAID6)
VFP500N+AMS2500 (NAS)
500 TB (near-line NAS/RAID6)
HA8000/RS210HM
80 GB x 25 nodes
32 GB x 2 nodes
CloudStack 3.x CloudStack 4.x
Hadoop Package for “Bigdata”
(Hadoop, Hive, Mahout, and R)
Supporting “Big Data”
“Big Data” cluster package
• Hadoop, Hive, Mahout, and R
• MPI, OpenMP, and Torque
– Automatic deployment of VM-based clusters
– Custom scheduling policy
• Spread I/O on multiple disks
4
VM
VM
VM
VM
#1
#2
#3
#4
Storage #1
Storage #2
Storage #3
Storage #4
Hadoop Cluster
Virtual Disks
Hadoop
Hive
Mahout
R
Big Data Package
Lessons Learned (So Far)
No single Hadoop (a little like silos)
– Hadoop instance for each group of users
Version problem
– Upgrades and expansion of Hadoop ecosystem
Strong demand of middle person
– Gives advice with deep understanding of research domains,
statistical analysis, and Hadoop-based systems
5
VM VM VM
Hadoop #1
VM VM VM
Hadoop #2
VM VM VM
Hadoop #3
Research Group #1 Research Group #2 Research Group #3
Research
Data
Going Next
A new system will be installed in April, 2018
– x2 CPU cores, x5 storage space
– Bare-metal, accelerating performance at every layer
– Supports both interclouds and hybrid clouds
Still supports Hadoop as well as Spark
– Cluster templates
– Build user community
6
Supercomputer
System Hokkaido U.
Regions
(Tokyo,
Osaka,
Okinawa)
Cloud
Systems
(In other universities
and public clouds)
Cluster Templates (Hadoop, Spark, …)
Requirements
Run Hadoop on multiple Clouds
– Academic Clouds (Community Clouds)
• Hokkaido University Academic Cloud, ...
– Public Clouds
• Amazon AWS, Microsoft Azure, Google Cloud, …
Offer best choice for researches (our users)
– Under multiple criteria
• Cost
• Performance (time constraints)
• Energy
…
7
Our Solution
A Container-based Sizing Framework for Hadoop Clusters
– Docker-based
• Light-weight, easily migrate to other clouds
– Emulation (rather than simulation)
• Close to actual execution times on multiple clouds
– Output:
• Instance type
• Number of instances
• Hadoop configuration (*-site.xml files)
8
Architecture
9
Emulation
Engine
Docker Runtime
Application (HPC, Big Data)
Application (HPC, Big Data)
Docker
Application (HPC, Big Data,…)
CPU
Memory
DiskI/O
NetworkI/O
Interpose
Collect Metrics
Run Profiles
Instance
Profiles
t2 m4 r3c4
Public Clouds
Cost
Estimator
Why Docker?
10
Virtual Machines OS Containers
Size Large Small
Machine Emulation Complete Partial (Share OS kernel)
Launch time Large Small
Migration Sometime requires image
conversion
Easy
Software Xen, KVM, VMware Dockers, rkt, …
App
Lib
App
Lib
OS
Container Container
App
Lib
App
Lib
OS
VM VM
OS
Hypervisor
Container Execution
Cluster Management
– Docker Swarm
– Multi-host (VXLAN-based) networking mode
Container
– Resources
• CPUs, memory, disk, and network I/O
– Regulation
• Docker run options, cgroups, and tc
– Monitoring
• Docker remote API and cgroups
11
Docker Image
“Hadoop all in the box”
– Hadoop
– Spark
– HiBench
The same image for master/slaves
Exports
– (Environment variables)
– File mounts
• *-site.xml files
– (Data volumes)
12
Hadoop
Spark
HiBench
Hadoop
Spark
HiBench
Volume mounts
Hadoop all in the box
core-site.xml
hdfs-site.xml
yarn-site.xml
mapred-site.xml
Resources
Resources How Command
CPU cores Change CPU set Docker run/cgroups
clock rate Change quota & period Docker run/cgroups
Memory size Set memory limit Docker run/cgroups
Out-of-
memory
(OOM)
Change out-of-memory
handling
Docker run/cgroups
Disk IOPS Throttle read/write IOPS Docker run/cgroups
bandwidth Throttle read/write bytes/sec Docker run/cgroups
Network IOPS Throttle TX/RX IOPS Docker run/cgroups
bandwidth Throttle TX/RX bytes/sec Docker run/cgroups
latency Insert latency (> 1 ms) tc
Freezer freeze Suspend/resume cgroups
13
Hadoop Configuration
Must be adjusted according to
– Instance type (CPU, memory, disk, and network)
– Number of instances
Targeting all parameters in *-site.xml
Dependent parameters
– (Instance type)
– YARN container size
– JVM heap size
– Map task size
– Reduce task size
14
Machine Instance Size
YARN Container Size
JVM Heap Size
Map/Reduce
Task Size
Optimization
Multi-objective GAs
– Trading cost and performance (time constraints)
– Other factors: energy, …
– Future: multi-objective to many-objective (> 3)
Generate “Pareto-optimal Front”
Technique: non-dominated sorting
15
Objective 1
Objective 2
X
X X X
X
X
X
X
X
X
X
X
XX
X
(Short) Summary
A Sizing Framework for Hadoop/Spark Clusters
– OS container-based approach
– Combined with Genetic Algorithms
• Multi-objective optimization (cost & perf.)
Future Work
– Docker Container Executor (DCE)
• DCE runs YARN containers into Docker ones
• Designed to provide custom environment for each app.
• We believe DCE can also be utilized for slow-down and speeding-
up of Hadoop tasks
16
Slow Down - Torturing Hadoop
Make strugglers
No intervention is required
17
Map 1 Map 2 Map 3 Map 4 Map 5
Master
Red 1 Red 2 Red 3 Red 4
Map Tasks
Reduce Tasks
Struggler
Struggler
Speeding up - Accelerating Hadoop
Balance resource usage of tasks on the same node
18
Map 1 Map 2 Map 3 Map 4 Map 5
Master
Red 1 Red 2 Red 3 Red 4
Map Tasks
Reduce Tasks
Struggler
Struggler
MHCO: Multi-Objective Hadoop
Configuration Optimization Using
Steady-State NSGA-II
20
Introduction
BIG
DATA
◦ Increasing use of connected devices at the hands of the Internet of
Things and data growth from scientific researches will lead to an
exponential increase in the data
◦ Portion of these data is underutilized or underexploited
◦ Hadoop MapReduce is very popular programming model for large
scale data analytics
Problem Definition I
21
◦ Objective 1  Parameter Tuning for Minimizing Execution Time
mapred-site
.xml
core-site
.xml
hdfs-site
.xml
yarn-site
.xml
Configuration settings
for HDFS core such as
I/O settings
Configuration settings
for HDFS daemons
Configuration settings
for MapReduce
daemons
Configuration settings
for YARN daemons
◦Hadoop provides tunable options have significant effect on
application performance
◦Practitioners and administers lack the expertise to tune
◦Appropriate parameter configuration is the key factor in Hadoop
Problem Definition II
22
◦ Appropriate machine instance selection for Hadoop cluster
◦ Objective 2  Instance Type Selection for Minimizing Hadoop
Cluster Deployment Cost
request
Service
provider
Applic
ation
result
Machine instance type
- small
- medium
- large
- x-large Pay Per
Use
Proposed Search based Approach
23
ssNSGA-II
Performance Optimization
Hadoop Parameter
Tuning
1
Deployment Cost
Optimization
Cluster Instance Type
Selection
2
◦ Chromosome encoding can solve dynamic nature of Hadoop on
version changes
◦ Use Steady State approach for computation overhead reduction
in generic GA approach
◦ Bi-objective optimization (execution time, cluster deployment
cost)
Objective Function
min t(p) , min c(p)
where,
p = [p1,p2,…,pm] ,
configuration parameter
list and instance type
t(p) = execution time of MR job
c(p)= machine instance usage
cost
24
t(p) = twc
c(p) = (SP*NS)*t(p)
where,
twc = workload execution time
SP= instance price
NS=no of machine instances
Assumption
- two objective functions are black-box functions
- no of instances in the cluster is static
Instance
type
Mem(GB)
/ cpu cores
Price per
second (Yen)
X-large 128/40 0.0160
Large 30/10 0.0039
Medium 12/4 0.0016
Small 3/1 0.0004
Parameter Grouping
I. HDFS and MAPREDUCE PARAMETERS
II. YARN PARAMETERS
III.YARN related MAPREDUCE PARAMETERS
25
17
6
7
30
machine instance
type specification
(cpu, mem)
reference from
previous researches
Group I Parameter Values
26
Parameter Name Value Range
dfs.namenode.handler.count 10, 20
dfs.datanode.handler.count 10, 20
dfs.blocksize 134217728,
268435456
mapreduce.map.output.compress True, False
mapreduce.job.jvm.numtasks 1: limited,
-1: unlimited
mapreduce.map.sort.spill.percent 0.8, 0.9
mapreduce.reduce.shuffle.input.buffer.p
ercent
0.7, 0.8
mapreduce.reduce.shuffle.memory.limit.
percent
0.25, 0.5
mapreduce.reduce.shuffle.merge.percent 0.66, 0.9
mapreduce.reduce.input.buffer.percent 0.0, 0.5
Parameter Name Value Range
dfs.datanode.max.transfer.threads 4096, 5120,
6144, 7168
dfs.datanode.balance.bandwidthPer
Sec
1048576,
2097152,
194304, 8388608
mapreduce.task.io.sort.factor 10, 20, 30, 40
mapreduce.task.io.sort.mb 100, 200, 300,
400
mapreduce.tasktracker.http.threads 40, 45, 50, 60
mapreduce.reduce.shuffle.parallelco
pies
5, 10, 15, 20
mapreduce.reduce.merge.inmem.thr
eshold
1000, 1500,
2000, 2500
Group II and III Parameter Values
YARN Parameters x-large large medium small
yarn.nodemanager.resource.memory.mb 102400 26624 10240 3072
yarn.nodemanager.resource.cpu-vcores 39 9 3 1
yarn.scheduler.maximum.allocation-mb 102400 26624 10240 3072
yarn.scheduler.minimum.allocation-mb 5120 2048 2048 1024
yarn.scheduler.maximum.allocation-vcores 39 9 3 1
yarn.scheduler.minimum.allocation-vcores 10 3 1 1
mapreduce.map.memory.mb 5120 2048 2048 1024
mapreduce.reduce.memory.mb 10240 4096 2048 1024
mapreduce.map.cpu.vcores 10 3 1 1
mapreduce.reduce.cpu.vcores 10 3 1 1
mapreduce.child.java.opts 8192 3277 1638 819
yarn.app.mapreduce.am.resource-mb 10240 4096 2048 1024
yarn.app.mapreduce.am.command-opts 8192 3277 1638 819
27
Chromosome Encoding
28
HDFS and MAPREDUCE Parameters
Binary Chromosome
Machine Instance Type
Single bit or two consecutive bits
represents parameter values,
instance type
Dependent Parameters
YARN Parameters
small
YARN related MapReduce Parameters
Chromosome Length = 26 bits
System Architecture
29
ssNSGA-II
optimization
workload
Resourc
e
Manager
Node
Manager
Node
Manager
Node
Manager
List of optimal
setting
Time Cost
…Cluster
deployment
cost
ssNSGA-II Based Hadoop Configuration Optimization
30
Generate n Sample Configuration Chromosomes
C1,C2,…,Cn
Select 2 Random Parents P1,P2
Perform 2 Point Crossover on P1, P2 (probability Pc =1)
Generate Offspring Coffspring
Perform Mutation on Coffspring (probability Pm= 0.1)
Coffspring Fitness Calculation
Update Population P
Perform Non-dominated Sorting
Update Population P
Output Pareto Solutions List, Copt
REPEAT
CONDITION = YES
Experiment Benchmark
31
Type Workload Input Size Benchmark
MicroBenchmark - Sort
- TeraSort
- Wordcount
2.98023GB - measure cluster
performance
(intrinsic behavior
of the cluster)
Web Search - Pagerank 5000 pages with
3 Iterations
- measure the
execution
performance for
real world big data
applications
Benchmark used : Hibench Benchmark suite version 4.0,
https://github.com/intel-hadoop/HiBench/releases
Experiment Environment
32
Setup Information Specification
CPU Intel ® Xeon R
E7-8870(40
cores)
Memory 128 GB RAM
Storage 400 TB
Hadoop version 2.7.1
JDK 1.8.0
NameNode
DataNode1 DataNode2 DataNode3
DataNode4 DataNode 5
User
Public
network
6-node cluster
1 NameNode
5 DataNodes
ssNSGA-II
optimization
Experimental Results
33
0
1
2
3
4
5
6
7
8
0 50 100 150 200
cost(¥)
execution time (sec)
sort workload result
small medium large x-large
0
1
2
3
4
5
6
30 40 50 60 70 80cost(¥)
execution time (sec)
terasort workload result
small medium large x-large
Population Size =30 Number of Evaluations=180
Number of Objectives = 2 Mutation Probability = 0.1
Crossover Probability = 1.0
* significant effects on HDFS and MapReduce Parameters
Experimental Results Cont’d
34
0
2
4
6
8
10
12
14
16
18
50 100 150 200 250 300cost(¥)
execution time (sec)
pagerank workload result
medium large x-largesmall
0
1
2
3
4
5
6
7
8
0 100 200 300 400 500 600
cost(¥)
execution time (sec)
wordcount workload result
small medium large x-large
* depend on YARN / related Parameters compared to HDFS and MapReduce Parameters
Population Size =30 Number of Evaluations=180
Number of Objectives = 2 Mutation Probability = 0.1
Crossover Probability = 1.0
Conclusion & Continuing Work
35
◦ Offline Hadoop configuration optimization using the ssNSGA-II
based search strategy
◦ x-large instance type cluster is not a suitable option for the
current workloads and input data size
◦ Large or medium instance type cluster show the balance for our
objective functions
◦Continuing process - dynamic cluster resizing through containers
and online configuration optimization of M/R workloads for
scientific workflow applications for effective Big Data Processing
Ad

Recommended

Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig
Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig
DataWorks Summit/Hadoop Summit
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
DataWorks Summit
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
DataWorks Summit/Hadoop Summit
 
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
DataWorks Summit/Hadoop Summit
 
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
DataWorks Summit/Hadoop Summit
 
What's new in Hadoop Common and HDFS
What's new in Hadoop Common and HDFS
DataWorks Summit/Hadoop Summit
 
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters
DataWorks Summit
 
Have your Cake and Eat it Too - Architecture for Batch and Real-time processing
Have your Cake and Eat it Too - Architecture for Batch and Real-time processing
DataWorks Summit
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
DataWorks Summit/Hadoop Summit
 
Spark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit
 
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Cloudera, Inc.
 
October 2014 HUG : Hive On Spark
October 2014 HUG : Hive On Spark
Yahoo Developer Network
 
Apache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real Time
DataWorks Summit/Hadoop Summit
 
Moving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloud
DataWorks Summit/Hadoop Summit
 
Low latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache Kudu
DataWorks Summit
 
Operationalizing YARN based Hadoop Clusters in the Cloud
Operationalizing YARN based Hadoop Clusters in the Cloud
DataWorks Summit/Hadoop Summit
 
The Future of Apache Storm
The Future of Apache Storm
DataWorks Summit/Hadoop Summit
 
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
DataWorks Summit
 
Running Spark in Production
Running Spark in Production
DataWorks Summit/Hadoop Summit
 
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
DataWorks Summit
 
Achieving 100k Queries per Hour on Hive on Tez
Achieving 100k Queries per Hour on Hive on Tez
DataWorks Summit/Hadoop Summit
 
Tachyon and Apache Spark
Tachyon and Apache Spark
rhatr
 
Apache Spark 2.4 Bridges the Gap Between Big Data and Deep Learning
Apache Spark 2.4 Bridges the Gap Between Big Data and Deep Learning
DataWorks Summit
 
Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the Cloud
Databricks
 
Hadoop 2 - Going beyond MapReduce
Hadoop 2 - Going beyond MapReduce
Uwe Printz
 
Apache Spark & Hadoop
Apache Spark & Hadoop
MapR Technologies
 
Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroData
Ofir Manor
 
TriHUG Feb: Hive on spark
TriHUG Feb: Hive on spark
trihug
 
The truth about SQL and Data Warehousing on Hadoop
The truth about SQL and Data Warehousing on Hadoop
DataWorks Summit/Hadoop Summit
 
Comparison of Transactional Libraries for HBase
Comparison of Transactional Libraries for HBase
DataWorks Summit/Hadoop Summit
 

More Related Content

What's hot (20)

LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
DataWorks Summit/Hadoop Summit
 
Spark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit
 
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Cloudera, Inc.
 
October 2014 HUG : Hive On Spark
October 2014 HUG : Hive On Spark
Yahoo Developer Network
 
Apache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real Time
DataWorks Summit/Hadoop Summit
 
Moving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloud
DataWorks Summit/Hadoop Summit
 
Low latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache Kudu
DataWorks Summit
 
Operationalizing YARN based Hadoop Clusters in the Cloud
Operationalizing YARN based Hadoop Clusters in the Cloud
DataWorks Summit/Hadoop Summit
 
The Future of Apache Storm
The Future of Apache Storm
DataWorks Summit/Hadoop Summit
 
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
DataWorks Summit
 
Running Spark in Production
Running Spark in Production
DataWorks Summit/Hadoop Summit
 
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
DataWorks Summit
 
Achieving 100k Queries per Hour on Hive on Tez
Achieving 100k Queries per Hour on Hive on Tez
DataWorks Summit/Hadoop Summit
 
Tachyon and Apache Spark
Tachyon and Apache Spark
rhatr
 
Apache Spark 2.4 Bridges the Gap Between Big Data and Deep Learning
Apache Spark 2.4 Bridges the Gap Between Big Data and Deep Learning
DataWorks Summit
 
Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the Cloud
Databricks
 
Hadoop 2 - Going beyond MapReduce
Hadoop 2 - Going beyond MapReduce
Uwe Printz
 
Apache Spark & Hadoop
Apache Spark & Hadoop
MapR Technologies
 
Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroData
Ofir Manor
 
TriHUG Feb: Hive on spark
TriHUG Feb: Hive on spark
trihug
 
Spark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit
 
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Cloudera, Inc.
 
Moving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloud
DataWorks Summit/Hadoop Summit
 
Low latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache Kudu
DataWorks Summit
 
Operationalizing YARN based Hadoop Clusters in the Cloud
Operationalizing YARN based Hadoop Clusters in the Cloud
DataWorks Summit/Hadoop Summit
 
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
DataWorks Summit
 
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
DataWorks Summit
 
Tachyon and Apache Spark
Tachyon and Apache Spark
rhatr
 
Apache Spark 2.4 Bridges the Gap Between Big Data and Deep Learning
Apache Spark 2.4 Bridges the Gap Between Big Data and Deep Learning
DataWorks Summit
 
Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the Cloud
Databricks
 
Hadoop 2 - Going beyond MapReduce
Hadoop 2 - Going beyond MapReduce
Uwe Printz
 
Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroData
Ofir Manor
 
TriHUG Feb: Hive on spark
TriHUG Feb: Hive on spark
trihug
 

Viewers also liked (20)

The truth about SQL and Data Warehousing on Hadoop
The truth about SQL and Data Warehousing on Hadoop
DataWorks Summit/Hadoop Summit
 
Comparison of Transactional Libraries for HBase
Comparison of Transactional Libraries for HBase
DataWorks Summit/Hadoop Summit
 
Case study of DevOps for Hadoop in Recruit.
Case study of DevOps for Hadoop in Recruit.
DataWorks Summit/Hadoop Summit
 
SEGA : Growth hacking by Spark ML for Mobile games
SEGA : Growth hacking by Spark ML for Mobile games
DataWorks Summit/Hadoop Summit
 
Hadoop Summit Tokyo HDP Sandbox Workshop
Hadoop Summit Tokyo HDP Sandbox Workshop
DataWorks Summit/Hadoop Summit
 
Leveraging smart meter data for electric utilities: Comparison of Spark SQL w...
Leveraging smart meter data for electric utilities: Comparison of Spark SQL w...
DataWorks Summit/Hadoop Summit
 
The real world use of Big Data to change business
The real world use of Big Data to change business
DataWorks Summit/Hadoop Summit
 
#HSTokyo16 Apache Spark Crash Course
#HSTokyo16 Apache Spark Crash Course
DataWorks Summit/Hadoop Summit
 
From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...
From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...
DataWorks Summit/Hadoop Summit
 
Why is my Hadoop cluster slow?
Why is my Hadoop cluster slow?
DataWorks Summit/Hadoop Summit
 
Hadoop Summit Tokyo Apache NiFi Crash Course
Hadoop Summit Tokyo Apache NiFi Crash Course
DataWorks Summit/Hadoop Summit
 
Use case and Live demo : Agile data integration from Legacy system to Hadoop ...
Use case and Live demo : Agile data integration from Legacy system to Hadoop ...
DataWorks Summit/Hadoop Summit
 
Introduction to Hadoop and Spark (before joining the other talk) and An Overv...
Introduction to Hadoop and Spark (before joining the other talk) and An Overv...
DataWorks Summit/Hadoop Summit
 
Network for the Large-scale Hadoop cluster at Yahoo! JAPAN
Network for the Large-scale Hadoop cluster at Yahoo! JAPAN
DataWorks Summit/Hadoop Summit
 
Major advancements in Apache Hive towards full support of SQL compliance
Major advancements in Apache Hive towards full support of SQL compliance
DataWorks Summit/Hadoop Summit
 
Rebuilding Web Tracking Infrastructure for Scale
Rebuilding Web Tracking Infrastructure for Scale
DataWorks Summit/Hadoop Summit
 
Streamline Hadoop DevOps with Apache Ambari
Streamline Hadoop DevOps with Apache Ambari
DataWorks Summit/Hadoop Summit
 
Apache Hadoop 3.0 What's new in YARN and MapReduce
Apache Hadoop 3.0 What's new in YARN and MapReduce
DataWorks Summit/Hadoop Summit
 
Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...
DataWorks Summit/Hadoop Summit
 
Enabling Apache Zeppelin and Spark for Data Science in the Enterprise
Enabling Apache Zeppelin and Spark for Data Science in the Enterprise
DataWorks Summit/Hadoop Summit
 
Leveraging smart meter data for electric utilities: Comparison of Spark SQL w...
Leveraging smart meter data for electric utilities: Comparison of Spark SQL w...
DataWorks Summit/Hadoop Summit
 
From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...
From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...
DataWorks Summit/Hadoop Summit
 
Use case and Live demo : Agile data integration from Legacy system to Hadoop ...
Use case and Live demo : Agile data integration from Legacy system to Hadoop ...
DataWorks Summit/Hadoop Summit
 
Introduction to Hadoop and Spark (before joining the other talk) and An Overv...
Introduction to Hadoop and Spark (before joining the other talk) and An Overv...
DataWorks Summit/Hadoop Summit
 
Network for the Large-scale Hadoop cluster at Yahoo! JAPAN
Network for the Large-scale Hadoop cluster at Yahoo! JAPAN
DataWorks Summit/Hadoop Summit
 
Major advancements in Apache Hive towards full support of SQL compliance
Major advancements in Apache Hive towards full support of SQL compliance
DataWorks Summit/Hadoop Summit
 
Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...
DataWorks Summit/Hadoop Summit
 
Enabling Apache Zeppelin and Spark for Data Science in the Enterprise
Enabling Apache Zeppelin and Spark for Data Science in the Enterprise
DataWorks Summit/Hadoop Summit
 
Ad

Similar to A Container-based Sizing Framework for Apache Hadoop/Spark Clusters (20)

Architecting a Scalable Hadoop Platform: Top 10 considerations for success
Architecting a Scalable Hadoop Platform: Top 10 considerations for success
DataWorks Summit
 
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
Sumeet Singh
 
Inside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworld
Richard McDougall
 
App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)
outstanding59
 
App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)
outstanding59
 
Hadoop in the Cloud – The What, Why and How from the Experts
Hadoop in the Cloud – The What, Why and How from the Experts
DataWorks Summit/Hadoop Summit
 
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Sumeet Singh
 
What it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! Perspectives
DataWorks Summit
 
Hadoop in the Cloud - The what, why and how from the experts
Hadoop in the Cloud - The what, why and how from the experts
DataWorks Summit/Hadoop Summit
 
Hadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_Plan
Narayana B
 
Hadoop configuration & performance tuning
Hadoop configuration & performance tuning
Vitthal Gogate
 
Scaling Hadoop at LinkedIn
Scaling Hadoop at LinkedIn
DataWorks Summit
 
Intro to hadoop
Intro to hadoop
Haden Pereira
 
Hadoop Research
Hadoop Research
Shreyansh Ajit kumar
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
tcloudcomputing-tw
 
Hokkaido University Academic Cloud: Largest Academic Cloud System in Japan
Hokkaido University Academic Cloud: Largest Academic Cloud System in Japan
Masaharu Munetomo
 
Delivering Apache Hadoop for the Modern Data Architecture
Delivering Apache Hadoop for the Modern Data Architecture
Hortonworks
 
Apache-Hadoop-Slides.pptx
Apache-Hadoop-Slides.pptx
MURINDANYISUDI
 
Hadoop World 2011: Hadoop as a Service in Cloud
Hadoop World 2011: Hadoop as a Service in Cloud
Cloudera, Inc.
 
Apache Hadoop 0.23 at Hadoop World 2011
Apache Hadoop 0.23 at Hadoop World 2011
Hortonworks
 
Architecting a Scalable Hadoop Platform: Top 10 considerations for success
Architecting a Scalable Hadoop Platform: Top 10 considerations for success
DataWorks Summit
 
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
Sumeet Singh
 
Inside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworld
Richard McDougall
 
App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)
outstanding59
 
App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)
outstanding59
 
Hadoop in the Cloud – The What, Why and How from the Experts
Hadoop in the Cloud – The What, Why and How from the Experts
DataWorks Summit/Hadoop Summit
 
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Sumeet Singh
 
What it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! Perspectives
DataWorks Summit
 
Hadoop in the Cloud - The what, why and how from the experts
Hadoop in the Cloud - The what, why and how from the experts
DataWorks Summit/Hadoop Summit
 
Hadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_Plan
Narayana B
 
Hadoop configuration & performance tuning
Hadoop configuration & performance tuning
Vitthal Gogate
 
Scaling Hadoop at LinkedIn
Scaling Hadoop at LinkedIn
DataWorks Summit
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
tcloudcomputing-tw
 
Hokkaido University Academic Cloud: Largest Academic Cloud System in Japan
Hokkaido University Academic Cloud: Largest Academic Cloud System in Japan
Masaharu Munetomo
 
Delivering Apache Hadoop for the Modern Data Architecture
Delivering Apache Hadoop for the Modern Data Architecture
Hortonworks
 
Apache-Hadoop-Slides.pptx
Apache-Hadoop-Slides.pptx
MURINDANYISUDI
 
Hadoop World 2011: Hadoop as a Service in Cloud
Hadoop World 2011: Hadoop as a Service in Cloud
Cloudera, Inc.
 
Apache Hadoop 0.23 at Hadoop World 2011
Apache Hadoop 0.23 at Hadoop World 2011
Hortonworks
 
Ad

More from DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
DataWorks Summit/Hadoop Summit
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
DataWorks Summit/Hadoop Summit
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
DataWorks Summit/Hadoop Summit
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
DataWorks Summit/Hadoop Summit
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
DataWorks Summit/Hadoop Summit
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
DataWorks Summit/Hadoop Summit
 
Hadoop Crash Course
Hadoop Crash Course
DataWorks Summit/Hadoop Summit
 
Data Science Crash Course
Data Science Crash Course
DataWorks Summit/Hadoop Summit
 
Apache Spark Crash Course
Apache Spark Crash Course
DataWorks Summit/Hadoop Summit
 
Dataflow with Apache NiFi
Dataflow with Apache NiFi
DataWorks Summit/Hadoop Summit
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data Free
DataWorks Summit/Hadoop Summit
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
DataWorks Summit/Hadoop Summit
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
DataWorks Summit/Hadoop Summit
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
DataWorks Summit/Hadoop Summit
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
DataWorks Summit/Hadoop Summit
 
HBase in Practice
HBase in Practice
DataWorks Summit/Hadoop Summit
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
DataWorks Summit/Hadoop Summit
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
DataWorks Summit/Hadoop Summit
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
DataWorks Summit/Hadoop Summit
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
DataWorks Summit/Hadoop Summit
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
DataWorks Summit/Hadoop Summit
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
DataWorks Summit/Hadoop Summit
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
DataWorks Summit/Hadoop Summit
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
DataWorks Summit/Hadoop Summit
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
DataWorks Summit/Hadoop Summit
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
DataWorks Summit/Hadoop Summit
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
DataWorks Summit/Hadoop Summit
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
DataWorks Summit/Hadoop Summit
 

Recently uploaded (20)

Tech-ASan: Two-stage check for Address Sanitizer - Yixuan Cao.pdf
Tech-ASan: Two-stage check for Address Sanitizer - Yixuan Cao.pdf
caoyixuan2019
 
UserCon Belgium: Honey, VMware increased my bill
UserCon Belgium: Honey, VMware increased my bill
stijn40
 
Techniques for Automatic Device Identification and Network Assignment.pdf
Techniques for Automatic Device Identification and Network Assignment.pdf
Priyanka Aash
 
PyCon SG 25 - Firecracker Made Easy with Python.pdf
PyCon SG 25 - Firecracker Made Easy with Python.pdf
Muhammad Yuga Nugraha
 
Smarter Aviation Data Management: Lessons from Swedavia Airports and Sweco
Smarter Aviation Data Management: Lessons from Swedavia Airports and Sweco
Safe Software
 
Enhance GitHub Copilot using MCP - Enterprise version.pdf
Enhance GitHub Copilot using MCP - Enterprise version.pdf
Nilesh Gule
 
You are not excused! How to avoid security blind spots on the way to production
You are not excused! How to avoid security blind spots on the way to production
Michele Leroux Bustamante
 
2025_06_18 - OpenMetadata Community Meeting.pdf
2025_06_18 - OpenMetadata Community Meeting.pdf
OpenMetadata
 
Quantum AI: Where Impossible Becomes Probable
Quantum AI: Where Impossible Becomes Probable
Saikat Basu
 
Curietech AI in action - Accelerate MuleSoft development
Curietech AI in action - Accelerate MuleSoft development
shyamraj55
 
Daily Lesson Log MATATAG ICT TEchnology 8
Daily Lesson Log MATATAG ICT TEchnology 8
LOIDAALMAZAN3
 
CapCut Pro Crack For PC Latest Version {Fully Unlocked} 2025
CapCut Pro Crack For PC Latest Version {Fully Unlocked} 2025
pcprocore
 
The Future of Product Management in AI ERA.pdf
The Future of Product Management in AI ERA.pdf
Alyona Owens
 
A Constitutional Quagmire - Ethical Minefields of AI, Cyber, and Privacy.pdf
A Constitutional Quagmire - Ethical Minefields of AI, Cyber, and Privacy.pdf
Priyanka Aash
 
Connecting Data and Intelligence: The Role of FME in Machine Learning
Connecting Data and Intelligence: The Role of FME in Machine Learning
Safe Software
 
Cyber Defense Matrix Workshop - RSA Conference
Cyber Defense Matrix Workshop - RSA Conference
Priyanka Aash
 
EIS-Webinar-Engineering-Retail-Infrastructure-06-16-2025.pdf
EIS-Webinar-Engineering-Retail-Infrastructure-06-16-2025.pdf
Earley Information Science
 
AI Agents and FME: A How-to Guide on Generating Synthetic Metadata
AI Agents and FME: A How-to Guide on Generating Synthetic Metadata
Safe Software
 
"Database isolation: how we deal with hundreds of direct connections to the d...
"Database isolation: how we deal with hundreds of direct connections to the d...
Fwdays
 
“MPU+: A Transformative Solution for Next-Gen AI at the Edge,” a Presentation...
“MPU+: A Transformative Solution for Next-Gen AI at the Edge,” a Presentation...
Edge AI and Vision Alliance
 
Tech-ASan: Two-stage check for Address Sanitizer - Yixuan Cao.pdf
Tech-ASan: Two-stage check for Address Sanitizer - Yixuan Cao.pdf
caoyixuan2019
 
UserCon Belgium: Honey, VMware increased my bill
UserCon Belgium: Honey, VMware increased my bill
stijn40
 
Techniques for Automatic Device Identification and Network Assignment.pdf
Techniques for Automatic Device Identification and Network Assignment.pdf
Priyanka Aash
 
PyCon SG 25 - Firecracker Made Easy with Python.pdf
PyCon SG 25 - Firecracker Made Easy with Python.pdf
Muhammad Yuga Nugraha
 
Smarter Aviation Data Management: Lessons from Swedavia Airports and Sweco
Smarter Aviation Data Management: Lessons from Swedavia Airports and Sweco
Safe Software
 
Enhance GitHub Copilot using MCP - Enterprise version.pdf
Enhance GitHub Copilot using MCP - Enterprise version.pdf
Nilesh Gule
 
You are not excused! How to avoid security blind spots on the way to production
You are not excused! How to avoid security blind spots on the way to production
Michele Leroux Bustamante
 
2025_06_18 - OpenMetadata Community Meeting.pdf
2025_06_18 - OpenMetadata Community Meeting.pdf
OpenMetadata
 
Quantum AI: Where Impossible Becomes Probable
Quantum AI: Where Impossible Becomes Probable
Saikat Basu
 
Curietech AI in action - Accelerate MuleSoft development
Curietech AI in action - Accelerate MuleSoft development
shyamraj55
 
Daily Lesson Log MATATAG ICT TEchnology 8
Daily Lesson Log MATATAG ICT TEchnology 8
LOIDAALMAZAN3
 
CapCut Pro Crack For PC Latest Version {Fully Unlocked} 2025
CapCut Pro Crack For PC Latest Version {Fully Unlocked} 2025
pcprocore
 
The Future of Product Management in AI ERA.pdf
The Future of Product Management in AI ERA.pdf
Alyona Owens
 
A Constitutional Quagmire - Ethical Minefields of AI, Cyber, and Privacy.pdf
A Constitutional Quagmire - Ethical Minefields of AI, Cyber, and Privacy.pdf
Priyanka Aash
 
Connecting Data and Intelligence: The Role of FME in Machine Learning
Connecting Data and Intelligence: The Role of FME in Machine Learning
Safe Software
 
Cyber Defense Matrix Workshop - RSA Conference
Cyber Defense Matrix Workshop - RSA Conference
Priyanka Aash
 
EIS-Webinar-Engineering-Retail-Infrastructure-06-16-2025.pdf
EIS-Webinar-Engineering-Retail-Infrastructure-06-16-2025.pdf
Earley Information Science
 
AI Agents and FME: A How-to Guide on Generating Synthetic Metadata
AI Agents and FME: A How-to Guide on Generating Synthetic Metadata
Safe Software
 
"Database isolation: how we deal with hundreds of direct connections to the d...
"Database isolation: how we deal with hundreds of direct connections to the d...
Fwdays
 
“MPU+: A Transformative Solution for Next-Gen AI at the Edge,” a Presentation...
“MPU+: A Transformative Solution for Next-Gen AI at the Edge,” a Presentation...
Edge AI and Vision Alliance
 

A Container-based Sizing Framework for Apache Hadoop/Spark Clusters

  • 1. A Container-based Sizing Framework for Apache Hadoop/Spark Clusters October 27, 2016 Hokkaido University Akiyoshi SUGIKI, Phyo Thandar Thant
  • 2. Agenda Hokkaido University Academic Cloud A Docker-based Sizing Framework for Hadoop Multi-objective Optimization of Hadoop 1
  • 3. Information Initiative Center, Hokkaido University Founded in 1962 as a national supercomputing center A member of – HPCI (High Performance Computing Infrastructure) - 12 institutes – JHPCN (Joint Usage/Research Center for Interdisciplinary Large- scale Information Infrastructure) - 8 institutes University R&D center for supercomputing, cloud computing, networking, and cyber security Operating HPC twins – Supercomputer (172 TFLOPS) and Academic Cloud System (43 TFLOPS) 2
  • 4. Hokkaido University Academic Cloud (2011-) Japan’s largest academic cloud system – > 43 TFLOPS (> 114 nodes) – ~2,000 VMs 3 Supercomputer Cloud System Data-science Cloud System (Added, 2013-) SR16000 M1 172 TF/176 nodes 22 TB (128 GB/node) AMS2500 (File System) 600 TB (SAS, RAID5) 300 TB (SATA, RAID6) BS2000 44 TF/114 nodes 14 TB (128 GB/node) Cloud Storage 1.96 PB AMS2300 (Boot File System) 260 TB (SAS, RAID6) VFP500N+AMS2500 (NAS) 500 TB (near-line NAS/RAID6) HA8000/RS210HM 80 GB x 25 nodes 32 GB x 2 nodes CloudStack 3.x CloudStack 4.x Hadoop Package for “Bigdata” (Hadoop, Hive, Mahout, and R)
  • 5. Supporting “Big Data” “Big Data” cluster package • Hadoop, Hive, Mahout, and R • MPI, OpenMP, and Torque – Automatic deployment of VM-based clusters – Custom scheduling policy • Spread I/O on multiple disks 4 VM VM VM VM #1 #2 #3 #4 Storage #1 Storage #2 Storage #3 Storage #4 Hadoop Cluster Virtual Disks Hadoop Hive Mahout R Big Data Package
  • 6. Lessons Learned (So Far) No single Hadoop (a little like silos) – Hadoop instance for each group of users Version problem – Upgrades and expansion of Hadoop ecosystem Strong demand of middle person – Gives advice with deep understanding of research domains, statistical analysis, and Hadoop-based systems 5 VM VM VM Hadoop #1 VM VM VM Hadoop #2 VM VM VM Hadoop #3 Research Group #1 Research Group #2 Research Group #3 Research Data
  • 7. Going Next A new system will be installed in April, 2018 – x2 CPU cores, x5 storage space – Bare-metal, accelerating performance at every layer – Supports both interclouds and hybrid clouds Still supports Hadoop as well as Spark – Cluster templates – Build user community 6 Supercomputer System Hokkaido U. Regions (Tokyo, Osaka, Okinawa) Cloud Systems (In other universities and public clouds) Cluster Templates (Hadoop, Spark, …)
  • 8. Requirements Run Hadoop on multiple Clouds – Academic Clouds (Community Clouds) • Hokkaido University Academic Cloud, ... – Public Clouds • Amazon AWS, Microsoft Azure, Google Cloud, … Offer best choice for researches (our users) – Under multiple criteria • Cost • Performance (time constraints) • Energy … 7
  • 9. Our Solution A Container-based Sizing Framework for Hadoop Clusters – Docker-based • Light-weight, easily migrate to other clouds – Emulation (rather than simulation) • Close to actual execution times on multiple clouds – Output: • Instance type • Number of instances • Hadoop configuration (*-site.xml files) 8
  • 10. Architecture 9 Emulation Engine Docker Runtime Application (HPC, Big Data) Application (HPC, Big Data) Docker Application (HPC, Big Data,…) CPU Memory DiskI/O NetworkI/O Interpose Collect Metrics Run Profiles Instance Profiles t2 m4 r3c4 Public Clouds Cost Estimator
  • 11. Why Docker? 10 Virtual Machines OS Containers Size Large Small Machine Emulation Complete Partial (Share OS kernel) Launch time Large Small Migration Sometime requires image conversion Easy Software Xen, KVM, VMware Dockers, rkt, … App Lib App Lib OS Container Container App Lib App Lib OS VM VM OS Hypervisor
  • 12. Container Execution Cluster Management – Docker Swarm – Multi-host (VXLAN-based) networking mode Container – Resources • CPUs, memory, disk, and network I/O – Regulation • Docker run options, cgroups, and tc – Monitoring • Docker remote API and cgroups 11
  • 13. Docker Image “Hadoop all in the box” – Hadoop – Spark – HiBench The same image for master/slaves Exports – (Environment variables) – File mounts • *-site.xml files – (Data volumes) 12 Hadoop Spark HiBench Hadoop Spark HiBench Volume mounts Hadoop all in the box core-site.xml hdfs-site.xml yarn-site.xml mapred-site.xml
  • 14. Resources Resources How Command CPU cores Change CPU set Docker run/cgroups clock rate Change quota & period Docker run/cgroups Memory size Set memory limit Docker run/cgroups Out-of- memory (OOM) Change out-of-memory handling Docker run/cgroups Disk IOPS Throttle read/write IOPS Docker run/cgroups bandwidth Throttle read/write bytes/sec Docker run/cgroups Network IOPS Throttle TX/RX IOPS Docker run/cgroups bandwidth Throttle TX/RX bytes/sec Docker run/cgroups latency Insert latency (> 1 ms) tc Freezer freeze Suspend/resume cgroups 13
  • 15. Hadoop Configuration Must be adjusted according to – Instance type (CPU, memory, disk, and network) – Number of instances Targeting all parameters in *-site.xml Dependent parameters – (Instance type) – YARN container size – JVM heap size – Map task size – Reduce task size 14 Machine Instance Size YARN Container Size JVM Heap Size Map/Reduce Task Size
  • 16. Optimization Multi-objective GAs – Trading cost and performance (time constraints) – Other factors: energy, … – Future: multi-objective to many-objective (> 3) Generate “Pareto-optimal Front” Technique: non-dominated sorting 15 Objective 1 Objective 2 X X X X X X X X X X X X XX X
  • 17. (Short) Summary A Sizing Framework for Hadoop/Spark Clusters – OS container-based approach – Combined with Genetic Algorithms • Multi-objective optimization (cost & perf.) Future Work – Docker Container Executor (DCE) • DCE runs YARN containers into Docker ones • Designed to provide custom environment for each app. • We believe DCE can also be utilized for slow-down and speeding- up of Hadoop tasks 16
  • 18. Slow Down - Torturing Hadoop Make strugglers No intervention is required 17 Map 1 Map 2 Map 3 Map 4 Map 5 Master Red 1 Red 2 Red 3 Red 4 Map Tasks Reduce Tasks Struggler Struggler
  • 19. Speeding up - Accelerating Hadoop Balance resource usage of tasks on the same node 18 Map 1 Map 2 Map 3 Map 4 Map 5 Master Red 1 Red 2 Red 3 Red 4 Map Tasks Reduce Tasks Struggler Struggler
  • 20. MHCO: Multi-Objective Hadoop Configuration Optimization Using Steady-State NSGA-II
  • 21. 20 Introduction BIG DATA ◦ Increasing use of connected devices at the hands of the Internet of Things and data growth from scientific researches will lead to an exponential increase in the data ◦ Portion of these data is underutilized or underexploited ◦ Hadoop MapReduce is very popular programming model for large scale data analytics
  • 22. Problem Definition I 21 ◦ Objective 1  Parameter Tuning for Minimizing Execution Time mapred-site .xml core-site .xml hdfs-site .xml yarn-site .xml Configuration settings for HDFS core such as I/O settings Configuration settings for HDFS daemons Configuration settings for MapReduce daemons Configuration settings for YARN daemons ◦Hadoop provides tunable options have significant effect on application performance ◦Practitioners and administers lack the expertise to tune ◦Appropriate parameter configuration is the key factor in Hadoop
  • 23. Problem Definition II 22 ◦ Appropriate machine instance selection for Hadoop cluster ◦ Objective 2  Instance Type Selection for Minimizing Hadoop Cluster Deployment Cost request Service provider Applic ation result Machine instance type - small - medium - large - x-large Pay Per Use
  • 24. Proposed Search based Approach 23 ssNSGA-II Performance Optimization Hadoop Parameter Tuning 1 Deployment Cost Optimization Cluster Instance Type Selection 2 ◦ Chromosome encoding can solve dynamic nature of Hadoop on version changes ◦ Use Steady State approach for computation overhead reduction in generic GA approach ◦ Bi-objective optimization (execution time, cluster deployment cost)
  • 25. Objective Function min t(p) , min c(p) where, p = [p1,p2,…,pm] , configuration parameter list and instance type t(p) = execution time of MR job c(p)= machine instance usage cost 24 t(p) = twc c(p) = (SP*NS)*t(p) where, twc = workload execution time SP= instance price NS=no of machine instances Assumption - two objective functions are black-box functions - no of instances in the cluster is static Instance type Mem(GB) / cpu cores Price per second (Yen) X-large 128/40 0.0160 Large 30/10 0.0039 Medium 12/4 0.0016 Small 3/1 0.0004
  • 26. Parameter Grouping I. HDFS and MAPREDUCE PARAMETERS II. YARN PARAMETERS III.YARN related MAPREDUCE PARAMETERS 25 17 6 7 30 machine instance type specification (cpu, mem) reference from previous researches
  • 27. Group I Parameter Values 26 Parameter Name Value Range dfs.namenode.handler.count 10, 20 dfs.datanode.handler.count 10, 20 dfs.blocksize 134217728, 268435456 mapreduce.map.output.compress True, False mapreduce.job.jvm.numtasks 1: limited, -1: unlimited mapreduce.map.sort.spill.percent 0.8, 0.9 mapreduce.reduce.shuffle.input.buffer.p ercent 0.7, 0.8 mapreduce.reduce.shuffle.memory.limit. percent 0.25, 0.5 mapreduce.reduce.shuffle.merge.percent 0.66, 0.9 mapreduce.reduce.input.buffer.percent 0.0, 0.5 Parameter Name Value Range dfs.datanode.max.transfer.threads 4096, 5120, 6144, 7168 dfs.datanode.balance.bandwidthPer Sec 1048576, 2097152, 194304, 8388608 mapreduce.task.io.sort.factor 10, 20, 30, 40 mapreduce.task.io.sort.mb 100, 200, 300, 400 mapreduce.tasktracker.http.threads 40, 45, 50, 60 mapreduce.reduce.shuffle.parallelco pies 5, 10, 15, 20 mapreduce.reduce.merge.inmem.thr eshold 1000, 1500, 2000, 2500
  • 28. Group II and III Parameter Values YARN Parameters x-large large medium small yarn.nodemanager.resource.memory.mb 102400 26624 10240 3072 yarn.nodemanager.resource.cpu-vcores 39 9 3 1 yarn.scheduler.maximum.allocation-mb 102400 26624 10240 3072 yarn.scheduler.minimum.allocation-mb 5120 2048 2048 1024 yarn.scheduler.maximum.allocation-vcores 39 9 3 1 yarn.scheduler.minimum.allocation-vcores 10 3 1 1 mapreduce.map.memory.mb 5120 2048 2048 1024 mapreduce.reduce.memory.mb 10240 4096 2048 1024 mapreduce.map.cpu.vcores 10 3 1 1 mapreduce.reduce.cpu.vcores 10 3 1 1 mapreduce.child.java.opts 8192 3277 1638 819 yarn.app.mapreduce.am.resource-mb 10240 4096 2048 1024 yarn.app.mapreduce.am.command-opts 8192 3277 1638 819 27
  • 29. Chromosome Encoding 28 HDFS and MAPREDUCE Parameters Binary Chromosome Machine Instance Type Single bit or two consecutive bits represents parameter values, instance type Dependent Parameters YARN Parameters small YARN related MapReduce Parameters Chromosome Length = 26 bits
  • 31. ssNSGA-II Based Hadoop Configuration Optimization 30 Generate n Sample Configuration Chromosomes C1,C2,…,Cn Select 2 Random Parents P1,P2 Perform 2 Point Crossover on P1, P2 (probability Pc =1) Generate Offspring Coffspring Perform Mutation on Coffspring (probability Pm= 0.1) Coffspring Fitness Calculation Update Population P Perform Non-dominated Sorting Update Population P Output Pareto Solutions List, Copt REPEAT CONDITION = YES
  • 32. Experiment Benchmark 31 Type Workload Input Size Benchmark MicroBenchmark - Sort - TeraSort - Wordcount 2.98023GB - measure cluster performance (intrinsic behavior of the cluster) Web Search - Pagerank 5000 pages with 3 Iterations - measure the execution performance for real world big data applications Benchmark used : Hibench Benchmark suite version 4.0, https://github.com/intel-hadoop/HiBench/releases
  • 33. Experiment Environment 32 Setup Information Specification CPU Intel ® Xeon R E7-8870(40 cores) Memory 128 GB RAM Storage 400 TB Hadoop version 2.7.1 JDK 1.8.0 NameNode DataNode1 DataNode2 DataNode3 DataNode4 DataNode 5 User Public network 6-node cluster 1 NameNode 5 DataNodes ssNSGA-II optimization
  • 34. Experimental Results 33 0 1 2 3 4 5 6 7 8 0 50 100 150 200 cost(¥) execution time (sec) sort workload result small medium large x-large 0 1 2 3 4 5 6 30 40 50 60 70 80cost(¥) execution time (sec) terasort workload result small medium large x-large Population Size =30 Number of Evaluations=180 Number of Objectives = 2 Mutation Probability = 0.1 Crossover Probability = 1.0 * significant effects on HDFS and MapReduce Parameters
  • 35. Experimental Results Cont’d 34 0 2 4 6 8 10 12 14 16 18 50 100 150 200 250 300cost(¥) execution time (sec) pagerank workload result medium large x-largesmall 0 1 2 3 4 5 6 7 8 0 100 200 300 400 500 600 cost(¥) execution time (sec) wordcount workload result small medium large x-large * depend on YARN / related Parameters compared to HDFS and MapReduce Parameters Population Size =30 Number of Evaluations=180 Number of Objectives = 2 Mutation Probability = 0.1 Crossover Probability = 1.0
  • 36. Conclusion & Continuing Work 35 ◦ Offline Hadoop configuration optimization using the ssNSGA-II based search strategy ◦ x-large instance type cluster is not a suitable option for the current workloads and input data size ◦ Large or medium instance type cluster show the balance for our objective functions ◦Continuing process - dynamic cluster resizing through containers and online configuration optimization of M/R workloads for scientific workflow applications for effective Big Data Processing

Editor's Notes

  • #23: Tell about configuration file a little Slaves contain a list of hosts, one per line, that are needed to host DataNode and TaskTracker servers. The Masters contain a list of hosts, one per line, that are required to host secondary NameNode servers. The Masters file informs about the Secondary NameNode location to Hadoop daemon. The ‘Masters’ file at Master server contains a hostname, Secondary Name Node servers.  core-site.xml file informs Hadoop daemon where NameNode runs in the cluster. It contains the configuration settings for Hadoop Core such as I/O settings that are common to HDFS and MapReduce  hdfs-site.xml file contains the configuration settings for HDFS daemons; the NameNode, the Secondary NameNode, and the DataNodes. Here, we can configure hdfs-site.xml to specify default block replication and permission checking on HDFS. The mapred-site.xml file contains the configuration settings for MapReduce daemons; the job tracker and the task-trackers. Yarn-site.xml for resource manager and node manager
  • #24: Service provider provides services in pay per use basic Instance price differ according to the instance type
  • #25: In business, health That allow us to leverage all types of data to gain insights and add value
  • #27: In order to optimize these two objectives, we need to select sensitive parameters , a total of 30 parameters are selected 17 parameters for general Hadoop configuration optimization for execution performance , these 17 parameters are encoded and other 13 parameters are dependent parameters according to the encoded machine instance type 13 parameters for dynamic machine instance type optimization during execution
  • #29: Group 2 and Group 3 parameters differ according to the instance type, the table shows the associated parameter values for various machine instance types
  • #31: Why steady state algorithm is selected??? Nebro [1] state that ssNSGA-II outperforms generational NSGAII in terms of quality, convergence speed, and computing time Specify cloud in this case
  • #33: Description of workloads (what kind of tasks they are conducting) Other benchmarks just include workload for measuring cluster performance Hibench , a new realistic and comprehensive benchmark suite for Hadoop to properly evaluate and characterize Hadoop framework Developed by intel Dynamic input size changes Evaluation on both hardware and software
  • #34: Specify that Genetic Algorithm will run on NameNode of the Hadoop cluster
  • #35: Why large instance type execution time is shorter than x-large instance type How long it takes to do the experiment for each of these workload? Because of the costly execution of Hadoop Mapreduce workloads in experiments, we can only perform to get intermediate optimized solution result. For each workload for 150 evaluations, it takes 1 or 2 days for this intermediate results
  • #36: shows overlap points, big difference only in single machine type, further experiment is necessary
  • #37: In business, health That allow us to leverage all types of data to gain insights and add value