SlideShare a Scribd company logo
Debunking the Myths of
HDFS Erasure Coding Performance
 HDFS inherits 3-way replication from Google File System
- Simple, scalable and robust
 200% storage overhead
 Secondary replicas rarely accessed
Replication is Expensive
Erasure Coding Saves Storage
 Simplified Example: storing 2 bits
 Same data durability
- can lose any 1 bit
 Half the storage overhead
 Slower recovery
1 01 0Replication:
XOR Coding: 1 0⊕ 1=
2 extra bits
1 extra bit
Erasure Coding Saves Storage
 Facebook
- f4 stores 65PB of BLOBs in EC
 Windows Azure Storage (WAS)
- A PB of new data every 1~2 days
- All “sealed” data stored in EC
 Google File System
- Large portion of data stored in EC
Roadmap
 Background of EC
- Redundancy Theory
- EC in Distributed Storage Systems
 HDFS-EC architecture
 Hardware-accelerated Codec Framework
 Performance Evaluation
Durability and Efficiency
Data Durability = How many simultaneous failures can be tolerated?
Storage Efficiency = How much portion of storage is for useful data?
useful data
3-way Replication: Data Durability = 2
Storage Efficiency = 1/3 (33%)
redundant data
Durability and Efficiency
Data Durability = How many simultaneous failures can be tolerated?
Storage Efficiency = How much portion of storage is for useful data?
XOR:
Data Durability = 1
Storage Efficiency = 2/3 (67%)
useful data redundant data
X Y X ⊕ Y
0 0 0
0 1 1
1 0 1
1 1 0
Y = 0 ⊕ 1 = 1
Durability and Efficiency
Data Durability = How many simultaneous failures can be tolerated?
Storage Efficiency = How much portion of storage is for useful data?
Reed-Solomon (RS):
Data Durability = 2
Storage Efficiency = 4/6 (67%)
Very flexible!
Durability and Efficiency
Data Durability = How many simultaneous failures can be tolerated?
Storage Efficiency = How much portion of storage is for useful data?
Data Durability Storage Efficiency
Single Replica 0 100%
3-way Replication 2 33%
XOR with 6 data cells 1 86%
RS (6,3) 3 67%
RS (10,4) 4 71%
EC in Distributed Storage
Block Layout:
Data Locality 👍🏻
Small Files 👎🏻
128~256MFile 0~128M … 640~768M
0~128
M
block0
DataNode 0
128~
256M
block1
DataNode 1
0~128M 128~256M
…
640~
768M
block5
DataNode 5 DataNode 6
…
parity
Contiguous Layout:
EC in Distributed Storage
Block Layout:
File
block0
DataNode 0
block1
DataNode 1
…
block5
DataNode 5 DataNode 6
…
parity
Striped Layout:
0~1M 1~2M 5~6M
6~7M
Data Locality 👎🏻
Small Files 👍🏻
Parallel I/O 👍🏻
0~128M 128~256M
EC in Distributed Storage
Spectrum:
Replication
Erasure
Coding
Striping
Contiguous
Ceph
Ceph
Quancast File System
Quancast File System
HDFS Facebook f4
Windows Azure
Roadmap

-
-
 HDFS-EC architecture
 Hardware-accelerated Codec Framework
 Performance Evaluation
Choosing Block Layout
Medium: 1~6 blocksSmall files: < 1 blockAssuming (6,3) coding Large: > 6 blocks (1 group)
96.29%
1.86% 1.85%
26.06%
9.33%
64.61%
small medium large
file count
space usage
Top 2% files occupy ~65% space
Cluster A Profile
86.59%
11.38% 2.03%
23.89%
36.03%
40.08%
file count
space
usage
Top 2% files occupy ~40% space
small medium large
Cluster B Profile
99.64%
0.36% 0.00%
76.05%
20.75%
3.20%
file count
space usage
Dominated by small files
small medium large
Cluster C Profile
Choosing Block Layout
Current
HDFS
Generalizing Block NameNode
Mapping Logical and Storage Blocks Too Many Storage Blocks?
Hierarchical Naming Protocol:
Client Parallel Writing
streamer
queue
streamer … streamer
Coordinator
Client Parallel Reading
…
parity
Reconstruction on DataNode
 Important to avoid delay on the critical path
- Especially if original data is lost
 Integrated with Replication Monitor
- Under-protected EC blocks scheduled together with under-replicated blocks
- New priority algorithms
 New ErasureCodingWorker component on DataNode
Data Checksum Support
 Supports getFileChecksum for EC striped mode files
- Comparable checksums for same content striped files
- Can’t compare the checksums for contiguous file and striped file
- Can reconstruct on the fly if found block misses while computing
 Planning to introduce new version of getFileChecksum
- To achieve comparable checksums between contiguous and striped file
Roadmap

-
-

 Hardware-accelerated Codec Framework
 Performance Evaluation
Acceleration with Intel ISA-L
 1 legacy coder
- From Facebook’s HDFS-RAID project
 2 new coders
- Pure Java — code improvement over HDFS-RAID
- Native coder with Intel’s Intelligent Storage Acceleration Library (ISA-L)
Why is ISA-L Fast?
pre-computed and reused
parallel
operation
Direct ByteBuffer
Microbenchmark: Codec Calculation
Microbenchmark: Codec Calculation
Microbenchmark: HDFS I/O
Microbenchmark: HDFS I/O
Microbenchmark: HDFS I/O
DFSIO / MapReduce
Hive-on-MR — locality sensitive
Hive-on-Spark — locality sensitive
Conclusion
 Erasure coding expands effective storage space by ~50%!
 HDFS-EC phase I implements erasure coding in striped block layout
 Upstream effort (HDFS-7285):
- Design finalized Nov. 2014
- Development started Jan. 2015
- 218 commits, ~25k LoC change
- Broad collaboration: Cloudera, Intel, Hortonworks, Huawei, Yahoo, LinkedIn
 Phase II will support contiguous block layout for better locality
Acknowledgements
 Cloudera
- Andrew Wang, Aaron T. Myers, Colin McCabe, Todd Lipcon, Silvius Rus
 Intel
- Kai Zheng, Rakesh R, Yi Liu, Weihua Jiang, Rui Li
 Hortonworks
- Jing Zhao, Tsz Wo Nicholas Sze
 Huawei
- Vinayakumar B, Walter Su, Xinwei Qin
 Yahoo (Japan)
- Gao Rui, Kai Sasaki, Takuya Fukudome, Hui Zheng
Questions?
Zhe Zhang, LinkedIn
zhz@apache.org | @oldcap
http://zhe-thoughts.github.io/
Uma Gangumalla, Intel
umamahesh@apache.org
@UmaMaheswaraG
http://blog.cloudera.com/blog/2016/02/progress-report-bringing-erasure-coding-to-apache-hadoop/
Come See us at Intel - Booth 305
“Amazing Analytics from Silicon to Software”
• Intel powers analytics solutions that are optimized for
performance and security from silicon to software
• Intel unleashes the potential of Big Data to enable
advancement in healthcare/ life sciences, retail,
manufacturing, telecom and financial services
• Intel accelerates advanced analytics and machine learning
solutions
Twitter #HS16SJ
LinkedIn Hadoop
Dali: LinkedIn’s Logical
Data Access Layer for
Hadoop
Meetup Thu 6/30
6~9PM @LinkedIn
2nd floor, Unite room
2025 Stierlin Ct
Mountain View
Dr. Elephant: performance
monitoring and tuning.
SFHUG in Aug
Backup
Debunking the Myths of HDFS Erasure Coding Performance
Debunking the Myths of HDFS Erasure Coding Performance

More Related Content

PPTX
HDFS Erasure Coding in Action
PDF
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
PDF
Apache Spark in Depth: Core Concepts, Architecture & Internals
PDF
Databricks Delta Lake and Its Benefits
PDF
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
PPTX
RocksDB compaction
PDF
Intro to HBase
PDF
A crash course in CRUSH
HDFS Erasure Coding in Action
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Apache Spark in Depth: Core Concepts, Architecture & Internals
Databricks Delta Lake and Its Benefits
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
RocksDB compaction
Intro to HBase
A crash course in CRUSH

What's hot (20)

KEY
Introduction to memcached
PPTX
Basics of storage Technology
PPTX
Hive: Loading Data
PDF
Dynamic Partition Pruning in Apache Spark
PPTX
Processing Large Data with Apache Spark -- HasGeek
PPTX
HBase in Practice
PPTX
Delta lake and the delta architecture
PDF
Introduction to Redis
PDF
Productizing Structured Streaming Jobs
PDF
Seastore: Next Generation Backing Store for Ceph
PPTX
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
PPTX
Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016
PDF
Presto Summit 2018 - 09 - Netflix Iceberg
PPTX
YugaByte DB Internals - Storage Engine and Transactions
PDF
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
PPTX
Apache Spark Architecture
PDF
Hadoop Distributed File System
PPTX
Apache Arrow: In Theory, In Practice
PDF
Data Security at Scale through Spark and Parquet Encryption
PDF
Apache Spark 101
Introduction to memcached
Basics of storage Technology
Hive: Loading Data
Dynamic Partition Pruning in Apache Spark
Processing Large Data with Apache Spark -- HasGeek
HBase in Practice
Delta lake and the delta architecture
Introduction to Redis
Productizing Structured Streaming Jobs
Seastore: Next Generation Backing Store for Ceph
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016
Presto Summit 2018 - 09 - Netflix Iceberg
YugaByte DB Internals - Storage Engine and Transactions
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Apache Spark Architecture
Hadoop Distributed File System
Apache Arrow: In Theory, In Practice
Data Security at Scale through Spark and Parquet Encryption
Apache Spark 101
Ad

Viewers also liked (20)

PPTX
Apache Hadoop 3.0 What's new in YARN and MapReduce
PPTX
HDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency
PPTX
What's new in Hadoop Common and HDFS
PPTX
What's new in hadoop 3.0
PPTX
The Impala Cookbook
PDF
Quantcast File System (QFS) - Alternative to HDFS
PDF
HDFS Deep Dive
PPTX
Cloudera Impala + PostgreSQL
PDF
Erasure codes and storage tiers on gluster
PPTX
Hadoop 3.0 features
PPTX
Hadoop cluster os_tuning_v1.0_20170106_mobile
PPTX
Big Data Pipeline and Analytics Platform
PPTX
Internet of things Crash Course Workshop
PPTX
Evolving HDFS to a Generalized Distributed Storage Subsystem
PPT
Reed Solomon
PPTX
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
PDF
빅데이터 분석을 위한 스파크 2 프로그래밍 : 대용량 데이터 처리부터 머신러닝까지
PDF
Storage tiering and erasure coding in Ceph (SCaLE13x)
PDF
Side by Side with Elasticsearch & Solr, Part 2
PDF
HDFS Analysis for Small Files
Apache Hadoop 3.0 What's new in YARN and MapReduce
HDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency
What's new in Hadoop Common and HDFS
What's new in hadoop 3.0
The Impala Cookbook
Quantcast File System (QFS) - Alternative to HDFS
HDFS Deep Dive
Cloudera Impala + PostgreSQL
Erasure codes and storage tiers on gluster
Hadoop 3.0 features
Hadoop cluster os_tuning_v1.0_20170106_mobile
Big Data Pipeline and Analytics Platform
Internet of things Crash Course Workshop
Evolving HDFS to a Generalized Distributed Storage Subsystem
Reed Solomon
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
빅데이터 분석을 위한 스파크 2 프로그래밍 : 대용량 데이터 처리부터 머신러닝까지
Storage tiering and erasure coding in Ceph (SCaLE13x)
Side by Side with Elasticsearch & Solr, Part 2
HDFS Analysis for Small Files
Ad

Similar to Debunking the Myths of HDFS Erasure Coding Performance (20)

PPTX
Less is More: 2X Storage Efficiency with HDFS Erasure Coding
PPTX
Hadoop enhancements using next gen IA technologies
PDF
Hadoop 3.0 - Revolution or evolution?
PDF
Hadoop 3.0 - Revolution or evolution?
PPTX
Hug syncsort etl hadoop big data
PDF
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...
PDF
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
PPT
Hadoop training in bangalore
PPTX
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
PDF
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
PDF
Tachyon-2014-11-21-amp-camp5
PDF
Modeling data and best practices for the Azure Cosmos DB.
PDF
Fredrick Ishengoma - HDFS+- Erasure Coding Based Hadoop Distributed File System
PDF
Distributed Data Storage & Streaming for Real-time Decisioning Using Kafka, S...
PPTX
Storing data in windows server 2012 ss
PDF
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
PDF
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
PPT
Hadoop for Scientific Workloads__HadoopSummit2010
PDF
optimizing_ceph_flash
Less is More: 2X Storage Efficiency with HDFS Erasure Coding
Hadoop enhancements using next gen IA technologies
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
Hug syncsort etl hadoop big data
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Hadoop training in bangalore
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
Tachyon-2014-11-21-amp-camp5
Modeling data and best practices for the Azure Cosmos DB.
Fredrick Ishengoma - HDFS+- Erasure Coding Based Hadoop Distributed File System
Distributed Data Storage & Streaming for Real-time Decisioning Using Kafka, S...
Storing data in windows server 2012 ss
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Hadoop for Scientific Workloads__HadoopSummit2010
optimizing_ceph_flash

More from DataWorks Summit/Hadoop Summit (20)

PPT
Running Apache Spark & Apache Zeppelin in Production
PPT
State of Security: Apache Spark & Apache Zeppelin
PDF
Unleashing the Power of Apache Atlas with Apache Ranger
PDF
Enabling Digital Diagnostics with a Data Science Platform
PDF
Revolutionize Text Mining with Spark and Zeppelin
PDF
Double Your Hadoop Performance with Hortonworks SmartSense
PDF
Hadoop Crash Course
PDF
Data Science Crash Course
PDF
Apache Spark Crash Course
PDF
Dataflow with Apache NiFi
PPTX
Schema Registry - Set you Data Free
PPTX
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
PDF
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
PPTX
Mool - Automated Log Analysis using Data Science and ML
PPTX
How Hadoop Makes the Natixis Pack More Efficient
PPTX
The Challenge of Driving Business Value from the Analytics of Things (AOT)
PDF
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
PPTX
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
PPTX
Backup and Disaster Recovery in Hadoop
PPTX
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Running Apache Spark & Apache Zeppelin in Production
State of Security: Apache Spark & Apache Zeppelin
Unleashing the Power of Apache Atlas with Apache Ranger
Enabling Digital Diagnostics with a Data Science Platform
Revolutionize Text Mining with Spark and Zeppelin
Double Your Hadoop Performance with Hortonworks SmartSense
Hadoop Crash Course
Data Science Crash Course
Apache Spark Crash Course
Dataflow with Apache NiFi
Schema Registry - Set you Data Free
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Mool - Automated Log Analysis using Data Science and ML
How Hadoop Makes the Natixis Pack More Efficient
The Challenge of Driving Business Value from the Analytics of Things (AOT)
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
Backup and Disaster Recovery in Hadoop
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes

Recently uploaded (20)

PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Machine learning based COVID-19 study performance prediction
PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Approach and Philosophy of On baking technology
PPTX
1. Introduction to Computer Programming.pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Encapsulation theory and applications.pdf
PDF
Getting Started with Data Integration: FME Form 101
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
Dropbox Q2 2025 Financial Results & Investor Presentation
SOPHOS-XG Firewall Administrator PPT.pptx
“AI and Expert System Decision Support & Business Intelligence Systems”
Advanced methodologies resolving dimensionality complications for autism neur...
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Machine learning based COVID-19 study performance prediction
Programs and apps: productivity, graphics, security and other tools
Digital-Transformation-Roadmap-for-Companies.pptx
A comparative analysis of optical character recognition models for extracting...
The Rise and Fall of 3GPP – Time for a Sabbatical?
Network Security Unit 5.pdf for BCA BBA.
Approach and Philosophy of On baking technology
1. Introduction to Computer Programming.pptx
Building Integrated photovoltaic BIPV_UPV.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
Encapsulation theory and applications.pdf
Getting Started with Data Integration: FME Form 101
Assigned Numbers - 2025 - Bluetooth® Document
Per capita expenditure prediction using model stacking based on satellite ima...

Debunking the Myths of HDFS Erasure Coding Performance

  • 1. Debunking the Myths of HDFS Erasure Coding Performance
  • 2.  HDFS inherits 3-way replication from Google File System - Simple, scalable and robust  200% storage overhead  Secondary replicas rarely accessed Replication is Expensive
  • 3. Erasure Coding Saves Storage  Simplified Example: storing 2 bits  Same data durability - can lose any 1 bit  Half the storage overhead  Slower recovery 1 01 0Replication: XOR Coding: 1 0⊕ 1= 2 extra bits 1 extra bit
  • 4. Erasure Coding Saves Storage  Facebook - f4 stores 65PB of BLOBs in EC  Windows Azure Storage (WAS) - A PB of new data every 1~2 days - All “sealed” data stored in EC  Google File System - Large portion of data stored in EC
  • 5. Roadmap  Background of EC - Redundancy Theory - EC in Distributed Storage Systems  HDFS-EC architecture  Hardware-accelerated Codec Framework  Performance Evaluation
  • 6. Durability and Efficiency Data Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data? useful data 3-way Replication: Data Durability = 2 Storage Efficiency = 1/3 (33%) redundant data
  • 7. Durability and Efficiency Data Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data? XOR: Data Durability = 1 Storage Efficiency = 2/3 (67%) useful data redundant data X Y X ⊕ Y 0 0 0 0 1 1 1 0 1 1 1 0 Y = 0 ⊕ 1 = 1
  • 8. Durability and Efficiency Data Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data? Reed-Solomon (RS): Data Durability = 2 Storage Efficiency = 4/6 (67%) Very flexible!
  • 9. Durability and Efficiency Data Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data? Data Durability Storage Efficiency Single Replica 0 100% 3-way Replication 2 33% XOR with 6 data cells 1 86% RS (6,3) 3 67% RS (10,4) 4 71%
  • 10. EC in Distributed Storage Block Layout: Data Locality 👍🏻 Small Files 👎🏻 128~256MFile 0~128M … 640~768M 0~128 M block0 DataNode 0 128~ 256M block1 DataNode 1 0~128M 128~256M … 640~ 768M block5 DataNode 5 DataNode 6 … parity Contiguous Layout:
  • 11. EC in Distributed Storage Block Layout: File block0 DataNode 0 block1 DataNode 1 … block5 DataNode 5 DataNode 6 … parity Striped Layout: 0~1M 1~2M 5~6M 6~7M Data Locality 👎🏻 Small Files 👍🏻 Parallel I/O 👍🏻 0~128M 128~256M
  • 12. EC in Distributed Storage Spectrum: Replication Erasure Coding Striping Contiguous Ceph Ceph Quancast File System Quancast File System HDFS Facebook f4 Windows Azure
  • 13. Roadmap  - -  HDFS-EC architecture  Hardware-accelerated Codec Framework  Performance Evaluation
  • 14. Choosing Block Layout Medium: 1~6 blocksSmall files: < 1 blockAssuming (6,3) coding Large: > 6 blocks (1 group) 96.29% 1.86% 1.85% 26.06% 9.33% 64.61% small medium large file count space usage Top 2% files occupy ~65% space Cluster A Profile 86.59% 11.38% 2.03% 23.89% 36.03% 40.08% file count space usage Top 2% files occupy ~40% space small medium large Cluster B Profile 99.64% 0.36% 0.00% 76.05% 20.75% 3.20% file count space usage Dominated by small files small medium large Cluster C Profile
  • 16. Generalizing Block NameNode Mapping Logical and Storage Blocks Too Many Storage Blocks? Hierarchical Naming Protocol:
  • 19. Reconstruction on DataNode  Important to avoid delay on the critical path - Especially if original data is lost  Integrated with Replication Monitor - Under-protected EC blocks scheduled together with under-replicated blocks - New priority algorithms  New ErasureCodingWorker component on DataNode
  • 20. Data Checksum Support  Supports getFileChecksum for EC striped mode files - Comparable checksums for same content striped files - Can’t compare the checksums for contiguous file and striped file - Can reconstruct on the fly if found block misses while computing  Planning to introduce new version of getFileChecksum - To achieve comparable checksums between contiguous and striped file
  • 21. Roadmap  - -   Hardware-accelerated Codec Framework  Performance Evaluation
  • 22. Acceleration with Intel ISA-L  1 legacy coder - From Facebook’s HDFS-RAID project  2 new coders - Pure Java — code improvement over HDFS-RAID - Native coder with Intel’s Intelligent Storage Acceleration Library (ISA-L)
  • 23. Why is ISA-L Fast? pre-computed and reused parallel operation Direct ByteBuffer
  • 32. Conclusion  Erasure coding expands effective storage space by ~50%!  HDFS-EC phase I implements erasure coding in striped block layout  Upstream effort (HDFS-7285): - Design finalized Nov. 2014 - Development started Jan. 2015 - 218 commits, ~25k LoC change - Broad collaboration: Cloudera, Intel, Hortonworks, Huawei, Yahoo, LinkedIn  Phase II will support contiguous block layout for better locality
  • 33. Acknowledgements  Cloudera - Andrew Wang, Aaron T. Myers, Colin McCabe, Todd Lipcon, Silvius Rus  Intel - Kai Zheng, Rakesh R, Yi Liu, Weihua Jiang, Rui Li  Hortonworks - Jing Zhao, Tsz Wo Nicholas Sze  Huawei - Vinayakumar B, Walter Su, Xinwei Qin  Yahoo (Japan) - Gao Rui, Kai Sasaki, Takuya Fukudome, Hui Zheng
  • 34. Questions? Zhe Zhang, LinkedIn [email protected] | @oldcap http://zhe-thoughts.github.io/ Uma Gangumalla, Intel [email protected] @UmaMaheswaraG http://blog.cloudera.com/blog/2016/02/progress-report-bringing-erasure-coding-to-apache-hadoop/
  • 35. Come See us at Intel - Booth 305 “Amazing Analytics from Silicon to Software” • Intel powers analytics solutions that are optimized for performance and security from silicon to software • Intel unleashes the potential of Big Data to enable advancement in healthcare/ life sciences, retail, manufacturing, telecom and financial services • Intel accelerates advanced analytics and machine learning solutions Twitter #HS16SJ
  • 36. LinkedIn Hadoop Dali: LinkedIn’s Logical Data Access Layer for Hadoop Meetup Thu 6/30 6~9PM @LinkedIn 2nd floor, Unite room 2025 Stierlin Ct Mountain View Dr. Elephant: performance monitoring and tuning. SFHUG in Aug

Editor's Notes

  • #2: Simply put, it doubles the storage capacity of your cluster. This talk explains how it happens. Blog post link.
  • #3: When the GFS paper was published more than a decade ago, the objective was to store massive amount of data on a large number of cheap commodity machines. A breakthrough design was to rely on machine-level replication to protect against machine failures, instead of xxx.
  • #4: A more efficient approach to reliably store data is erasure coding. Here’s a simplified example
  • #6: In this talk I will introduce how we implemented erasure coding in HDFS.
  • #9: RS uses more sophisticated linear algebra operations to generate multiple parity cells, and thus can tolerate multiple failures per group. It works by multiplying a vector of k data cells with a Generator Matrix (GT) to generate an extended codeword vector with k data cells and m parity cells. In this particular example, it combines the strong durability of replication and high efficiency of simple XOR. More importantly, flexible.
  • #11: To manage potentially very large files, distributed storage systems usually divide files into fixed-size logical byte ranges called logical blocks. These logical blocks are then mapped to storage blocks on the cluster, which reflect the physical layout of data on the cluster. The simplest mapping between logical and storage blocks is a contiguous block layout, which maps each logical block one-to-one to a storage block. Reading a file with a contiguous block layout is as easy as reading each storage block linearly in sequence.
  • #12: To manage potentially very large files, distributed storage systems usually divide files into fixed-size logical byte ranges called logical blocks. These logical blocks are then mapped to storage blocks on the cluster, which reflect the physical layout of data on the cluster. The simplest mapping between logical and storage blocks is a contiguous block layout, which maps each logical block one-to-one to a storage block. Reading a file with a contiguous block layout is as easy as reading each storage block linearly in sequence.
  • #13: Non-trivial trade-offs between x and x, y and y.
  • #15: In all cases, the saving from EC will be significantly lower if only applied on large files. In some cases, no savings at all.
  • #17: The former represents a logical byte range in a file, while the latter is the basic unit of data chunks stored on a DataNode. In the example, the file /tmp/foo is logically divided into 13 striping cells (cell_0 through cell_12). Logical block 0 represents the logical byte range of cells 0~8, and logical block 1 represents cells 9~12. Cells 0, 3, 6 form a storage block, which will be stored as a single chunk of data on a DataNode. To reduce this overhead we have introduced a new hierarchical block naming protocol. Currently HDFS allocates block IDs sequentially based on block creation time. This protocol instead divides each block ID into 2~3 sections, as illustrated in Figure 7. Each block ID starts with a flag indicating its layout (contiguous=0, striped=1). For striped blocks, the rest of the ID consists of two parts: the middle section with ID of the logical block and the tail section representing the index of a storage block in the logical block. This allows the NameNode to manage a logical block as a summary of its storage blocks. Storage block IDs can be mapped to their logical block by masking the index; this is required when the NameNode processes DataNode block reports.
  • #25: Figure 8 first shows results from an in-memory encoding/decoding micro benchmark. The ISA-L implementation outperforms the HDFS-EC Java implementation by more than 4x, and the Facebook HDFS-RAID coder by ~20x. Based on the results, we strongly recommend the ISA-L accelerated implementation for all production deployments.
  • #26: Figure 8 first shows results from an in-memory encoding/decoding micro benchmark. The ISA-L implementation outperforms the HDFS-EC Java implementation by more than 4x, and the Facebook HDFS-RAID coder by ~20x. Based on the results, we strongly recommend the ISA-L accelerated implementation for all production deployments.
  • #27: We also compared end-to-end HDFS I/O performance with these different coders against HDFS’s default scheme of three-way replication. The tests were performed on a cluster with 11 nodes (1 NameNode, 9 DataNodes, 1 client node) interconnected with 10 GigE network. Figure 9 shows the throughput results of 1) client writing a 12GB file to HDFS; and 2) client reading a 12GB file from HDFS. In the reading tests we manually killed two DataNodes so the results include decoding overhead. As shown in Figure 9, in both sequential write and read and read benchmarks, throughput is greatly constrained by the pure Java coders (HDFS-RAID and our own implementation). The ISA-L implementation is much faster than the pure Java coders because of its excellent CPU efficiency. It also outperforms replication by 2-3x because the striped layout allows the client to perform I/O with multiple DataNodes in parallel, leveraging the aggregate bandwidth of their disk drives. We have also tested read performance without any DataNode failure: HDFS-EC is roughly 5x faster than three-way replication.
  • #33: Phase 2 backup slide
  • #36: Accelerating Advanced Analytics and Machine Learning Solutions “Accelerated Machine Learning and Big Data Analytics Applications (through or with) the Trusted Analytics Platform” Need tagline for Machine Learning: need ML tagline from Nidhi Accelerating Machine Learning Applications and Big Data deployments through Trusted Analytics Platform Remove TAP tagline. Or say “Accelerate analytics on Big Data with Trusted Analytics Platform” Last line lessen words: just say “visit us at booth 409”