Debunking the Myths of HDFS Erasure Coding Performance

Debunking the Myths of
HDFS Erasure Coding Performance

 HDFS inherits 3-way replication from Google File System
- Simple, scalable and robust
 200% storage overhead
 Secondary replicas rarely accessed
Replication is Expensive

Erasure Coding Saves Storage
 Simplified Example: storing 2 bits
 Same data durability
- can lose any 1 bit
 Half the storage overhead
 Slower recovery
1 01 0Replication:
XOR Coding: 1 0⊕ 1=
2 extra bits
1 extra bit

Erasure Coding Saves Storage
 Facebook
- f4 stores 65PB of BLOBs in EC
 Windows Azure Storage (WAS)
- A PB of new data every 1~2 days
- All “sealed” data stored in EC
 Google File System
- Large portion of data stored in EC

Roadmap
 Background of EC
- Redundancy Theory
- EC in Distributed Storage Systems
 HDFS-EC architecture
 Hardware-accelerated Codec Framework
 Performance Evaluation

Durability and Efficiency
Data Durability = How many simultaneous failures can be tolerated?
Storage Efficiency = How much portion of storage is for useful data?
useful data
3-way Replication: Data Durability = 2
Storage Efficiency = 1/3 (33%)
redundant data

XOR:
Data Durability = 1
useful data redundant data
X Y X ⊕ Y
0 0 0
0 1 1
1 0 1
1 1 0
Y = 0 ⊕ 1 = 1

Reed-Solomon (RS):
Data Durability = 2
Very flexible!

Data Durability Storage Efficiency
Single Replica 0 100%
3-way Replication 2 33%
XOR with 6 data cells 1 86%
RS (6,3) 3 67%
RS (10,4) 4 71%

EC in Distributed Storage
Block Layout:
Data Locality 👍🏻
Small Files 👎🏻
128~256MFile 0~128M … 640~768M
0~128
M
block0
DataNode 0
128~
256M
block1
DataNode 1
0~128M 128~256M
…
640~
768M
block5
DataNode 5 DataNode 6
…
parity
Contiguous Layout:

Block Layout:
File
block0
DataNode 0
block1
DataNode 1
…
block5
DataNode 5 DataNode 6
…
parity
Striped Layout:
0~1M 1~2M 5~6M
6~7M
Data Locality 👎🏻
Small Files 👍🏻
Parallel I/O 👍🏻
0~128M 128~256M

Spectrum:
Replication
Erasure
Coding
Striping
Contiguous
Ceph
Ceph
Quancast File System
Quancast File System
HDFS Facebook f4
Windows Azure

Roadmap

-
-
 HDFS-EC architecture

Choosing Block Layout
Medium: 1~6 blocksSmall files: < 1 blockAssuming (6,3) coding Large: > 6 blocks (1 group)
96.29%
1.86% 1.85%
26.06%
9.33%
64.61%
small medium large
file count
space usage
Top 2% files occupy ~65% space
Cluster A Profile
86.59%
11.38% 2.03%
23.89%
36.03%
40.08%
file count
space
usage
Top 2% files occupy ~40% space
small medium large
Cluster B Profile
99.64%
0.36% 0.00%
76.05%
20.75%
3.20%
file count
space usage
Dominated by small files
small medium large
Cluster C Profile

Choosing Block Layout
Current
HDFS

Generalizing Block NameNode
Mapping Logical and Storage Blocks Too Many Storage Blocks?
Hierarchical Naming Protocol:

Client Parallel Writing
streamer
queue
streamer … streamer
Coordinator

Client Parallel Reading
…
parity

Reconstruction on DataNode
 Important to avoid delay on the critical path
- Especially if original data is lost
 Integrated with Replication Monitor
- Under-protected EC blocks scheduled together with under-replicated blocks
- New priority algorithms
 New ErasureCodingWorker component on DataNode

Data Checksum Support
 Supports getFileChecksum for EC striped mode files
- Comparable checksums for same content striped files
- Can’t compare the checksums for contiguous file and striped file
- Can reconstruct on the fly if found block misses while computing
 Planning to introduce new version of getFileChecksum
- To achieve comparable checksums between contiguous and striped file

Roadmap

-
-


Acceleration with Intel ISA-L
 1 legacy coder
- From Facebook’s HDFS-RAID project
 2 new coders
- Pure Java — code improvement over HDFS-RAID
- Native coder with Intel’s Intelligent Storage Acceleration Library (ISA-L)

Why is ISA-L Fast?
pre-computed and reused
parallel
operation
Direct ByteBuffer

Microbenchmark: Codec Calculation

Hive-on-MR — locality sensitive

Hive-on-Spark — locality sensitive

Conclusion
 Erasure coding expands effective storage space by ~50%!
 HDFS-EC phase I implements erasure coding in striped block layout
 Upstream effort (HDFS-7285):
- Design finalized Nov. 2014
- Development started Jan. 2015
- 218 commits, ~25k LoC change
- Broad collaboration: Cloudera, Intel, Hortonworks, Huawei, Yahoo, LinkedIn
 Phase II will support contiguous block layout for better locality

Acknowledgements
 Cloudera
- Andrew Wang, Aaron T. Myers, Colin McCabe, Todd Lipcon, Silvius Rus
 Intel
- Kai Zheng, Rakesh R, Yi Liu, Weihua Jiang, Rui Li
 Hortonworks
- Jing Zhao, Tsz Wo Nicholas Sze
 Huawei
- Vinayakumar B, Walter Su, Xinwei Qin
 Yahoo (Japan)
- Gao Rui, Kai Sasaki, Takuya Fukudome, Hui Zheng

Questions?
Zhe Zhang, LinkedIn
zhz@apache.org | @oldcap
http://zhe-thoughts.github.io/
Uma Gangumalla, Intel
umamahesh@apache.org
@UmaMaheswaraG
http://blog.cloudera.com/blog/2016/02/progress-report-bringing-erasure-coding-to-apache-hadoop/

Come See us at Intel - Booth 305
“Amazing Analytics from Silicon to Software”
• Intel powers analytics solutions that are optimized for
performance and security from silicon to software
• Intel unleashes the potential of Big Data to enable
advancement in healthcare/ life sciences, retail,
manufacturing, telecom and financial services
• Intel accelerates advanced analytics and machine learning
solutions
Twitter #HS16SJ

LinkedIn Hadoop
Dali: LinkedIn’s Logical
Data Access Layer for
Hadoop
Meetup Thu 6/30
6~9PM @LinkedIn
2nd floor, Unite room
2025 Stierlin Ct
Mountain View
Dr. Elephant: performance
monitoring and tuning.
SFHUG in Aug

Debunking the Myths of HDFS Erasure Coding Performance

Debunking the Myths of HDFS Erasure Coding Performance

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Debunking the Myths of HDFS Erasure Coding Performance (20)

More from DataWorks Summit/Hadoop Summit (20)

Recently uploaded (20)

Debunking the Myths of HDFS Erasure Coding Performance

Editor's Notes