Design of Hadoop Distributed File System

Dr. C.V. Suresh Babu
DESIGN OF
HADOOP DISTRIBUTED FILE SYSTEM
(CentreforKnowledgeTransfer)
institute

DISCUSSION TOPICS
 Hadoop Distributed File System (HDFS)
 How does HDFS work?
 HDFS Architecture
 Features of HDFS
 Benefits of using HDFS
 Examples: Target Marketing
 HDFS data replication
institute

HADOOP DISTRIBUTED FILE SYSTEM (HDFS)
 The Hadoop Distributed File System (HDFS) is the primary data storage system used by Hadoop
applications.
 HDFS employs a NameNode and DataNode architecture to implement a distributed file system that
provides high-performance access to data across highly scalable Hadoop clusters.
 Hadoop itself is an open source distributed processing framework that manages data processing and
storage for big data applications.
 HDFS is a key part of the many Hadoop ecosystem technologies.
 It provides a reliable means for managing pools of big data and supporting related big data analytics
applications.
institute

HOW DOES HDFS WORK?
 HDFS enables the rapid transfer of data between compute nodes.
 It was closely coupled with MapReduce, a framework for data processing that filters and divides up
work among the nodes in a cluster, and it organizes and condenses the results into a cohesive answer
to a query.
 Similarly, when HDFS takes in data, it breaks the information down into separate blocks and distributes
them to different nodes in a cluster.
institute

institute

FEATURES OF HDFS
Data replication. This is used to ensure that the data is always available and prevents data loss. For
example, when a node crashes or there is a hardware failure, replicated data can be
pulled from elsewhere within a cluster, so processing continues while data is
recovered.
Fault tolerance and reliability. HDFS' ability to replicate file blocks and store them across nodes in a large
cluster ensures fault tolerance and reliability.
High availability. As mentioned earlier, because of replication across notes, data is available even if the
NameNode or a DataNode fails.
Scalability. Because HDFS stores data on various nodes in the cluster, as requirements increase, a cluster
can scale to hundreds of nodes.
High throughput. Because HDFS stores data in a distributed manner, the data can be processed in
parallel on a cluster of nodes. This, plus data locality (see next bullet), cut the
processing time and enable high throughput.
Data locality. With HDFS, computation happens on the DataNodes where the data resides, rather than
having the data move to where the computational unit is. By minimizing the distance
between the data and the computing process, this approach decreases network
congestion and boosts a system's overall throughput.
institute

BENEFITS OF USING HDFS
 Cost effectiveness. The DataNodes that store the data rely on inexpensive off-the-shelf hardware,
which cuts storage costs. Also, because HDFS is open source, there's no licensing fee.
 Large data set storage. HDFS stores a variety of data of any size -- from megabytes to petabytes --
and in any format, including structured and unstructured data.
 Fast recovery from hardware failure. HDFS is designed to detect faults and automatically recover on its
own.
 Portability. HDFS is portable across all hardware platforms, and it is compatible with several operating
systems, including Windows, Linux and Mac OS/X.
 Streaming data access. HDFS is built for high data throughput, which is best for access to streaming
data.
institute

EXAMPLES: TARGET MARKETING
 Targeted marketing campaigns depend on marketers knowing a lot about their
target audiences.
 Marketers can get this information from several sources, including CRM systems,
direct mail responses, point-of-sale systems, Facebook and Twitter.
 Because much of this data is unstructured, an HDFS cluster is the most cost-
effective place to put data before analyzing it.
institute

HDFS DATA REPLICATION
 Data replication is an important part of the HDFS format as it ensures data
remains available if there's a node or hardware failure.
 As previously mentioned, the data is divided into blocks and replicated
across numerous nodes in the cluster.
 Therefore, when one node goes down, the user can access the data that
was on that node from other machines.
 HDFS maintains the replication process at regular intervals.
institute

Design of Hadoop Distributed File System

More Related Content

What's hot (20)

Similar to Design of Hadoop Distributed File System (20)

More from Dr. C.V. Suresh Babu (20)

Recently uploaded (20)

Design of Hadoop Distributed File System