SlideShare a Scribd company logo
3
Most read
5
Most read
6
Most read
Dr. C.V. Suresh Babu
DESIGN OF
HADOOP DISTRIBUTED FILE SYSTEM
(CentreforKnowledgeTransfer)
institute
DISCUSSION TOPICS
 Hadoop Distributed File System (HDFS)
 How does HDFS work?
 HDFS Architecture
 Features of HDFS
 Benefits of using HDFS
 Examples: Target Marketing
 HDFS data replication
(CentreforKnowledgeTransfer)
institute
HADOOP DISTRIBUTED FILE SYSTEM (HDFS)
 The Hadoop Distributed File System (HDFS) is the primary data storage system used by Hadoop
applications.
 HDFS employs a NameNode and DataNode architecture to implement a distributed file system that
provides high-performance access to data across highly scalable Hadoop clusters.
 Hadoop itself is an open source distributed processing framework that manages data processing and
storage for big data applications.
 HDFS is a key part of the many Hadoop ecosystem technologies.
 It provides a reliable means for managing pools of big data and supporting related big data analytics
applications.
(CentreforKnowledgeTransfer)
institute
HOW DOES HDFS WORK?
 HDFS enables the rapid transfer of data between compute nodes.
 It was closely coupled with MapReduce, a framework for data processing that filters and divides up
work among the nodes in a cluster, and it organizes and condenses the results into a cohesive answer
to a query.
 Similarly, when HDFS takes in data, it breaks the information down into separate blocks and distributes
them to different nodes in a cluster.
(CentreforKnowledgeTransfer)
institute
(CentreforKnowledgeTransfer)
institute
FEATURES OF HDFS
Data replication. This is used to ensure that the data is always available and prevents data loss. For
example, when a node crashes or there is a hardware failure, replicated data can be
pulled from elsewhere within a cluster, so processing continues while data is
recovered.
Fault tolerance and reliability. HDFS' ability to replicate file blocks and store them across nodes in a large
cluster ensures fault tolerance and reliability.
High availability. As mentioned earlier, because of replication across notes, data is available even if the
NameNode or a DataNode fails.
Scalability. Because HDFS stores data on various nodes in the cluster, as requirements increase, a cluster
can scale to hundreds of nodes.
High throughput. Because HDFS stores data in a distributed manner, the data can be processed in
parallel on a cluster of nodes. This, plus data locality (see next bullet), cut the
processing time and enable high throughput.
Data locality. With HDFS, computation happens on the DataNodes where the data resides, rather than
having the data move to where the computational unit is. By minimizing the distance
between the data and the computing process, this approach decreases network
congestion and boosts a system's overall throughput.
(CentreforKnowledgeTransfer)
institute
BENEFITS OF USING HDFS
 Cost effectiveness. The DataNodes that store the data rely on inexpensive off-the-shelf hardware,
which cuts storage costs. Also, because HDFS is open source, there's no licensing fee.
 Large data set storage. HDFS stores a variety of data of any size -- from megabytes to petabytes --
and in any format, including structured and unstructured data.
 Fast recovery from hardware failure. HDFS is designed to detect faults and automatically recover on its
own.
 Portability. HDFS is portable across all hardware platforms, and it is compatible with several operating
systems, including Windows, Linux and Mac OS/X.
 Streaming data access. HDFS is built for high data throughput, which is best for access to streaming
data.
(CentreforKnowledgeTransfer)
institute
EXAMPLES: TARGET MARKETING
 Targeted marketing campaigns depend on marketers knowing a lot about their
target audiences.
 Marketers can get this information from several sources, including CRM systems,
direct mail responses, point-of-sale systems, Facebook and Twitter.
 Because much of this data is unstructured, an HDFS cluster is the most cost-
effective place to put data before analyzing it.
(CentreforKnowledgeTransfer)
institute
HDFS DATA REPLICATION
 Data replication is an important part of the HDFS format as it ensures data
remains available if there's a node or hardware failure.
 As previously mentioned, the data is divided into blocks and replicated
across numerous nodes in the cluster.
 Therefore, when one node goes down, the user can access the data that
was on that node from other machines.
 HDFS maintains the replication process at regular intervals.
(CentreforKnowledgeTransfer)
institute

More Related Content

PPTX
Introduction to Hadoop
PPTX
Distributed computing
PPTX
Hadoop And Their Ecosystem ppt
PPTX
Big Data Analytics with Hadoop
PPTX
PPT on Hadoop
PPTX
HADOOP TECHNOLOGY ppt
PPTX
Cloud File System with GFS and HDFS
PPTX
Introduction to Hadoop and Hadoop component
Introduction to Hadoop
Distributed computing
Hadoop And Their Ecosystem ppt
Big Data Analytics with Hadoop
PPT on Hadoop
HADOOP TECHNOLOGY ppt
Cloud File System with GFS and HDFS
Introduction to Hadoop and Hadoop component

What's hot (20)

PPTX
OLAP & DATA WAREHOUSE
PDF
Hadoop Overview & Architecture
 
PPTX
Introduction to HDFS
PPTX
Hadoop File system (HDFS)
PPTX
NOSQL Databases types and Uses
PDF
PPTX
Big data and Hadoop
PPTX
Big Data Open Source Technologies
PDF
Map Reduce data types and formats
PPT
Hive(ppt)
PPTX
Distributed database management system
PDF
BIGDATA ANALYTICS LAB MANUAL final.pdf
PPTX
Mining Data Streams
PPTX
Density based methods
PDF
Introduction to Apache Hive
PPTX
Distributed database
PPTX
Introduction to Big Data & Hadoop Architecture - Module 1
PPTX
Developing a Map Reduce Application
PPT
Map Reduce
PPTX
Map Reduce
OLAP & DATA WAREHOUSE
Hadoop Overview & Architecture
 
Introduction to HDFS
Hadoop File system (HDFS)
NOSQL Databases types and Uses
Big data and Hadoop
Big Data Open Source Technologies
Map Reduce data types and formats
Hive(ppt)
Distributed database management system
BIGDATA ANALYTICS LAB MANUAL final.pdf
Mining Data Streams
Density based methods
Introduction to Apache Hive
Distributed database
Introduction to Big Data & Hadoop Architecture - Module 1
Developing a Map Reduce Application
Map Reduce
Map Reduce
Ad

Similar to Design of Hadoop Distributed File System (20)

PDF
Hadoop Data Management (1).pdfhbjhkjkkmkm
ODP
Hadoop HDFS by rohitkapa
PDF
cloud computing notes for enginnering students
PPTX
Bigdata and Hadoop Introduction
PPTX
hadoop_Introduction module 2 and chapter 3pptx.pptx
PPTX
Hadoop_Introduction unit-2 for vtu syllabus
PPTX
Managing Big data with Hadoop
DOCX
PPTX
Unit-1 Introduction to Big Data.pptx
PPTX
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
ODP
Apache Hadoop HDFS
PPTX
Big Data & Hadoop
PPTX
Introduction to hadoop and hdfs
PPTX
PPTX
Big Data Analytics -Introduction education
PPTX
Hadoop
PPTX
Unit-3.pptx
PDF
hadoop distributed file systems complete information
PPT
HDFS_architecture.ppt
Hadoop Data Management (1).pdfhbjhkjkkmkm
Hadoop HDFS by rohitkapa
cloud computing notes for enginnering students
Bigdata and Hadoop Introduction
hadoop_Introduction module 2 and chapter 3pptx.pptx
Hadoop_Introduction unit-2 for vtu syllabus
Managing Big data with Hadoop
Unit-1 Introduction to Big Data.pptx
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Apache Hadoop HDFS
Big Data & Hadoop
Introduction to hadoop and hdfs
Big Data Analytics -Introduction education
Hadoop
Unit-3.pptx
hadoop distributed file systems complete information
HDFS_architecture.ppt
Ad

More from Dr. C.V. Suresh Babu (20)

PPTX
Data analytics with R
PPTX
Association rules
PPTX
PPTX
Classification
PPTX
Blue property assumptions.
PPTX
Introduction to regression
PPTX
Expert systems
PPTX
Dempster shafer theory
PPTX
Bayes network
PPTX
Bayes' theorem
PPTX
Knowledge based agents
PPTX
Rule based system
PPTX
Formal Logic in AI
PPTX
Production based system
PPTX
Game playing in AI
PPTX
Diagnosis test of diabetics and hypertension by AI
PPTX
A study on “impact of artificial intelligence in covid19 diagnosis”
PDF
A study on “impact of artificial intelligence in covid19 diagnosis”
Data analytics with R
Association rules
Classification
Blue property assumptions.
Introduction to regression
Expert systems
Dempster shafer theory
Bayes network
Bayes' theorem
Knowledge based agents
Rule based system
Formal Logic in AI
Production based system
Game playing in AI
Diagnosis test of diabetics and hypertension by AI
A study on “impact of artificial intelligence in covid19 diagnosis”
A study on “impact of artificial intelligence in covid19 diagnosis”

Recently uploaded (20)

PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PPTX
L1 - Introduction to python Backend.pptx
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PDF
How Creative Agencies Leverage Project Management Software.pdf
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PDF
medical staffing services at VALiNTRY
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PDF
Nekopoi APK 2025 free lastest update
PDF
PTS Company Brochure 2025 (1).pdf.......
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PDF
AI in Product Development-omnex systems
PPTX
Introduction to Artificial Intelligence
PPTX
Operating system designcfffgfgggggggvggggggggg
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PPTX
ISO 45001 Occupational Health and Safety Management System
PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
L1 - Introduction to python Backend.pptx
Which alternative to Crystal Reports is best for small or large businesses.pdf
2025 Textile ERP Trends: SAP, Odoo & Oracle
How Creative Agencies Leverage Project Management Software.pdf
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
medical staffing services at VALiNTRY
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
Nekopoi APK 2025 free lastest update
PTS Company Brochure 2025 (1).pdf.......
Odoo Companies in India – Driving Business Transformation.pdf
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
AI in Product Development-omnex systems
Introduction to Artificial Intelligence
Operating system designcfffgfgggggggvggggggggg
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
ISO 45001 Occupational Health and Safety Management System
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...

Design of Hadoop Distributed File System

  • 1. Dr. C.V. Suresh Babu DESIGN OF HADOOP DISTRIBUTED FILE SYSTEM (CentreforKnowledgeTransfer) institute
  • 2. DISCUSSION TOPICS  Hadoop Distributed File System (HDFS)  How does HDFS work?  HDFS Architecture  Features of HDFS  Benefits of using HDFS  Examples: Target Marketing  HDFS data replication (CentreforKnowledgeTransfer) institute
  • 3. HADOOP DISTRIBUTED FILE SYSTEM (HDFS)  The Hadoop Distributed File System (HDFS) is the primary data storage system used by Hadoop applications.  HDFS employs a NameNode and DataNode architecture to implement a distributed file system that provides high-performance access to data across highly scalable Hadoop clusters.  Hadoop itself is an open source distributed processing framework that manages data processing and storage for big data applications.  HDFS is a key part of the many Hadoop ecosystem technologies.  It provides a reliable means for managing pools of big data and supporting related big data analytics applications. (CentreforKnowledgeTransfer) institute
  • 4. HOW DOES HDFS WORK?  HDFS enables the rapid transfer of data between compute nodes.  It was closely coupled with MapReduce, a framework for data processing that filters and divides up work among the nodes in a cluster, and it organizes and condenses the results into a cohesive answer to a query.  Similarly, when HDFS takes in data, it breaks the information down into separate blocks and distributes them to different nodes in a cluster. (CentreforKnowledgeTransfer) institute
  • 6. FEATURES OF HDFS Data replication. This is used to ensure that the data is always available and prevents data loss. For example, when a node crashes or there is a hardware failure, replicated data can be pulled from elsewhere within a cluster, so processing continues while data is recovered. Fault tolerance and reliability. HDFS' ability to replicate file blocks and store them across nodes in a large cluster ensures fault tolerance and reliability. High availability. As mentioned earlier, because of replication across notes, data is available even if the NameNode or a DataNode fails. Scalability. Because HDFS stores data on various nodes in the cluster, as requirements increase, a cluster can scale to hundreds of nodes. High throughput. Because HDFS stores data in a distributed manner, the data can be processed in parallel on a cluster of nodes. This, plus data locality (see next bullet), cut the processing time and enable high throughput. Data locality. With HDFS, computation happens on the DataNodes where the data resides, rather than having the data move to where the computational unit is. By minimizing the distance between the data and the computing process, this approach decreases network congestion and boosts a system's overall throughput. (CentreforKnowledgeTransfer) institute
  • 7. BENEFITS OF USING HDFS  Cost effectiveness. The DataNodes that store the data rely on inexpensive off-the-shelf hardware, which cuts storage costs. Also, because HDFS is open source, there's no licensing fee.  Large data set storage. HDFS stores a variety of data of any size -- from megabytes to petabytes -- and in any format, including structured and unstructured data.  Fast recovery from hardware failure. HDFS is designed to detect faults and automatically recover on its own.  Portability. HDFS is portable across all hardware platforms, and it is compatible with several operating systems, including Windows, Linux and Mac OS/X.  Streaming data access. HDFS is built for high data throughput, which is best for access to streaming data. (CentreforKnowledgeTransfer) institute
  • 8. EXAMPLES: TARGET MARKETING  Targeted marketing campaigns depend on marketers knowing a lot about their target audiences.  Marketers can get this information from several sources, including CRM systems, direct mail responses, point-of-sale systems, Facebook and Twitter.  Because much of this data is unstructured, an HDFS cluster is the most cost- effective place to put data before analyzing it. (CentreforKnowledgeTransfer) institute
  • 9. HDFS DATA REPLICATION  Data replication is an important part of the HDFS format as it ensures data remains available if there's a node or hardware failure.  As previously mentioned, the data is divided into blocks and replicated across numerous nodes in the cluster.  Therefore, when one node goes down, the user can access the data that was on that node from other machines.  HDFS maintains the replication process at regular intervals. (CentreforKnowledgeTransfer) institute