SlideShare a Scribd company logo
MANAGING BIG DATA WITH 
HADOOP 
Presented by: 
Nalini Mehta 
Student(MLVTEC Bhilwara) 
Email: nalinimehta52@gmail.com
Introduction 
Big Data: 
•Big data is a term used to describe the voluminous amount of unstructured and 
semi-structured data . 
•Data that would take too much time and cost too much money to load into a 
relational database for analysis. 
• Big data doesn't refer to any specific quantity, the term is often used when 
speaking about petabytes and exabytes of data.
Managing Big data with Hadoop
General framework of Big Data 
Networking 
 The driving force behind 
the implementation of Big 
data is both infrastructure 
and analytics which 
together constitutes the 
software. 
 Hadoop is the Big Data 
management software 
which is used to 
distribute, catalogue 
manage and query data 
across multiple, 
horizontally scaled server 
nodes.
Managing Big Data
Overview of Hadoop 
• Hadoop is a platform for 
processing large amount of 
data in distributed fashion. 
• It provides scheduling and 
resource management 
framework to execute the 
map and to reduce phases 
in the cluster environment. 
• Hadoop Distributed File is 
Hadoop’s data storage layer 
which is designed to handle 
the petabytes and exabytes 
of data distributed over 
multiple nodes in parallel.
Hadoop Cluster 
• DataNode- The DataNodes are 
the repositories for the data, and it 
consist of multiple smaller 
database infrastructures. 
• Client- The client represents the 
user interface to the big data 
implementation and query engine. 
The client could be a server or PC 
with a traditional user interface. 
• NameNode- the NameNode is 
equivalent to the address router 
and location of every data node. 
• Job Tracker- The job tracker 
represents the software tracking 
mechanism to distribute and 
aggregate search queries across 
multiple nodes for ultimate client 
analysis.
Apache Hadoop 
• Apache Hadoop is an open source distributed software platform for 
storing and processing data. 
• It is a framework for running applications on large cluster built of 
commodity hardware. 
• A common way of avoiding data loss is through replication: 
redundant copies of the data are kept by the system so that in the 
event of failure, there is another copy available. The Hadoop 
Distributed File system (HDFS), takes care of this problem. 
• MapReduce is a simple programming model for processing and 
generating large data sets.
What is MapReduce? 
 MapReduce is a programming model . 
 Programs written automatically parallelized and executed on a large 
cluster of commodity machines. 
 Users specify a map function that processes a key/value pair to 
generate a set of intermediate key/value pair, and a reduce function that 
merges all intermediate values associated with the same intermediate 
key. 
MapReduce 
MAP 
map function that 
processes a key/value 
pair to generate a set of 
intermediate key/value 
pairs 
REDUCE 
and a reduce function 
that merges all 
intermediate values 
associated with the 
same intermediate key.
The Programming Model Of MapReduce 
 Map, written by the user, takes an input pair and produces a set of 
intermediate key/value pairs. The MapReduce library groups 
together all intermediate values associated with the same 
intermediate key and passes them to the Reduce function.
 The Reduce function, also written by the user, accepts 
an intermediate key and a set of values for that key. 
It merges together these values to form a possibly 
smaller set of values.
HADOOP DISTRIBUTED FILE 
SYSTEM (HDFS) 
 Apache Hadoop comes with a distributed file system called HDFS, 
which stands for Hadoop Distributed File System. 
 HDFS is designed to hold very large amounts of data (terabytes or 
even petabytes), and provide high-throughput access to this 
information. 
 HDFS is designed for scalability and fault tolerance and provides 
APIs MapReduce applications to read and write data in parallel. 
 The capacity and performance of HDFS can be scaled by adding 
Data Nodes, and a single Name Node mechanisms that manages 
data placement and monitor server availability.
Assumptions and Goals 
1. Hardware Failure 
• An HDFS instance may consist of hundreds or thousands of server machines, 
each storing part of the file system’s data. 
• There are a huge number of components and that each component has a non-trivial 
probability of failure. 
• Detection of faults and quick, automatic recovery from them is a core 
architectural goal of HDFS. 
2. Streaming Data Access 
• Applications that run on HDFS need streaming access to their data sets. 
• HDFS is designed more for batch processing rather than interactive use by 
users. 
• The emphasis is on high throughput of data access rather than low latency of 
data access. 
3. Large Data Sets 
• A typical file in HDFS is gigabytes to terabytes in size. 
• Thus, HDFS is tuned to support large files. 
• It should provide high aggregate data bandwidth and scale to hundreds of 
nodes in a single cluster.
4. Simple coherency model 
• HDFS applications need a write-once-read-many access model for files. 
• A file once created, written, and closed need not be changed. 
• This assumption simplifies data coherency issues and enables high 
throughput data access. 
5. “Moving Computation is Cheaper than Moving 
Data” 
• A computation requested by an application is much more efficient if it is 
executed near the data it operates on when the size of the data set is huge. 
• This minimizes network congestion and increases the overall throughput of 
the system. 
6. Portability across Heterogeneous Hardware and 
Software Platforms 
• HDFS has been designed to be easily portable from one platform to 
another. This facilitates widespread adoption of HDFS as a platform of 
choice for a large set of applications.
Concepts of HDFS:
NameNode and DataNodes 
 A HDFS cluster has two 
types of node operating in 
a master-slave pattern: a 
NameNode (the master) 
and a number of 
DataNodes (slaves). 
 The NameNode manages 
the file system 
namespace. It maintains 
the file system tree and 
the metadata for all the 
files and directories in the 
tree. 
 Internally a file is split into 
one or more blocks and 
these blocks are stored in 
a set of DataNodes.
 The NameNode executes file system namespace 
operations like opening, closing, and renaming 
files and directories. 
 DataNodes store and retrieve blocks when they 
are told to (by clients or the NameNode), and they 
report back to the NameNode periodically with lists 
of blocks that they are storing. 
 The DataNodes also perform block creation, 
deletion, and replication upon instruction from the 
NameNode. 
 Without the NameNode, the file system cannot be 
used. In fact, if the machine running the 
NameNode were destroyed, all the files on the file 
system would be lost since there would be no way 
of knowing how to reconstruct the files from the 
blocks on the DataNodes.
File System Namespace 
 HDFS supports a traditional hierarchical file 
organization. A user or an application can create 
and remove files, move a file from one directory to 
another, rename a file, create directories and store 
files inside these directories. 
 The NameNode maintains the file system 
namespace. Any change to the file system 
namespace or its properties is recorded by the 
NameNode. 
 An application can specify the number of replicas of 
a file that should be maintained by HDFS. The 
number of copies of a file is called the replication 
factor of that file. This information is stored by the 
NameNode.
Data Replication 
 The blocks of a file are replicated for fault 
tolerance. 
 The block and replication factor are configurable as 
per file. 
 The NameNode makes all decisions regarding 
replication of blocks. 
 A Block report contains a list of all blocks on a 
DataNode.
Hadoop as a Service in the Cloud 
(Haas): 
 Hadoop is economical for large scale data driven 
companies like Yahoo or Facebook. 
 The ecosystem around Hadoop nowadays offers various 
tools like Hive and Pig to make Big Data processing 
accessible focusing on what to do with the data and to 
avoid the complexity of programming. 
 Consequently, a minimal Hadoop as a Service provide a 
managed Hadoop cluster ready to use without the need to 
configure or install any Hadoop relevant services on any 
cluster nodes like Job tracker, Task tracker, NameNode or 
DataNode. 
 Depending on the level of service, abstraction and tools 
provided, Hadoop as a Service (HaaS) can be placed in the 
cloud stack as a Platform or Software as a Service 
solutions, between infrastructure services and cloud clients.
Limitations: 
It places several requirements on the network: 
 Data locality 
 The distributed Hadoop nodes running jobs parallel 
causes east-west network traffic that can be adversely 
affected by the suboptimal network connectivity. 
 The network should provide high bandwidth, low latency 
and any to any connectivity between the nodes for 
optimal Hadoop performance. 
 Scale out 
 Deployments might start with a small cluster and then 
scale out over time as the customer may realize the 
initial success and then needs. 
 The underlying network architecture should also scale 
seamlessly with Hadoop clusters and should provide 
predictable performance.
Conclusion 
 The growth of communication and 
connectivity has led to the emergence of 
Big Data. Apache Hadoop is an open 
source framework that has become a de-facto 
standard for big data platforms 
deployed today. 
 To sum up, we conclude that promising 
progress has been made in the area of 
Big Data but much remains to be done. 
Almost all proposed approaches are 
evaluated to a limited scale, and further 
research is required for large scale 
evaluations.
References: 
 White paper –Introduction to Big Data: Infrastructure 
and Network consideration 
 MapReduce: Simplified Data processing on Large 
Clusters, http://research .google.com/archive 
/mapreduce.html 
 White paper Big Data Analytics[http:/Hadoop.intel.com] 
 The Hadoop Distributed File System Architecture and 
Design:by Dhruba Borthakur 
 Big Data in the enterprise, Cisco White Paper. 
 Cloudera capacity planning recommendations: 
http://www.cloudera.com/blog/ 2010/08/Hadoop HBase-capacity- 
planning/ 
 Apache Hadoop Wiki Website: 
http://en.wikipedia.org/wiki/Apache-Hadoop. 
 Towards a Big Data Reference Architecture 
 [www.win.tue.nl/~gfletche/Maier_MSc_thesis.pdf]
Managing Big data with Hadoop

More Related Content

PPTX
Big Data Open Source Technologies
PPTX
Data models in NoSQL
PDF
Data Mesh at CMC Markets: Past, Present and Future
PPTX
Map reduce prashant
PPTX
NOSQL Databases types and Uses
PPTX
The Basics of MongoDB
PPTX
Introduction to Apache Hadoop Eco-System
PPSX
Big Data Open Source Technologies
Data models in NoSQL
Data Mesh at CMC Markets: Past, Present and Future
Map reduce prashant
NOSQL Databases types and Uses
The Basics of MongoDB
Introduction to Apache Hadoop Eco-System

What's hot (20)

PPTX
Big Data and Hadoop
PDF
A Tour of Google Cloud Platform
PPTX
Big data and Hadoop
PDF
The Data Science Process
PDF
PPTX
introduction to NOSQL Database
PPTX
Big Data Platforms: An Overview
PDF
Time to Talk about Data Mesh
PDF
What is new in Apache Hive 3.0?
PPT
HADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.ppt
PDF
Big Data Analytics Architecture PowerPoint Presentation Slides
PPTX
Hadoop technology
PDF
Lecture1 introduction to big data
PPTX
Design of Hadoop Distributed File System
PDF
Hadoop Overview & Architecture
 
PDF
Introduction to Bigdata and HADOOP
PPTX
Cloud security and security architecture
PPT
Hadoop Map Reduce
PPTX
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Big Data and Hadoop
A Tour of Google Cloud Platform
Big data and Hadoop
The Data Science Process
introduction to NOSQL Database
Big Data Platforms: An Overview
Time to Talk about Data Mesh
What is new in Apache Hive 3.0?
HADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.ppt
Big Data Analytics Architecture PowerPoint Presentation Slides
Hadoop technology
Lecture1 introduction to big data
Design of Hadoop Distributed File System
Hadoop Overview & Architecture
 
Introduction to Bigdata and HADOOP
Cloud security and security architecture
Hadoop Map Reduce
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Ad

Similar to Managing Big data with Hadoop (20)

PPT
hadoop
PPT
hadoop
PDF
Hadoop
PPT
Hadoop Technology
PPTX
PPTX
PPTX
Cppt Hadoop
PDF
Hadoop overview.pdf
PPTX
Seminar ppt
PPTX
Introduction to Hadoop and Hadoop component
PPTX
Big Data and Hadoop
PPTX
Bigdata and Hadoop Introduction
PPTX
Topic 9a-Hadoop Storage- HDFS.pptx
PPTX
Distributed Systems Hadoop.pptx
DOCX
project report on hadoop
PDF
BIGDATA MODULE 3.pdf
PDF
Big Data Analysis and Its Scheduling Policy – Hadoop
PDF
G017143640
PPTX
Big Data Analytics Presentation on the resourcefulness of Big data
PDF
2.1-HADOOP.pdf
hadoop
hadoop
Hadoop
Hadoop Technology
Cppt Hadoop
Hadoop overview.pdf
Seminar ppt
Introduction to Hadoop and Hadoop component
Big Data and Hadoop
Bigdata and Hadoop Introduction
Topic 9a-Hadoop Storage- HDFS.pptx
Distributed Systems Hadoop.pptx
project report on hadoop
BIGDATA MODULE 3.pdf
Big Data Analysis and Its Scheduling Policy – Hadoop
G017143640
Big Data Analytics Presentation on the resourcefulness of Big data
2.1-HADOOP.pdf
Ad

Recently uploaded (20)

PPTX
Foundation to blockchain - A guide to Blockchain Tech
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PPTX
Geodesy 1.pptx...............................................
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PPTX
Current and future trends in Computer Vision.pptx
PDF
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
DOCX
573137875-Attendance-Management-System-original
PPTX
web development for engineering and engineering
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PPTX
CH1 Production IntroductoryConcepts.pptx
PDF
III.4.1.2_The_Space_Environment.p pdffdf
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PPTX
UNIT 4 Total Quality Management .pptx
PPTX
Construction Project Organization Group 2.pptx
PPT
Project quality management in manufacturing
PPTX
bas. eng. economics group 4 presentation 1.pptx
PPTX
OOP with Java - Java Introduction (Basics)
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
Foundation to blockchain - A guide to Blockchain Tech
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
Model Code of Practice - Construction Work - 21102022 .pdf
Embodied AI: Ushering in the Next Era of Intelligent Systems
Geodesy 1.pptx...............................................
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
Current and future trends in Computer Vision.pptx
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
573137875-Attendance-Management-System-original
web development for engineering and engineering
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
CH1 Production IntroductoryConcepts.pptx
III.4.1.2_The_Space_Environment.p pdffdf
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
UNIT 4 Total Quality Management .pptx
Construction Project Organization Group 2.pptx
Project quality management in manufacturing
bas. eng. economics group 4 presentation 1.pptx
OOP with Java - Java Introduction (Basics)
CYBER-CRIMES AND SECURITY A guide to understanding

Managing Big data with Hadoop

  • 1. MANAGING BIG DATA WITH HADOOP Presented by: Nalini Mehta Student(MLVTEC Bhilwara) Email: [email protected]
  • 2. Introduction Big Data: •Big data is a term used to describe the voluminous amount of unstructured and semi-structured data . •Data that would take too much time and cost too much money to load into a relational database for analysis. • Big data doesn't refer to any specific quantity, the term is often used when speaking about petabytes and exabytes of data.
  • 4. General framework of Big Data Networking  The driving force behind the implementation of Big data is both infrastructure and analytics which together constitutes the software.  Hadoop is the Big Data management software which is used to distribute, catalogue manage and query data across multiple, horizontally scaled server nodes.
  • 6. Overview of Hadoop • Hadoop is a platform for processing large amount of data in distributed fashion. • It provides scheduling and resource management framework to execute the map and to reduce phases in the cluster environment. • Hadoop Distributed File is Hadoop’s data storage layer which is designed to handle the petabytes and exabytes of data distributed over multiple nodes in parallel.
  • 7. Hadoop Cluster • DataNode- The DataNodes are the repositories for the data, and it consist of multiple smaller database infrastructures. • Client- The client represents the user interface to the big data implementation and query engine. The client could be a server or PC with a traditional user interface. • NameNode- the NameNode is equivalent to the address router and location of every data node. • Job Tracker- The job tracker represents the software tracking mechanism to distribute and aggregate search queries across multiple nodes for ultimate client analysis.
  • 8. Apache Hadoop • Apache Hadoop is an open source distributed software platform for storing and processing data. • It is a framework for running applications on large cluster built of commodity hardware. • A common way of avoiding data loss is through replication: redundant copies of the data are kept by the system so that in the event of failure, there is another copy available. The Hadoop Distributed File system (HDFS), takes care of this problem. • MapReduce is a simple programming model for processing and generating large data sets.
  • 9. What is MapReduce?  MapReduce is a programming model .  Programs written automatically parallelized and executed on a large cluster of commodity machines.  Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pair, and a reduce function that merges all intermediate values associated with the same intermediate key. MapReduce MAP map function that processes a key/value pair to generate a set of intermediate key/value pairs REDUCE and a reduce function that merges all intermediate values associated with the same intermediate key.
  • 10. The Programming Model Of MapReduce  Map, written by the user, takes an input pair and produces a set of intermediate key/value pairs. The MapReduce library groups together all intermediate values associated with the same intermediate key and passes them to the Reduce function.
  • 11.  The Reduce function, also written by the user, accepts an intermediate key and a set of values for that key. It merges together these values to form a possibly smaller set of values.
  • 12. HADOOP DISTRIBUTED FILE SYSTEM (HDFS)  Apache Hadoop comes with a distributed file system called HDFS, which stands for Hadoop Distributed File System.  HDFS is designed to hold very large amounts of data (terabytes or even petabytes), and provide high-throughput access to this information.  HDFS is designed for scalability and fault tolerance and provides APIs MapReduce applications to read and write data in parallel.  The capacity and performance of HDFS can be scaled by adding Data Nodes, and a single Name Node mechanisms that manages data placement and monitor server availability.
  • 13. Assumptions and Goals 1. Hardware Failure • An HDFS instance may consist of hundreds or thousands of server machines, each storing part of the file system’s data. • There are a huge number of components and that each component has a non-trivial probability of failure. • Detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS. 2. Streaming Data Access • Applications that run on HDFS need streaming access to their data sets. • HDFS is designed more for batch processing rather than interactive use by users. • The emphasis is on high throughput of data access rather than low latency of data access. 3. Large Data Sets • A typical file in HDFS is gigabytes to terabytes in size. • Thus, HDFS is tuned to support large files. • It should provide high aggregate data bandwidth and scale to hundreds of nodes in a single cluster.
  • 14. 4. Simple coherency model • HDFS applications need a write-once-read-many access model for files. • A file once created, written, and closed need not be changed. • This assumption simplifies data coherency issues and enables high throughput data access. 5. “Moving Computation is Cheaper than Moving Data” • A computation requested by an application is much more efficient if it is executed near the data it operates on when the size of the data set is huge. • This minimizes network congestion and increases the overall throughput of the system. 6. Portability across Heterogeneous Hardware and Software Platforms • HDFS has been designed to be easily portable from one platform to another. This facilitates widespread adoption of HDFS as a platform of choice for a large set of applications.
  • 16. NameNode and DataNodes  A HDFS cluster has two types of node operating in a master-slave pattern: a NameNode (the master) and a number of DataNodes (slaves).  The NameNode manages the file system namespace. It maintains the file system tree and the metadata for all the files and directories in the tree.  Internally a file is split into one or more blocks and these blocks are stored in a set of DataNodes.
  • 17.  The NameNode executes file system namespace operations like opening, closing, and renaming files and directories.  DataNodes store and retrieve blocks when they are told to (by clients or the NameNode), and they report back to the NameNode periodically with lists of blocks that they are storing.  The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.  Without the NameNode, the file system cannot be used. In fact, if the machine running the NameNode were destroyed, all the files on the file system would be lost since there would be no way of knowing how to reconstruct the files from the blocks on the DataNodes.
  • 18. File System Namespace  HDFS supports a traditional hierarchical file organization. A user or an application can create and remove files, move a file from one directory to another, rename a file, create directories and store files inside these directories.  The NameNode maintains the file system namespace. Any change to the file system namespace or its properties is recorded by the NameNode.  An application can specify the number of replicas of a file that should be maintained by HDFS. The number of copies of a file is called the replication factor of that file. This information is stored by the NameNode.
  • 19. Data Replication  The blocks of a file are replicated for fault tolerance.  The block and replication factor are configurable as per file.  The NameNode makes all decisions regarding replication of blocks.  A Block report contains a list of all blocks on a DataNode.
  • 20. Hadoop as a Service in the Cloud (Haas):  Hadoop is economical for large scale data driven companies like Yahoo or Facebook.  The ecosystem around Hadoop nowadays offers various tools like Hive and Pig to make Big Data processing accessible focusing on what to do with the data and to avoid the complexity of programming.  Consequently, a minimal Hadoop as a Service provide a managed Hadoop cluster ready to use without the need to configure or install any Hadoop relevant services on any cluster nodes like Job tracker, Task tracker, NameNode or DataNode.  Depending on the level of service, abstraction and tools provided, Hadoop as a Service (HaaS) can be placed in the cloud stack as a Platform or Software as a Service solutions, between infrastructure services and cloud clients.
  • 21. Limitations: It places several requirements on the network:  Data locality  The distributed Hadoop nodes running jobs parallel causes east-west network traffic that can be adversely affected by the suboptimal network connectivity.  The network should provide high bandwidth, low latency and any to any connectivity between the nodes for optimal Hadoop performance.  Scale out  Deployments might start with a small cluster and then scale out over time as the customer may realize the initial success and then needs.  The underlying network architecture should also scale seamlessly with Hadoop clusters and should provide predictable performance.
  • 22. Conclusion  The growth of communication and connectivity has led to the emergence of Big Data. Apache Hadoop is an open source framework that has become a de-facto standard for big data platforms deployed today.  To sum up, we conclude that promising progress has been made in the area of Big Data but much remains to be done. Almost all proposed approaches are evaluated to a limited scale, and further research is required for large scale evaluations.
  • 23. References:  White paper –Introduction to Big Data: Infrastructure and Network consideration  MapReduce: Simplified Data processing on Large Clusters, http://research .google.com/archive /mapreduce.html  White paper Big Data Analytics[http:/Hadoop.intel.com]  The Hadoop Distributed File System Architecture and Design:by Dhruba Borthakur  Big Data in the enterprise, Cisco White Paper.  Cloudera capacity planning recommendations: http://www.cloudera.com/blog/ 2010/08/Hadoop HBase-capacity- planning/  Apache Hadoop Wiki Website: http://en.wikipedia.org/wiki/Apache-Hadoop.  Towards a Big Data Reference Architecture  [www.win.tue.nl/~gfletche/Maier_MSc_thesis.pdf]