SlideShare a Scribd company logo
Sunderdeep Engineering College
Department of Computer Science
Session-2017-18
Topic:-
Submitted to Submitted by
Mr.Ashutosh Rao Kamran Khan
H.O.D. (CSE) Dept. B.tech IIIrd Year
Contents
 Introduction
 What’s Big Data?
 3’V of Big Data
 Problem & Solution
 What’s Hadoop?
 HDFS
 MapReduce
 Architecture of Hadoop
 Applications of Hadoop
 Pros & Cons of Hadoop
 Conclusion
 Refrences
Introduction
 Apache Hadoop is an open source, Scalable, and Fault
tolerant framework written in Java. It efficiently processes large
volumes of data (BIG DATA) on a cluster of commodity hardware.
Hadoop is not only a storage system but is a platform for large
data storage as well as processing.
 Created by Doug Cutting, Mike Cafarella in 2005.
 Doug named it after his son's toy elephant
 Now Apache Hadoop is a registered trademark of the Apache
Software Foundation.
What’s the Problem?
What is Big Data?
Data which are very large in size is called Big
Data. Normally we work on data of size
MB(Word ,Excel) or maximum GB(Movies,
Codes) but data in Peta bytes i.e. 10^15 byte
size is called Big Data. It is stated that almost
90% of today's data has been generated in
the past 5 years.
Hadoop by kamran khan
Hadoop by kamran khan
3V's of Big Data
 Velocity: The data is increasing at a very fast rate. It is
estimated that the volume of data will double in every 2
years.
 Variety: Now a days data are not stored in rows and column.
Data is structured as well as unstructured. Log file, CCTV
footage is unstructured data. Data which can be saved in
tables are structured data like the transaction data of the
bank.
 Volume: The amount of data which we deal with is of very
large size of Peta bytes.
So what is the problem??
Processing that large data is very
difficult in relational database.
It would take too much time to process
data and cost.
Traditional Approach
 In this approach, an enterprise will have a computer to store and process
big data. Here data will be stored in an RDBMS like Oracle Database, MS
SQL Server or DB2 and sophisticated softwares can be written to interact
with the database, process the required data and present it to the users
for analysis purpose.
 This approach works well where we have less volume of data that can be
accommodated by standard database servers, or up to the limit of the
processor which is processing the data. But when it comes to dealing
with huge amounts of data, it is really a tedious task to process such data
through a traditional database server.
‘s Solution!!
 Google solved this problem using an algorithm called MapReduce.
This algorithm divides the task into small parts and assigns those
parts to many computers connected over the network, and collects
the results to form the final result dataset.
Solution!!
What is Hadoop?
The Apache Hadoop software library is a framework
that allows for the distributed processing of large
data sets across clusters of computers using simple
programming models.
 It is made by apache software foundation in 2011.
 Written in JAVA.
Hadoop is open source software.
Framework
Massive Storage
Processing Power
We can solve this problem by Distributed
Computing.
But the problems in distributed computing is –
 Hardware failure
Chances of hardware failure is always there.
 Combine the data after analysis
Data from all disks have to be combined from all the disks which is a mess.
To Solve all the Problems Hadoop Came.
It has two main parts –
 Hadoop Distributed File System (HDFS),
 Data Processing Framework & MapReduce
Hadoop Distributed File System
 It ties so many small and reasonable priced machines
together into a single cost effective computer cluster.
 Data and application processing are protected against
hardware failure.
 If a node goes down, jobs are automatically redirected to
other nodes to make sure the distributed computing does
not fail.
 It automatically stores multiple copies of all data.
 It provides simplified programming model which allows user
to quickly read and write the distributed system.
HDFS Architecture
 NameNode in HDFS Architecture is also known as Master node. HDFS Namenode
stores meta-data i.e. number of data block, replicas and other details. This meta-data is
available in memory in the master for faster retrieval of data. NameNode maintains and
manages the slave nodes, and assigns tasks to them. It should deploy on reliable
hardware as it is the centerpiece of HDFS.
 DataNode in HDFS Architecture is also known as Slave. In Hadoop HDFS Architecture,
DataNode stores actual data in HDFS. It performs read and write operation as per the
request of the client. DataNodes can deploy on commodity hardware.
 In HDFS, when NameNode starts, first it reads HDFS state from an image file, FsImage.
After that, it applies edits from the edits log file. NameNode then writes new HDFS state
to the FsImage. Then it starts normal operation with an empty edits file. At the time of
start-up, NameNode merges FsImage and edits files, so the edit log file could get very
large over time. A side effect of a larger edits file is that next restart of Namenode takes
longer.
 Secondary Namenode solves this issue. Secondary NameNode downloads the FsImage
and EditLogs from the NameNode. And then merges EditLogs with the FsImage
(FileSystem Image). It keeps edits log size within a limit. It stores the modified FsImage
into persistent storage. And we can use it in the case of NameNode failure.
 Secondary NameNode performs a regular checkpoint in HDFS.

 The JobTracker is the service within Hadoop that farms out MapReduce tasks to specific
nodes in the cluster, ideally the nodes that have the data, or at least are in the same rack.
 Client applications submit jobs to the Job tracker.
 The JobTracker talks to the NameNode to determine the location of the data
 The JobTracker locates TaskTracker nodes with available slots at or near the data
 The JobTracker submits the work to the chosen TaskTracker nodes.
 The TaskTracker nodes are monitored. If they do not submit heartbeat signals often
enough, they are deemed to have failed and the work is scheduled on a
different TaskTracker.
 A TaskTracker will notify the JobTracker when a task fails. The JobTracker decides what to
do then: it may resubmit the job elsewhere, it may mark that specific record as something
to avoid, and it may may even blacklist the TaskTracker as unreliable.
 When the work is completed, the JobTracker updates its status.
 Client applications can poll the JobTracker for information.
 The JobTracker is a point of failure for the Hadoop MapReduce service. If it goes down, all
running jobs are halted.
MapReduce
MapReduce is a programming model for processing and
generating large data sets with a parallel, distributed
algorithm on a cluster.
It is an associative implementation for processing and
generating large data sets.
MAP function that process a key pair to generates a set
of intermediate key pairs.
REDUCE function that merges all intermediate values
associated with the same intermediate key
Hadoop by kamran khan
Hadoop by kamran khan
Hadoop by kamran khan
Applications
Pros of Hadoop
 Computing power
 Flexibility
 Fault Tolerance
 Low Cost
 Scalability
Cons of Hadoop
 1. Integration with existing systems
 Hadoop is not optimised for ease for use. Installing and integrating with existing
 databases might prove to be difficult, especially since there is no software support
 provided.
 2. Administration and ease of use
 Hadoop requires knowledge of MapReduce, while most data practitioners use SQL. This
 means significant training may be required to administer Hadoop clusters.
 3. Security
 Hadoop lacks the level of security functionality needed for safe enterprise deployment,
 especially if it concerns sensitive data.
Conclusion:
Hadoop has been very effective solution for companies
dealing with the data in petabytes.
It has solved many problems in industry related to
huge data
management and distributed system.
As it is open source, so it is adopted by companies
widely.
References
https://www.knowledgehut.com/blog/bigdata-hadoop/top-
pros-and-cons-of-hadoop
https://data-flair.training/blogs/hadoop-hdfs-architecture/
https://www.dezyre.com/article/hadoop-architecture-
explained-what-it-is-and-why-it-matters/317
https://www.tutorialspoint.com/hadoop/index.htm
https://www.edureka.co/blog/hadoop-tutorial/
THANK YOU!!!

More Related Content

PPTX
Managing Big data with Hadoop
PDF
Understanding Big Data And Hadoop
PDF
Seminar_Report_hadoop
PDF
Hadoop MapReduce Framework
PDF
Apache Hadoop - Big Data Engineering
PPTX
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
PPTX
Hadoop
PDF
Why Talend for Big Data?
Managing Big data with Hadoop
Understanding Big Data And Hadoop
Seminar_Report_hadoop
Hadoop MapReduce Framework
Apache Hadoop - Big Data Engineering
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
Hadoop
Why Talend for Big Data?

What's hot (20)

PPTX
Big data concepts
PPTX
Big data ppt
PPTX
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
PDF
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
PDF
Hadoop Developer
PPTX
Big data processing with apache spark part1
PPTX
Big Data & Hadoop Tutorial
PPTX
Big Data and Hadoop
PPTX
Hadoop and Big Data
PDF
Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...
PPTX
Hadoop and Graph Data Management: Challenges and Opportunities
PPTX
Big data Hadoop presentation
PPTX
Hadoop: Distributed Data Processing
PPTX
Hadoop introduction
PPTX
Hadoop Tutorial For Beginners
PPTX
Apache Hadoop
PPTX
Bigdata and Hadoop Introduction
PDF
Big data Hadoop Analytic and Data warehouse comparison guide
DOCX
Hadoop Seminar Report
PDF
Harnessing Hadoop and Big Data to Reduce Execution Times
Big data concepts
Big data ppt
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Hadoop Developer
Big data processing with apache spark part1
Big Data & Hadoop Tutorial
Big Data and Hadoop
Hadoop and Big Data
Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...
Hadoop and Graph Data Management: Challenges and Opportunities
Big data Hadoop presentation
Hadoop: Distributed Data Processing
Hadoop introduction
Hadoop Tutorial For Beginners
Apache Hadoop
Bigdata and Hadoop Introduction
Big data Hadoop Analytic and Data warehouse comparison guide
Hadoop Seminar Report
Harnessing Hadoop and Big Data to Reduce Execution Times
Ad

Similar to Hadoop by kamran khan (20)

PPT
hadoop
PPT
hadoop
PPSX
Hadoop-Quick introduction
PPTX
Apache hadoop basics
PPTX
Introduction to Apache Hadoop Eco-System
PPTX
Hadoop info
PPTX
Hadoop_EcoSystem slide by CIDAC India.pptx
PPTX
Big Data and Hadoop
PPTX
hadoop.pptx
ODP
Hadoop seminar
PPTX
Introduction to Hadoop and Big Data
PPTX
Hadoop.pptx
PPTX
Hadoop.pptx
PPTX
List of Engineering Colleges in Uttarakhand
PPTX
Hadoop ppt1
PPTX
OPERATING SYSTEM .pptx
PPTX
Hadoop introduction , Why and What is Hadoop ?
PPT
Hadoop Technology
hadoop
hadoop
Hadoop-Quick introduction
Apache hadoop basics
Introduction to Apache Hadoop Eco-System
Hadoop info
Hadoop_EcoSystem slide by CIDAC India.pptx
Big Data and Hadoop
hadoop.pptx
Hadoop seminar
Introduction to Hadoop and Big Data
Hadoop.pptx
Hadoop.pptx
List of Engineering Colleges in Uttarakhand
Hadoop ppt1
OPERATING SYSTEM .pptx
Hadoop introduction , Why and What is Hadoop ?
Hadoop Technology
Ad

Recently uploaded (20)

PPTX
Construction Project Organization Group 2.pptx
PDF
737-MAX_SRG.pdf student reference guides
PPTX
Fundamentals of Mechanical Engineering.pptx
PPTX
Fundamentals of safety and accident prevention -final (1).pptx
PDF
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
PDF
PREDICTION OF DIABETES FROM ELECTRONIC HEALTH RECORDS
PPTX
6ME3A-Unit-II-Sensors and Actuators_Handouts.pptx
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PDF
R24 SURVEYING LAB MANUAL for civil enggi
PDF
Human-AI Collaboration: Balancing Agentic AI and Autonomy in Hybrid Systems
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PDF
PPT on Performance Review to get promotions
PDF
Well-logging-methods_new................
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PPT
Project quality management in manufacturing
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PDF
BIO-INSPIRED HORMONAL MODULATION AND ADAPTIVE ORCHESTRATION IN S-AI-GPT
Construction Project Organization Group 2.pptx
737-MAX_SRG.pdf student reference guides
Fundamentals of Mechanical Engineering.pptx
Fundamentals of safety and accident prevention -final (1).pptx
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
PREDICTION OF DIABETES FROM ELECTRONIC HEALTH RECORDS
6ME3A-Unit-II-Sensors and Actuators_Handouts.pptx
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
R24 SURVEYING LAB MANUAL for civil enggi
Human-AI Collaboration: Balancing Agentic AI and Autonomy in Hybrid Systems
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PPT on Performance Review to get promotions
Well-logging-methods_new................
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
Project quality management in manufacturing
Automation-in-Manufacturing-Chapter-Introduction.pdf
Embodied AI: Ushering in the Next Era of Intelligent Systems
BIO-INSPIRED HORMONAL MODULATION AND ADAPTIVE ORCHESTRATION IN S-AI-GPT

Hadoop by kamran khan

  • 1. Sunderdeep Engineering College Department of Computer Science Session-2017-18 Topic:- Submitted to Submitted by Mr.Ashutosh Rao Kamran Khan H.O.D. (CSE) Dept. B.tech IIIrd Year
  • 2. Contents  Introduction  What’s Big Data?  3’V of Big Data  Problem & Solution  What’s Hadoop?  HDFS  MapReduce  Architecture of Hadoop  Applications of Hadoop  Pros & Cons of Hadoop  Conclusion  Refrences
  • 3. Introduction  Apache Hadoop is an open source, Scalable, and Fault tolerant framework written in Java. It efficiently processes large volumes of data (BIG DATA) on a cluster of commodity hardware. Hadoop is not only a storage system but is a platform for large data storage as well as processing.  Created by Doug Cutting, Mike Cafarella in 2005.  Doug named it after his son's toy elephant  Now Apache Hadoop is a registered trademark of the Apache Software Foundation.
  • 5. What is Big Data? Data which are very large in size is called Big Data. Normally we work on data of size MB(Word ,Excel) or maximum GB(Movies, Codes) but data in Peta bytes i.e. 10^15 byte size is called Big Data. It is stated that almost 90% of today's data has been generated in the past 5 years.
  • 8. 3V's of Big Data  Velocity: The data is increasing at a very fast rate. It is estimated that the volume of data will double in every 2 years.  Variety: Now a days data are not stored in rows and column. Data is structured as well as unstructured. Log file, CCTV footage is unstructured data. Data which can be saved in tables are structured data like the transaction data of the bank.  Volume: The amount of data which we deal with is of very large size of Peta bytes.
  • 9. So what is the problem?? Processing that large data is very difficult in relational database. It would take too much time to process data and cost.
  • 10. Traditional Approach  In this approach, an enterprise will have a computer to store and process big data. Here data will be stored in an RDBMS like Oracle Database, MS SQL Server or DB2 and sophisticated softwares can be written to interact with the database, process the required data and present it to the users for analysis purpose.  This approach works well where we have less volume of data that can be accommodated by standard database servers, or up to the limit of the processor which is processing the data. But when it comes to dealing with huge amounts of data, it is really a tedious task to process such data through a traditional database server.
  • 11. ‘s Solution!!  Google solved this problem using an algorithm called MapReduce. This algorithm divides the task into small parts and assigns those parts to many computers connected over the network, and collects the results to form the final result dataset.
  • 13. What is Hadoop? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.  It is made by apache software foundation in 2011.  Written in JAVA.
  • 14. Hadoop is open source software. Framework Massive Storage Processing Power
  • 15. We can solve this problem by Distributed Computing. But the problems in distributed computing is –  Hardware failure Chances of hardware failure is always there.  Combine the data after analysis Data from all disks have to be combined from all the disks which is a mess.
  • 16. To Solve all the Problems Hadoop Came. It has two main parts –  Hadoop Distributed File System (HDFS),  Data Processing Framework & MapReduce
  • 17. Hadoop Distributed File System  It ties so many small and reasonable priced machines together into a single cost effective computer cluster.  Data and application processing are protected against hardware failure.  If a node goes down, jobs are automatically redirected to other nodes to make sure the distributed computing does not fail.  It automatically stores multiple copies of all data.  It provides simplified programming model which allows user to quickly read and write the distributed system.
  • 19.  NameNode in HDFS Architecture is also known as Master node. HDFS Namenode stores meta-data i.e. number of data block, replicas and other details. This meta-data is available in memory in the master for faster retrieval of data. NameNode maintains and manages the slave nodes, and assigns tasks to them. It should deploy on reliable hardware as it is the centerpiece of HDFS.  DataNode in HDFS Architecture is also known as Slave. In Hadoop HDFS Architecture, DataNode stores actual data in HDFS. It performs read and write operation as per the request of the client. DataNodes can deploy on commodity hardware.  In HDFS, when NameNode starts, first it reads HDFS state from an image file, FsImage. After that, it applies edits from the edits log file. NameNode then writes new HDFS state to the FsImage. Then it starts normal operation with an empty edits file. At the time of start-up, NameNode merges FsImage and edits files, so the edit log file could get very large over time. A side effect of a larger edits file is that next restart of Namenode takes longer.  Secondary Namenode solves this issue. Secondary NameNode downloads the FsImage and EditLogs from the NameNode. And then merges EditLogs with the FsImage (FileSystem Image). It keeps edits log size within a limit. It stores the modified FsImage into persistent storage. And we can use it in the case of NameNode failure.  Secondary NameNode performs a regular checkpoint in HDFS. 
  • 20.  The JobTracker is the service within Hadoop that farms out MapReduce tasks to specific nodes in the cluster, ideally the nodes that have the data, or at least are in the same rack.  Client applications submit jobs to the Job tracker.  The JobTracker talks to the NameNode to determine the location of the data  The JobTracker locates TaskTracker nodes with available slots at or near the data  The JobTracker submits the work to the chosen TaskTracker nodes.  The TaskTracker nodes are monitored. If they do not submit heartbeat signals often enough, they are deemed to have failed and the work is scheduled on a different TaskTracker.  A TaskTracker will notify the JobTracker when a task fails. The JobTracker decides what to do then: it may resubmit the job elsewhere, it may mark that specific record as something to avoid, and it may may even blacklist the TaskTracker as unreliable.  When the work is completed, the JobTracker updates its status.  Client applications can poll the JobTracker for information.  The JobTracker is a point of failure for the Hadoop MapReduce service. If it goes down, all running jobs are halted.
  • 21. MapReduce MapReduce is a programming model for processing and generating large data sets with a parallel, distributed algorithm on a cluster. It is an associative implementation for processing and generating large data sets. MAP function that process a key pair to generates a set of intermediate key pairs. REDUCE function that merges all intermediate values associated with the same intermediate key
  • 26. Pros of Hadoop  Computing power  Flexibility  Fault Tolerance  Low Cost  Scalability
  • 27. Cons of Hadoop  1. Integration with existing systems  Hadoop is not optimised for ease for use. Installing and integrating with existing  databases might prove to be difficult, especially since there is no software support  provided.  2. Administration and ease of use  Hadoop requires knowledge of MapReduce, while most data practitioners use SQL. This  means significant training may be required to administer Hadoop clusters.  3. Security  Hadoop lacks the level of security functionality needed for safe enterprise deployment,  especially if it concerns sensitive data.
  • 28. Conclusion: Hadoop has been very effective solution for companies dealing with the data in petabytes. It has solved many problems in industry related to huge data management and distributed system. As it is open source, so it is adopted by companies widely.