SlideShare a Scribd company logo
Krishnendu P
CONTENTS:
 Data and Big Data
 Problems with Big Data
 Hadoop
 Small History of Hadoop
 What problems can Hadoop solve?
 Components of Hadoop - HDFS, MapReduce
 Hadoop Cluster
 High Level Archetecture of Hadoop
 Hadoop Core Components
 Features of Hadoop
 Limitations of Hadoop
 Users of Hadoop
 Conclusion
 References
Data:
➔ Any real world symbol (character, numeric,
special character ) or group of them is said
to be data.
➔It may be visual, audio, scriptual etc.
Big Data
Big data means really a big data, it is a collection
of large datasets that cannot be processed using
on hand database management tools or
traditional computing techniques.
Big Data
The Big Data includes huge volume, high velocity,
and extensible variety of data. The data in it will be of
three types.
Structured data : Relational data.
Semi Structured data : XML data.
Unstructured data : Word, PDF, Text
Problems with Big Data:
➔Daily about 0.5 petabytes of updates are being
made into FACEBOOK including 40 millions
photos.
➔Daily YOUTUBE is loaded with videos that can be
watched for one year continously.
➔Limitations are encountered due to large data sets
in many areas, including genomics,complex
physics simulations, and biological and
environmental research.
Cont...
➔Also affect Internet search, finance and
business informatics.
➔The challenges include in capture, retrieval
,storage, search, sharing, analysis, and
visualization.
What could be the solution for
Big Data ?
hadoohadoo
pp
What is hadoop ?
➔Hadoop is an open source, Java-based
programming framework developed by Doug
Cutting and Mike Cafarella in 2005.
➔It is part of the Apache project sponsored by the
Apache Software Foundation.
➔Its designed to scale up from single servers to
thousands of machines, each offering local computers
and storage.
Cont...
➔It is used for distributed storage and distributed
processing of very large data sets on computer
clusters built from commodity hardware.
Small History
➔Hadoop was inspired by Google's MapReduce, a
software framework in which an application is
broken down into numerous small parts.
➔Any of these parts(also called fragments or blocks)
can be run on any node in the cluster.
➔Doug Cutting, Hadoop's creator, named the
framework after his child's stuffed toy elephant.
Small History
➔Started with building Web Search Engine
- Nutch in 2002
- Aim was to index billons of pages.
- Archetecture can't support billons of pages.
➔Google's GFS in 2003 solved storage problem.
- Nutch Distributed File System in 2004.
➔Google's MapReduce in 2004
- MapReduce implemented in 2005.
Doug Cutting with Hadoop
Mike Cafarella
2005: Doug Cutting and Mike Cafarella developed Hadoop
to support distribution for the Nutch search engine project.
The project was funded by Yahoo.
2006: Yahoo gave the project to Apache
Software Foundation.
Now Apache Hadoop is a registered trademark of the
Apache Software Foundation.
What problems can Hadoop solve?
The Hadoop platform was designed to solve problems
where you have a lot of data " perhaps a mixture of
complex and structured data " and it doesn't fit well
into tables.
Components Of Hadoop
Hadoop consists of MapReduce, the Hadoop
distributed file system (HDFS) and a number of
related projects such as Apache Hive, HBase and
Zookeeper.
HADOOPHADOOP
HDFS MapReduce
Hadoop seminar
HDFS (Hadoop Distributed File System)
➔The Hadoop Distributed File System (HDFS) is a
distributed file system designed to run on
commodity hardware.
➔ Its is a sub-project of Apache Hadoop project.
➔ HDFS is highly fault-tolerant and is designed to
be deployed on low-cost hardware.
➔HDFS provides high throughput access to
application data and is suitable for applications
that have large data sets.
Cont...
➔The HDFS takes care of storing and managing the
data within the hadoop cluster.
Cont...
MapReduce
➔ MapReducing is a programming model used for
processing large data sets.
➔Programs written in this functional style are
automatically parallelized and executed on a large
cluster of commodity machines.
➔MapReduce is an associated implementation for
processing and generating large data sets.
MapReduce
MapReduce program executes in two stages, namely
map stage, and reduce stage.
Map stage :
The map or mapper’s job is to process the
input data. Generally the input data is in the form of
file or directory and is stored in the Hadoop file
system (HDFS). The input file is passed to the
mapper function line by line. The mapper processes
the data and creates several small chunks of data.
MapReduce
MapReduce program executes in two stages, namely
map stage, and reduce stage.
Reduce stage :
The Reducer’s job is to process the data that
comes from the mapper. After processing, it
produces a new set of output, which will be stored in
the HDFS.
MapReduce
Hadoop Core components
MASTER NODE
SLAVE NODE
Name node
Data node
Job tracker
Task tracker
Storage node Compute node
Cont...
Node :
It is a technical term used to describe a
machine or a computer that is present in a
cluster.
Demode :
It is a technical term used to describe the
background process that is running on a
linux machine.
Cont...
➔ The Master node responsible for running
Name nodes and Job tracker demodes.
➔The Slave node responsible for running the
Data nodes and Task tracker demodes.
Cont...
➔Name node and Data node are responsible
for storing and managing the data, and they
are commonly referred to as Storage Node.
➔Job Tracker and Task Tracker are
responsible for processing and computing the
data, and they are commonly referred to as
Compute Node.
Cont..
➔Usually Name node and Job tracker
configured on a single machine.
➔ The Data node and Task tracker
configured on multiple machines. But can
have instances running on more than one
machines at the same time.
Hadoop Cluster
➔ Normally any set of loosely connected or tightly
connected computers that work together as a single
system is called Cluster.
➔ In simple words, a computer cluster used for Hadoop
is called Hadoop Cluster.
Hadoop Cluster
Hadoop cluster is a special type of computational
cluster designed for storing and analyzing vast
amount of unstructured data in a distributed
computing environment. These clusters run on low
cost commodity computers.
Hadoop Cluster
Hadoop Cluster
➔Hadoop clusters are often referred to as "shared
nothing" systems because the only thing that is
shared between nodes is the network that connects
Them.
➔Clustering improves the system's availability to
users.
Hadoop Cluster
A Real Time Example:
Here is a picture of Yahoo's Hadoop cluster. They
have more than 10,000 machines running Hadoop
and nearly 1 petabyte of user data.
● Scalability :
Scalability basically refers to the ability of
adding or removing the nodes without bringing
down or affecting the cluster operation.
Features of Hadoop
Features of Hadoop
● Cost effective :
Hadoop does not requires any expensive
cost specialized harware. In other words, it can
be implemented on a simple hardware. These
hardware components are technically called as
commodity hardware.
Features of Hadoop
● Large Cluster of Nodes:
A hadoop cluster can be made up
off 100's and 1000's of nodes. One of the
main advantage of having a large cluster is, it
offers more computing power and huge
storage system to the clients.
Features of Hadoop
● Parallel Processing of Data:
The data can be process
simultaniously across all the nodes
within the cluster and thus saving a lot
of time.
Features of Hadoop
● Automatic Failover Management:
In case, if any of the nodes
within the cluster fails, the hadoop framework
will replace that particular machine with
another machine.
● Flexible :
Hadoop is schema-less, and can
absorb any type of data, structured or not,
from any number of sources.
● Fault-tolerant :
When you lose a node, the system
redirects work to another location of the
data and continue processing without
missing a beat.
Features of Hadoop
Limitations of Hadoop
● Security concerns
● Vulnerable by nature
● Not fit for Small data
● Potential steability issues
What is Hadoop used for?
● Search
– Yahoo, Amazon, Zvents
• Log processing
– Facebook, Yahoo, ContextWeb. Joost,
Last.fm
• Recommendation Systems
– Facebook
• Data Warehouse
– Facebook, AOL(America Online)
• Video and Image Analysis
– New York Times, Eyealike
Conclusion
➔Hadoop has been very effective for companies
dealing with the data in petabytes.
➔It has solved many problems in industry
related to huge data management and
distributed system.
➔As it is open source, so it is adopted by
companies widely.
References
● www.dezyre.com/Big-Data-and-Hadoop
● www.cloudera.com/content/www/...hadoop
/hdfs-mapreduce-yarn.html
● www.ufaber.com/hadoop/bigbata/free
● www.psgtech.edu/yrgcc/attach/haoop_archite
cture.ppt
Hadoop seminar
Hadoop seminar

More Related Content

PPTX
Introduction to Hadoop Technology
PPSX
PPTX
Apache hadoop technology : Beginners
PPTX
Hadoop
PPTX
Apache hive introduction
PPTX
Hadoop File system (HDFS)
PPT
Seminar Presentation Hadoop
PPTX
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Introduction to Hadoop Technology
Apache hadoop technology : Beginners
Hadoop
Apache hive introduction
Hadoop File system (HDFS)
Seminar Presentation Hadoop
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...

What's hot (20)

PPTX
PPT on Hadoop
PPTX
Introduction of Big data, NoSQL & Hadoop
PPTX
HADOOP TECHNOLOGY ppt
PPTX
HADOOP TECHNOLOGY ppt
PPTX
Hadoop Tutorial For Beginners
PDF
HDFS Architecture
PPTX
Introduction to Hadoop and Hadoop component
PPTX
Hadoop introduction , Why and What is Hadoop ?
KEY
Hadoop, Pig, and Twitter (NoSQL East 2009)
PPTX
Introduction to Map Reduce
PPTX
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
PPTX
Hadoop technology
PDF
Hadoop ecosystem
PPTX
Big Data Analytics with Hadoop
PDF
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
PDF
Hadoop Ecosystem
PPTX
Apache hive
PPTX
Introduction to Apache Hadoop Eco-System
PPTX
Introduction to HiveQL
PDF
Hadoop Overview & Architecture
 
PPT on Hadoop
Introduction of Big data, NoSQL & Hadoop
HADOOP TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
Hadoop Tutorial For Beginners
HDFS Architecture
Introduction to Hadoop and Hadoop component
Hadoop introduction , Why and What is Hadoop ?
Hadoop, Pig, and Twitter (NoSQL East 2009)
Introduction to Map Reduce
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop technology
Hadoop ecosystem
Big Data Analytics with Hadoop
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
Hadoop Ecosystem
Apache hive
Introduction to Apache Hadoop Eco-System
Introduction to HiveQL
Hadoop Overview & Architecture
 
Ad

Similar to Hadoop seminar (20)

PPTX
OPERATING SYSTEM .pptx
PPTX
Hadoop_EcoSystem slide by CIDAC India.pptx
PPTX
Lecture 2 Hadoop.pptx
PDF
Hadoop framework thesis (3)
DOCX
Hadoop Seminar Report
PPTX
Hadoop info
PPTX
Hadoo its a good pdf to read some notes p.pptx
PDF
Big data and hadoop overvew
PPTX
2. hadoop fundamentals
PPTX
MOD-2 presentation on engineering students
PDF
hdfs readrmation ghghg bigdats analytics info.pdf
PPTX
Big Data Hadoop Technology
PPTX
Hadoop by kamran khan
PDF
Unit IV.pdf
PPSX
Hadoop-Quick introduction
PPTX
What is hadoop
PPTX
Hadoop and Big data in Big data and cloud.pptx
ODP
Hadoop - Overview
PPTX
Hadoop bigdata overview
PPTX
Hadoop and Big Data
OPERATING SYSTEM .pptx
Hadoop_EcoSystem slide by CIDAC India.pptx
Lecture 2 Hadoop.pptx
Hadoop framework thesis (3)
Hadoop Seminar Report
Hadoop info
Hadoo its a good pdf to read some notes p.pptx
Big data and hadoop overvew
2. hadoop fundamentals
MOD-2 presentation on engineering students
hdfs readrmation ghghg bigdats analytics info.pdf
Big Data Hadoop Technology
Hadoop by kamran khan
Unit IV.pdf
Hadoop-Quick introduction
What is hadoop
Hadoop and Big data in Big data and cloud.pptx
Hadoop - Overview
Hadoop bigdata overview
Hadoop and Big Data
Ad

Recently uploaded (20)

PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PDF
ChatGPT for Dummies - Pam Baker Ccesa007.pdf
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PDF
Complications of Minimal Access Surgery at WLH
PDF
Chinmaya Tiranga quiz Grand Finale.pdf
PPTX
202450812 BayCHI UCSC-SV 20250812 v17.pptx
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PDF
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
PPTX
Cell Types and Its function , kingdom of life
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PDF
Classroom Observation Tools for Teachers
PDF
LDMMIA Reiki Yoga Finals Review Spring Summer
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PDF
01-Introduction-to-Information-Management.pdf
PPTX
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
PDF
RTP_AR_KS1_Tutor's Guide_English [FOR REPRODUCTION].pdf
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PDF
LNK 2025 (2).pdf MWEHEHEHEHEHEHEHEHEHEHE
PDF
Anesthesia in Laparoscopic Surgery in India
PPTX
master seminar digital applications in india
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
ChatGPT for Dummies - Pam Baker Ccesa007.pdf
Microbial diseases, their pathogenesis and prophylaxis
Complications of Minimal Access Surgery at WLH
Chinmaya Tiranga quiz Grand Finale.pdf
202450812 BayCHI UCSC-SV 20250812 v17.pptx
STATICS OF THE RIGID BODIES Hibbelers.pdf
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
Cell Types and Its function , kingdom of life
Supply Chain Operations Speaking Notes -ICLT Program
Classroom Observation Tools for Teachers
LDMMIA Reiki Yoga Finals Review Spring Summer
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
01-Introduction-to-Information-Management.pdf
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
RTP_AR_KS1_Tutor's Guide_English [FOR REPRODUCTION].pdf
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
LNK 2025 (2).pdf MWEHEHEHEHEHEHEHEHEHEHE
Anesthesia in Laparoscopic Surgery in India
master seminar digital applications in india

Hadoop seminar

  • 2. CONTENTS:  Data and Big Data  Problems with Big Data  Hadoop  Small History of Hadoop  What problems can Hadoop solve?  Components of Hadoop - HDFS, MapReduce  Hadoop Cluster  High Level Archetecture of Hadoop  Hadoop Core Components  Features of Hadoop  Limitations of Hadoop  Users of Hadoop  Conclusion  References
  • 3. Data: ➔ Any real world symbol (character, numeric, special character ) or group of them is said to be data. ➔It may be visual, audio, scriptual etc.
  • 4. Big Data Big data means really a big data, it is a collection of large datasets that cannot be processed using on hand database management tools or traditional computing techniques.
  • 5. Big Data The Big Data includes huge volume, high velocity, and extensible variety of data. The data in it will be of three types. Structured data : Relational data. Semi Structured data : XML data. Unstructured data : Word, PDF, Text
  • 6. Problems with Big Data: ➔Daily about 0.5 petabytes of updates are being made into FACEBOOK including 40 millions photos. ➔Daily YOUTUBE is loaded with videos that can be watched for one year continously. ➔Limitations are encountered due to large data sets in many areas, including genomics,complex physics simulations, and biological and environmental research.
  • 7. Cont... ➔Also affect Internet search, finance and business informatics. ➔The challenges include in capture, retrieval ,storage, search, sharing, analysis, and visualization.
  • 8. What could be the solution for Big Data ?
  • 10. What is hadoop ? ➔Hadoop is an open source, Java-based programming framework developed by Doug Cutting and Mike Cafarella in 2005. ➔It is part of the Apache project sponsored by the Apache Software Foundation.
  • 11. ➔Its designed to scale up from single servers to thousands of machines, each offering local computers and storage. Cont... ➔It is used for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware.
  • 12. Small History ➔Hadoop was inspired by Google's MapReduce, a software framework in which an application is broken down into numerous small parts. ➔Any of these parts(also called fragments or blocks) can be run on any node in the cluster. ➔Doug Cutting, Hadoop's creator, named the framework after his child's stuffed toy elephant.
  • 13. Small History ➔Started with building Web Search Engine - Nutch in 2002 - Aim was to index billons of pages. - Archetecture can't support billons of pages. ➔Google's GFS in 2003 solved storage problem. - Nutch Distributed File System in 2004. ➔Google's MapReduce in 2004 - MapReduce implemented in 2005.
  • 16. 2005: Doug Cutting and Mike Cafarella developed Hadoop to support distribution for the Nutch search engine project. The project was funded by Yahoo. 2006: Yahoo gave the project to Apache Software Foundation. Now Apache Hadoop is a registered trademark of the Apache Software Foundation.
  • 17. What problems can Hadoop solve? The Hadoop platform was designed to solve problems where you have a lot of data " perhaps a mixture of complex and structured data " and it doesn't fit well into tables.
  • 18. Components Of Hadoop Hadoop consists of MapReduce, the Hadoop distributed file system (HDFS) and a number of related projects such as Apache Hive, HBase and Zookeeper.
  • 21. HDFS (Hadoop Distributed File System) ➔The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. ➔ Its is a sub-project of Apache Hadoop project. ➔ HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware.
  • 22. ➔HDFS provides high throughput access to application data and is suitable for applications that have large data sets. Cont... ➔The HDFS takes care of storing and managing the data within the hadoop cluster.
  • 24. MapReduce ➔ MapReducing is a programming model used for processing large data sets. ➔Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. ➔MapReduce is an associated implementation for processing and generating large data sets.
  • 25. MapReduce MapReduce program executes in two stages, namely map stage, and reduce stage. Map stage : The map or mapper’s job is to process the input data. Generally the input data is in the form of file or directory and is stored in the Hadoop file system (HDFS). The input file is passed to the mapper function line by line. The mapper processes the data and creates several small chunks of data.
  • 26. MapReduce MapReduce program executes in two stages, namely map stage, and reduce stage. Reduce stage : The Reducer’s job is to process the data that comes from the mapper. After processing, it produces a new set of output, which will be stored in the HDFS.
  • 28. Hadoop Core components MASTER NODE SLAVE NODE Name node Data node Job tracker Task tracker Storage node Compute node
  • 29. Cont... Node : It is a technical term used to describe a machine or a computer that is present in a cluster. Demode : It is a technical term used to describe the background process that is running on a linux machine.
  • 30. Cont... ➔ The Master node responsible for running Name nodes and Job tracker demodes. ➔The Slave node responsible for running the Data nodes and Task tracker demodes.
  • 31. Cont... ➔Name node and Data node are responsible for storing and managing the data, and they are commonly referred to as Storage Node. ➔Job Tracker and Task Tracker are responsible for processing and computing the data, and they are commonly referred to as Compute Node.
  • 32. Cont.. ➔Usually Name node and Job tracker configured on a single machine. ➔ The Data node and Task tracker configured on multiple machines. But can have instances running on more than one machines at the same time.
  • 33. Hadoop Cluster ➔ Normally any set of loosely connected or tightly connected computers that work together as a single system is called Cluster. ➔ In simple words, a computer cluster used for Hadoop is called Hadoop Cluster.
  • 34. Hadoop Cluster Hadoop cluster is a special type of computational cluster designed for storing and analyzing vast amount of unstructured data in a distributed computing environment. These clusters run on low cost commodity computers.
  • 36. Hadoop Cluster ➔Hadoop clusters are often referred to as "shared nothing" systems because the only thing that is shared between nodes is the network that connects Them. ➔Clustering improves the system's availability to users.
  • 37. Hadoop Cluster A Real Time Example: Here is a picture of Yahoo's Hadoop cluster. They have more than 10,000 machines running Hadoop and nearly 1 petabyte of user data.
  • 38. ● Scalability : Scalability basically refers to the ability of adding or removing the nodes without bringing down or affecting the cluster operation. Features of Hadoop
  • 39. Features of Hadoop ● Cost effective : Hadoop does not requires any expensive cost specialized harware. In other words, it can be implemented on a simple hardware. These hardware components are technically called as commodity hardware.
  • 40. Features of Hadoop ● Large Cluster of Nodes: A hadoop cluster can be made up off 100's and 1000's of nodes. One of the main advantage of having a large cluster is, it offers more computing power and huge storage system to the clients.
  • 41. Features of Hadoop ● Parallel Processing of Data: The data can be process simultaniously across all the nodes within the cluster and thus saving a lot of time.
  • 42. Features of Hadoop ● Automatic Failover Management: In case, if any of the nodes within the cluster fails, the hadoop framework will replace that particular machine with another machine.
  • 43. ● Flexible : Hadoop is schema-less, and can absorb any type of data, structured or not, from any number of sources. ● Fault-tolerant : When you lose a node, the system redirects work to another location of the data and continue processing without missing a beat. Features of Hadoop
  • 44. Limitations of Hadoop ● Security concerns ● Vulnerable by nature ● Not fit for Small data ● Potential steability issues
  • 45. What is Hadoop used for? ● Search – Yahoo, Amazon, Zvents • Log processing – Facebook, Yahoo, ContextWeb. Joost, Last.fm • Recommendation Systems – Facebook • Data Warehouse – Facebook, AOL(America Online) • Video and Image Analysis – New York Times, Eyealike
  • 46. Conclusion ➔Hadoop has been very effective for companies dealing with the data in petabytes. ➔It has solved many problems in industry related to huge data management and distributed system. ➔As it is open source, so it is adopted by companies widely.
  • 47. References ● www.dezyre.com/Big-Data-and-Hadoop ● www.cloudera.com/content/www/...hadoop /hdfs-mapreduce-yarn.html ● www.ufaber.com/hadoop/bigbata/free ● www.psgtech.edu/yrgcc/attach/haoop_archite cture.ppt