SlideShare a Scribd company logo
Machine
Learning Basics
An Introduction
In a farm far away…
Jack harvests grapes and then sells it in
the nearby town
After harvesting, he then stores the
produce in a storage room
Soon there was a high demand for other fruits. So,
he started harvesting apples and oranges as well
He then realizes that it is time consuming and
difficult to harvest all the fruits by himself
So, he hires 2 more people to work with him. With
this, harvesting is done simultaneously
Now, the storage room becomes a bottleneck to
store and access all the fruits in a single storage
area
Jack now decides to distribute the storage area
and give each one of them a separate storage
space
Hello, I want a fruit
basket of 3 grapes, 2
apples and 3 oranges
To complete the order on time, all of them work
parallelly with their own storage space
Hello, I want a fruit
basket of 3 grapes, 2
apples and 3 oranges
This solution helps them to complete the order on
time without any hassles
Fruit
basket
All of them are happy and they are prepared
for an increase in demand in the future
All of them are happy and they are prepared
for an increase in demand in the future
So, how does this story
relate to Big Data?
The rise of Big Data
Structured data
Earlier with limited data, only one processor and one storage unit was needed
The rise of Big Data
Structured data
Semi structured data
Unstructured data
Soon, data generation increased leading to high volume of data along with
different data formats
The rise of Big Data
Structured data
Semi structured data
Unstructured data
A single processor was not enough to process such high volume of different kinds
of data as it was very time consuming
The rise of Big Data
Structured data
Semi structured data
Unstructured data
Hence, multiple processors were used to process high volume of data and this
saved time
The rise of Big Data
Structured data
Semi structured data
Unstructured data
The single storage unit became the bottleneck due to which network overhead
was generated
The rise of Big Data
Structured data
Semi structured data
Unstructured data
The solution was to use distributed storage for each processor. This enabled easy
access to store and access data
The rise of Big Data
Structured data
Semi structured data
Unstructured data
This method worked and there was no network overhead generated
The rise of Big Data
Structured data
Semi structured data
Unstructured data
This is known as parallel processing with distributed storage
The rise of Big Data
Structured data
Semi structured data
Unstructured data
This is known as parallel processing with distributed storage
Parallel processing
The rise of Big Data
Structured data
Semi structured data
Unstructured data
This is known as parallel processing with distributed storage
Parallel processing Distributed storage
What’s in it for you?
What’s in it for you?
1. Big Data and it’s challenges1
What’s in it for you?
1. Big Data and it’s challenges1
1. Hadoop as a solution2
What’s in it for you?
1. Big Data and it’s challenges1
1. Hadoop as a solution2
1. What is Hadoop?3
What’s in it for you?
1. Big Data and it’s challenges1
1. Hadoop as a solution2
1. What is Hadoop?3
1. Components of Hadoop4
What’s in it for you?
1. Big Data and it’s challenges1
1. Hadoop as a solution2
1. What is Hadoop?3
1. Components of Hadoop4
1. Use case of Hadoop5
What is Big Data?
What is Big Data?
Massive amount of data which cannot be stored, processed and analyzed using the traditional
ways
What is Big Data?
Massive amount of data which cannot be stored, processed and analyzed using the traditional
ways
VERACITY
BIG
DATA
VELOCITY
VOLUME
VARIETYVALUE
VERACITY
What is Big Data?
Massive amount of data which cannot be stored, processed and analyzed using the traditional
ways
VERACITY
BIG
DATA
VELOCITY
VOLUME
VARIETYVALUE
VERACITY
What is Big Data?
Massive amount of data which cannot be stored, processed and analyzed using the traditional
ways
VERACITY
BIG
DATA
VELOCITY
VOLUME
VARIETYVALUE
VERACITY
What is Big Data?
Massive amount of data which cannot be stored, processed and analyzed using the traditional
ways
VERACITY
BIG
DATA
VELOCITY
VOLUME
VARIETYVALUE
VERACITY
What is Big Data?
Massive amount of data which cannot be stored, processed and analyzed using the traditional
ways
VERACITY
BIG
DATA
VELOCITY
VOLUME
VARIETYVALUE
VERACITY
Big Data challenges and solution
Single central storage
Challenges
Big Data challenges and solution
Distributed storagesSingle central storage
Challenges Solutions
Distributed storage
Big Data challenges and solution
Serial processing
OutputProcess
Input
A
Distributed storagesSingle central storage
Challenges Solutions
Distributed storage
Big Data challenges and solution
Serial processing
OutputProcess
Input
A
Distributed storagesSingle central storage
Parallel processing
Output
B
Inputs
A
Process
Challenges Solutions
Distributed storage
Big Data challenges and solution
Serial processing
OutputProcess
Input
A
Distributed storagesSingle central storage
Parallel processing
Output
B
Inputs
A
Process
Lack of ability to process
unstructured data
Challenges Solutions
Distributed storage
Big Data challenges and solution
Serial processing
OutputProcess
Input
A
Distributed storagesSingle central storage
Parallel processing
Output
B
Inputs
A
Process
Lack of ability to process
unstructured data
Ability to process every type
of data
Challenges Solutions
Distributed storage
Hadoop as a solution
Serial processing
OutputProcess
Input
A
Distributed storagesSingle central storage
Parallel processing
Output
B
Inputs
A
Process
Lack of ability to process
unstructured data
Ability to process every type
of data
Challenges Solutions
Distributed storage
What is Hadoop?
What is Hadoop?
Big Data
VOLUME
STORING
Storing Processing Analyzing
Hadoop is a framework that manages big data storage in a distributed way and processes it parallelly
Components of Hadoop
Components of Hadoop
Storage unit of
Hadoop
Processing unit of
Hadoop
Components of Hadoop
Storage unit of
Hadoop
Processing unit of
Hadoop
What is HDFS?
What is HDFS?
VOLUME
STORING
Hadoop Distributed File System (HDFS) is specially designed for storing huge datasets in commodity
hardware
Distributed storage
What is HDFS?
VOLUME
STORING
Hadoop Distributed File System (HDFS) has two core components NameNode and DataNode
NameNode
DataNode
What is HDFS?
VOLUME
STORING
Hadoop Distributed File System (HDFS) has two core components NameNode and DataNode
NameNode
DataNode
There is only one
NameNode
What is HDFS?
VOLUME
STORING
Hadoop Distributed File System (HDFS) has two core components NameNode and DataNode
NameNode
DataNode
There is only one
NameNode
DataNode DataNode
There can be multiple
DataNodes
What is HDFS?
VOLUME
STORING
Master/slave nodes typically form the HDFS cluster
What is HDFS?
VOLUME
STORING
Master/slave nodes typically form the HDFS cluster
Master/NameNode
Slave/DataNode Slave/DataNode Slave/DataNode
What is HDFS?
VOLUME
STORING
Master/slave nodes typically form the HDFS cluster
Master/NameNode
Slave/DataNode Slave/DataNode Slave/DataNode
NameNode maintains and manages the
DataNode. It also stores the metadata
What is HDFS?
VOLUME
STORING
Master/slave nodes typically form the HDFS cluster
Master/NameNode
Slave/DataNode Slave/DataNode Slave/DataNode
NameNode maintains and manages the
DataNode. It also stores the metadata
DataNodes stores the actual data, does
reading, writing and processing. Performs
replication as well
What is HDFS?
VOLUME
STORING
Master/slave nodes typically form the HDFS cluster
Master/NameNode
Slave/DataNode Slave/DataNode Slave/DataNode
NameNode maintains and manages the
DataNode. It also stores the metadata
DataNodes stores the actual data, does
reading, writing and processing. Performs
replication as well
What is HDFS?
VOLUME
STORING
Master/slave nodes typically form the HDFS cluster
Master/NameNode
Slave/DataNode Slave/DataNode Slave/DataNode
NameNode maintains and manages the
DataNode. It also stores the metadata
DataNodes stores the actual data, does
reading, writing and processing. Performs
replication as well
HeartBeat is the signal that DataNode
continuously sends to the NameNode.
This signal shows the status of the DataNode
What is HDFS?
VOLUME
STORING
In HDFS, data is stored in a distributed manner
30 TB
file
What is HDFS?
VOLUME
STORING
In HDFS, data is stored in a distributed manner
30 TB
file
NameNode
30 TB of
data is
loaded
What is HDFS?
VOLUME
STORING
In HDFS, data is stored in a distributed manner
30 TB
file
NameNode
30 TB of
data is
loaded
.
.
.
Data is divided into
blocks of 128 MB each
What is HDFS?
VOLUME
STORING
In HDFS, data is stored in a distributed manner
30 TB
file
NameNode
30 TB of
data is
loaded
DataNodes
.
.
.
Data is divided into
blocks of 128 MB each
.
.
.
.
.
What is HDFS?
VOLUME
STORING
In HDFS, data is stored in a distributed manner
30 TB
file
NameNode
30 TB of
data is
loaded
DataNodes
.
.
.
Data is divided into
blocks of 128 MB each
Blocks are then
replicated among the
DataNodes
.
.
.
.
.
What is HDFS?
Provides distributed
storage
Features of HDFS
What is HDFS?
Provides distributed
storage
Implemented on
commodity hardware
Features of HDFS
What is HDFS?
Provides distributed
storage
Implemented on
commodity hardware
Provides data
security
Features of HDFS
What is HDFS?
Provides distributed
storage
Implemented on
commodity hardware
Provides data
security
Highly fault tolerant
Features of HDFS
Components of Hadoop
Storage unit of
Hadoop
Processing unit of
Hadoop
What is MapReduce?
What is MapReduce?
VOLUME
STORING
Hadoop MapReduce is a programming technique where huge data is processed in a parallel and
distributed fashion
What is MapReduce?
VOLUME
STORING
Hadoop MapReduce is a programming technique where huge data is processed in a parallel and
distributed fashion
Big Data
What is MapReduce?
VOLUME
STORING
Hadoop MapReduce is a programming technique where huge data is processed in a parallel and
distributed fashion
Big Data
Processor
What is MapReduce?
VOLUME
STORING
Hadoop MapReduce is a programming technique where huge data is processed in a parallel and
distributed fashion
Big Data
Processor
MapReduce is used for parallel processing of the Big
Data, which is stored in HDFS
What is MapReduce?
VOLUME
STORING
Hadoop MapReduce is a programming technique where huge data is processed in a parallel and
distributed fashion
Big Data
Output
Processor
MapReduce is used for parallel processing of the Big
Data, which is stored in HDFS
What is MapReduce?
VOLUME
STORING
In MapReduce approach, processing is done at the slave nodes and the final result is sent to the
master node
What is MapReduce?
VOLUME
STORING
In MapReduce approach, processing is done at the slave nodes and the final result is sent to the
master node
Master
Slave Slave
Slave Slave
Traditional approach – Data is
processed at the Master node
What is MapReduce?
VOLUME
STORING
In MapReduce approach, processing is done at the slave nodes and the final result is sent to the
master node
Master
Slave Slave
Slave Slave
Traditional approach – Data is
processed at the Master node
MapReduce approach – Data is
processed at the Slave nodes
Slave Slave
Slave Slave
Master
What is MapReduce?
Input
Bus Car Train
Ship Ship Train
Bus Ship Car
What is MapReduce?
Input Split
Ship Ship Train
Bus Car Train
Bus Car Train
Ship Ship Train
Bus Ship Car
Bus Ship Car
The input dataset is first
split into chunks of data
What is MapReduce?
Input Split Map phase
Ship Ship Train
Bus Car Train
Bus Car Train
Ship Ship Train
Bus Ship Car
Bus Ship Car
Ship, 1
Ship, 1
Train, 1
Bus, 1
Ship, 1
Car, 1
Bus, 1
Car, 1
Train, 1
These chunks of data are
then processed by map
tasks parallelly
What is MapReduce?
Input Split Map phase Reduce phase
Ship Ship Train
Bus Car Train
Bus Car Train
Ship Ship Train
Bus Ship Car
Bus Ship Car
Ship, 1
Ship, 1
Train, 1
Bus, 1
Ship, 1
Car, 1
Bus, 1
Car, 1
Train, 1
Ship, 1
Ship, 1
Ship, 1
Bus, 1
Bus, 1
Car, 1
Car, 1
Train, 1
Train, 1
What is MapReduce?
Input Split Map phase Shuffle and sortReduce phase
Ship Ship Train
Bus Car Train
Bus Car Train
Ship Ship Train
Bus Ship Car
Bus Ship Car
Ship, 1
Ship, 1
Train, 1
Bus, 1
Ship, 1
Car, 1
Bus, 1
Car, 1
Train, 1
Ship, 1
Ship, 1
Ship, 1
Bus, 2
Car, 2
Ship, 3
Train, 2
Bus, 1
Bus, 1
Car, 1
Car, 1
Train, 1
Train, 1
At the reduce task, the
aggregation takes place and
the final output is obtained
Components of Hadoop version 2.0
Storage unit of
Hadoop
Processing unit of
Hadoop
Resource management
unit of Hadoop
What is YARN?
YARN – Yet Another Resource Negotiator
Acts like an OS
to Hadoop 2 Does job scheduling
Responsible for managing
cluster resources
What is YARN?
What is YARN?
Client
Client
Client
What is YARN?
Client
Client
Client
Client submits the
job request
What is YARN?
Resource
Manager
Client
Client
Client
Client submits the
job request
What is YARN?
Resource
Manager
Responsible for resource
allocation and
management
Client
Client
Client
Client submits the
job request
What is YARN?
Resource
Manager
Responsible for resource
allocation and
management
Node
Manager
Client
Client
Client
Node
Manager
Node
Manager
Client submits the
job request
What is YARN?
Node Manager manages
the nodes and monitors
resource usage
Resource
Manager
Responsible for resource
allocation and
management
Node
Manager
Client
Client
Client
Node
Manager
Node
Manager
Client submits the
job request
What is YARN?
Container is a collection
of physical resources
such as RAM, CPU
Node Manager manages
the nodes and monitors
resource usage
Resource
Manager
Responsible for resource
allocation and
management
container
Node
Manager
Client
Client
Client
Node
Manager
Node
Manager
container
container container
Client submits the
job request
What is YARN?
Container is a collection
of physical resources
such as RAM, CPU
Node Manager manages
the nodes and monitors
resource usage
Resource
Manager
Responsible for resource
allocation and
management
App Master
container
Node
Manager
Client
Client
Client
Node
Manager
Node
Manager
App Master container
container container
App Master requests
container from the
NodeManager
Client submits the
job request
What is YARN?
Container is a collection
of physical resources
such as RAM, CPU
Node Manager manages
the nodes and monitors
resource usage
Resource
Manager
Responsible for resource
allocation and
management
App Master
container
Node
Manager
Client
Client
Client
Node
Manager
Node
Manager
App Master container
container container
App Master requests
container from the
NodeManager
Client submits the
job request
Hadoop use case – Combating
fraudulent activities
Hadoop use case – Combating fraudulent activities
Fraud activities
Detecting fraudulent transactions is one among the various problems any bank faces
Zions’ main challenge was to combat the fraudulent activities which were taking place
Challenge
Hadoop use case – Combating fraudulent activities
Approaches used by Zions’ security team to combat fraudulent activities
Hadoop use case – Combating fraudulent activities
Approaches used by Zions’ security team to combat fraudulent activities
Security information
management – SIM Tools
Problem
It was based on RDBMS
Unable to store huge data which
needed to be analyzed
Hadoop use case – Combating fraudulent activities
Approaches used by Zions’ security team to combat fraudulent activities
Security information
management – SIM Tools
Problem
It was based on RDBMS
Unable to store huge data which
needed to be analyzed
Hadoop use case – Combating fraudulent activities
Parallel processing system
Problem
Analyzing unstructured data
was not possible
Approaches used by Zions’ security team to combat fraudulent activities
Security information
management – SIM Tools
Problem
It was based on RDBMS
Unable to store huge data which
needed to be analyzed
Hadoop use case – Combating fraudulent activities
Parallel processing system
Problem
Analyzing unstructured data
was not possible
How Hadoop solved the problems
Hadoop use case – Combating fraudulent activities
Storing
Zions could now store
massive amount of data
using Hadoop
How Hadoop solved the problems
Hadoop use case – Combating fraudulent activities
Storing
Zions could now store
massive amount of data
using Hadoop
Processing
Processing of unstructured
data (like server logs, customer
data, customer transactions)
was now possible
How Hadoop solved the problems
Hadoop use case – Combating fraudulent activities
Storing
Zions could now store
massive amount of data
using Hadoop
Processing Analyzing
In-depth analysis of different data
formats became easy and time
efficient
Processing of unstructured
data (like server logs, customer
data, customer transactions)
was now possible
How Hadoop solved the problems
Hadoop use case – Combating fraudulent activities
Storing
Zions could now store
massive amount of data
using Hadoop
Processing Analyzing Detecting
In-depth analysis of different data
formats became easy and time
efficient
The team could now detect
everything from malware, spear
phishing attempts to account
takeovers
Processing of unstructured
data (like server logs, customer
data, customer transactions)
was now possible
Key Takeaways
What Is Hadoop? | What Is Big Data & Hadoop | Introduction To Hadoop | Hadoop Tutorial | Simplilearn

More Related Content

PPTX
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
PPTX
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
PPTX
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
PPTX
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
PPTX
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
PPTX
Hadoop introduction , Why and What is Hadoop ?
PPTX
Introduction to Big Data & Hadoop Architecture - Module 1
PPTX
Introduction to Hadoop Technology
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
Hadoop introduction , Why and What is Hadoop ?
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Hadoop Technology

What's hot (20)

PPTX
Hadoop File system (HDFS)
PPTX
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
PDF
Hadoop Overview & Architecture
 
PPTX
Introduction to Hadoop and Hadoop component
PDF
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
PPTX
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
PPTX
Map Reduce
PPTX
Data Lakehouse, Data Mesh, and Data Fabric (r2)
PPTX
Hadoop and Big Data
PPTX
Data Lake Overview
PPT
Hadoop hive presentation
PDF
Hadoop Ecosystem
PPTX
Big Data Open Source Technologies
PPTX
Big data Presentation
PPTX
Introduction to Big Data
PPTX
Hadoop Tutorial For Beginners
PPTX
Hadoop technology
PDF
Apache Iceberg - A Table Format for Hige Analytic Datasets
PPT
Seminar Presentation Hadoop
Hadoop File system (HDFS)
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Overview & Architecture
 
Introduction to Hadoop and Hadoop component
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Map Reduce
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Hadoop and Big Data
Data Lake Overview
Hadoop hive presentation
Hadoop Ecosystem
Big Data Open Source Technologies
Big data Presentation
Introduction to Big Data
Hadoop Tutorial For Beginners
Hadoop technology
Apache Iceberg - A Table Format for Hige Analytic Datasets
Seminar Presentation Hadoop
Ad

Similar to What Is Hadoop? | What Is Big Data & Hadoop | Introduction To Hadoop | Hadoop Tutorial | Simplilearn (20)

PDF
Hadoop introduction
PDF
Lesson 1 introduction to_big_data_and_hadoop.pptx
PDF
Big data and hadoop overvew
PDF
Hadoop Master Class : A concise overview
PDF
UNIT-II-BIG-DATA-FINAL(aktu imp)-PDF.pdf
PPTX
Big data processing system
PPTX
Hadoop Training Tutorial for Freshers
PPT
PDF
Big data and hadoop
PPT
Hadoop Technology
PPTX
A gentle introduction to the world of BigData and Hadoop
PPTX
Big Data and Hadoop
PPTX
Inroduction to Big Data
PPTX
Hadoop_EcoSystem slide by CIDAC India.pptx
PPTX
Hadoop
PDF
Big Data and Hadoop Ecosystem
PDF
getFamiliarWithHadoop
PPTX
Module 1- Introduction to Big Data and Hadoop
PPTX
عصر کلان داده، چرا و چگونه؟
PPT
Hadoop HDFS.ppt
Hadoop introduction
Lesson 1 introduction to_big_data_and_hadoop.pptx
Big data and hadoop overvew
Hadoop Master Class : A concise overview
UNIT-II-BIG-DATA-FINAL(aktu imp)-PDF.pdf
Big data processing system
Hadoop Training Tutorial for Freshers
Big data and hadoop
Hadoop Technology
A gentle introduction to the world of BigData and Hadoop
Big Data and Hadoop
Inroduction to Big Data
Hadoop_EcoSystem slide by CIDAC India.pptx
Hadoop
Big Data and Hadoop Ecosystem
getFamiliarWithHadoop
Module 1- Introduction to Big Data and Hadoop
عصر کلان داده، چرا و چگونه؟
Hadoop HDFS.ppt
Ad

More from Simplilearn (20)

PPTX
Top 50 Scrum Master Interview Questions | Scrum Master Interview Questions & ...
PPTX
Bagging Vs Boosting In Machine Learning | Ensemble Learning In Machine Learni...
PPTX
Future Of Social Media | Social Media Trends and Strategies 2025 | Instagram ...
PPTX
SQL Query Optimization | SQL Query Optimization Techniques | SQL Basics | SQL...
PPTX
SQL INterview Questions .pTop 45 SQL Interview Questions And Answers In 2025 ...
PPTX
How To Start Influencer Marketing Business | Influencer Marketing For Beginne...
PPTX
Cyber Security Roadmap 2025 | How To Become Cyber Security Engineer In 2025 |...
PPTX
How To Become An AI And ML Engineer In 2025 | AI Engineer Roadmap | AI ML Car...
PPTX
What Is GitHub Copilot? | How To Use GitHub Copilot? | How does GitHub Copilo...
PPTX
Top 10 Data Analyst Certification For 2025 | Best Data Analyst Certification ...
PPTX
Complete Data Science Roadmap For 2025 | Data Scientist Roadmap For Beginners...
PPTX
Top 7 High Paying AI Certifications Courses For 2025 | Best AI Certifications...
PPTX
Data Cleaning In Data Mining | Step by Step Data Cleaning Process | Data Clea...
PPTX
Top 10 Data Analyst Projects For 2025 | Data Analyst Projects | Data Analysis...
PPTX
AI Engineer Roadmap 2025 | AI Engineer Roadmap For Beginners | AI Engineer Ca...
PPTX
Machine Learning Roadmap 2025 | Machine Learning Engineer Roadmap For Beginne...
PPTX
Kotter's 8-Step Change Model Explained | Kotter's Change Management Model | S...
PPTX
Gen AI Engineer Roadmap For 2025 | How To Become Gen AI Engineer In 2025 | Si...
PPTX
Top 10 Data Analyst Certification For 2025 | Best Data Analyst Certification ...
PPTX
Complete Data Science Roadmap For 2025 | Data Scientist Roadmap For Beginners...
Top 50 Scrum Master Interview Questions | Scrum Master Interview Questions & ...
Bagging Vs Boosting In Machine Learning | Ensemble Learning In Machine Learni...
Future Of Social Media | Social Media Trends and Strategies 2025 | Instagram ...
SQL Query Optimization | SQL Query Optimization Techniques | SQL Basics | SQL...
SQL INterview Questions .pTop 45 SQL Interview Questions And Answers In 2025 ...
How To Start Influencer Marketing Business | Influencer Marketing For Beginne...
Cyber Security Roadmap 2025 | How To Become Cyber Security Engineer In 2025 |...
How To Become An AI And ML Engineer In 2025 | AI Engineer Roadmap | AI ML Car...
What Is GitHub Copilot? | How To Use GitHub Copilot? | How does GitHub Copilo...
Top 10 Data Analyst Certification For 2025 | Best Data Analyst Certification ...
Complete Data Science Roadmap For 2025 | Data Scientist Roadmap For Beginners...
Top 7 High Paying AI Certifications Courses For 2025 | Best AI Certifications...
Data Cleaning In Data Mining | Step by Step Data Cleaning Process | Data Clea...
Top 10 Data Analyst Projects For 2025 | Data Analyst Projects | Data Analysis...
AI Engineer Roadmap 2025 | AI Engineer Roadmap For Beginners | AI Engineer Ca...
Machine Learning Roadmap 2025 | Machine Learning Engineer Roadmap For Beginne...
Kotter's 8-Step Change Model Explained | Kotter's Change Management Model | S...
Gen AI Engineer Roadmap For 2025 | How To Become Gen AI Engineer In 2025 | Si...
Top 10 Data Analyst Certification For 2025 | Best Data Analyst Certification ...
Complete Data Science Roadmap For 2025 | Data Scientist Roadmap For Beginners...

Recently uploaded (20)

PDF
Business Ethics Teaching Materials for college
PPTX
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PPTX
Introduction to Child Health Nursing – Unit I | Child Health Nursing I | B.Sc...
PDF
01-Introduction-to-Information-Management.pdf
PDF
VCE English Exam - Section C Student Revision Booklet
PDF
Microbial disease of the cardiovascular and lymphatic systems
PPTX
The Healthy Child – Unit II | Child Health Nursing I | B.Sc Nursing 5th Semester
PDF
RMMM.pdf make it easy to upload and study
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PPTX
Institutional Correction lecture only . . .
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PDF
Insiders guide to clinical Medicine.pdf
PDF
Origin of periodic table-Mendeleev’s Periodic-Modern Periodic table
PPTX
Pharma ospi slides which help in ospi learning
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
Business Ethics Teaching Materials for college
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
2.FourierTransform-ShortQuestionswithAnswers.pdf
Introduction to Child Health Nursing – Unit I | Child Health Nursing I | B.Sc...
01-Introduction-to-Information-Management.pdf
VCE English Exam - Section C Student Revision Booklet
Microbial disease of the cardiovascular and lymphatic systems
The Healthy Child – Unit II | Child Health Nursing I | B.Sc Nursing 5th Semester
RMMM.pdf make it easy to upload and study
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
O5-L3 Freight Transport Ops (International) V1.pdf
Institutional Correction lecture only . . .
Microbial diseases, their pathogenesis and prophylaxis
Module 4: Burden of Disease Tutorial Slides S2 2025
Insiders guide to clinical Medicine.pdf
Origin of periodic table-Mendeleev’s Periodic-Modern Periodic table
Pharma ospi slides which help in ospi learning
Final Presentation General Medicine 03-08-2024.pptx
Abdominal Access Techniques with Prof. Dr. R K Mishra
FourierSeries-QuestionsWithAnswers(Part-A).pdf

What Is Hadoop? | What Is Big Data & Hadoop | Introduction To Hadoop | Hadoop Tutorial | Simplilearn

  • 2. In a farm far away…
  • 3. Jack harvests grapes and then sells it in the nearby town
  • 4. After harvesting, he then stores the produce in a storage room
  • 5. Soon there was a high demand for other fruits. So, he started harvesting apples and oranges as well
  • 6. He then realizes that it is time consuming and difficult to harvest all the fruits by himself
  • 7. So, he hires 2 more people to work with him. With this, harvesting is done simultaneously
  • 8. Now, the storage room becomes a bottleneck to store and access all the fruits in a single storage area
  • 9. Jack now decides to distribute the storage area and give each one of them a separate storage space
  • 10. Hello, I want a fruit basket of 3 grapes, 2 apples and 3 oranges
  • 11. To complete the order on time, all of them work parallelly with their own storage space Hello, I want a fruit basket of 3 grapes, 2 apples and 3 oranges
  • 12. This solution helps them to complete the order on time without any hassles Fruit basket
  • 13. All of them are happy and they are prepared for an increase in demand in the future
  • 14. All of them are happy and they are prepared for an increase in demand in the future So, how does this story relate to Big Data?
  • 15. The rise of Big Data Structured data Earlier with limited data, only one processor and one storage unit was needed
  • 16. The rise of Big Data Structured data Semi structured data Unstructured data Soon, data generation increased leading to high volume of data along with different data formats
  • 17. The rise of Big Data Structured data Semi structured data Unstructured data A single processor was not enough to process such high volume of different kinds of data as it was very time consuming
  • 18. The rise of Big Data Structured data Semi structured data Unstructured data Hence, multiple processors were used to process high volume of data and this saved time
  • 19. The rise of Big Data Structured data Semi structured data Unstructured data The single storage unit became the bottleneck due to which network overhead was generated
  • 20. The rise of Big Data Structured data Semi structured data Unstructured data The solution was to use distributed storage for each processor. This enabled easy access to store and access data
  • 21. The rise of Big Data Structured data Semi structured data Unstructured data This method worked and there was no network overhead generated
  • 22. The rise of Big Data Structured data Semi structured data Unstructured data This is known as parallel processing with distributed storage
  • 23. The rise of Big Data Structured data Semi structured data Unstructured data This is known as parallel processing with distributed storage Parallel processing
  • 24. The rise of Big Data Structured data Semi structured data Unstructured data This is known as parallel processing with distributed storage Parallel processing Distributed storage
  • 25. What’s in it for you?
  • 26. What’s in it for you? 1. Big Data and it’s challenges1
  • 27. What’s in it for you? 1. Big Data and it’s challenges1 1. Hadoop as a solution2
  • 28. What’s in it for you? 1. Big Data and it’s challenges1 1. Hadoop as a solution2 1. What is Hadoop?3
  • 29. What’s in it for you? 1. Big Data and it’s challenges1 1. Hadoop as a solution2 1. What is Hadoop?3 1. Components of Hadoop4
  • 30. What’s in it for you? 1. Big Data and it’s challenges1 1. Hadoop as a solution2 1. What is Hadoop?3 1. Components of Hadoop4 1. Use case of Hadoop5
  • 31. What is Big Data?
  • 32. What is Big Data? Massive amount of data which cannot be stored, processed and analyzed using the traditional ways
  • 33. What is Big Data? Massive amount of data which cannot be stored, processed and analyzed using the traditional ways VERACITY BIG DATA VELOCITY VOLUME VARIETYVALUE VERACITY
  • 34. What is Big Data? Massive amount of data which cannot be stored, processed and analyzed using the traditional ways VERACITY BIG DATA VELOCITY VOLUME VARIETYVALUE VERACITY
  • 35. What is Big Data? Massive amount of data which cannot be stored, processed and analyzed using the traditional ways VERACITY BIG DATA VELOCITY VOLUME VARIETYVALUE VERACITY
  • 36. What is Big Data? Massive amount of data which cannot be stored, processed and analyzed using the traditional ways VERACITY BIG DATA VELOCITY VOLUME VARIETYVALUE VERACITY
  • 37. What is Big Data? Massive amount of data which cannot be stored, processed and analyzed using the traditional ways VERACITY BIG DATA VELOCITY VOLUME VARIETYVALUE VERACITY
  • 38. Big Data challenges and solution Single central storage Challenges
  • 39. Big Data challenges and solution Distributed storagesSingle central storage Challenges Solutions Distributed storage
  • 40. Big Data challenges and solution Serial processing OutputProcess Input A Distributed storagesSingle central storage Challenges Solutions Distributed storage
  • 41. Big Data challenges and solution Serial processing OutputProcess Input A Distributed storagesSingle central storage Parallel processing Output B Inputs A Process Challenges Solutions Distributed storage
  • 42. Big Data challenges and solution Serial processing OutputProcess Input A Distributed storagesSingle central storage Parallel processing Output B Inputs A Process Lack of ability to process unstructured data Challenges Solutions Distributed storage
  • 43. Big Data challenges and solution Serial processing OutputProcess Input A Distributed storagesSingle central storage Parallel processing Output B Inputs A Process Lack of ability to process unstructured data Ability to process every type of data Challenges Solutions Distributed storage
  • 44. Hadoop as a solution Serial processing OutputProcess Input A Distributed storagesSingle central storage Parallel processing Output B Inputs A Process Lack of ability to process unstructured data Ability to process every type of data Challenges Solutions Distributed storage
  • 46. What is Hadoop? Big Data VOLUME STORING Storing Processing Analyzing Hadoop is a framework that manages big data storage in a distributed way and processes it parallelly
  • 48. Components of Hadoop Storage unit of Hadoop Processing unit of Hadoop
  • 49. Components of Hadoop Storage unit of Hadoop Processing unit of Hadoop
  • 51. What is HDFS? VOLUME STORING Hadoop Distributed File System (HDFS) is specially designed for storing huge datasets in commodity hardware Distributed storage
  • 52. What is HDFS? VOLUME STORING Hadoop Distributed File System (HDFS) has two core components NameNode and DataNode NameNode DataNode
  • 53. What is HDFS? VOLUME STORING Hadoop Distributed File System (HDFS) has two core components NameNode and DataNode NameNode DataNode There is only one NameNode
  • 54. What is HDFS? VOLUME STORING Hadoop Distributed File System (HDFS) has two core components NameNode and DataNode NameNode DataNode There is only one NameNode DataNode DataNode There can be multiple DataNodes
  • 55. What is HDFS? VOLUME STORING Master/slave nodes typically form the HDFS cluster
  • 56. What is HDFS? VOLUME STORING Master/slave nodes typically form the HDFS cluster Master/NameNode Slave/DataNode Slave/DataNode Slave/DataNode
  • 57. What is HDFS? VOLUME STORING Master/slave nodes typically form the HDFS cluster Master/NameNode Slave/DataNode Slave/DataNode Slave/DataNode NameNode maintains and manages the DataNode. It also stores the metadata
  • 58. What is HDFS? VOLUME STORING Master/slave nodes typically form the HDFS cluster Master/NameNode Slave/DataNode Slave/DataNode Slave/DataNode NameNode maintains and manages the DataNode. It also stores the metadata DataNodes stores the actual data, does reading, writing and processing. Performs replication as well
  • 59. What is HDFS? VOLUME STORING Master/slave nodes typically form the HDFS cluster Master/NameNode Slave/DataNode Slave/DataNode Slave/DataNode NameNode maintains and manages the DataNode. It also stores the metadata DataNodes stores the actual data, does reading, writing and processing. Performs replication as well
  • 60. What is HDFS? VOLUME STORING Master/slave nodes typically form the HDFS cluster Master/NameNode Slave/DataNode Slave/DataNode Slave/DataNode NameNode maintains and manages the DataNode. It also stores the metadata DataNodes stores the actual data, does reading, writing and processing. Performs replication as well HeartBeat is the signal that DataNode continuously sends to the NameNode. This signal shows the status of the DataNode
  • 61. What is HDFS? VOLUME STORING In HDFS, data is stored in a distributed manner 30 TB file
  • 62. What is HDFS? VOLUME STORING In HDFS, data is stored in a distributed manner 30 TB file NameNode 30 TB of data is loaded
  • 63. What is HDFS? VOLUME STORING In HDFS, data is stored in a distributed manner 30 TB file NameNode 30 TB of data is loaded . . . Data is divided into blocks of 128 MB each
  • 64. What is HDFS? VOLUME STORING In HDFS, data is stored in a distributed manner 30 TB file NameNode 30 TB of data is loaded DataNodes . . . Data is divided into blocks of 128 MB each . . . . .
  • 65. What is HDFS? VOLUME STORING In HDFS, data is stored in a distributed manner 30 TB file NameNode 30 TB of data is loaded DataNodes . . . Data is divided into blocks of 128 MB each Blocks are then replicated among the DataNodes . . . . .
  • 66. What is HDFS? Provides distributed storage Features of HDFS
  • 67. What is HDFS? Provides distributed storage Implemented on commodity hardware Features of HDFS
  • 68. What is HDFS? Provides distributed storage Implemented on commodity hardware Provides data security Features of HDFS
  • 69. What is HDFS? Provides distributed storage Implemented on commodity hardware Provides data security Highly fault tolerant Features of HDFS
  • 70. Components of Hadoop Storage unit of Hadoop Processing unit of Hadoop
  • 72. What is MapReduce? VOLUME STORING Hadoop MapReduce is a programming technique where huge data is processed in a parallel and distributed fashion
  • 73. What is MapReduce? VOLUME STORING Hadoop MapReduce is a programming technique where huge data is processed in a parallel and distributed fashion Big Data
  • 74. What is MapReduce? VOLUME STORING Hadoop MapReduce is a programming technique where huge data is processed in a parallel and distributed fashion Big Data Processor
  • 75. What is MapReduce? VOLUME STORING Hadoop MapReduce is a programming technique where huge data is processed in a parallel and distributed fashion Big Data Processor MapReduce is used for parallel processing of the Big Data, which is stored in HDFS
  • 76. What is MapReduce? VOLUME STORING Hadoop MapReduce is a programming technique where huge data is processed in a parallel and distributed fashion Big Data Output Processor MapReduce is used for parallel processing of the Big Data, which is stored in HDFS
  • 77. What is MapReduce? VOLUME STORING In MapReduce approach, processing is done at the slave nodes and the final result is sent to the master node
  • 78. What is MapReduce? VOLUME STORING In MapReduce approach, processing is done at the slave nodes and the final result is sent to the master node Master Slave Slave Slave Slave Traditional approach – Data is processed at the Master node
  • 79. What is MapReduce? VOLUME STORING In MapReduce approach, processing is done at the slave nodes and the final result is sent to the master node Master Slave Slave Slave Slave Traditional approach – Data is processed at the Master node MapReduce approach – Data is processed at the Slave nodes Slave Slave Slave Slave Master
  • 80. What is MapReduce? Input Bus Car Train Ship Ship Train Bus Ship Car
  • 81. What is MapReduce? Input Split Ship Ship Train Bus Car Train Bus Car Train Ship Ship Train Bus Ship Car Bus Ship Car The input dataset is first split into chunks of data
  • 82. What is MapReduce? Input Split Map phase Ship Ship Train Bus Car Train Bus Car Train Ship Ship Train Bus Ship Car Bus Ship Car Ship, 1 Ship, 1 Train, 1 Bus, 1 Ship, 1 Car, 1 Bus, 1 Car, 1 Train, 1 These chunks of data are then processed by map tasks parallelly
  • 83. What is MapReduce? Input Split Map phase Reduce phase Ship Ship Train Bus Car Train Bus Car Train Ship Ship Train Bus Ship Car Bus Ship Car Ship, 1 Ship, 1 Train, 1 Bus, 1 Ship, 1 Car, 1 Bus, 1 Car, 1 Train, 1 Ship, 1 Ship, 1 Ship, 1 Bus, 1 Bus, 1 Car, 1 Car, 1 Train, 1 Train, 1
  • 84. What is MapReduce? Input Split Map phase Shuffle and sortReduce phase Ship Ship Train Bus Car Train Bus Car Train Ship Ship Train Bus Ship Car Bus Ship Car Ship, 1 Ship, 1 Train, 1 Bus, 1 Ship, 1 Car, 1 Bus, 1 Car, 1 Train, 1 Ship, 1 Ship, 1 Ship, 1 Bus, 2 Car, 2 Ship, 3 Train, 2 Bus, 1 Bus, 1 Car, 1 Car, 1 Train, 1 Train, 1 At the reduce task, the aggregation takes place and the final output is obtained
  • 85. Components of Hadoop version 2.0 Storage unit of Hadoop Processing unit of Hadoop Resource management unit of Hadoop
  • 87. YARN – Yet Another Resource Negotiator Acts like an OS to Hadoop 2 Does job scheduling Responsible for managing cluster resources What is YARN?
  • 89. What is YARN? Client Client Client Client submits the job request
  • 91. What is YARN? Resource Manager Responsible for resource allocation and management Client Client Client Client submits the job request
  • 92. What is YARN? Resource Manager Responsible for resource allocation and management Node Manager Client Client Client Node Manager Node Manager Client submits the job request
  • 93. What is YARN? Node Manager manages the nodes and monitors resource usage Resource Manager Responsible for resource allocation and management Node Manager Client Client Client Node Manager Node Manager Client submits the job request
  • 94. What is YARN? Container is a collection of physical resources such as RAM, CPU Node Manager manages the nodes and monitors resource usage Resource Manager Responsible for resource allocation and management container Node Manager Client Client Client Node Manager Node Manager container container container Client submits the job request
  • 95. What is YARN? Container is a collection of physical resources such as RAM, CPU Node Manager manages the nodes and monitors resource usage Resource Manager Responsible for resource allocation and management App Master container Node Manager Client Client Client Node Manager Node Manager App Master container container container App Master requests container from the NodeManager Client submits the job request
  • 96. What is YARN? Container is a collection of physical resources such as RAM, CPU Node Manager manages the nodes and monitors resource usage Resource Manager Responsible for resource allocation and management App Master container Node Manager Client Client Client Node Manager Node Manager App Master container container container App Master requests container from the NodeManager Client submits the job request
  • 97. Hadoop use case – Combating fraudulent activities
  • 98. Hadoop use case – Combating fraudulent activities Fraud activities Detecting fraudulent transactions is one among the various problems any bank faces
  • 99. Zions’ main challenge was to combat the fraudulent activities which were taking place Challenge Hadoop use case – Combating fraudulent activities
  • 100. Approaches used by Zions’ security team to combat fraudulent activities Hadoop use case – Combating fraudulent activities
  • 101. Approaches used by Zions’ security team to combat fraudulent activities Security information management – SIM Tools Problem It was based on RDBMS Unable to store huge data which needed to be analyzed Hadoop use case – Combating fraudulent activities
  • 102. Approaches used by Zions’ security team to combat fraudulent activities Security information management – SIM Tools Problem It was based on RDBMS Unable to store huge data which needed to be analyzed Hadoop use case – Combating fraudulent activities Parallel processing system Problem Analyzing unstructured data was not possible
  • 103. Approaches used by Zions’ security team to combat fraudulent activities Security information management – SIM Tools Problem It was based on RDBMS Unable to store huge data which needed to be analyzed Hadoop use case – Combating fraudulent activities Parallel processing system Problem Analyzing unstructured data was not possible
  • 104. How Hadoop solved the problems Hadoop use case – Combating fraudulent activities Storing Zions could now store massive amount of data using Hadoop
  • 105. How Hadoop solved the problems Hadoop use case – Combating fraudulent activities Storing Zions could now store massive amount of data using Hadoop Processing Processing of unstructured data (like server logs, customer data, customer transactions) was now possible
  • 106. How Hadoop solved the problems Hadoop use case – Combating fraudulent activities Storing Zions could now store massive amount of data using Hadoop Processing Analyzing In-depth analysis of different data formats became easy and time efficient Processing of unstructured data (like server logs, customer data, customer transactions) was now possible
  • 107. How Hadoop solved the problems Hadoop use case – Combating fraudulent activities Storing Zions could now store massive amount of data using Hadoop Processing Analyzing Detecting In-depth analysis of different data formats became easy and time efficient The team could now detect everything from malware, spear phishing attempts to account takeovers Processing of unstructured data (like server logs, customer data, customer transactions) was now possible

Editor's Notes