SlideShare a Scribd company logo
What’s in it for you?
Hadoop Ecosystem
Data storage
Cluster resource management
Data processing
Data collection
and ingestion
Scripting SQL queries
Real time data
analysis
Machine
Learning
Management
and monitoring
Streaming
Security
Workflow system
Hadoop Ecosystem
Data storage
Cluster resource management
Data processing
Data collection
and ingestion
Scripting SQL queries
Real time data
analysis
Machine
Learning
Management
and monitoring
Streaming
Security
Workflow system
HDFS
HDFS stands for Hadoop Distributed File System
Stores different formats of data
on various machines
Namenode
(Master)
Datanode
(Slave)
2 major components
128 MB
300 MB
128 MB 44 MB
Splits the data into multiple
blocks (128 MB by default)
Hadoop Ecosystem
Data storage
Cluster resource management
Data processing
Data collection
and ingestion
Scripting SQL queries
Real time data
analysis
Machine
Learning
Management
and monitoring
Streaming
Security
Workflow system
YARN
YARN stands for Yet Another Resource Negotiator
ResourceManager
(Master)
NodeManager
(Slave)
2 major components
Allocates RAM, memory and
other resources to different
applications
RAM
Memory
ResourcesHandles the cluster of
nodes
Hadoop Ecosystem
Data storage
Cluster resource management
Data processing
Data collection
and ingestion
Scripting SQL queries
Real time data
analysis
Machine
Learning
Management
and monitoring
Streaming
Security
Workflow system
MapReduce
MapReduce processes large volumes of data in a parallelly distributed manner
Big Data
Map()
Map()
Map()
Map()
Map()
Map()
Shuffle and
sort
Reduce()
Reduce()
Output
Hadoop Ecosystem
Data storage
Cluster resource management
Data processing
Data collection
and ingestion
Scripting SQL queries
Real time data
analysis
Machine
Learning
Management
and monitoring
Streaming
Security
Workflow system
Sqoop
Sqoop is used to transfer data between Hadoop and external datastores such as
relational databases and enterprise data warehouses
Relational database and
enterprise data warehouse
Hadoop data
It imports data from external datastores into
HDFS, Hive and HBase
Map Task
HDFS/HBase/Hi
ve
Enterprise
data
warehouse
Document
based systems
Relational
Database
Hadoop
command
Flume
Flume is distributed service for collecting, aggregating and moving large amounts of
log data
Unstructured and semi-
structured data into
HDFS
Flume
ingests
Ingests online streaming data from social
media, log files, web server into HDFS
Web server/
Cloud/Social
media data
Source Sink
HDFS
Channel
Hadoop Ecosystem
Data storage
Cluster resource management
Data processing
Data collection
and ingestion
Scripting SQL queries
Real time data
analysis
Machine
Learning
Management
and monitoring
Streaming
Security
Workflow system
Pig
Pig is used to analyze data in Hadoop. It provides a high level data processing
language to perform numerous operations on the data
Pig Latin Scripts
Parser
Optimizer
Compiler
Execution Engine
MapReduce
HDFS
Grunt Shell Pig Server
Apache PigPig Latin
Pig Latin
Compiler
Language for
scripting
Converts Pig Latin code to
executable code
Provides a platform for building
data flow for ETL
10 lines of Pig Latin script is around
200 lines of MapReduce job
Hive
Hive facilitates reading, writing and managing large datasets residing in the distributed
storage using SQL (Hive Query Language)
Hive Command
Line
JDBC/ODBC
driver
2 major components
JDBC/ODBC
Hive Thrift
Server
Hive Web
Interface
CLI
Driver
(Compiler, Optimizer, Executor)
Job Tracker Namenode
Hive
Datanode
+
Task
Tracker
Provides User Defined Functions (UDF) for
data mining, document indexing, log
processing, etc.
Hadoop Ecosystem
Data storage
Cluster resource management
Data processing
Data collection
and ingestion
Scripting SQL queries
Real time data
analysis
Machine
Learning
Management
and monitoring
Streaming
Security
Workflow system
Spark
Spark is an open-source distributed computing engine for processing and analyzing
huge volumes of real time data
Written in
Driver
Program
SparkContext
Cluster
Manager
Worker Node
Task Cache
Executor
Task Cache
Executor
Worker Node
Runs 100x times faster than MapReduce
Provides in-memory computation of data
Used to process and analyze real time streaming
data such as stock market and banking data
Hadoop Ecosystem
Data storage
Cluster resource management
Data processing
Data collection
and ingestion
Scripting SQL queries
Real time data
analysis
Machine
Learning
Management
and monitoring
Streaming
Security
Workflow system
Mahout
Mahout is used to create scalable and distributed machine learning algorithms
Machine learning
applications
Mahout
environment
builds
Collaborative
Filtering
Classification
Clustering
Has a library that contains in-
built algorithms for
Hadoop Ecosystem
Data storage
Cluster resource management
Data processing
Data collection
and ingestion
Scripting SQL queries
Real time data
analysis
Machine
Learning
Management
and monitoring
Streaming
Security
Workflow system
Ambari
Ambari is an open-source tool responsible for keeping track of running applications and
their statuses
Host Server
Agent Agent Agent
Ambari Web
Database
Host Server Host Server
Ambari Server
• Manages, monitors and provisions Hadoop clusters
• Provides a central management service to start, stop and
configure Hadoop services
Hadoop Ecosystem
Data storage
Cluster resource management
Data processing
Data collection
and ingestion
Scripting SQL queries
Real time data
analysis
Machine
Learning
Management
and monitoring
Streaming
Security
Workflow system
Kafka
Kafka is a distributed streaming platform to store and process streams of records
Written in
Builds real-time streaming data pipelines that reliably
get data between applications
Builds real-time streaming applications that
transforms data into streams
Kafka uses a messaging system for transferring data
from one application to another
Sender Receiver
Message queue
Storm
Storm is a processing engine that processes real-time streaming data at a
very high speed
Written in
Clojure
Ability to process over a million jobs in a fraction of
seconds on a node
It is integrated with Hadoop to harness higher
throughputs
Hadoop Ecosystem
Data storage
Cluster resource management
Data processing
Data collection
and ingestion
Scripting SQL queries
Real time data
analysis
Machine
Learning
Management
and monitoring
Streaming
Security
Workflow system
Ranger
Ranger is a framework to enable, monitor and manage data securities across
the Hadoop platform
Provides centralized security
administration to manage all
security related tasks
1
Standardize authorization across all
Hadoop components
2
Enhanced support for different
authorization methods – Role based
access control, attribute based
access control, etc.
3
Knox
Knox is an application gateway for interacting with the REST APIs and UIs of
Hadoop deployments
Knox delivers 3 groups of user facing services:
Provides access to Hadoop via
proxying the HTTP request
Proxying
Services1
Authentication for REST API access
and WebSSO flow for user
interfaces
Authentication
Services2
Client development can be done
with the scripting through DSL or
using the Knox shell classes
Client
Services3
Hadoop Ecosystem
Data storage
Cluster resource management
Data processing
Data collection
and ingestion
Scripting SQL queries
Real time data
analysis
Machine
Learning
Management
and monitoring
Streaming
Security
Workflow system
Oozie
Oozie is a workflow scheduler system to manage Hadoop jobs
Workflow
engine
Coordinator
engine
Consists of 2 parts
1 2
1
2
Directed Acyclic Graphs (DAGs) which
specifies a sequence of actions to be
executed
These consist of workflow jobs triggered by
time and data
availability
Start
MapReduce
Program [Action
Node]
Notify client of
success [Email
Action Node]
Notify Client of
Error [Email Action
Node]
Kill
(unsuccessful
termination)
Begin Success
Error
End
(successful
completion)
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners | Simplilearn

More Related Content

PPTX
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
PPTX
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
PDF
Introducing Databricks Delta
PDF
Building an open data platform with apache iceberg
PDF
Modernizing to a Cloud Data Architecture
PDF
Lakehouse in Azure
PPTX
PDF
Hadoop Overview & Architecture
 
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Introducing Databricks Delta
Building an open data platform with apache iceberg
Modernizing to a Cloud Data Architecture
Lakehouse in Azure
Hadoop Overview & Architecture
 

What's hot (20)

PDF
Apache Iceberg - A Table Format for Hige Analytic Datasets
PPTX
Big_data_ppt
PPTX
What Is Hadoop? | What Is Big Data & Hadoop | Introduction To Hadoop | Hadoop...
PPTX
Hadoop File system (HDFS)
PDF
A Thorough Comparison of Delta Lake, Iceberg and Hudi
PDF
Apache Iceberg: An Architectural Look Under the Covers
PDF
Iceberg: A modern table format for big data (Strata NY 2018)
PPTX
Azure data platform overview
PDF
Intro to Delta Lake
PPTX
PPT on Hadoop
PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
PPTX
Data Lakehouse, Data Mesh, and Data Fabric (r1)
PDF
Databricks Delta Lake and Its Benefits
PPTX
Introduction to HDFS
PPTX
Introduction to Data Engineering
PPTX
Delta lake and the delta architecture
PDF
Managed Feature Store for Machine Learning
PPTX
Hadoop introduction , Why and What is Hadoop ?
PPTX
Apache PIG
Apache Iceberg - A Table Format for Hige Analytic Datasets
Big_data_ppt
What Is Hadoop? | What Is Big Data & Hadoop | Introduction To Hadoop | Hadoop...
Hadoop File system (HDFS)
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Apache Iceberg: An Architectural Look Under the Covers
Iceberg: A modern table format for big data (Strata NY 2018)
Azure data platform overview
Intro to Delta Lake
PPT on Hadoop
DW Migration Webinar-March 2022.pptx
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Databricks Delta Lake and Its Benefits
Introduction to HDFS
Introduction to Data Engineering
Delta lake and the delta architecture
Managed Feature Store for Machine Learning
Hadoop introduction , Why and What is Hadoop ?
Apache PIG
Ad

Similar to Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners | Simplilearn (20)

DOCX
hadoop resume
PPTX
What is Hadoop? Key Concepts, Architecture, and Applications
PPTX
Big data
PPTX
Hadoop Big Data A big picture
PDF
Bigdata and Hadoop Bootcamp
PPTX
Hadoop_arunam_ppt
PDF
Big Data , Big Problem?
PPTX
PDF
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
PPTX
Big Data and Hadoop
PDF
Big data on Azure for Architects
PPTX
Real time analytics
PPTX
Storage and-compute-hdfs-map reduce
PPTX
CCD-410 Cloudera Study Material
PPTX
In15orlesss hadoop
PDF
20131205 hadoop-hdfs-map reduce-introduction
PDF
HPE Hadoop Solutions - From use cases to proposal
PDF
Google Data Engineering.pdf
PDF
Data Engineering on GCP
PDF
data_engineering_on_GCP_PDE_cheat_sheets
hadoop resume
What is Hadoop? Key Concepts, Architecture, and Applications
Big data
Hadoop Big Data A big picture
Bigdata and Hadoop Bootcamp
Hadoop_arunam_ppt
Big Data , Big Problem?
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
Big Data and Hadoop
Big data on Azure for Architects
Real time analytics
Storage and-compute-hdfs-map reduce
CCD-410 Cloudera Study Material
In15orlesss hadoop
20131205 hadoop-hdfs-map reduce-introduction
HPE Hadoop Solutions - From use cases to proposal
Google Data Engineering.pdf
Data Engineering on GCP
data_engineering_on_GCP_PDE_cheat_sheets
Ad

More from Simplilearn (20)

PPTX
Top 50 Scrum Master Interview Questions | Scrum Master Interview Questions & ...
PPTX
Bagging Vs Boosting In Machine Learning | Ensemble Learning In Machine Learni...
PPTX
Future Of Social Media | Social Media Trends and Strategies 2025 | Instagram ...
PPTX
SQL Query Optimization | SQL Query Optimization Techniques | SQL Basics | SQL...
PPTX
SQL INterview Questions .pTop 45 SQL Interview Questions And Answers In 2025 ...
PPTX
How To Start Influencer Marketing Business | Influencer Marketing For Beginne...
PPTX
Cyber Security Roadmap 2025 | How To Become Cyber Security Engineer In 2025 |...
PPTX
How To Become An AI And ML Engineer In 2025 | AI Engineer Roadmap | AI ML Car...
PPTX
What Is GitHub Copilot? | How To Use GitHub Copilot? | How does GitHub Copilo...
PPTX
Top 10 Data Analyst Certification For 2025 | Best Data Analyst Certification ...
PPTX
Complete Data Science Roadmap For 2025 | Data Scientist Roadmap For Beginners...
PPTX
Top 7 High Paying AI Certifications Courses For 2025 | Best AI Certifications...
PPTX
Data Cleaning In Data Mining | Step by Step Data Cleaning Process | Data Clea...
PPTX
Top 10 Data Analyst Projects For 2025 | Data Analyst Projects | Data Analysis...
PPTX
AI Engineer Roadmap 2025 | AI Engineer Roadmap For Beginners | AI Engineer Ca...
PPTX
Machine Learning Roadmap 2025 | Machine Learning Engineer Roadmap For Beginne...
PPTX
Kotter's 8-Step Change Model Explained | Kotter's Change Management Model | S...
PPTX
Gen AI Engineer Roadmap For 2025 | How To Become Gen AI Engineer In 2025 | Si...
PPTX
Top 10 Data Analyst Certification For 2025 | Best Data Analyst Certification ...
PPTX
Complete Data Science Roadmap For 2025 | Data Scientist Roadmap For Beginners...
Top 50 Scrum Master Interview Questions | Scrum Master Interview Questions & ...
Bagging Vs Boosting In Machine Learning | Ensemble Learning In Machine Learni...
Future Of Social Media | Social Media Trends and Strategies 2025 | Instagram ...
SQL Query Optimization | SQL Query Optimization Techniques | SQL Basics | SQL...
SQL INterview Questions .pTop 45 SQL Interview Questions And Answers In 2025 ...
How To Start Influencer Marketing Business | Influencer Marketing For Beginne...
Cyber Security Roadmap 2025 | How To Become Cyber Security Engineer In 2025 |...
How To Become An AI And ML Engineer In 2025 | AI Engineer Roadmap | AI ML Car...
What Is GitHub Copilot? | How To Use GitHub Copilot? | How does GitHub Copilo...
Top 10 Data Analyst Certification For 2025 | Best Data Analyst Certification ...
Complete Data Science Roadmap For 2025 | Data Scientist Roadmap For Beginners...
Top 7 High Paying AI Certifications Courses For 2025 | Best AI Certifications...
Data Cleaning In Data Mining | Step by Step Data Cleaning Process | Data Clea...
Top 10 Data Analyst Projects For 2025 | Data Analyst Projects | Data Analysis...
AI Engineer Roadmap 2025 | AI Engineer Roadmap For Beginners | AI Engineer Ca...
Machine Learning Roadmap 2025 | Machine Learning Engineer Roadmap For Beginne...
Kotter's 8-Step Change Model Explained | Kotter's Change Management Model | S...
Gen AI Engineer Roadmap For 2025 | How To Become Gen AI Engineer In 2025 | Si...
Top 10 Data Analyst Certification For 2025 | Best Data Analyst Certification ...
Complete Data Science Roadmap For 2025 | Data Scientist Roadmap For Beginners...

Recently uploaded (20)

PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
PDF
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
PDF
01-Introduction-to-Information-Management.pdf
PDF
TR - Agricultural Crops Production NC III.pdf
PPTX
Pharma ospi slides which help in ospi learning
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PPTX
Cardiovascular Pharmacology for pharmacy students.pptx
PDF
PSYCHOLOGY IN EDUCATION.pdf ( nice pdf ...)
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PPTX
Introduction to Child Health Nursing – Unit I | Child Health Nursing I | B.Sc...
PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PDF
Electrolyte Disturbances and Fluid Management A clinical and physiological ap...
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PDF
Insiders guide to clinical Medicine.pdf
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PPTX
Week 4 Term 3 Study Techniques revisited.pptx
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
01-Introduction-to-Information-Management.pdf
TR - Agricultural Crops Production NC III.pdf
Pharma ospi slides which help in ospi learning
O5-L3 Freight Transport Ops (International) V1.pdf
Cardiovascular Pharmacology for pharmacy students.pptx
PSYCHOLOGY IN EDUCATION.pdf ( nice pdf ...)
Abdominal Access Techniques with Prof. Dr. R K Mishra
Renaissance Architecture: A Journey from Faith to Humanism
Introduction to Child Health Nursing – Unit I | Child Health Nursing I | B.Sc...
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
Electrolyte Disturbances and Fluid Management A clinical and physiological ap...
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
Insiders guide to clinical Medicine.pdf
Microbial diseases, their pathogenesis and prophylaxis
FourierSeries-QuestionsWithAnswers(Part-A).pdf
Week 4 Term 3 Study Techniques revisited.pptx
102 student loan defaulters named and shamed – Is someone you know on the list?

Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners | Simplilearn

  • 1. What’s in it for you?
  • 2. Hadoop Ecosystem Data storage Cluster resource management Data processing Data collection and ingestion Scripting SQL queries Real time data analysis Machine Learning Management and monitoring Streaming Security Workflow system
  • 3. Hadoop Ecosystem Data storage Cluster resource management Data processing Data collection and ingestion Scripting SQL queries Real time data analysis Machine Learning Management and monitoring Streaming Security Workflow system
  • 4. HDFS HDFS stands for Hadoop Distributed File System Stores different formats of data on various machines Namenode (Master) Datanode (Slave) 2 major components 128 MB 300 MB 128 MB 44 MB Splits the data into multiple blocks (128 MB by default)
  • 5. Hadoop Ecosystem Data storage Cluster resource management Data processing Data collection and ingestion Scripting SQL queries Real time data analysis Machine Learning Management and monitoring Streaming Security Workflow system
  • 6. YARN YARN stands for Yet Another Resource Negotiator ResourceManager (Master) NodeManager (Slave) 2 major components Allocates RAM, memory and other resources to different applications RAM Memory ResourcesHandles the cluster of nodes
  • 7. Hadoop Ecosystem Data storage Cluster resource management Data processing Data collection and ingestion Scripting SQL queries Real time data analysis Machine Learning Management and monitoring Streaming Security Workflow system
  • 8. MapReduce MapReduce processes large volumes of data in a parallelly distributed manner Big Data Map() Map() Map() Map() Map() Map() Shuffle and sort Reduce() Reduce() Output
  • 9. Hadoop Ecosystem Data storage Cluster resource management Data processing Data collection and ingestion Scripting SQL queries Real time data analysis Machine Learning Management and monitoring Streaming Security Workflow system
  • 10. Sqoop Sqoop is used to transfer data between Hadoop and external datastores such as relational databases and enterprise data warehouses Relational database and enterprise data warehouse Hadoop data It imports data from external datastores into HDFS, Hive and HBase Map Task HDFS/HBase/Hi ve Enterprise data warehouse Document based systems Relational Database Hadoop command
  • 11. Flume Flume is distributed service for collecting, aggregating and moving large amounts of log data Unstructured and semi- structured data into HDFS Flume ingests Ingests online streaming data from social media, log files, web server into HDFS Web server/ Cloud/Social media data Source Sink HDFS Channel
  • 12. Hadoop Ecosystem Data storage Cluster resource management Data processing Data collection and ingestion Scripting SQL queries Real time data analysis Machine Learning Management and monitoring Streaming Security Workflow system
  • 13. Pig Pig is used to analyze data in Hadoop. It provides a high level data processing language to perform numerous operations on the data Pig Latin Scripts Parser Optimizer Compiler Execution Engine MapReduce HDFS Grunt Shell Pig Server Apache PigPig Latin Pig Latin Compiler Language for scripting Converts Pig Latin code to executable code Provides a platform for building data flow for ETL 10 lines of Pig Latin script is around 200 lines of MapReduce job
  • 14. Hive Hive facilitates reading, writing and managing large datasets residing in the distributed storage using SQL (Hive Query Language) Hive Command Line JDBC/ODBC driver 2 major components JDBC/ODBC Hive Thrift Server Hive Web Interface CLI Driver (Compiler, Optimizer, Executor) Job Tracker Namenode Hive Datanode + Task Tracker Provides User Defined Functions (UDF) for data mining, document indexing, log processing, etc.
  • 15. Hadoop Ecosystem Data storage Cluster resource management Data processing Data collection and ingestion Scripting SQL queries Real time data analysis Machine Learning Management and monitoring Streaming Security Workflow system
  • 16. Spark Spark is an open-source distributed computing engine for processing and analyzing huge volumes of real time data Written in Driver Program SparkContext Cluster Manager Worker Node Task Cache Executor Task Cache Executor Worker Node Runs 100x times faster than MapReduce Provides in-memory computation of data Used to process and analyze real time streaming data such as stock market and banking data
  • 17. Hadoop Ecosystem Data storage Cluster resource management Data processing Data collection and ingestion Scripting SQL queries Real time data analysis Machine Learning Management and monitoring Streaming Security Workflow system
  • 18. Mahout Mahout is used to create scalable and distributed machine learning algorithms Machine learning applications Mahout environment builds Collaborative Filtering Classification Clustering Has a library that contains in- built algorithms for
  • 19. Hadoop Ecosystem Data storage Cluster resource management Data processing Data collection and ingestion Scripting SQL queries Real time data analysis Machine Learning Management and monitoring Streaming Security Workflow system
  • 20. Ambari Ambari is an open-source tool responsible for keeping track of running applications and their statuses Host Server Agent Agent Agent Ambari Web Database Host Server Host Server Ambari Server • Manages, monitors and provisions Hadoop clusters • Provides a central management service to start, stop and configure Hadoop services
  • 21. Hadoop Ecosystem Data storage Cluster resource management Data processing Data collection and ingestion Scripting SQL queries Real time data analysis Machine Learning Management and monitoring Streaming Security Workflow system
  • 22. Kafka Kafka is a distributed streaming platform to store and process streams of records Written in Builds real-time streaming data pipelines that reliably get data between applications Builds real-time streaming applications that transforms data into streams Kafka uses a messaging system for transferring data from one application to another Sender Receiver Message queue
  • 23. Storm Storm is a processing engine that processes real-time streaming data at a very high speed Written in Clojure Ability to process over a million jobs in a fraction of seconds on a node It is integrated with Hadoop to harness higher throughputs
  • 24. Hadoop Ecosystem Data storage Cluster resource management Data processing Data collection and ingestion Scripting SQL queries Real time data analysis Machine Learning Management and monitoring Streaming Security Workflow system
  • 25. Ranger Ranger is a framework to enable, monitor and manage data securities across the Hadoop platform Provides centralized security administration to manage all security related tasks 1 Standardize authorization across all Hadoop components 2 Enhanced support for different authorization methods – Role based access control, attribute based access control, etc. 3
  • 26. Knox Knox is an application gateway for interacting with the REST APIs and UIs of Hadoop deployments Knox delivers 3 groups of user facing services: Provides access to Hadoop via proxying the HTTP request Proxying Services1 Authentication for REST API access and WebSSO flow for user interfaces Authentication Services2 Client development can be done with the scripting through DSL or using the Knox shell classes Client Services3
  • 27. Hadoop Ecosystem Data storage Cluster resource management Data processing Data collection and ingestion Scripting SQL queries Real time data analysis Machine Learning Management and monitoring Streaming Security Workflow system
  • 28. Oozie Oozie is a workflow scheduler system to manage Hadoop jobs Workflow engine Coordinator engine Consists of 2 parts 1 2 1 2 Directed Acyclic Graphs (DAGs) which specifies a sequence of actions to be executed These consist of workflow jobs triggered by time and data availability Start MapReduce Program [Action Node] Notify client of success [Email Action Node] Notify Client of Error [Email Action Node] Kill (unsuccessful termination) Begin Success Error End (successful completion)

Editor's Notes