SlideShare a Scribd company logo
2
Most read
10
Most read
15
Most read
HADOOP
VS.
APACHE SPARK
Hadoop and Spark are popular Apache projects in the big data
ecosystem.
Apache Spark is an open-source platform, based on the original
Hadoop MapReduce component of the Hadoop ecosystem.
 Apache developed Hadoop project as open-source software
for reliable, scalable, distributed computing.
 A framework that allows distributed processing of large
datasets across clusters of computers using simple
programming models.
 Hadoop can be easily scaled-up to multi cluster machines, each
offering local storage and computation.
 Hadoop libraries are designed in such a way that it can detect
the failed cluster at application layer and can handle those
failures by it.
 Hadoop Common: These are Java libraries and utilities required
for running other Hadoop modules. These libraries provide OS
level and filesystem abstractions and contain the necessary
Java files and scripts required to start and run Hadoop.
 Hadoop Distributed File System (HDFS): A distributed file
system that provides high-throughput access to application
data.
 Hadoop YARN: A framework for job scheduling and cluster
resource management.
 Hadoop MapReduce: A YARN-based system for parallel
processing of large datasets.
The project includes these modules:
Hadoop MapReduce, HDFS and YARN provide a scalable, fault-tolerant and
distributed platform for storage and processing of very large datasets across clusters
of commodity computers. Hadoop uses the same set of nodes for data storage as
well as to perform the computations. This allows Hadoop to improve the
performance of large scale computations by combining computations along with the
storage.
Hadoop vs Apache Spark
Hadoop Distributed File System – HDFS
HDFS is a distributed filesystem that is designed to store large volume of
data reliably.
HDFS stores a single large file on different nodes across the cluster of
commodity machines.
HDFS overlays on top of the existing filesystem. Data is stored in fine
grained blocks, with default block size of 128MB.
HDFS also stores redundant copies of these data blocks in multiple nodes
to ensure reliability and fault tolerance. HDFS is a distributed, reliable and
scalable file system.
YARN (Yet Another Resource Negotiator), a central component in the
Hadoop ecosystem, is a framework for job scheduling and cluster resource
management. The basic idea of YARN is to split up the functionalities of
resource management and job scheduling/monitoring into separate
daemons.
Hadoop YARN
Hadoop MapReduce
MapReduce is a programming model and an associated implementation for processing and
generating large datasets with a parallel, distributed algorithm on a cluster. Mapper maps
input key/value pair to set of intermediate pairs. Reducer takes this intermediate pairs and
process to output the required values. Mapper processes the jobs in parallel on every
cluster and Reducer process them in any available node as directed by YARN.
 It is a framework for analysing data analytics on a distributed
computing cluster.
 It provides in-memory computations for increasing speed and
data processing over MapReduce.
 It utilizes the Hadoop Distributed File System (HDFS) and runs
on top of existing Hadoop cluster.
 It can also process both structured data in Hive and streaming
data from different sources like HDFS, Flume, Kafka, and
Twitter.
Spark Stream
Spark Streaming is an extension of the core Spark API.
Processing live data streams can be done using Spark Streaming, that enables
scalable, high-throughput, fault-tolerant stream.
Input Data can be from any sources like WebStream (TCP sockets), Flume,
Kafka, etc., and can be processed using complex algorithms with high-level
functions like map, reduce, join, etc. Finally, processed data can be pushed out
to filesystems (HDFS), databases, and live dashboards.
We can also apply Spark’s graph processing algorithms and machine learning
on data streams.
Spark SQL
 Apache Spark provides a separate module Spark SQL for processing
structured data.
 Spark SQL has an interface, which provides detailed information about the
structure of the data and the computation being performed.
 Internally, Spark SQL uses this additional information to perform extra
optimizations.
Datasets and DataFrames
 A distributed collection of data is called as Dataset in Apache Spark.
Dataset provides the benefits of RDDs along with utilizing the Spark
SQL’s optimized execution engine.
 A Dataset can be constructed from objects and then manipulated
using functional transformations.
 A DataFrame is a dataset organized into named columns. It is equally
related to a relational database table or a R/Python data frame, but
with richer optimizations under the hood.
 A DataFrame can be constructed, using various data source like
structured data file or Hive tables or external databases or existing
RDDs.
Resilient Distributed Datasets (RDDs)
Spark works on fault-tolerant collection of elements that can be operated
on in parallel, the concept called resilient distributed dataset (RDD). RDDs
can be created in two ways, parallelizing an existing collection in driver
program, or referencing a dataset in an external storage system, such as a
shared filesystem, HDFS, HBase, etc.
Here is a quick comparison guideline before concluding.
Aspects Hadoop Apache Spark
Difficulty MapReduce is difficult to program
and needs abstractions.
Spark is easy to program and does
not require any abstractions.
Interactive Mode
There is no in-built interactive
mode, except Pig and Hive.
It has interactive mode.
Streaming
Hadoop MapReduce just get to
process a batch of large stored
data.
Spark can be used to modify in
real time through Spark
Streaming.
Aspects Hadoop Apache Spark
Performance
MapReduce does not leverage the
memory of the Hadoop cluster to
the maximum.
Spark has been said to execute
batch processing jobs about 10 to
100 times faster than Hadoop
MapReduce.
Latency
MapReduce is disk oriented
completely.
Spark ensures lower latency
computations by caching the
partial results across its
memory of distributed workers.
Ease of coding
Writing Hadoop MapReduce
pipelines is complex and lengthy
process.
Writing Spark code is always more
compact.
CONTACT US
Write to us : business@altencalsoftlabs.com
Visit Our Website: https://www.altencalsoftlabs.com
USA | FRANCE | UK | INDIA | SINGAPORE

More Related Content

PDF
Apache Spark Introduction
PPTX
Introduction to Pig
PDF
PPTX
Hadoop Tutorial For Beginners
PDF
Spark on yarn
PPTX
Spark architecture
PDF
Hadoop ecosystem
PDF
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark Introduction
Introduction to Pig
Hadoop Tutorial For Beginners
Spark on yarn
Spark architecture
Hadoop ecosystem
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab

What's hot (20)

PPTX
Hadoop File system (HDFS)
PPT
Unit-3_BDA.ppt
PPTX
Apache hive introduction
PPTX
04 spark-pair rdd-rdd-persistence
PPTX
Programming in Spark using PySpark
PDF
Apache Spark Overview
PPTX
Apache Spark Architecture
PPTX
Data Mining: Graph mining and social network analysis
PPTX
Introduction to Hadoop Technology
PDF
Spark SQL
PPTX
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
PPTX
Intro to Apache Spark
PPTX
Introduction to Apache Spark
PDF
Hadoop Overview & Architecture
 
PDF
Apache spark
PDF
Hadoop Ecosystem
PDF
Introduction to apache spark
PDF
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
PPTX
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
PDF
Hadoop ecosystem
Hadoop File system (HDFS)
Unit-3_BDA.ppt
Apache hive introduction
04 spark-pair rdd-rdd-persistence
Programming in Spark using PySpark
Apache Spark Overview
Apache Spark Architecture
Data Mining: Graph mining and social network analysis
Introduction to Hadoop Technology
Spark SQL
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Intro to Apache Spark
Introduction to Apache Spark
Hadoop Overview & Architecture
 
Apache spark
Hadoop Ecosystem
Introduction to apache spark
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
Hadoop ecosystem
Ad

Similar to Hadoop vs Apache Spark (20)

PPTX
Big Data Analytics Presentation on the resourcefulness of Big data
PPTX
APACHE SPARK.pptx
PDF
[@NaukriEngineering] Apache Spark
PPTX
Glint with Apache Spark
PPTX
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
PPTX
In Memory Analytics with Apache Spark
PPTX
Apache Spark
PDF
SparkPaper
PPTX
Spark from the Surface
PDF
Spark vs Hadoop
PPTX
Intro to Apache Spark by CTO of Twingo
PPTX
Apache Spark Fundamentals
PPTX
Apache spark installation [autosaved]
PPTX
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
PDF
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
PPTX
Unit II Real Time Data Processing tools.pptx
PDF
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
PDF
Apache Spark PDF
PDF
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
PPTX
Big data processing with Apache Spark and Oracle Database
Big Data Analytics Presentation on the resourcefulness of Big data
APACHE SPARK.pptx
[@NaukriEngineering] Apache Spark
Glint with Apache Spark
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
In Memory Analytics with Apache Spark
Apache Spark
SparkPaper
Spark from the Surface
Spark vs Hadoop
Intro to Apache Spark by CTO of Twingo
Apache Spark Fundamentals
Apache spark installation [autosaved]
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Unit II Real Time Data Processing tools.pptx
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Apache Spark PDF
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Big data processing with Apache Spark and Oracle Database
Ad

More from ALTEN Calsoft Labs (18)

PPTX
Harnessing the True Potential of Cloud Security.pptx
PPTX
How do SIS and LMS solutions support the entire student lifecycle?
PPTX
How can you keep supply chain management
PPTX
Lean TLF Mock Shells: A Programmer’s Boon
PPTX
Net suite erp implementation
PPTX
Case Study: Loan default prediction
PPTX
Introduction to Robotic Process Automation (rpa) and RPA Case Study
PPTX
Overview of IoT/M2M Capability
PPTX
Embedded System and IoT - ALTEN Calsoft Labs
PPTX
Business Intelligence and Analytics Capability
PPTX
Top 10 IoT Blogs
PPTX
Top 10 retail tech trends 2017
PPTX
Top 9 Retail IoT Use Cases
PPTX
Top 6 IoT Use Cases in Manufacturing
PPTX
Top 100 IoT Use Cases
PPTX
Intel DPDK - ALTEN Calsoft Lab's Expertise
PPTX
Healthcare Data Analytics Implementation
PPTX
Genomic Dashboard For Targeted Cancer Therapy
Harnessing the True Potential of Cloud Security.pptx
How do SIS and LMS solutions support the entire student lifecycle?
How can you keep supply chain management
Lean TLF Mock Shells: A Programmer’s Boon
Net suite erp implementation
Case Study: Loan default prediction
Introduction to Robotic Process Automation (rpa) and RPA Case Study
Overview of IoT/M2M Capability
Embedded System and IoT - ALTEN Calsoft Labs
Business Intelligence and Analytics Capability
Top 10 IoT Blogs
Top 10 retail tech trends 2017
Top 9 Retail IoT Use Cases
Top 6 IoT Use Cases in Manufacturing
Top 100 IoT Use Cases
Intel DPDK - ALTEN Calsoft Lab's Expertise
Healthcare Data Analytics Implementation
Genomic Dashboard For Targeted Cancer Therapy

Recently uploaded (20)

PDF
Empathic Computing: Creating Shared Understanding
PDF
A comparative study of natural language inference in Swahili using monolingua...
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
A comparative analysis of optical character recognition models for extracting...
PPTX
OMC Textile Division Presentation 2021.pptx
PPTX
1. Introduction to Computer Programming.pptx
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Machine learning based COVID-19 study performance prediction
PPT
Teaching material agriculture food technology
PDF
Getting Started with Data Integration: FME Form 101
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
TLE Review Electricity (Electricity).pptx
PDF
Heart disease approach using modified random forest and particle swarm optimi...
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Empathic Computing: Creating Shared Understanding
A comparative study of natural language inference in Swahili using monolingua...
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Per capita expenditure prediction using model stacking based on satellite ima...
Encapsulation_ Review paper, used for researhc scholars
A comparative analysis of optical character recognition models for extracting...
OMC Textile Division Presentation 2021.pptx
1. Introduction to Computer Programming.pptx
Reach Out and Touch Someone: Haptics and Empathic Computing
Machine learning based COVID-19 study performance prediction
Teaching material agriculture food technology
Getting Started with Data Integration: FME Form 101
Unlocking AI with Model Context Protocol (MCP)
Mobile App Security Testing_ A Comprehensive Guide.pdf
Group 1 Presentation -Planning and Decision Making .pptx
Agricultural_Statistics_at_a_Glance_2022_0.pdf
TLE Review Electricity (Electricity).pptx
Heart disease approach using modified random forest and particle swarm optimi...
SOPHOS-XG Firewall Administrator PPT.pptx
Profit Center Accounting in SAP S/4HANA, S4F28 Col11

Hadoop vs Apache Spark

  • 2. Hadoop and Spark are popular Apache projects in the big data ecosystem. Apache Spark is an open-source platform, based on the original Hadoop MapReduce component of the Hadoop ecosystem.
  • 3.  Apache developed Hadoop project as open-source software for reliable, scalable, distributed computing.  A framework that allows distributed processing of large datasets across clusters of computers using simple programming models.  Hadoop can be easily scaled-up to multi cluster machines, each offering local storage and computation.  Hadoop libraries are designed in such a way that it can detect the failed cluster at application layer and can handle those failures by it.
  • 4.  Hadoop Common: These are Java libraries and utilities required for running other Hadoop modules. These libraries provide OS level and filesystem abstractions and contain the necessary Java files and scripts required to start and run Hadoop.  Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data.  Hadoop YARN: A framework for job scheduling and cluster resource management.  Hadoop MapReduce: A YARN-based system for parallel processing of large datasets. The project includes these modules:
  • 5. Hadoop MapReduce, HDFS and YARN provide a scalable, fault-tolerant and distributed platform for storage and processing of very large datasets across clusters of commodity computers. Hadoop uses the same set of nodes for data storage as well as to perform the computations. This allows Hadoop to improve the performance of large scale computations by combining computations along with the storage.
  • 7. Hadoop Distributed File System – HDFS HDFS is a distributed filesystem that is designed to store large volume of data reliably. HDFS stores a single large file on different nodes across the cluster of commodity machines. HDFS overlays on top of the existing filesystem. Data is stored in fine grained blocks, with default block size of 128MB. HDFS also stores redundant copies of these data blocks in multiple nodes to ensure reliability and fault tolerance. HDFS is a distributed, reliable and scalable file system.
  • 8. YARN (Yet Another Resource Negotiator), a central component in the Hadoop ecosystem, is a framework for job scheduling and cluster resource management. The basic idea of YARN is to split up the functionalities of resource management and job scheduling/monitoring into separate daemons. Hadoop YARN
  • 9. Hadoop MapReduce MapReduce is a programming model and an associated implementation for processing and generating large datasets with a parallel, distributed algorithm on a cluster. Mapper maps input key/value pair to set of intermediate pairs. Reducer takes this intermediate pairs and process to output the required values. Mapper processes the jobs in parallel on every cluster and Reducer process them in any available node as directed by YARN.
  • 10.  It is a framework for analysing data analytics on a distributed computing cluster.  It provides in-memory computations for increasing speed and data processing over MapReduce.  It utilizes the Hadoop Distributed File System (HDFS) and runs on top of existing Hadoop cluster.  It can also process both structured data in Hive and streaming data from different sources like HDFS, Flume, Kafka, and Twitter.
  • 11. Spark Stream Spark Streaming is an extension of the core Spark API. Processing live data streams can be done using Spark Streaming, that enables scalable, high-throughput, fault-tolerant stream. Input Data can be from any sources like WebStream (TCP sockets), Flume, Kafka, etc., and can be processed using complex algorithms with high-level functions like map, reduce, join, etc. Finally, processed data can be pushed out to filesystems (HDFS), databases, and live dashboards. We can also apply Spark’s graph processing algorithms and machine learning on data streams.
  • 12. Spark SQL  Apache Spark provides a separate module Spark SQL for processing structured data.  Spark SQL has an interface, which provides detailed information about the structure of the data and the computation being performed.  Internally, Spark SQL uses this additional information to perform extra optimizations.
  • 13. Datasets and DataFrames  A distributed collection of data is called as Dataset in Apache Spark. Dataset provides the benefits of RDDs along with utilizing the Spark SQL’s optimized execution engine.  A Dataset can be constructed from objects and then manipulated using functional transformations.  A DataFrame is a dataset organized into named columns. It is equally related to a relational database table or a R/Python data frame, but with richer optimizations under the hood.  A DataFrame can be constructed, using various data source like structured data file or Hive tables or external databases or existing RDDs.
  • 14. Resilient Distributed Datasets (RDDs) Spark works on fault-tolerant collection of elements that can be operated on in parallel, the concept called resilient distributed dataset (RDD). RDDs can be created in two ways, parallelizing an existing collection in driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, etc.
  • 15. Here is a quick comparison guideline before concluding. Aspects Hadoop Apache Spark Difficulty MapReduce is difficult to program and needs abstractions. Spark is easy to program and does not require any abstractions. Interactive Mode There is no in-built interactive mode, except Pig and Hive. It has interactive mode. Streaming Hadoop MapReduce just get to process a batch of large stored data. Spark can be used to modify in real time through Spark Streaming.
  • 16. Aspects Hadoop Apache Spark Performance MapReduce does not leverage the memory of the Hadoop cluster to the maximum. Spark has been said to execute batch processing jobs about 10 to 100 times faster than Hadoop MapReduce. Latency MapReduce is disk oriented completely. Spark ensures lower latency computations by caching the partial results across its memory of distributed workers. Ease of coding Writing Hadoop MapReduce pipelines is complex and lengthy process. Writing Spark code is always more compact.
  • 17. CONTACT US Write to us : [email protected] Visit Our Website: https://www.altencalsoftlabs.com USA | FRANCE | UK | INDIA | SINGAPORE