SlideShare a Scribd company logo
www.edureka.co/r-for-analytics
www.edureka.co/apache-spark-scala-training
Apache Spark: Beyond Hadoop MapReduce
Presenter: Vishal
Slide 2Slide 2Slide 2 www.edureka.co/apache-spark-scala-training
What will you learn today?
 Strength of MapReduce
 Limitations of MapReduce
 How MapReduce limitations can be overcome
 How Spark fits the bill
 Other exciting features in Spark
Strength of MapReduce
Slide 4Slide 4Slide 4 www.edureka.co/apache-spark-scala-training
Simple
Scalable
Fault
Tolerant
Minimal
data
motion
Strength of MapReduce
Independent of a programming language, such as
Java, C++ or Python.
It can process petabytes of data,
stored in HDFS on one cluster
MapReduce takes care of failures
using the replicated copies.
Process moves towards data to minimize Disk I/O
Limitations of MapReduce
Slide 6Slide 6Slide 6 www.edureka.co/apache-spark-scala-training
Real
Time
Complex
Algorithm
Re-reading
and parsing
Data
Minimal
Data
Motion
Graph
Processing
Iterative
Tasks
Random
Access
Limitations Of MR
Slide 7Slide 7Slide 7 www.edureka.co/apache-spark-scala-training
Feature Comparison with Spark
Fast 100x faster than MapReduce
Batch Processing Batch and Real-time Processing
Stores Data on Disk Stores Data in Memory
Written in Java Written in Scala
Hadoop MapReduce Hadoop Spark
Source: Databrix
What are the MR limitations and
how Spark overcomes it?
Slide 9Slide 9Slide 9 www.edureka.co/apache-spark-scala-training
Overcoming MR limitations
By Cutting down on the number
of Reads and Writes to the disc
Real
time
Slide 10Slide 10Slide 10 www.edureka.co/apache-spark-scala-training
Spark tries to keep things in-memory of its distributed workers, allowing for significantly faster/lower-latency
computations, whereas MapReduce keeps shuffling things in and out of disk.
Spark Cuts Down Read/Write I/O To Disk
Slide 11Slide 11Slide 11 www.edureka.co/apache-spark-scala-training
Overcoming MR limitations
Libraries for Machine
Learning & Streaming
Graph
processing
Complex
algorithm
Slide 12Slide 12Slide 12 www.edureka.co/apache-spark-scala-training
Libraries For ML, Graph Programming …
Machine Learning
Library
Graph
programming
Spark interface
For RDBMS lovers
Utility for
continuous
ingestion of data
Slide 13Slide 13Slide 13 www.edureka.co/apache-spark-scala-training
Overcoming MR limitations
Cyclic data flows
Random
access
Slide 14Slide 14Slide 14 www.edureka.co/apache-spark-scala-training
Cyclic Data Flows
• All jobs in spark comprise a series of operators and run on a set of data.
• All the operators in a job are used to construct a DAG (Directed Acyclic
Graph).
• The DAG is optimized by rearranging and combining operators where
possible.
Slide 15Slide 15Slide 15 www.edureka.co/apache-spark-scala-training
Spark Features makes its Architecture better
than MR
Other Spark Features In Demand
Slide 17Slide 17Slide 17 www.edureka.co/apache-spark-scala-training
Spark Features/Modules In Demand
Source: Typesafe
Slide 18Slide 18Slide 18 www.edureka.co/apache-spark-scala-training
New Features In 2015
Data Frames 
• Similar API to data frames in R and Pandas
• Automatically optimised via Spark SQL
• Released in Spark 1.3
SparkR 
• Released in Spark 1.4
• Exposes DataFrames, RDD’s & MLlibrary in R
Machine Learning Pipelines 
• High Level API
• Featurization
• Evaluation
• Model Tuning
External Data Sources 
• Platform API to plug Data-Sources into Spark
• Pushes logic into sources
Source: Databrix
Slide 19Slide 19Slide 19 www.edureka.co/apache-spark-scala-training
Get Certified in Spark from Edureka
Edureka's Spark and Scala course:
• Learn large-scale data processing by mastering the concepts of Scala, RDD, Traits, OOPS and Spark SQL
• Online Live Courses: 24 hours
• Assignments: 32 hours
• Project: 20 hours
• Lifetime Access + 24 X 7 Support
Go to www.edureka.co/apache-spark-scala-training
Batch starts from 10th October (Weekend Batch)
Thank You
Questions/Queries/Feedback/Survey
Recording and presentation will be made available to you within 24 hours

More Related Content

PPTX
5 reasons why spark is in demand!
PPTX
Intro to Apache Spark by CTO of Twingo
PPTX
Apache spark
PPTX
Spark for big data analytics
PDF
Performance of Spark vs MapReduce
PDF
Big Data Processing with Spark and Scala
PDF
Spark SQL | Apache Spark
PDF
Big Data Processing With Spark
5 reasons why spark is in demand!
Intro to Apache Spark by CTO of Twingo
Apache spark
Spark for big data analytics
Performance of Spark vs MapReduce
Big Data Processing with Spark and Scala
Spark SQL | Apache Spark
Big Data Processing With Spark

What's hot (20)

PPTX
Big data Processing with Apache Spark & Scala
PDF
Spark Streaming
PDF
5 Reasons why Spark is in demand!
PDF
Apache spark
PDF
Apache spark linkedin
PDF
Apache spark - Architecture , Overview & libraries
PDF
End-to-End Data Pipelines with Apache Spark
PPTX
An Introduction to Apache Spark
PDF
Sydney Apache Spark Meetup - Spark Natural Language Processing
PPTX
Introduction to Apache Spark
PDF
Spark For Faster Batch Processing
PPTX
5 things one must know about spark!
PDF
Sydney Spark Meetup - September 2015
PDF
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
PPTX
Spark: The State of the Art Engine for Big Data Processing
PDF
An Introduction to Sparkling Water by Michal Malohlava
PDF
Spark Will Replace Hadoop ! Know Why
PPTX
Apache spark
PDF
Apache Spark Overview
PPT
Spark_Part 1
Big data Processing with Apache Spark & Scala
Spark Streaming
5 Reasons why Spark is in demand!
Apache spark
Apache spark linkedin
Apache spark - Architecture , Overview & libraries
End-to-End Data Pipelines with Apache Spark
An Introduction to Apache Spark
Sydney Apache Spark Meetup - Spark Natural Language Processing
Introduction to Apache Spark
Spark For Faster Batch Processing
5 things one must know about spark!
Sydney Spark Meetup - September 2015
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
Spark: The State of the Art Engine for Big Data Processing
An Introduction to Sparkling Water by Michal Malohlava
Spark Will Replace Hadoop ! Know Why
Apache spark
Apache Spark Overview
Spark_Part 1
Ad

Viewers also liked (18)

PDF
MapReduce Example | MapReduce Programming | Hadoop MapReduce Tutorial | Edureka
PDF
Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...
PDF
Introduction to Apache Spark
PPTX
What Is Salesforce CRM? | Salesforce CRM Tutorial For Beginners | Salesforce ...
PDF
5 things one must know about spark!
PDF
Hadoop Introduction
PDF
Les Business Analysts face à l'agilité : de nouveaux challenges à relever
PPTX
A Basic Introduction to the Hadoop eco system - no animation
PDF
Agile & Top Management
PPTX
Spark One Platform Webinar
PDF
Understanding Big Data And Hadoop
PDF
De la pensée projet à la pensée produit
PDF
Cloud : en 2017, sortez du stratus !
PDF
Fault Tolerance with Kafka
PDF
Démystifions l'API-culture!
PDF
Afterwork Blockchain : la prochaine technologie disruptive ?
PDF
Introduction to Big Data & Hadoop
PDF
Real-time Big Data Processing with Storm
MapReduce Example | MapReduce Programming | Hadoop MapReduce Tutorial | Edureka
Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...
Introduction to Apache Spark
What Is Salesforce CRM? | Salesforce CRM Tutorial For Beginners | Salesforce ...
5 things one must know about spark!
Hadoop Introduction
Les Business Analysts face à l'agilité : de nouveaux challenges à relever
A Basic Introduction to the Hadoop eco system - no animation
Agile & Top Management
Spark One Platform Webinar
Understanding Big Data And Hadoop
De la pensée projet à la pensée produit
Cloud : en 2017, sortez du stratus !
Fault Tolerance with Kafka
Démystifions l'API-culture!
Afterwork Blockchain : la prochaine technologie disruptive ?
Introduction to Big Data & Hadoop
Real-time Big Data Processing with Storm
Ad

Similar to Apache Spark beyond Hadoop MapReduce (20)

PDF
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...
PDF
Spark is going to replace Apache Hadoop! Know Why?
PDF
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
PDF
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
PPTX
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
PDF
Spark Interview Questions and Answers | Apache Spark Interview Questions | Sp...
PPTX
Apache Spark & Scala
PPTX
Apache Spark on HDinsight Training
PPTX
Introduction to Spark - DataFactZ
PPTX
Unit II Real Time Data Processing tools.pptx
PPTX
4Introduction+to+Spark.pptx sdfsdfsdfsdfsdf
PPTX
Is Spark the right choice for data analysis ?
PPTX
Apache Spark Core
PDF
Apache Spark Training | Spark Tutorial For Beginners | Apache Spark Certifica...
PPTX
Apache Spark Fundamentals
PPTX
APACHE SPARK.pptx
PPTX
Scrap Your MapReduce - Apache Spark
PPTX
PPTX
Apache spark with java 8
PDF
Apache spark with java 8
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...
Spark is going to replace Apache Hadoop! Know Why?
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Spark Interview Questions and Answers | Apache Spark Interview Questions | Sp...
Apache Spark & Scala
Apache Spark on HDinsight Training
Introduction to Spark - DataFactZ
Unit II Real Time Data Processing tools.pptx
4Introduction+to+Spark.pptx sdfsdfsdfsdfsdf
Is Spark the right choice for data analysis ?
Apache Spark Core
Apache Spark Training | Spark Tutorial For Beginners | Apache Spark Certifica...
Apache Spark Fundamentals
APACHE SPARK.pptx
Scrap Your MapReduce - Apache Spark
Apache spark with java 8
Apache spark with java 8

More from Edureka! (20)

PDF
What to learn during the 21 days Lockdown | Edureka
PDF
Top 10 Dying Programming Languages in 2020 | Edureka
PDF
Top 5 Trending Business Intelligence Tools | Edureka
PDF
Tableau Tutorial for Data Science | Edureka
PDF
Python Programming Tutorial | Edureka
PDF
Top 5 PMP Certifications | Edureka
PDF
Top Maven Interview Questions in 2020 | Edureka
PDF
Linux Mint Tutorial | Edureka
PDF
How to Deploy Java Web App in AWS| Edureka
PDF
Importance of Digital Marketing | Edureka
PDF
RPA in 2020 | Edureka
PDF
Email Notifications in Jenkins | Edureka
PDF
EA Algorithm in Machine Learning | Edureka
PDF
Cognitive AI Tutorial | Edureka
PDF
AWS Cloud Practitioner Tutorial | Edureka
PDF
Blue Prism Top Interview Questions | Edureka
PDF
Big Data on AWS Tutorial | Edureka
PDF
A star algorithm | A* Algorithm in Artificial Intelligence | Edureka
PDF
Kubernetes Installation on Ubuntu | Edureka
PDF
Introduction to DevOps | Edureka
What to learn during the 21 days Lockdown | Edureka
Top 10 Dying Programming Languages in 2020 | Edureka
Top 5 Trending Business Intelligence Tools | Edureka
Tableau Tutorial for Data Science | Edureka
Python Programming Tutorial | Edureka
Top 5 PMP Certifications | Edureka
Top Maven Interview Questions in 2020 | Edureka
Linux Mint Tutorial | Edureka
How to Deploy Java Web App in AWS| Edureka
Importance of Digital Marketing | Edureka
RPA in 2020 | Edureka
Email Notifications in Jenkins | Edureka
EA Algorithm in Machine Learning | Edureka
Cognitive AI Tutorial | Edureka
AWS Cloud Practitioner Tutorial | Edureka
Blue Prism Top Interview Questions | Edureka
Big Data on AWS Tutorial | Edureka
A star algorithm | A* Algorithm in Artificial Intelligence | Edureka
Kubernetes Installation on Ubuntu | Edureka
Introduction to DevOps | Edureka

Recently uploaded (20)

PDF
A comparative analysis of optical character recognition models for extracting...
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Empathic Computing: Creating Shared Understanding
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
Machine Learning_overview_presentation.pptx
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Electronic commerce courselecture one. Pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
Getting Started with Data Integration: FME Form 101
PPTX
1. Introduction to Computer Programming.pptx
PDF
Approach and Philosophy of On baking technology
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
A comparative analysis of optical character recognition models for extracting...
Advanced methodologies resolving dimensionality complications for autism neur...
Empathic Computing: Creating Shared Understanding
gpt5_lecture_notes_comprehensive_20250812015547.pdf
MYSQL Presentation for SQL database connectivity
Dropbox Q2 2025 Financial Results & Investor Presentation
Machine Learning_overview_presentation.pptx
Mobile App Security Testing_ A Comprehensive Guide.pdf
Electronic commerce courselecture one. Pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
The Rise and Fall of 3GPP – Time for a Sabbatical?
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Assigned Numbers - 2025 - Bluetooth® Document
Getting Started with Data Integration: FME Form 101
1. Introduction to Computer Programming.pptx
Approach and Philosophy of On baking technology
“AI and Expert System Decision Support & Business Intelligence Systems”
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx

Apache Spark beyond Hadoop MapReduce

  • 2. Slide 2Slide 2Slide 2 www.edureka.co/apache-spark-scala-training What will you learn today?  Strength of MapReduce  Limitations of MapReduce  How MapReduce limitations can be overcome  How Spark fits the bill  Other exciting features in Spark
  • 4. Slide 4Slide 4Slide 4 www.edureka.co/apache-spark-scala-training Simple Scalable Fault Tolerant Minimal data motion Strength of MapReduce Independent of a programming language, such as Java, C++ or Python. It can process petabytes of data, stored in HDFS on one cluster MapReduce takes care of failures using the replicated copies. Process moves towards data to minimize Disk I/O
  • 6. Slide 6Slide 6Slide 6 www.edureka.co/apache-spark-scala-training Real Time Complex Algorithm Re-reading and parsing Data Minimal Data Motion Graph Processing Iterative Tasks Random Access Limitations Of MR
  • 7. Slide 7Slide 7Slide 7 www.edureka.co/apache-spark-scala-training Feature Comparison with Spark Fast 100x faster than MapReduce Batch Processing Batch and Real-time Processing Stores Data on Disk Stores Data in Memory Written in Java Written in Scala Hadoop MapReduce Hadoop Spark Source: Databrix
  • 8. What are the MR limitations and how Spark overcomes it?
  • 9. Slide 9Slide 9Slide 9 www.edureka.co/apache-spark-scala-training Overcoming MR limitations By Cutting down on the number of Reads and Writes to the disc Real time
  • 10. Slide 10Slide 10Slide 10 www.edureka.co/apache-spark-scala-training Spark tries to keep things in-memory of its distributed workers, allowing for significantly faster/lower-latency computations, whereas MapReduce keeps shuffling things in and out of disk. Spark Cuts Down Read/Write I/O To Disk
  • 11. Slide 11Slide 11Slide 11 www.edureka.co/apache-spark-scala-training Overcoming MR limitations Libraries for Machine Learning & Streaming Graph processing Complex algorithm
  • 12. Slide 12Slide 12Slide 12 www.edureka.co/apache-spark-scala-training Libraries For ML, Graph Programming … Machine Learning Library Graph programming Spark interface For RDBMS lovers Utility for continuous ingestion of data
  • 13. Slide 13Slide 13Slide 13 www.edureka.co/apache-spark-scala-training Overcoming MR limitations Cyclic data flows Random access
  • 14. Slide 14Slide 14Slide 14 www.edureka.co/apache-spark-scala-training Cyclic Data Flows • All jobs in spark comprise a series of operators and run on a set of data. • All the operators in a job are used to construct a DAG (Directed Acyclic Graph). • The DAG is optimized by rearranging and combining operators where possible.
  • 15. Slide 15Slide 15Slide 15 www.edureka.co/apache-spark-scala-training Spark Features makes its Architecture better than MR
  • 16. Other Spark Features In Demand
  • 17. Slide 17Slide 17Slide 17 www.edureka.co/apache-spark-scala-training Spark Features/Modules In Demand Source: Typesafe
  • 18. Slide 18Slide 18Slide 18 www.edureka.co/apache-spark-scala-training New Features In 2015 Data Frames  • Similar API to data frames in R and Pandas • Automatically optimised via Spark SQL • Released in Spark 1.3 SparkR  • Released in Spark 1.4 • Exposes DataFrames, RDD’s & MLlibrary in R Machine Learning Pipelines  • High Level API • Featurization • Evaluation • Model Tuning External Data Sources  • Platform API to plug Data-Sources into Spark • Pushes logic into sources Source: Databrix
  • 19. Slide 19Slide 19Slide 19 www.edureka.co/apache-spark-scala-training Get Certified in Spark from Edureka Edureka's Spark and Scala course: • Learn large-scale data processing by mastering the concepts of Scala, RDD, Traits, OOPS and Spark SQL • Online Live Courses: 24 hours • Assignments: 32 hours • Project: 20 hours • Lifetime Access + 24 X 7 Support Go to www.edureka.co/apache-spark-scala-training Batch starts from 10th October (Weekend Batch)
  • 20. Thank You Questions/Queries/Feedback/Survey Recording and presentation will be made available to you within 24 hours