SlideShare a Scribd company logo
B I G D A T A W O R K G R O U P . I R
WHAT IS SPARK
Apache Spark is an open source big data
processing framework built around speed, ease
of use, and sophisticated analytics. It was
originally developed in 2009 in UC Berkeley’s
AMPLab, and open sourced in 2010 as an
Apache project.
B I G D A T A W O R K G R O U P . I R
WHAT IS SPARK
Advantages: In Memory
 Spark enables applications in Hadoop clusters to run up to 100
times faster in memory and 10 times faster even when running
on disk.
B I G D A T A W O R K G R O U P . I R
WHAT IS SPARK
Advantages: Generic API
 Spark lets you quickly write applications in Java, Scala, or
Python. It comes with a built-in set of over 80 high-level
operators. And you can use it interactively to query data within
the shell.
B I G D A T A W O R K G R O U P . I R
WHAT IS SPARK
Advantages: Many Applications
 Spark gives us a comprehensive, unified framework to manage
big data processing requirements with a variety of data sets
that are diverse in nature (text data, graph data etc) as well as
the source of data (batch v. real-time streaming data).
B I G D A T A W O R K G R O U P . I R
WHAT IS SPARK
Advantages: Many Applications
 In addition to Map and Reduce operations, it supports SQL
queries, streaming data, machine learning and graph data
processing. Developers can use these capabilities stand-alone
or combine them to run in a single data pipeline use case.
B I G D A T A W O R K G R O U P . I R
HADOOP AND SPARK
Hadoop Spark
Map & Reduce -> suitable for on-
pass computations
multi-step data pipelines using
directed acyclic graph (DAG)
pattern.
Clusters are hard to set up and
manage
supports in-memory data sharing
across DAGs.
need to integrate with Mahout
(Machine Learning) and Storm
(Streaming data processing)
Spark as an alternative to Hadoop
MapReduce
B I G D A T A W O R K G R O U P . I R
SPARK FEATURES
Less expensive shuffles in the data processing. With capabilities like in-
memory data storage
Lazy evaluation of big data queries, which helps with optimization of the
steps in data processing workflows.
Higher level API to improve developer productivity and a consistent
architect model for big data solutions.
B I G D A T A W O R K G R O U P . I R
SPARK FEATURES
Spark holds intermediate results in memory rather than writing them to
disk
Spark can be used for processing datasets that larger than the aggregate
memory in a cluster.
B I G D A T A W O R K G R O U P . I R
SPARK ECOSYSTEM
Spark Streaming
 micro batch style of computing and processing.(DStream)
Spark SQL
 JDBC API, SQL like queries, ETL
Spark Mlib
 including classification, regression, clustering, collaborative filtering,
dimensionality reduction, as well as underlying optimization primitives
B I G D A T A W O R K G R O U P . I R
SPARK ECOSYSTEM
Spark GraphX
GraphX extends the Spark RDD by introducing the
Resilient Distributed Property Graph
Set of fundamental operators (e.g., subgraph,
joinVertices, and aggregateMessages)
B I G D A T A W O R K G R O U P . I R
SPARK ECOSYSTEM
BlinkDB
trade-off query accuracy for response time.
Tachyon
Caches working set files in memory
Spark Cassandra Connector
access data stored in a Cassandra database
SparkR
B I G D A T A W O R K G R O U P . I R
B I G D A T A W O R K G R O U P . I R
SPARK ARCHITECTURE
B I G D A T A W O R K G R O U P . I R
RESILIENT DISTRIBUTED DATASETS
Fault tolerance because an RDD know how to recreate and re-compute the
datasets.
RDDs are immutable.
B I G D A T A W O R K G R O U P . I R
RDD OPERATIONS
B I G D A T A W O R K G R O U P . I R
HOW TO RUN SPARK
B I G D A T A W O R K G R O U P . I R
HOW TO INTERACT WITH SPARK
spark-shell.cmd
B I G D A T A W O R K G R O U P . I R
SPARK WEB CONSOLE
http://localhost:4040
B I G D A T A W O R K G R O U P . I R
SHARED VARIABLES
Broadcast Variables
Accumulators
B I G D A T A W O R K G R O U P . I R
SPARK ECOSYSTEM
Spark SQL
 JDBC API, SQL like queries, ETL
B I G D A T A W O R K G R O U P . I R
SPARK ECOSYSTEM
Spark Streaming
 micro batch style of computing and processing.(DStream)
B I G D A T A W O R K G R O U P . I R

More Related Content

What's hot (20)

PDF
Introduction to Big Data Analytics on Apache Hadoop
Avkash Chauhan
 
PDF
Introduction to Bigdata and HADOOP
vinoth kumar
 
PDF
What is hadoop
Asis Mohanty
 
PPS
Big data hadoop rdbms
Arjen de Vries
 
PPTX
Big data Analytics Hadoop
Mishika Bharadwaj
 
PDF
Hadoop core concepts
Maryan Faryna
 
PDF
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Mahantesh Angadi
 
PPTX
Hadoop: Distributed Data Processing
Cloudera, Inc.
 
PPTX
Hadoop Presentation
Pham Thai Hoa
 
PPTX
Understanding hdfs
Thirunavukkarasu Ps
 
PPTX
Big Data and Hadoop Introduction
Dzung Nguyen
 
PPTX
Introduction to Apache Hadoop Eco-System
Md. Hasan Basri (Angel)
 
PPTX
Hadoop and Big Data
Harshdeep Kaur
 
PPTX
Apache Hadoop at 10
Cloudera, Inc.
 
PPTX
Big Data & Hadoop Tutorial
Edureka!
 
PDF
Hadoop Family and Ecosystem
tcloudcomputing-tw
 
PPTX
Top Hadoop Big Data Interview Questions and Answers for Fresher
JanBask Training
 
PPTX
Introduction to Big Data & Hadoop Architecture - Module 1
Rohit Agrawal
 
PDF
Emergent Distributed Data Storage
hybrid cloud
 
PPT
Big Data and Hadoop Basics
Sonal Tiwari
 
Introduction to Big Data Analytics on Apache Hadoop
Avkash Chauhan
 
Introduction to Bigdata and HADOOP
vinoth kumar
 
What is hadoop
Asis Mohanty
 
Big data hadoop rdbms
Arjen de Vries
 
Big data Analytics Hadoop
Mishika Bharadwaj
 
Hadoop core concepts
Maryan Faryna
 
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Mahantesh Angadi
 
Hadoop: Distributed Data Processing
Cloudera, Inc.
 
Hadoop Presentation
Pham Thai Hoa
 
Understanding hdfs
Thirunavukkarasu Ps
 
Big Data and Hadoop Introduction
Dzung Nguyen
 
Introduction to Apache Hadoop Eco-System
Md. Hasan Basri (Angel)
 
Hadoop and Big Data
Harshdeep Kaur
 
Apache Hadoop at 10
Cloudera, Inc.
 
Big Data & Hadoop Tutorial
Edureka!
 
Hadoop Family and Ecosystem
tcloudcomputing-tw
 
Top Hadoop Big Data Interview Questions and Answers for Fresher
JanBask Training
 
Introduction to Big Data & Hadoop Architecture - Module 1
Rohit Agrawal
 
Emergent Distributed Data Storage
hybrid cloud
 
Big Data and Hadoop Basics
Sonal Tiwari
 

Similar to Big data processing with apache spark part1 (20)

PDF
Apache Spark PDF
Naresh Rupareliya
 
PPTX
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
PDF
Apache Spark Introduction.pdf
MaheshPandit16
 
PDF
Apache spark
Dona Mary Philip
 
PPTX
Spark from the Surface
Josi Aranda
 
PPTX
Introduction to spark
Home
 
PDF
Spark SQL | Apache Spark
Edureka!
 
PDF
Big Data Processing With Spark
Edureka!
 
PDF
Spark For Faster Batch Processing
Edureka!
 
PDF
spark interview questions & answers acadgild blogs
prateek kumar
 
PPTX
Machine Learning with SparkR
Olgun Aydın
 
PPTX
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
bhuvankumar3877
 
PDF
5 things one must know about spark!
Edureka!
 
PDF
5 Reasons why Spark is in demand!
Edureka!
 
PDF
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Edureka!
 
PPTX
Apache Spark Overview
Dharmjit Singh
 
PPTX
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
PPTX
Pyspark presentationfsfsfjspfsjfsfsfjsfpsfsf
sasuke20y4sh
 
PPTX
Unit II Real Time Data Processing tools.pptx
Rahul Borate
 
PDF
Using pySpark with Google Colab & Spark 3.0 preview
Mario Cartia
 
Apache Spark PDF
Naresh Rupareliya
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
Apache Spark Introduction.pdf
MaheshPandit16
 
Apache spark
Dona Mary Philip
 
Spark from the Surface
Josi Aranda
 
Introduction to spark
Home
 
Spark SQL | Apache Spark
Edureka!
 
Big Data Processing With Spark
Edureka!
 
Spark For Faster Batch Processing
Edureka!
 
spark interview questions & answers acadgild blogs
prateek kumar
 
Machine Learning with SparkR
Olgun Aydın
 
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
bhuvankumar3877
 
5 things one must know about spark!
Edureka!
 
5 Reasons why Spark is in demand!
Edureka!
 
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Edureka!
 
Apache Spark Overview
Dharmjit Singh
 
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
Pyspark presentationfsfsfjspfsjfsfsfjsfpsfsf
sasuke20y4sh
 
Unit II Real Time Data Processing tools.pptx
Rahul Borate
 
Using pySpark with Google Colab & Spark 3.0 preview
Mario Cartia
 
Ad

Recently uploaded (20)

PPTX
ENGLISH -PPT- Week1 Quarter1 -day-1.pptx
garcialhavz
 
PPTX
A Case of Identity A Sociological Approach Fix.pptx
Ismail868386
 
PDF
Nanotechnology and Functional Foods Effective Delivery of Bioactive Ingredien...
rmswlwcxai8321
 
PDF
THE PSYCHOANALYTIC OF THE BLACK CAT BY EDGAR ALLAN POE (1).pdf
nabilahk908
 
PPTX
Photo chemistry Power Point Presentation
mprpgcwa2024
 
PPTX
Martyrs of Ireland - who kept the faith of St. Patrick.pptx
Martin M Flynn
 
PDF
Gladiolous Cultivation practices by AKL.pdf
kushallamichhame
 
PPTX
Elo the Hero is an story about a young boy who became hero.
TeacherEmily1
 
PPTX
Comparing Translational and Rotational Motion.pptx
AngeliqueTolentinoDe
 
PDF
Our Guide to the July 2025 USPS® Rate Change
Postal Advocate Inc.
 
PDF
VCE Literature Section A Exam Response Guide
jpinnuck
 
PPT
M&A5 Q1 1 differentiate evolving early Philippine conventional and contempora...
ErlizaRosete
 
PDF
The Power of Compound Interest (Stanford Initiative for Financial Decision-Ma...
Stanford IFDM
 
PDF
Andreas Schleicher_Teaching Compass_Education 2040.pdf
EduSkills OECD
 
PDF
Public Health For The 21st Century 1st Edition Judy Orme Jane Powell
trjnesjnqg7801
 
PDF
Romanticism in Love and Sacrifice An Analysis of Oscar Wilde’s The Nightingal...
KaryanaTantri21
 
DOCX
ANNOTATION on objective 10 on pmes 2022-2025
joviejanesegundo1
 
PPTX
SYMPATHOMIMETICS[ADRENERGIC AGONISTS] pptx
saip95568
 
PPT
M&A5 Q1 1 differentiate evolving early Philippine conventional and contempora...
ErlizaRosete
 
PPTX
How to Configure Taxes in Company Currency in Odoo 18 Accounting
Celine George
 
ENGLISH -PPT- Week1 Quarter1 -day-1.pptx
garcialhavz
 
A Case of Identity A Sociological Approach Fix.pptx
Ismail868386
 
Nanotechnology and Functional Foods Effective Delivery of Bioactive Ingredien...
rmswlwcxai8321
 
THE PSYCHOANALYTIC OF THE BLACK CAT BY EDGAR ALLAN POE (1).pdf
nabilahk908
 
Photo chemistry Power Point Presentation
mprpgcwa2024
 
Martyrs of Ireland - who kept the faith of St. Patrick.pptx
Martin M Flynn
 
Gladiolous Cultivation practices by AKL.pdf
kushallamichhame
 
Elo the Hero is an story about a young boy who became hero.
TeacherEmily1
 
Comparing Translational and Rotational Motion.pptx
AngeliqueTolentinoDe
 
Our Guide to the July 2025 USPS® Rate Change
Postal Advocate Inc.
 
VCE Literature Section A Exam Response Guide
jpinnuck
 
M&A5 Q1 1 differentiate evolving early Philippine conventional and contempora...
ErlizaRosete
 
The Power of Compound Interest (Stanford Initiative for Financial Decision-Ma...
Stanford IFDM
 
Andreas Schleicher_Teaching Compass_Education 2040.pdf
EduSkills OECD
 
Public Health For The 21st Century 1st Edition Judy Orme Jane Powell
trjnesjnqg7801
 
Romanticism in Love and Sacrifice An Analysis of Oscar Wilde’s The Nightingal...
KaryanaTantri21
 
ANNOTATION on objective 10 on pmes 2022-2025
joviejanesegundo1
 
SYMPATHOMIMETICS[ADRENERGIC AGONISTS] pptx
saip95568
 
M&A5 Q1 1 differentiate evolving early Philippine conventional and contempora...
ErlizaRosete
 
How to Configure Taxes in Company Currency in Odoo 18 Accounting
Celine George
 
Ad

Big data processing with apache spark part1

  • 1. B I G D A T A W O R K G R O U P . I R
  • 2. WHAT IS SPARK Apache Spark is an open source big data processing framework built around speed, ease of use, and sophisticated analytics. It was originally developed in 2009 in UC Berkeley’s AMPLab, and open sourced in 2010 as an Apache project. B I G D A T A W O R K G R O U P . I R
  • 3. WHAT IS SPARK Advantages: In Memory  Spark enables applications in Hadoop clusters to run up to 100 times faster in memory and 10 times faster even when running on disk. B I G D A T A W O R K G R O U P . I R
  • 4. WHAT IS SPARK Advantages: Generic API  Spark lets you quickly write applications in Java, Scala, or Python. It comes with a built-in set of over 80 high-level operators. And you can use it interactively to query data within the shell. B I G D A T A W O R K G R O U P . I R
  • 5. WHAT IS SPARK Advantages: Many Applications  Spark gives us a comprehensive, unified framework to manage big data processing requirements with a variety of data sets that are diverse in nature (text data, graph data etc) as well as the source of data (batch v. real-time streaming data). B I G D A T A W O R K G R O U P . I R
  • 6. WHAT IS SPARK Advantages: Many Applications  In addition to Map and Reduce operations, it supports SQL queries, streaming data, machine learning and graph data processing. Developers can use these capabilities stand-alone or combine them to run in a single data pipeline use case. B I G D A T A W O R K G R O U P . I R
  • 7. HADOOP AND SPARK Hadoop Spark Map & Reduce -> suitable for on- pass computations multi-step data pipelines using directed acyclic graph (DAG) pattern. Clusters are hard to set up and manage supports in-memory data sharing across DAGs. need to integrate with Mahout (Machine Learning) and Storm (Streaming data processing) Spark as an alternative to Hadoop MapReduce B I G D A T A W O R K G R O U P . I R
  • 8. SPARK FEATURES Less expensive shuffles in the data processing. With capabilities like in- memory data storage Lazy evaluation of big data queries, which helps with optimization of the steps in data processing workflows. Higher level API to improve developer productivity and a consistent architect model for big data solutions. B I G D A T A W O R K G R O U P . I R
  • 9. SPARK FEATURES Spark holds intermediate results in memory rather than writing them to disk Spark can be used for processing datasets that larger than the aggregate memory in a cluster. B I G D A T A W O R K G R O U P . I R
  • 10. SPARK ECOSYSTEM Spark Streaming  micro batch style of computing and processing.(DStream) Spark SQL  JDBC API, SQL like queries, ETL Spark Mlib  including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as underlying optimization primitives B I G D A T A W O R K G R O U P . I R
  • 11. SPARK ECOSYSTEM Spark GraphX GraphX extends the Spark RDD by introducing the Resilient Distributed Property Graph Set of fundamental operators (e.g., subgraph, joinVertices, and aggregateMessages) B I G D A T A W O R K G R O U P . I R
  • 12. SPARK ECOSYSTEM BlinkDB trade-off query accuracy for response time. Tachyon Caches working set files in memory Spark Cassandra Connector access data stored in a Cassandra database SparkR B I G D A T A W O R K G R O U P . I R
  • 13. B I G D A T A W O R K G R O U P . I R
  • 14. SPARK ARCHITECTURE B I G D A T A W O R K G R O U P . I R
  • 15. RESILIENT DISTRIBUTED DATASETS Fault tolerance because an RDD know how to recreate and re-compute the datasets. RDDs are immutable. B I G D A T A W O R K G R O U P . I R
  • 16. RDD OPERATIONS B I G D A T A W O R K G R O U P . I R
  • 17. HOW TO RUN SPARK B I G D A T A W O R K G R O U P . I R
  • 18. HOW TO INTERACT WITH SPARK spark-shell.cmd B I G D A T A W O R K G R O U P . I R
  • 20. SHARED VARIABLES Broadcast Variables Accumulators B I G D A T A W O R K G R O U P . I R
  • 21. SPARK ECOSYSTEM Spark SQL  JDBC API, SQL like queries, ETL B I G D A T A W O R K G R O U P . I R
  • 22. SPARK ECOSYSTEM Spark Streaming  micro batch style of computing and processing.(DStream) B I G D A T A W O R K G R O U P . I R