SlideShare a Scribd company logo
Big Data Processing With
Spark and Scala
http://www.edureka.co/apache-spark-scala-training
Slide 2Slide 2 http://www.edureka.co/apache-spark-scala-training
What is Big Data?
What is Spark?
Why Spark?
Spark Ecosystem
A note about Scala
Why Scala?
MapReduce vs Spark
Hello Spark!
Objectives of this Session
Slide 3Slide 3 http://www.edureka.co/apache-spark-scala-training
Big Data
 Lots of Data (Terabytes or Petabytes)
 Big data is the term for a collection of data sets
so large and complex that it becomes difficult to
process using on-hand database management
tools or traditional data processing applications
 The challenges include capture, curation,
storage, search, sharing, transfer, analysis, and
visualization
cloud
tools
statistics
No SQL
compression
storage
support
database
analyze
information
terabytes
processing
mobile
Big Data
Slide 4Slide 4 http://www.edureka.co/apache-spark-scala-training
What is Spark?
 Apache Spark is a general-purpose cluster in-memory computing system
 Provides high-level APIs in Java, Scala and Python, and an optimized engine that supports general execution graphs
 Provides various high level tools like Spark SQL for structured data processing, Mlib for Machine Learning and more..
High Level
APIs
High Level
Tools
More…
Slide 5Slide 5 http://www.edureka.co/apache-spark-scala-training
Why Spark?
Cluster Manager
Deployment
via YARN
 The Spark framework can be deployed through
Apache Mesos, Apache Hadoop via Yarn, or
Spark’s own cluster manager.
Slide 6Slide 6 http://www.edureka.co/apache-spark-scala-training
Why Spark?
Polyglot Scala
 Spark framework is polyglot – Can be programmed
in several programming languages (Currently
Scala, Java and Python supported).
Slide 7Slide 7 http://www.edureka.co/apache-spark-scala-training
Why Spark?
A fully Apache Hive compatible data
warehousing system that can run 100x
faster than Hive.
100x faster than for certain applications.
Slide 8Slide 8 http://www.edureka.co/apache-spark-scala-training
Why Spark?
 Provides powerful caching and disk persistence capabilities
 Interactive Data Analysis
 Faster Batch
 Iterative Algorithms
 Real-Time Stream Processing
 Faster Decision-Making
Slide 9Slide 9 http://www.edureka.co/apache-spark-scala-training
Spark Community is Super Active!
Slide 10Slide 10 http://www.edureka.co/apache-spark-scala-training
Spark Ecosystem
Spark Core Engine
Aplha/Pre-alpha
Shark
(SQL)
Spark
Streaming
(Streaming)
MLLib
(Machine
learning)
GraphX
(Graph
Computation)
SparkR
(R on Spark)
BlindDB
(Approximate
SQL)
Slide 11Slide 11 http://www.edureka.co/apache-spark-scala-training
Spark Ecosystem (Contd.)
Used for structured
data. Can run
unmodified hive
queries on existing
Hadoop
deployment.
Spark Core Engine
Aplha/Pre-alpha
Shark
(SQL)
Spark
Streaming
(Streaming)
MLLib
(Machine
learning)
GraphX
(Graph
Computation)
SparkR
(R on Spark)
BlindDB
(Approximate
SQL)
Enables analytical
and interactive
apps for live
streaming data.
An approximate
query engine. To
run over Core
Spark Engine.
Graph Computation
engine.
(Similar to Giraph)
Package for R language
to enable R-users to
leverage Spark power
from R shell.
Machine learning library being built on top of Spark. Provision for support to many
machine learning algorithms with speeds upto 100 times faster than Map-Reduce.
Slide 12Slide 12 http://www.edureka.co/apache-spark-scala-training
A Note on Scala
 Scala is a general-purpose programming language designed
to express common programming patterns in a concise,
elegant, and type-safe way
 Scala supports both Object Oriented Programming and
Functional Programming
 Scala is very much in fabric of present and Future Big Data
frameworks like Scalding, Spark, Akka
» All examples of Spark in class will be
covered in Scala
» Scala would be covered before Spark
coverage as part of course!
Slide 13Slide 13 http://www.edureka.co/apache-spark-scala-training
Why Scala?
 Scala is a pure object-oriented language. Conceptually, every value is an object and every operation is a
method-call. The language supports advanced component architectures through classes and traits
 Scala is also a functional language. Supports functions, immutable data structures and preference for
immutability over mutation
 Seamlessly integrated with Java
 Being used heavily for future Big data and developments frameworks like Spark, Akka, Scalding, Play etc
Slide 14Slide 14 http://www.edureka.co/apache-spark-scala-trainingSlide 14
 If you want to do some Real Time Analytics, where you are expecting result quickly, Hadoop
should not be used directly
 Hadoop works on Batch processing, hence response time is high
Day 1 Day 2 Day 3 Day 4 ......... ………. ………. Day n
Day 1 Day 2 Day 3 Day 4 ......... ………. ………. Day n
Input
Data
Processing
Data
Input
Data
Processing
Data
Input
Data
Processing
Data
Input Data
Processing Data using MR
Time Lag
Real Time Analytics
Slide 15Slide 15 http://www.edureka.co/apache-spark-scala-trainingSlide 15
Real Time Analytics – Accepted Way
Streaming
Data
Storing
Slide 16Slide 16 http://www.edureka.co/apache-spark-scala-trainingSlide 16
14 sec
0.6 sec
MapReduce vs Spark
Slide 17 http://www.edureka.co/apache-spark-scala-training
Spark Demo!
Spark Demo!
Slide 18 http://www.edureka.co/apache-spark-scala-training
Questions?
Big Data Processing with Spark and Scala

More Related Content

PPTX
Apache Spark Core
PDF
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
PDF
Introduction to Spark with Python
PPTX
PDF
Introduction to Apache Spark
PPTX
Introduction to Scala
PPTX
PDF
Big Data Analytics with Spark
Apache Spark Core
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
Introduction to Spark with Python
Introduction to Apache Spark
Introduction to Scala
Big Data Analytics with Spark

What's hot (20)

PPTX
Hadoop technology
PDF
Apache Spark Overview
PPTX
Introduction to Apache Spark
PDF
Apache Spark 101
PPTX
Learn Apache Spark: A Comprehensive Guide
PPTX
Tableau: A Business Intelligence and Analytics Software
PPTX
Introduction to Apache Hadoop Eco-System
PPTX
Programming in Spark using PySpark
PPTX
CCS334 BIG DATA ANALYTICS Session 3 Distributed models.pptx
PPTX
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
PPTX
Building Advanced Analytics Pipelines with Azure Databricks
PDF
Introduction to apache spark
PDF
Unified MLOps: Feature Stores & Model Deployment
PPTX
Introducing the Snowflake Computing Cloud Data Warehouse
PDF
Simplifying Big Data Analytics with Apache Spark
PDF
Build Real-Time Applications with Databricks Streaming
PPTX
Hadoop
PPTX
A 30 day plan to start ending your data struggle with Snowflake
PDF
Databricks Delta Lake and Its Benefits
PPTX
Oracle Data Warehouse
Hadoop technology
Apache Spark Overview
Introduction to Apache Spark
Apache Spark 101
Learn Apache Spark: A Comprehensive Guide
Tableau: A Business Intelligence and Analytics Software
Introduction to Apache Hadoop Eco-System
Programming in Spark using PySpark
CCS334 BIG DATA ANALYTICS Session 3 Distributed models.pptx
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Building Advanced Analytics Pipelines with Azure Databricks
Introduction to apache spark
Unified MLOps: Feature Stores & Model Deployment
Introducing the Snowflake Computing Cloud Data Warehouse
Simplifying Big Data Analytics with Apache Spark
Build Real-Time Applications with Databricks Streaming
Hadoop
A 30 day plan to start ending your data struggle with Snowflake
Databricks Delta Lake and Its Benefits
Oracle Data Warehouse
Ad

Viewers also liked (12)

PPTX
Spark for big data analytics
PDF
Power of Python with Big Data
PPTX
R and Visualization: A match made in Heaven
PPTX
Big Data Analytics for Non-Programmers
PPTX
Mastering in data warehousing & BusinessIintelligence
PDF
Is Data Scientist still the sexiest job of 21st century? Find Out!
PPTX
Top 5 algorithms used in Data Science
PDF
Clare Corthell: Learning Data Science Online
PPTX
Health care and big data with hadoop – Beacuse prevention is better than cure
PPTX
Python for Big Data Analytics
PDF
Machine Learning In Python | Python Machine Learning Tutorial | Deep Learning...
PDF
Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hado...
Spark for big data analytics
Power of Python with Big Data
R and Visualization: A match made in Heaven
Big Data Analytics for Non-Programmers
Mastering in data warehousing & BusinessIintelligence
Is Data Scientist still the sexiest job of 21st century? Find Out!
Top 5 algorithms used in Data Science
Clare Corthell: Learning Data Science Online
Health care and big data with hadoop – Beacuse prevention is better than cure
Python for Big Data Analytics
Machine Learning In Python | Python Machine Learning Tutorial | Deep Learning...
Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hado...
Ad

Similar to Big Data Processing with Spark and Scala (20)

PPTX
Big data Processing with Apache Spark & Scala
PDF
Spark SQL | Apache Spark
PDF
Big Data Processing With Spark
PPTX
5 reasons why spark is in demand!
PPTX
5 things one must know about spark!
PDF
Spark For Faster Batch Processing
PDF
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
PPTX
Apache Spark & Scala
PDF
5 things one must know about spark!
PPTX
Scalable Machine Learning with PySpark
PDF
Module01
PDF
Apache spark
PDF
Apache Spark beyond Hadoop MapReduce
PPTX
Apache Spark Overview
PDF
Spark Streaming
PDF
5 Reasons why Spark is in demand!
PDF
A Master Guide To Apache Spark Application And Versatile Uses.pdf
PDF
Performance of Spark vs MapReduce
PDF
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...
PDF
Spark Interview Questions and Answers | Apache Spark Interview Questions | Sp...
Big data Processing with Apache Spark & Scala
Spark SQL | Apache Spark
Big Data Processing With Spark
5 reasons why spark is in demand!
5 things one must know about spark!
Spark For Faster Batch Processing
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark & Scala
5 things one must know about spark!
Scalable Machine Learning with PySpark
Module01
Apache spark
Apache Spark beyond Hadoop MapReduce
Apache Spark Overview
Spark Streaming
5 Reasons why Spark is in demand!
A Master Guide To Apache Spark Application And Versatile Uses.pdf
Performance of Spark vs MapReduce
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...
Spark Interview Questions and Answers | Apache Spark Interview Questions | Sp...

More from Edureka! (20)

PDF
What to learn during the 21 days Lockdown | Edureka
PDF
Top 10 Dying Programming Languages in 2020 | Edureka
PDF
Top 5 Trending Business Intelligence Tools | Edureka
PDF
Tableau Tutorial for Data Science | Edureka
PDF
Python Programming Tutorial | Edureka
PDF
Top 5 PMP Certifications | Edureka
PDF
Top Maven Interview Questions in 2020 | Edureka
PDF
Linux Mint Tutorial | Edureka
PDF
How to Deploy Java Web App in AWS| Edureka
PDF
Importance of Digital Marketing | Edureka
PDF
RPA in 2020 | Edureka
PDF
Email Notifications in Jenkins | Edureka
PDF
EA Algorithm in Machine Learning | Edureka
PDF
Cognitive AI Tutorial | Edureka
PDF
AWS Cloud Practitioner Tutorial | Edureka
PDF
Blue Prism Top Interview Questions | Edureka
PDF
Big Data on AWS Tutorial | Edureka
PDF
A star algorithm | A* Algorithm in Artificial Intelligence | Edureka
PDF
Kubernetes Installation on Ubuntu | Edureka
PDF
Introduction to DevOps | Edureka
What to learn during the 21 days Lockdown | Edureka
Top 10 Dying Programming Languages in 2020 | Edureka
Top 5 Trending Business Intelligence Tools | Edureka
Tableau Tutorial for Data Science | Edureka
Python Programming Tutorial | Edureka
Top 5 PMP Certifications | Edureka
Top Maven Interview Questions in 2020 | Edureka
Linux Mint Tutorial | Edureka
How to Deploy Java Web App in AWS| Edureka
Importance of Digital Marketing | Edureka
RPA in 2020 | Edureka
Email Notifications in Jenkins | Edureka
EA Algorithm in Machine Learning | Edureka
Cognitive AI Tutorial | Edureka
AWS Cloud Practitioner Tutorial | Edureka
Blue Prism Top Interview Questions | Edureka
Big Data on AWS Tutorial | Edureka
A star algorithm | A* Algorithm in Artificial Intelligence | Edureka
Kubernetes Installation on Ubuntu | Edureka
Introduction to DevOps | Edureka

Recently uploaded (20)

PPTX
MYSQL Presentation for SQL database connectivity
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
Spectroscopy.pptx food analysis technology
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
cuic standard and advanced reporting.pdf
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPTX
Tartificialntelligence_presentation.pptx
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Getting Started with Data Integration: FME Form 101
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
MYSQL Presentation for SQL database connectivity
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Spectroscopy.pptx food analysis technology
Network Security Unit 5.pdf for BCA BBA.
Advanced methodologies resolving dimensionality complications for autism neur...
cuic standard and advanced reporting.pdf
Assigned Numbers - 2025 - Bluetooth® Document
MIND Revenue Release Quarter 2 2025 Press Release
Tartificialntelligence_presentation.pptx
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Accuracy of neural networks in brain wave diagnosis of schizophrenia
The Rise and Fall of 3GPP – Time for a Sabbatical?
Getting Started with Data Integration: FME Form 101
Dropbox Q2 2025 Financial Results & Investor Presentation
Spectral efficient network and resource selection model in 5G networks
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton

Big Data Processing with Spark and Scala

  • 1. Big Data Processing With Spark and Scala http://www.edureka.co/apache-spark-scala-training
  • 2. Slide 2Slide 2 http://www.edureka.co/apache-spark-scala-training What is Big Data? What is Spark? Why Spark? Spark Ecosystem A note about Scala Why Scala? MapReduce vs Spark Hello Spark! Objectives of this Session
  • 3. Slide 3Slide 3 http://www.edureka.co/apache-spark-scala-training Big Data  Lots of Data (Terabytes or Petabytes)  Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications  The challenges include capture, curation, storage, search, sharing, transfer, analysis, and visualization cloud tools statistics No SQL compression storage support database analyze information terabytes processing mobile Big Data
  • 4. Slide 4Slide 4 http://www.edureka.co/apache-spark-scala-training What is Spark?  Apache Spark is a general-purpose cluster in-memory computing system  Provides high-level APIs in Java, Scala and Python, and an optimized engine that supports general execution graphs  Provides various high level tools like Spark SQL for structured data processing, Mlib for Machine Learning and more.. High Level APIs High Level Tools More…
  • 5. Slide 5Slide 5 http://www.edureka.co/apache-spark-scala-training Why Spark? Cluster Manager Deployment via YARN  The Spark framework can be deployed through Apache Mesos, Apache Hadoop via Yarn, or Spark’s own cluster manager.
  • 6. Slide 6Slide 6 http://www.edureka.co/apache-spark-scala-training Why Spark? Polyglot Scala  Spark framework is polyglot – Can be programmed in several programming languages (Currently Scala, Java and Python supported).
  • 7. Slide 7Slide 7 http://www.edureka.co/apache-spark-scala-training Why Spark? A fully Apache Hive compatible data warehousing system that can run 100x faster than Hive. 100x faster than for certain applications.
  • 8. Slide 8Slide 8 http://www.edureka.co/apache-spark-scala-training Why Spark?  Provides powerful caching and disk persistence capabilities  Interactive Data Analysis  Faster Batch  Iterative Algorithms  Real-Time Stream Processing  Faster Decision-Making
  • 9. Slide 9Slide 9 http://www.edureka.co/apache-spark-scala-training Spark Community is Super Active!
  • 10. Slide 10Slide 10 http://www.edureka.co/apache-spark-scala-training Spark Ecosystem Spark Core Engine Aplha/Pre-alpha Shark (SQL) Spark Streaming (Streaming) MLLib (Machine learning) GraphX (Graph Computation) SparkR (R on Spark) BlindDB (Approximate SQL)
  • 11. Slide 11Slide 11 http://www.edureka.co/apache-spark-scala-training Spark Ecosystem (Contd.) Used for structured data. Can run unmodified hive queries on existing Hadoop deployment. Spark Core Engine Aplha/Pre-alpha Shark (SQL) Spark Streaming (Streaming) MLLib (Machine learning) GraphX (Graph Computation) SparkR (R on Spark) BlindDB (Approximate SQL) Enables analytical and interactive apps for live streaming data. An approximate query engine. To run over Core Spark Engine. Graph Computation engine. (Similar to Giraph) Package for R language to enable R-users to leverage Spark power from R shell. Machine learning library being built on top of Spark. Provision for support to many machine learning algorithms with speeds upto 100 times faster than Map-Reduce.
  • 12. Slide 12Slide 12 http://www.edureka.co/apache-spark-scala-training A Note on Scala  Scala is a general-purpose programming language designed to express common programming patterns in a concise, elegant, and type-safe way  Scala supports both Object Oriented Programming and Functional Programming  Scala is very much in fabric of present and Future Big Data frameworks like Scalding, Spark, Akka » All examples of Spark in class will be covered in Scala » Scala would be covered before Spark coverage as part of course!
  • 13. Slide 13Slide 13 http://www.edureka.co/apache-spark-scala-training Why Scala?  Scala is a pure object-oriented language. Conceptually, every value is an object and every operation is a method-call. The language supports advanced component architectures through classes and traits  Scala is also a functional language. Supports functions, immutable data structures and preference for immutability over mutation  Seamlessly integrated with Java  Being used heavily for future Big data and developments frameworks like Spark, Akka, Scalding, Play etc
  • 14. Slide 14Slide 14 http://www.edureka.co/apache-spark-scala-trainingSlide 14  If you want to do some Real Time Analytics, where you are expecting result quickly, Hadoop should not be used directly  Hadoop works on Batch processing, hence response time is high Day 1 Day 2 Day 3 Day 4 ......... ………. ………. Day n Day 1 Day 2 Day 3 Day 4 ......... ………. ………. Day n Input Data Processing Data Input Data Processing Data Input Data Processing Data Input Data Processing Data using MR Time Lag Real Time Analytics
  • 15. Slide 15Slide 15 http://www.edureka.co/apache-spark-scala-trainingSlide 15 Real Time Analytics – Accepted Way Streaming Data Storing
  • 16. Slide 16Slide 16 http://www.edureka.co/apache-spark-scala-trainingSlide 16 14 sec 0.6 sec MapReduce vs Spark