SlideShare a Scribd company logo
Introduction to Apache Spark
Contents
Introduction to Spark1
2
3
Resilient Distributed Datasets
RDD Operations
4 Workshop
1. Introduction
What is Apache Spark?
● Extends MapReduce
● Cluster computing platform
● Runs in memory
Fast
Easy of
development
Unified
Stack
Multi
Language
Support
Deployment
Flexibility
❏ Scala, python, java, R
❏ Deployment: Mesos, YARN, standalone, local
❏ Storage: HDFS, S3, local FS
❏ Batch
❏ Streaming
❏ 10x faster on disk
❏ 100x in memory
❏ Easy code
❏ Interactive shell
Why
Spark
Rise of the data center
Hugh amounts of data spread out
across many commodity servers
MapReduce
lots of data → scale out
Data Processing Requirements
Network bottleneck → Distributed Computing
Hardware failure → Fault Tolerance
Abstraction to organize parallelizable tasks
MapReduce
Abstraction to organize parallelizable tasks
MapReduce
Input Split Map [combine]
Suffle &
Sort
Reduce Output
AA BB AA
AA CC DD
AA EE DD
BB FF AA
AA BB AA
AA CC DD
AA EE DD
BB FF AA
(AA, 1)
(BB, 1)
(AA, 1)
(AA, 1)
(CC, 1)
(DD, 1)
(AA, 1)
(EE, 1)
(DD, 1)
(BB, 1)
(FF, 1)
(AA, 1)
(AA, 2)
(BB, 1)
(AA, 1)
(CC, 1)
(DD, 1)
(AA, 1)
(EE, 1)
(DD, 1)
(BB, 1)
(FF, 1)
(AA, 1)
(AA, 2)
(AA, 1)
(AA, 1)
(AA, 1)
(BB, 1)
(BB, 1)
(CC, 1)
(DD, 1)
(DD, 1)
(EE, 1)
(FF, 1)
(AA, 5)
(BB, 2)
(CC, 1)
(DD, 2)
(EE, 1)
(FF, 1)
AA, 5
BB, 2
CC, 1
DD, 2
EE, 1
FF, 1
Spark Components
Cluster Manager
Driver Program
SparkContext
Worker Node
Executor
Task Task
Worker Node
Executor
Task Task
Spark Components
SparkContext
● Main entry point for Spark functionality
● Represents the connection to a Spark cluster
● Tells Spark how & where to access a cluster
● Can be used to create RDDs, accumulators and
broadcast variables on that cluster
Driver program
● “Main” process coordinated by the
SparkContext object
● Allows to configure any spark process with
specific parameters
● Spark actions are executed in the Driver
● Spark-shell
● Application → driver program + executors
Driver Program
SparkContext
Spark Components
● External service for acquiring resources on the cluster
● Variety of cluster managers
○ Local
○ Standalone
○ YARN
○ Mesos
● Deploy mode:
○ Cluster → framework launches the driver inside of the cluser
○ Client → submitter launches the driver outside of the cluster
Cluster Manager
Spark Components
● Any node that can run application code in the cluster
● Key Terms
○ Executor: A process launched for an application on a worker node, that runs tasks and
keeps data in memory or disk storage across them. Each application has its own executors.
○ Task: Unit of work that will be sent to one executor
○ Job: A parallel computation consisting of multiple tasks that gets spawned in response to a
Spark action (e.g. save, collect)
○ Stage: smaller set of tasks inside any job
Worker Node
Executor
Task Task
Worker
2. Resilient Distributed Datasets
RDD
Resilient Distributed Datasets
● Collection of objects that is distributed across
nodes in a cluster
● Data Operations are performed on RDD
● Once created, RDD are immutable
● RDD can be persisted in memory or on disk
● Fault Tolerant
numbers = RDD[1,2,3,4,5,6,7,8,9,10]
Worker Node
Executor
[1,5,6,9]
Worker Node
Executor
[2,7,8]
Worker Node
Executor
[3,4,10]
RDD
● Lazy Evaluation
● Operation: Transformation / Action
● Lineage
● Base RDD
● Partition
● Task
● Level of Parallelism
Main Concepts
RDD
Internally, each RDD is characterized by five main properties
A list of partitions
A function for
computing each split
A list of dependencies
on other RDDs
A Partitioner for key-value RDDs
A list of preferred locations to
compute each split on
Method Location Input Output
getPartitions()
compute()
getDependencies()
Driver
Driver
Worker
-
Partition
-
[Partition]
Iterable
[Dependency]
Optionally
RDD
Creating RDDs
Text File
Collection
Database
val textFile = sc.textFile("README.md")
val input = sc.parallelize(List(1, 2, 3, 4))
val casRdd = sc.newAPIHadoopRDD(
job.getConfiguration(),
classOf[ColumnFamilyInputFormat],
classOf[ByteBuffer],
classOf[SortedMap[ByteBuffer, IColumn]])
Transformation val input = rddFather.map(value => value.toString )
File / set of files
(Local/Distributed)
Memory
Another RDD
Spark load and
write data with
database
RDD
Data Operations
RDD
RDD
RDD
RDD Value
Transformations
Action
3. RDD Operations
Data Operations
Transformations Actions
❏ Creates new dataset from existing one
❏ Lazy evaluated (Transformed RDD
executed only when action runs on it)
❏ Example: filter(), map(), flatMap()
❏ Return a value to driver program after
computation on dataset
❏ Example: count(), reduce(), take(), collect()
Transformations
map(func) Return a new distributed dataset formed by passing each
element of the source through a function func
filter(func) Return a new dataset formed by selecting those elements of the
source on which func returns true
flatMap(func) Similar to map, but each input item can be mapped to 0 or
more output items (so func should return a Seq rather than a
single item)
distinct Return a new dataset that contains the distinct elements of the
source dataset
Commonly Used Transformations
Transformations
Map(func)
1
2
3
3
2
3
4
4
rdd.map(x=> x+1)
Transformations
Filter(func)
1
2
3
3
2
3
3
rdd.filter(x=> x!=1)
Transformations
flatMap(func)
1
2
3
3
2
3
3
rdd.flatMap
(x=> x.to(3))
Transformations
Distinct
1
2
3
3
1
2
3
rdd.distinct()
Transformations
union(otherRDD) Return a new RDD that contains the union of the elements in
the source dataset and the argument
intersection
(otherRDD)
Return a new RDD that contains the intersection of elements in
the source dataset and the argument
Operations of mathematical sets
rdd
Transformations
Union
1
2
3
1
2
3
3
rdd.union(other)other
3
4
5
5
4
rdd
Transformations
Intersection
1
2
3 3
rdd.intersection
(other)
other
3
4
5
Actions
count() Returns the number of elements in the dataset
reduce(func) Aggregate the elements of the dataset using a function func
(which takes two arguments and returns one). The function
should be commutative and associative so that it can be
computed correctly in parallel
collect() Return all the elements of the dataset as an array at the driver
program. This is usually useful after a filter or other operation
that returns a sufficiently small subset of the data
take(n) Returns an array with first n elements
first() Returns the first element of the dataset
takeOrdered
(n,[ordering])
Returns first n elements of RDD using natural order or custom
operator
Commonly Used Actions
Actions
Count()
4
1
2
3
3
rdd.count()
Actions
Reduce(func)
9
1
2
3
3
rdd.reduce
((x,y)=>x+y)
Actions
Collect()
{1,2,3,3}
1
2
3
3
rdd.collect()
Actions
Take(n)
{1,2}
1
2
3
3
rdd.take(2)
Actions
first()
1
1
2
3
3
rdd.first()
Actions
takeOrdered(n,[ordering])
{3,3}
1
2
3
3
rdd.takeOrdered(2)
(myOrdering)
4. Workshop
WORKSHOP
In order to practice the main concepts, please complete the exercises
proposed at our Github repository by clicking the following link:
○ Homework
THANKS!
Any questions?
@datiobddatio-big-data
Special thanks to Stratio for its theoretical contribution
academy@datiobd.com

More Related Content

PDF
Your Journey to the Cloud
PDF
Putting the Ops in DataOps: Orchestrate the Flow of Data Across Data Pipelines
PDF
Apache Spark in Depth: Core Concepts, Architecture & Internals
PDF
Data platform architecture
PDF
Introduction to Big Data
PPTX
Screw DevOps, Let's Talk DataOps
PDF
Map reduce vs spark
Your Journey to the Cloud
Putting the Ops in DataOps: Orchestrate the Flow of Data Across Data Pipelines
Apache Spark in Depth: Core Concepts, Architecture & Internals
Data platform architecture
Introduction to Big Data
Screw DevOps, Let's Talk DataOps
Map reduce vs spark

What's hot (20)

PPTX
Hadoop introduction , Why and What is Hadoop ?
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PDF
Azure Cosmos DB
PPTX
Apache Spark Core
PPTX
Best Practices in DataOps: How to Create Agile, Automated Data Pipelines
PDF
Tupperware: Containerized Deployment at FB
PDF
Standing on the Shoulders of Open-Source Giants: The Serverless Realtime Lake...
PDF
Webinar Data Mesh - Part 3
KEY
Hybrid MongoDB and RDBMS Applications
PPTX
Operational Data Vault
PDF
Growing the Delta Ecosystem to Rust and Python with Delta-RS
PPTX
Data Warehousing Trends, Best Practices, and Future Outlook
PPT
GIS and Mapping Software Introduction
PDF
Greenplum User Case
PPTX
Data Mesh in Azure using Cloud Scale Analytics (WAF)
PPTX
Big Data & Hadoop Tutorial
PPTX
Washington DC DataOps Meetup -- Nov 2019
PDF
Greenplum Architecture
PDF
Delight: An Improved Apache Spark UI, Free, and Cross-Platform
PPTX
Hadoop introduction , Why and What is Hadoop ?
Data Lakehouse Symposium | Day 1 | Part 2
Azure Cosmos DB
Apache Spark Core
Best Practices in DataOps: How to Create Agile, Automated Data Pipelines
Tupperware: Containerized Deployment at FB
Standing on the Shoulders of Open-Source Giants: The Serverless Realtime Lake...
Webinar Data Mesh - Part 3
Hybrid MongoDB and RDBMS Applications
Operational Data Vault
Growing the Delta Ecosystem to Rust and Python with Delta-RS
Data Warehousing Trends, Best Practices, and Future Outlook
GIS and Mapping Software Introduction
Greenplum User Case
Data Mesh in Azure using Cloud Scale Analytics (WAF)
Big Data & Hadoop Tutorial
Washington DC DataOps Meetup -- Nov 2019
Greenplum Architecture
Delight: An Improved Apache Spark UI, Free, and Cross-Platform
Ad

Viewers also liked (9)

PDF
Unsupervised Learning with Apache Spark
PDF
A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...
PDF
Realizing AI Conversational Bot
PDF
Artificial Intelligence at Work - Assist Workshop 2016 - Pilar Manchón - INTEL
PPTX
Parallel and Iterative Processing for Machine Learning Recommendations with S...
PDF
What to Expect for Big Data and Apache Spark in 2017
PDF
Music Recommendations at Scale with Spark
PDF
Collaborative Filtering with Spark
PDF
The Chatbots Are Coming: A Guide to Chatbots, AI and Conversational Interfaces
 
Unsupervised Learning with Apache Spark
A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...
Realizing AI Conversational Bot
Artificial Intelligence at Work - Assist Workshop 2016 - Pilar Manchón - INTEL
Parallel and Iterative Processing for Machine Learning Recommendations with S...
What to Expect for Big Data and Apache Spark in 2017
Music Recommendations at Scale with Spark
Collaborative Filtering with Spark
The Chatbots Are Coming: A Guide to Chatbots, AI and Conversational Interfaces
 
Ad

Similar to Introduction to Apache Spark (20)

PDF
Apache Spark: What? Why? When?
PPTX
Apache Spark II (SparkSQL)
PPT
Apache Spark™ is a multi-language engine for executing data-S5.ppt
PPT
Scala and spark
PPTX
Spark 计算模型
PDF
Apache spark - Spark's distributed programming model
PDF
Apache Spark and DataStax Enablement
PPTX
MAP REDUCE IN DATA SCIENCE.pptx
PPTX
Dive into spark2
PPTX
Map Reduce
PDF
Boston Spark Meetup event Slides Update
PPTX
Ten tools for ten big data areas 03_Apache Spark
PPTX
OVERVIEW ON SPARK.pptx
PDF
Tuning and Debugging in Apache Spark
PPTX
SparkNotes
PDF
Tulsa techfest Spark Core Aug 5th 2016
PDF
Data processing platforms with SMACK: Spark and Mesos internals
PDF
Introduction to Apache Spark
PDF
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
PDF
Hadoop ecosystem
Apache Spark: What? Why? When?
Apache Spark II (SparkSQL)
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Scala and spark
Spark 计算模型
Apache spark - Spark's distributed programming model
Apache Spark and DataStax Enablement
MAP REDUCE IN DATA SCIENCE.pptx
Dive into spark2
Map Reduce
Boston Spark Meetup event Slides Update
Ten tools for ten big data areas 03_Apache Spark
OVERVIEW ON SPARK.pptx
Tuning and Debugging in Apache Spark
SparkNotes
Tulsa techfest Spark Core Aug 5th 2016
Data processing platforms with SMACK: Spark and Mesos internals
Introduction to Apache Spark
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Hadoop ecosystem

More from Datio Big Data (20)

PDF
Búsqueda IA
PDF
Descubriendo la Inteligencia Artificial
PDF
Learning Python. Level 0
PDF
Learn Python
PDF
How to document without dying in the attempt
PDF
Developers on test
PDF
Ceph: The Storage System of the Future
PDF
A Travel Through Mesos
PDF
Datio OpenStack
PDF
Quality Assurance Glossary
PDF
Data Integration
PDF
Gamification: from buzzword to reality
PDF
Pandas: High Performance Structured Data Manipulation
PDF
Road to Analytics
PDF
Del Mono al QA
PDF
Databases and how to choose them
PPTX
DC/OS: The definitive platform for modern apps
PPTX
PDP Your personal development plan
PPTX
Security&Governance
PDF
Kafka Connect by Datio
Búsqueda IA
Descubriendo la Inteligencia Artificial
Learning Python. Level 0
Learn Python
How to document without dying in the attempt
Developers on test
Ceph: The Storage System of the Future
A Travel Through Mesos
Datio OpenStack
Quality Assurance Glossary
Data Integration
Gamification: from buzzword to reality
Pandas: High Performance Structured Data Manipulation
Road to Analytics
Del Mono al QA
Databases and how to choose them
DC/OS: The definitive platform for modern apps
PDP Your personal development plan
Security&Governance
Kafka Connect by Datio

Recently uploaded (20)

PPTX
additive manufacturing of ss316l using mig welding
PPTX
Construction Project Organization Group 2.pptx
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PPTX
Artificial Intelligence
PPTX
CH1 Production IntroductoryConcepts.pptx
PDF
Human-AI Collaboration: Balancing Agentic AI and Autonomy in Hybrid Systems
PDF
III.4.1.2_The_Space_Environment.p pdffdf
PDF
737-MAX_SRG.pdf student reference guides
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
DOCX
573137875-Attendance-Management-System-original
PDF
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
PPTX
Sustainable Sites - Green Building Construction
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PDF
Well-logging-methods_new................
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PPTX
UNIT 4 Total Quality Management .pptx
additive manufacturing of ss316l using mig welding
Construction Project Organization Group 2.pptx
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
Artificial Intelligence
CH1 Production IntroductoryConcepts.pptx
Human-AI Collaboration: Balancing Agentic AI and Autonomy in Hybrid Systems
III.4.1.2_The_Space_Environment.p pdffdf
737-MAX_SRG.pdf student reference guides
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
Embodied AI: Ushering in the Next Era of Intelligent Systems
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
573137875-Attendance-Management-System-original
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
Sustainable Sites - Green Building Construction
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
Automation-in-Manufacturing-Chapter-Introduction.pdf
Well-logging-methods_new................
Foundation to blockchain - A guide to Blockchain Tech
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
UNIT 4 Total Quality Management .pptx

Introduction to Apache Spark