SlideShare a Scribd company logo
Introduction to Spark
Lambda World
Who are we?
Juan Pedro Moreno
Scala Software Engineer at 47Degrees
@juanpedromoreno
Fran Pérez
Scala Software Engineer at 47Degrees
@FPerezP
Workshop repo: https://github.com/47deg/spark-workshop
Roadmap
‱ Intro Big Data and Spark
‱ Spark Architecture
‱ Resilient Distributed Datasets (RDDs)
‱ Transformations and Actions on Data using RDDs
‱ Overview Spark SQL and DataFrames
‱ Overview Spark Streaming
‱ Spark Architecture and Cluster Deployment
Apache Spark Overview
https://github.com/apache/spark
‱ Fast and general engine for large-scale data processing
‱ Speed
‱ Ease of Use
‱ Generality
‱ Runs Everywhere
http://spark.apache.org
Spark Architecture
Scala Java Python R
Spark
SQL
Spark
Streaming
MLlib GraphX
DataFrames API
RDD API
Spark Core
Hadoop
HDFS
Cassandra JSON MySQL 

DATA SOURCES
Spark Core Concepts
Driver Program
Worker Node
Worker Node
Cluster Manager
SparkContext
Executor
Executor
Cache
Cache
Task
Task
Task
Task
Hadoop YARN Standalone Apache Mesos
Spark Core Concepts
‱ Executor: A process launched for an application on a worker node.
Each application has its own executors.
‱ Jobs: A parallel computation consisting of one or multiple stages that
gets spawned in response to a Spark action.
‱ Stages: Smaller set of tasks that each job is divided into.
‱ Tasks: A unit of work that will be sent to one executor.
Resilient Distributed Datasets
‱ Immutable.
‱ Partitioned collection.
‱ Operates in parallel.
‱ Customizable.
RDDs - Partitions
‱ A Partition is one of the diïŹ€erent chunks that a RDD is splitted on and
that is sent to a node
‱ The more partitions we have, the more parallelism we get
‱ Each partition is candidate to be spread out to diïŹ€erent worker nodes
Error, ts, msg1
Warn, ts, msg2
Error, ts, msg1
Info, ts, msg8
Warn, ts, msg2
Info, ts, msg8
Error, ts, msg3
Info, ts, msg5
Info, ts, msg5
Error, ts, msg4
Warn, ts, msg9
Error, ts, msg1
RDD with 4 partitions
RDDs - Partitions
RDD with 8 partitions
P1 P2 P3 P4 P5 P6 P7 P8
Worker Node
Executor
Worker Node
Executor
Worker Node
Executor
Worker Node
Executor
P1 P5 P2 P6 P3 P7 P4 P8
RDDs - Operations
Transformations
‱ Lazy operations. They don’t return a value, but a pointer to a new RDD.
Actions
‱ Non-lazy operations. They apply an operation to a RDD and return a
value or write data to an external storage system.
RDDs - Transformations
A set of some of the most popular Spark transformations:
‱ map
‱ ïŹ‚atMap
‱ ïŹlter
‱ groupByKey
‱ reduceByKey
RDDs - Actions
A set of some of the most popular Spark actions:
‱ reduce
‱ collect
‱ foreach
‱ saveAsTextFile
Transformations and Actions
With Visual Mnemonics, better.
Thanks to JeïŹ€rey Thompson
‱ http://data-frack.blogspot.com.es/2015/01/visual-mnemonics-for-
pyspark-api.html
‱ https://github.com/jkthompson/pyspark-pictures
‱ http://nbviewer.ipython.org/github/jkthompson/pyspark-pictures/
blob/master/pyspark-pictures.ipynb
Practice - Part 1 && Part 2
Overview Spark SQL and DataFrames
‱ Works with structured and semistructured data
‱ DataFrame simpliïŹes working with structured data
‱ Read/Write from structure data like JSON, Hive tables, Parquet, etc.
‱ SQL inside your Spark App
‱ Best Performance and more powerful operations API
Practice - Part 3
Overview Spark Streaming
‱ Streaming Applications
‱ DStreams or Discretized Streams
‱ Continuous Series of RDDs, grouped by batches
Kafka
Spark StreamingR
e
c
e
i
v
e
r
s
Flume HDFS
batches of
input data
Spark
Core
HDFS/S3 Database
Kinesis Dashboard
Twitter
Resources
‱ OïŹƒcial docs - http://spark.apache.org/docs/latest
‱ Learning Spark - http://shop.oreilly.com/product/0636920028512.do
‱ Databricks Spark Knowledge Base - https://goo.gl/wMy7Se
‱ Community packages for Spark - http://spark-packages.org/
‱ Apache Spark Youtube channel - https://www.youtube.com/user/
TheApacheSpark
‱ API through pictures - https://goo.gl/JMDeqJ
‱ 47 Degrees Blog - http://www.47deg.com/blog/tags/spark
Thanks!
47deg.com
Q&A

More Related Content

PDF
Introduction to Apache Spark
PDF
Introduction to apache spark
PDF
Apache Spark: The Next Gen toolset for Big Data Processing
PDF
Introduction to apache spark
PDF
An Overview of Apache Spark
PPTX
Apache spark - History and market overview
PDF
Introduction to apache spark
PPTX
Introduction to Apache Spark
Introduction to Apache Spark
Introduction to apache spark
Apache Spark: The Next Gen toolset for Big Data Processing
Introduction to apache spark
An Overview of Apache Spark
Apache spark - History and market overview
Introduction to apache spark
Introduction to Apache Spark

What's hot (20)

PDF
Apache Spark Briefing
PDF
Spark Core
PDF
Introduction to Apache Spark
PPTX
Spark architecture
PDF
Spark Streaming and MLlib - Hyderabad Spark Group
PPTX
Intro to Apache Spark
PDF
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
PDF
Spark summit 2019 infrastructure for deep learning in apache spark 0425
PDF
Hadoop and Spark
PDF
Apache Spark 101
PDF
Spark Meetup at Uber
PDF
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
PDF
Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs
PPTX
Introduction to Apache Spark
PPTX
Spark and Spark Streaming
PPTX
Lightening Fast Big Data Analytics using Apache Spark
PPTX
Spark Advanced Analytics NJ Data Science Meetup - Princeton University
PDF
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
PDF
Tachyon and Apache Spark
 
PPTX
Spark - The Ultimate Scala Collections by Martin Odersky
Apache Spark Briefing
Spark Core
Introduction to Apache Spark
Spark architecture
Spark Streaming and MLlib - Hyderabad Spark Group
Intro to Apache Spark
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Spark summit 2019 infrastructure for deep learning in apache spark 0425
Hadoop and Spark
Apache Spark 101
Spark Meetup at Uber
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs
Introduction to Apache Spark
Spark and Spark Streaming
Lightening Fast Big Data Analytics using Apache Spark
Spark Advanced Analytics NJ Data Science Meetup - Princeton University
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Tachyon and Apache Spark
 
Spark - The Ultimate Scala Collections by Martin Odersky
Ad

Viewers also liked (20)

PDF
Docker From Scratch
PDF
Introduction to Apache Spark
PPTX
Apache Spark Architecture
PDF
Cassandra and Spark
PPTX
data science toolkit 101: set up Python, Spark, & Jupyter
PPTX
Presentation of Apache Cassandra
PDF
Spark SQL | Apache Spark
PDF
Introduction to Cassandra - Denver
KEY
Cassandra Basics: Indexing
KEY
Developers summit cassandraă§èŠ‹ă‚‹NoSQL
PDF
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
PDF
Intro to py spark (and cassandra)
PDF
Python & Cassandra - Best Friends
PDF
Diagnosing Problems in Production: Cassandra Summit 2014
PDF
The Cassandra Distributed Database
PDF
Intro to Cassandra
PDF
PySpark Cassandra - Amsterdam Spark Meetup
PPT
Parquet overview
PDF
PySaprk
PDF
Cassandra Summit 2010 Performance Tuning
 
Docker From Scratch
Introduction to Apache Spark
Apache Spark Architecture
Cassandra and Spark
data science toolkit 101: set up Python, Spark, & Jupyter
Presentation of Apache Cassandra
Spark SQL | Apache Spark
Introduction to Cassandra - Denver
Cassandra Basics: Indexing
Developers summit cassandraă§èŠ‹ă‚‹NoSQL
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
Intro to py spark (and cassandra)
Python & Cassandra - Best Friends
Diagnosing Problems in Production: Cassandra Summit 2014
The Cassandra Distributed Database
Intro to Cassandra
PySpark Cassandra - Amsterdam Spark Meetup
Parquet overview
PySaprk
Cassandra Summit 2010 Performance Tuning
 
Ad

Similar to Introduction to Apache Spark (20)

PDF
Spark Worshop
PDF
Hands on with Apache Spark
PDF
Big Data visualization with Apache Spark and Zeppelin
PPT
Apache spark-melbourne-april-2015-meetup
PPTX
Paris Data Geek - Spark Streaming
PPTX
Programming in Spark using PySpark
PDF
Migrating ETL Workflow to Apache Spark at Scale in Pinterest
PPTX
Apache Spark Fundamentals
PPTX
Apache Spark on HDinsight Training
PPTX
Intro to Apache Spark
PPTX
Ten tools for ten big data areas 03_Apache Spark
PPTX
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
PPTX
Big Data Processing with Apache Spark 2014
PPTX
Unit II Real Time Data Processing tools.pptx
PPTX
Dec6 meetup spark presentation
PPTX
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
PDF
Spark Summit EU talk by Heiko Korndorf
PDF
Apache Spark Introduction.pdf
PDF
Jump Start on Apache Spark 2.2 with Databricks
PPTX
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
Spark Worshop
Hands on with Apache Spark
Big Data visualization with Apache Spark and Zeppelin
Apache spark-melbourne-april-2015-meetup
Paris Data Geek - Spark Streaming
Programming in Spark using PySpark
Migrating ETL Workflow to Apache Spark at Scale in Pinterest
Apache Spark Fundamentals
Apache Spark on HDinsight Training
Intro to Apache Spark
Ten tools for ten big data areas 03_Apache Spark
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Big Data Processing with Apache Spark 2014
Unit II Real Time Data Processing tools.pptx
Dec6 meetup spark presentation
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Spark Summit EU talk by Heiko Korndorf
Apache Spark Introduction.pdf
Jump Start on Apache Spark 2.2 with Databricks
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx

Recently uploaded (20)

PDF
System and Network Administraation Chapter 3
PDF
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
PDF
Digital Strategies for Manufacturing Companies
PPTX
Embracing Complexity in Serverless! GOTO Serverless Bengaluru
PDF
System and Network Administration Chapter 2
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PDF
Product Update: Alluxio AI 3.7 Now with Sub-Millisecond Latency
PDF
Nekopoi APK 2025 free lastest update
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
 
PDF
PTS Company Brochure 2025 (1).pdf.......
PPTX
Reimagine Home Health with the Power of Agentic AI​
PPTX
Transform Your Business with a Software ERP System
PDF
Softaken Excel to vCard Converter Software.pdf
PPTX
Computer Software and OS of computer science of grade 11.pptx
PDF
Designing Intelligence for the Shop Floor.pdf
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PDF
wealthsignaloriginal-com-DS-text-... (1).pdf
PPTX
L1 - Introduction to python Backend.pptx
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
System and Network Administraation Chapter 3
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
Digital Strategies for Manufacturing Companies
Embracing Complexity in Serverless! GOTO Serverless Bengaluru
System and Network Administration Chapter 2
Wondershare Filmora 15 Crack With Activation Key [2025
Product Update: Alluxio AI 3.7 Now with Sub-Millisecond Latency
Nekopoi APK 2025 free lastest update
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
 
PTS Company Brochure 2025 (1).pdf.......
Reimagine Home Health with the Power of Agentic AI​
Transform Your Business with a Software ERP System
Softaken Excel to vCard Converter Software.pdf
Computer Software and OS of computer science of grade 11.pptx
Designing Intelligence for the Shop Floor.pdf
Which alternative to Crystal Reports is best for small or large businesses.pdf
Odoo Companies in India – Driving Business Transformation.pdf
wealthsignaloriginal-com-DS-text-... (1).pdf
L1 - Introduction to python Backend.pptx
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)

Introduction to Apache Spark

  • 2. Who are we? Juan Pedro Moreno Scala Software Engineer at 47Degrees @juanpedromoreno Fran PĂ©rez Scala Software Engineer at 47Degrees @FPerezP Workshop repo: https://github.com/47deg/spark-workshop
  • 3. Roadmap ‱ Intro Big Data and Spark ‱ Spark Architecture ‱ Resilient Distributed Datasets (RDDs) ‱ Transformations and Actions on Data using RDDs ‱ Overview Spark SQL and DataFrames ‱ Overview Spark Streaming ‱ Spark Architecture and Cluster Deployment
  • 4. Apache Spark Overview https://github.com/apache/spark ‱ Fast and general engine for large-scale data processing ‱ Speed ‱ Ease of Use ‱ Generality ‱ Runs Everywhere http://spark.apache.org
  • 5. Spark Architecture Scala Java Python R Spark SQL Spark Streaming MLlib GraphX DataFrames API RDD API Spark Core Hadoop HDFS Cassandra JSON MySQL 
 DATA SOURCES
  • 6. Spark Core Concepts Driver Program Worker Node Worker Node Cluster Manager SparkContext Executor Executor Cache Cache Task Task Task Task Hadoop YARN Standalone Apache Mesos
  • 7. Spark Core Concepts ‱ Executor: A process launched for an application on a worker node. Each application has its own executors. ‱ Jobs: A parallel computation consisting of one or multiple stages that gets spawned in response to a Spark action. ‱ Stages: Smaller set of tasks that each job is divided into. ‱ Tasks: A unit of work that will be sent to one executor.
  • 8. Resilient Distributed Datasets ‱ Immutable. ‱ Partitioned collection. ‱ Operates in parallel. ‱ Customizable.
  • 9. RDDs - Partitions ‱ A Partition is one of the diïŹ€erent chunks that a RDD is splitted on and that is sent to a node ‱ The more partitions we have, the more parallelism we get ‱ Each partition is candidate to be spread out to diïŹ€erent worker nodes Error, ts, msg1 Warn, ts, msg2 Error, ts, msg1 Info, ts, msg8 Warn, ts, msg2 Info, ts, msg8 Error, ts, msg3 Info, ts, msg5 Info, ts, msg5 Error, ts, msg4 Warn, ts, msg9 Error, ts, msg1 RDD with 4 partitions
  • 10. RDDs - Partitions RDD with 8 partitions P1 P2 P3 P4 P5 P6 P7 P8 Worker Node Executor Worker Node Executor Worker Node Executor Worker Node Executor P1 P5 P2 P6 P3 P7 P4 P8
  • 11. RDDs - Operations Transformations ‱ Lazy operations. They don’t return a value, but a pointer to a new RDD. Actions ‱ Non-lazy operations. They apply an operation to a RDD and return a value or write data to an external storage system.
  • 12. RDDs - Transformations A set of some of the most popular Spark transformations: ‱ map ‱ ïŹ‚atMap ‱ ïŹlter ‱ groupByKey ‱ reduceByKey
  • 13. RDDs - Actions A set of some of the most popular Spark actions: ‱ reduce ‱ collect ‱ foreach ‱ saveAsTextFile
  • 14. Transformations and Actions With Visual Mnemonics, better. Thanks to JeïŹ€rey Thompson ‱ http://data-frack.blogspot.com.es/2015/01/visual-mnemonics-for- pyspark-api.html ‱ https://github.com/jkthompson/pyspark-pictures ‱ http://nbviewer.ipython.org/github/jkthompson/pyspark-pictures/ blob/master/pyspark-pictures.ipynb
  • 15. Practice - Part 1 && Part 2
  • 16. Overview Spark SQL and DataFrames ‱ Works with structured and semistructured data ‱ DataFrame simpliïŹes working with structured data ‱ Read/Write from structure data like JSON, Hive tables, Parquet, etc. ‱ SQL inside your Spark App ‱ Best Performance and more powerful operations API
  • 18. Overview Spark Streaming ‱ Streaming Applications ‱ DStreams or Discretized Streams ‱ Continuous Series of RDDs, grouped by batches Kafka Spark StreamingR e c e i v e r s Flume HDFS batches of input data Spark Core HDFS/S3 Database Kinesis Dashboard Twitter
  • 19. Resources ‱ OïŹƒcial docs - http://spark.apache.org/docs/latest ‱ Learning Spark - http://shop.oreilly.com/product/0636920028512.do ‱ Databricks Spark Knowledge Base - https://goo.gl/wMy7Se ‱ Community packages for Spark - http://spark-packages.org/ ‱ Apache Spark Youtube channel - https://www.youtube.com/user/ TheApacheSpark ‱ API through pictures - https://goo.gl/JMDeqJ ‱ 47 Degrees Blog - http://www.47deg.com/blog/tags/spark