SlideShare a Scribd company logo
Parallelizing Existing R
Packages with SparkR
Hossein Falaki
@mhfalaki
About me
• Former Data Scientist at Apple Siri
• Software Engineer at Databricks
• Started using Apache Spark since version 0.6
• Developed first version of Apache Spark CSV data
source
• Worked on SparkR &Databricks R Notebook
feature
2
What is SparkR?
An R package distributed with Apache Spark:
- Provides R frontend to Spark
- Exposes Spark DataFrames (inspired by R and Pandas)
- Convenient interoperability between R and Spark
DataFrames
3
distributed/robust processing,
data sources, off-memory data
structures
+
Dynamic environment,
interactivity, packages,
visualization
SparkR architecture
4
Spark Driver
JVMR
RBackend
JVM
Worker
JVM
Worker
DataSources
JVM
SparkR architecture (since 2.0)
5
Spark Driver
R JVM
RBackend
JVM
Worker
JVM
Worker
DataSources
R
R
Overview of SparkR API
IO
read.df / write.df /
createDataFrame / collect
Caching
cache / persist / unpersist /
cacheTable / uncacheTable
SQL
sql / table / saveAsTable /
registerTempTable / tables
6
ML Lib
glm / kmeans / Naïve Bayes
Survival regression
DataFrame API
select / subset / groupBy /
head / avg / column / dim
UDF functionality (since 2.0)
spark.lapply / dapply /
gapply / dapplyCollect
http://spark.apache.org/docs/latest/api/R/
SparkR UDF API
7
spark.lapply
Runs a function
over a list of
elements
spark.lapply()
dapply
Applies a function
to each partition of
a SparkDataFrame
dapply()
dapplyCollect()
gapply
Applies a function
to each group
within a
SparkDataFrame
gapply()
gapplyCollect()
spark.lapply
8
Simplest SparkR UDF pattern
For each element of a list:
1. Sends the function to an R worker
2. Executes the function
3. Returns the result of all workers as a list to R driver
spark.lapply(1:100, function(x) {
runBootstrap(x)
}
spark.lapply control flow
9
RWorker JVM
RWorker JVM
RWorker JVMR Driver JVM
1. Serialize R closure
3. Transfer serialized closure over the network
5. De-serialize closure
4. Transfer over
local socket
6. Serialize result
2. Transfer over
local socket
7. Transfer over
local socket9. Transfer over
local socket
10. Deserialize result
8. Transfer serialized closure over the network
dapply
10
For each partition of a Spark DataFrame
1. collects each partition as an R data.frame
2. sends the R function to the R worker
3. executes the function
dapply(sparkDF, func, schema)
combines results as
DataFrame with provided
schema
dapplyCollect(sparkDF, func)
combines results as R
data.frame
dapply control & data flow
11
RWorker JVM
RWorker JVM
RWorker JVMR Driver JVM
local socket cluster network local socket
input data
ser/de transfer
result data
ser/de transfer
dapplyCollect control & data flow
12
RWorker JVM
RWorker JVM
RWorker JVMR Driver JVM
local socket cluster network local socket
input data
ser/de transfer
result transfer
result deser
gapply
13
Groups a Spark DataFrame on one or more columns
1. collects each group as an R data.frame
2. sends the R function to the R worker
3. executes the function
gapply(sparkDF, cols, func, schema)
combines results as
DataFrame with provided
schema
gapplyCollect(sparkDF, cols, func)
combines results as R
data.frame
gapply control & data flow
14
RWorker JVM
RWorker JVM
RWorker JVMR Driver JVM
local socket cluster network local socket
input data
ser/de transfer
result data
ser/de transfer
data
shuffle
dapply vs. gapply
15
gapply dapply
signature gapply(df, cols, func, schema)
gapply(gdf, func, schema)
dapply(df, func, schema)
user function
signature
function(key, data) function(data)
data partition controlled by grouping not controlled
Parallelizing data
• Do not use spark.lapply() to distribute large data sets
• Do not pack data in the closure
• Watch for skew in data
–Are partitions evenly sized?
• Auxiliary data
–Can be joined with input DataFrame
–Can be distributed to all the workers
16
Packages on workers
• SparkR closure capture does not include packages
• You need to import packages on each worker inside your
function
• If not installed install packages on workers out-of-band
• spark.lapply() can be used to install packages
17
Debugging user code
1. Verify your code on the Driver
2. Interactively execute the code on the cluster
– When R worker fails, Spark Driver throws exception with the R error
text
3. Inspect details of failure reason of failed job in spark UI
4. Inspect stdout/stderror of workers
18
Demo
19
http://bit.ly/2krYMwC
http://bit.ly/2ltLVKs
Thank you!

More Related Content

PPTX
Keeping Spark on Track: Productionizing Spark for ETL
PPTX
Spark r under the hood with Hossein Falaki
PDF
Parallelize R Code Using Apache Spark
PDF
Spark Summit EU 2015: Lessons from 300+ production users
PDF
From Pipelines to Refineries: Scaling Big Data Applications
PDF
Scalable Data Science in Python and R on Apache Spark
PPTX
Large-Scale Data Science in Apache Spark 2.0
PDF
Sparkly Notebook: Interactive Analysis and Visualization with Spark
Keeping Spark on Track: Productionizing Spark for ETL
Spark r under the hood with Hossein Falaki
Parallelize R Code Using Apache Spark
Spark Summit EU 2015: Lessons from 300+ production users
From Pipelines to Refineries: Scaling Big Data Applications
Scalable Data Science in Python and R on Apache Spark
Large-Scale Data Science in Apache Spark 2.0
Sparkly Notebook: Interactive Analysis and Visualization with Spark

What's hot (20)

PPTX
Introduction to Apache Spark Developer Training
PDF
Operational Tips for Deploying Spark
PPTX
Robust and Scalable ETL over Cloud Storage with Apache Spark
PDF
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
PDF
Recent Developments In SparkR For Advanced Analytics
PDF
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
PDF
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
PDF
Performant data processing with PySpark, SparkR and DataFrame API
PDF
Sqoop on Spark for Data Ingestion
PDF
Scalable Data Science with SparkR
PPTX
Programming in Spark using PySpark
PDF
Koalas: Interoperability Between Koalas and Apache Spark
PDF
Apache spark linkedin
PDF
PySaprk
PDF
Memory Management in Apache Spark
PPTX
ETL with SPARK - First Spark London meetup
PDF
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
PDF
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
PDF
A really really fast introduction to PySpark - lightning fast cluster computi...
PDF
Spark Summit EU talk by Nimbus Goehausen
Introduction to Apache Spark Developer Training
Operational Tips for Deploying Spark
Robust and Scalable ETL over Cloud Storage with Apache Spark
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Recent Developments In SparkR For Advanced Analytics
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Performant data processing with PySpark, SparkR and DataFrame API
Sqoop on Spark for Data Ingestion
Scalable Data Science with SparkR
Programming in Spark using PySpark
Koalas: Interoperability Between Koalas and Apache Spark
Apache spark linkedin
PySaprk
Memory Management in Apache Spark
ETL with SPARK - First Spark London meetup
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
A really really fast introduction to PySpark - lightning fast cluster computi...
Spark Summit EU talk by Nimbus Goehausen
Ad

Viewers also liked (6)

PDF
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
PDF
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
PDF
[Sneak Preview] Apache Spark: Preparing for the next wave of Reactive Big Data
PDF
Easy, scalable, fault tolerant stream processing with structured streaming - ...
PDF
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
PDF
Map reduce vs spark
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
[Sneak Preview] Apache Spark: Preparing for the next wave of Reactive Big Data
Easy, scalable, fault tolerant stream processing with structured streaming - ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
Map reduce vs spark
Ad

Similar to Parallelizing Existing R Packages with SparkR (20)

PDF
Parallelizing Existing R Packages
PDF
Strata NYC 2015 - Supercharging R with Apache Spark
PDF
Enabling exploratory data science with Spark and R
PPTX
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
PDF
Introduction to SparkR
PDF
Introduction to SparkR
PPTX
Machine Learning with SparkR
PPTX
Spark from the Surface
PPTX
Big data processing with Apache Spark and Oracle Database
PDF
SparkR: Enabling Interactive Data Science at Scale
PPTX
Spark core
PDF
Strata NYC 2015: What's new in Spark Streaming
PDF
Module01
PPTX
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
PDF
Spark Programming Basic Training Handout
PPT
Apache spark-melbourne-april-2015-meetup
PPTX
5 things one must know about spark!
PPT
An Introduction to Apache spark with scala
PDF
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
PDF
Apache Spark 2.3 boosts advanced analytics and deep learning with Python
Parallelizing Existing R Packages
Strata NYC 2015 - Supercharging R with Apache Spark
Enabling exploratory data science with Spark and R
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Introduction to SparkR
Introduction to SparkR
Machine Learning with SparkR
Spark from the Surface
Big data processing with Apache Spark and Oracle Database
SparkR: Enabling Interactive Data Science at Scale
Spark core
Strata NYC 2015: What's new in Spark Streaming
Module01
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Spark Programming Basic Training Handout
Apache spark-melbourne-april-2015-meetup
5 things one must know about spark!
An Introduction to Apache spark with scala
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Apache Spark 2.3 boosts advanced analytics and deep learning with Python

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake

Recently uploaded (20)

PPT
JAVA ppt tutorial basics to learn java programming
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PDF
System and Network Administration Chapter 2
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PPTX
ManageIQ - Sprint 268 Review - Slide Deck
PDF
5 Lead Qualification Frameworks Every Sales Team Should Use
PPT
Introduction Database Management System for Course Database
PDF
medical staffing services at VALiNTRY
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PPTX
ai tools demonstartion for schools and inter college
DOCX
Looking for a Tableau Alternative Try Helical Insight Open Source BI Platform...
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PPTX
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
PPTX
FLIGHT TICKET RESERVATION SYSTEM | FLIGHT BOOKING ENGINE API
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PDF
Softaken Excel to vCard Converter Software.pdf
PPTX
CRUISE TICKETING SYSTEM | CRUISE RESERVATION SOFTWARE
PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
JAVA ppt tutorial basics to learn java programming
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
System and Network Administration Chapter 2
Upgrade and Innovation Strategies for SAP ERP Customers
ManageIQ - Sprint 268 Review - Slide Deck
5 Lead Qualification Frameworks Every Sales Team Should Use
Introduction Database Management System for Course Database
medical staffing services at VALiNTRY
How to Migrate SBCGlobal Email to Yahoo Easily
ai tools demonstartion for schools and inter college
Looking for a Tableau Alternative Try Helical Insight Open Source BI Platform...
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
FLIGHT TICKET RESERVATION SYSTEM | FLIGHT BOOKING ENGINE API
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
Softaken Excel to vCard Converter Software.pdf
CRUISE TICKETING SYSTEM | CRUISE RESERVATION SOFTWARE
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus

Parallelizing Existing R Packages with SparkR

  • 1. Parallelizing Existing R Packages with SparkR Hossein Falaki @mhfalaki
  • 2. About me • Former Data Scientist at Apple Siri • Software Engineer at Databricks • Started using Apache Spark since version 0.6 • Developed first version of Apache Spark CSV data source • Worked on SparkR &Databricks R Notebook feature 2
  • 3. What is SparkR? An R package distributed with Apache Spark: - Provides R frontend to Spark - Exposes Spark DataFrames (inspired by R and Pandas) - Convenient interoperability between R and Spark DataFrames 3 distributed/robust processing, data sources, off-memory data structures + Dynamic environment, interactivity, packages, visualization
  • 5. SparkR architecture (since 2.0) 5 Spark Driver R JVM RBackend JVM Worker JVM Worker DataSources R R
  • 6. Overview of SparkR API IO read.df / write.df / createDataFrame / collect Caching cache / persist / unpersist / cacheTable / uncacheTable SQL sql / table / saveAsTable / registerTempTable / tables 6 ML Lib glm / kmeans / Naïve Bayes Survival regression DataFrame API select / subset / groupBy / head / avg / column / dim UDF functionality (since 2.0) spark.lapply / dapply / gapply / dapplyCollect http://spark.apache.org/docs/latest/api/R/
  • 7. SparkR UDF API 7 spark.lapply Runs a function over a list of elements spark.lapply() dapply Applies a function to each partition of a SparkDataFrame dapply() dapplyCollect() gapply Applies a function to each group within a SparkDataFrame gapply() gapplyCollect()
  • 8. spark.lapply 8 Simplest SparkR UDF pattern For each element of a list: 1. Sends the function to an R worker 2. Executes the function 3. Returns the result of all workers as a list to R driver spark.lapply(1:100, function(x) { runBootstrap(x) }
  • 9. spark.lapply control flow 9 RWorker JVM RWorker JVM RWorker JVMR Driver JVM 1. Serialize R closure 3. Transfer serialized closure over the network 5. De-serialize closure 4. Transfer over local socket 6. Serialize result 2. Transfer over local socket 7. Transfer over local socket9. Transfer over local socket 10. Deserialize result 8. Transfer serialized closure over the network
  • 10. dapply 10 For each partition of a Spark DataFrame 1. collects each partition as an R data.frame 2. sends the R function to the R worker 3. executes the function dapply(sparkDF, func, schema) combines results as DataFrame with provided schema dapplyCollect(sparkDF, func) combines results as R data.frame
  • 11. dapply control & data flow 11 RWorker JVM RWorker JVM RWorker JVMR Driver JVM local socket cluster network local socket input data ser/de transfer result data ser/de transfer
  • 12. dapplyCollect control & data flow 12 RWorker JVM RWorker JVM RWorker JVMR Driver JVM local socket cluster network local socket input data ser/de transfer result transfer result deser
  • 13. gapply 13 Groups a Spark DataFrame on one or more columns 1. collects each group as an R data.frame 2. sends the R function to the R worker 3. executes the function gapply(sparkDF, cols, func, schema) combines results as DataFrame with provided schema gapplyCollect(sparkDF, cols, func) combines results as R data.frame
  • 14. gapply control & data flow 14 RWorker JVM RWorker JVM RWorker JVMR Driver JVM local socket cluster network local socket input data ser/de transfer result data ser/de transfer data shuffle
  • 15. dapply vs. gapply 15 gapply dapply signature gapply(df, cols, func, schema) gapply(gdf, func, schema) dapply(df, func, schema) user function signature function(key, data) function(data) data partition controlled by grouping not controlled
  • 16. Parallelizing data • Do not use spark.lapply() to distribute large data sets • Do not pack data in the closure • Watch for skew in data –Are partitions evenly sized? • Auxiliary data –Can be joined with input DataFrame –Can be distributed to all the workers 16
  • 17. Packages on workers • SparkR closure capture does not include packages • You need to import packages on each worker inside your function • If not installed install packages on workers out-of-band • spark.lapply() can be used to install packages 17
  • 18. Debugging user code 1. Verify your code on the Driver 2. Interactively execute the code on the cluster – When R worker fails, Spark Driver throws exception with the R error text 3. Inspect details of failure reason of failed job in spark UI 4. Inspect stdout/stderror of workers 18

Editor's Notes

  • #4: Syntax is closely similar to R data frames
  • #5: Worker refers to Worker machine Mention that all Spark data sources work
  • #6: Worker refers to Worker machine Mention that all Spark data sources work
  • #9: Designed for parameter search e.g.
  • #20: Add refrences