SlideShare a Scribd company logo
June 2017
Yanbo Liang
Apache Spark committer
Hortonworks
SparkR best practices for R data scientists
2 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Outline
à Introduction to	R	and SparkR.
à Typical data science workflow.
à SparkR + R for typical data science problem.
– Big data, small learning.
– Partition aggregate.
– Large scale machine learning.
à Future directions.
3 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Outline
à Introduction to	R	and SparkR.
à Typical data science workflow.
à SparkR + R for typical data science problem.
– Big data, small learning.
– Partition aggregate.
– Large scale machine learning.
à Future directions.
4 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
R for data scientist
à Pros
– Open source.
– Rich ecosystem of packages.
– Powerful visualization infrastructure.
– Data frames make data manipulation convenient.
– Taught by many schools to statistics and computer science students.
à Cons
– Single threaded
– Everything has to fit in single machine memory
5 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
SparkR = Spark + R
à An	R	frontend	for	Apache	Spark,	a	widely deployed cluster computing engine.
à Wrappers over DataFrames and DataFrame-based APIs (MLlib).
– Complete DataFrame API to behave just like R data.frame.
– ML APIs mimic to the methods implemented in R or R packages, rather than Scala/Python APIs.
à Data frame concept is the corner stone of both Spark and R.
à Convenient interoperability between R and Spark DataFrames.
6 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
SparkR architecture
7 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Outline
à Introduction to	R	and SparkR.
à Typical data science workflow.
à SparkR + R for typical data science problem.
– Big data, small learning.
– Partition aggregate.
– Large scale machine learning.
à Future directions.
8 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Data science workflow
R for Data Science (http://r4ds.had.co.nz/)
9 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Why SparkR + R
à There are thousands of community packages on CRAN.
– It is impossible for SparkR to match all existing features.
à Not every dataset is large.
– Many people work with small/medium datasets.
10 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Outline
à Introduction to	R	and SparkR.
à Typical data science workflow.
à SparkR + R for typical data science problem.
– Big data, small learning.
– Partition aggregate.
– Large scale machine learning.
à Future directions.
11 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Big data, small learning
Table1
Table2
Table3 Table4 Table5join
select/
where/
aggregate/
sample collect
model/
analytics
SparkR R
12 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Data wrangle with SparkR
Operation/Transformation function
Join different data sources or tables join
Pick observations by their value filter/where
Reorder the rows arrange
Pick variables by their names select
Create new variable with functions of existing variables mutate/withColumn
Collapse many values down to a single summary summary/describe
Aggregation groupBy
13 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Data wrangle
airlines <- read.df(path="/data/2008.csv", source="csv",
header="true", inferSchema="true")
planes <- read.df(path="/data/plane-data.csv", source="csv",
header="true", inferSchema="true")
joined <- join(airlines, planes, airlines$TailNum ==
planes$tailnum)
df1 <- select(joined, “aircraft_type”, “Distance”, “ArrDelay”,
“DepDelay”)
df2 <- dropna(df1)
14 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
SparkR performance
15 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Sampling Algorithms
à Bernoulli sampling (without replacement)
– df3 <- sample(df2,	FALSE,	0.1)
à Poisson sampling (with replacement)
– df3 <- sample(df2, TRUE, 0.1)
à stratified sampling
– df3 <- sampleBy(df2,	"aircraft_type",	list("Fixed	Wing	Multi-Engine"=0.1,	"Fixed	Wing	Single-
Engine"=0.2,	"Rotorcraft"=0.3),	0)
16 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Big data, small learning
Table1
Table2
Table3 Table4 Table5join
select/
where/
aggregate/
sample collect
model/
analytics
SparkR R
17 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Big data, small learning
Table1
Table2
Table3 Table4 Table5join
select/
where/
aggregate/
sample collect
model/
analytics
SparkDataFrame data.frame
18 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Distributed dataset to local
19 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Partition aggregate
à User Defined Functions (UDFs).
– dapply
– gapply
à Parallel execution of function.
– spark.lapply
20 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
User Defined Functions (UDFs)
à dapply
à gapply
21 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
dapply
> schema <- structType(structField(”aircraft_type”, “string”),
structField(”Distance“, ”integer“),
structField(”ArrDelay“, ”integer“),
structField(”DepDelay“, ”integer“),
structField(”DepDelayS“, ”integer“))
> df4 <- dapply(df3, function(x) { x <- cbind(x, x$DepDelay *
60L) }, schema)
> head(df4)
22 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
gapply
> schema <- structType(structField(”Distance“, ”integer“),
structField(”MaxActualDelay“, ”integer“))
> df5 <- gapply(df3, “Distance”, function(key, x) { y <-
data.frame(key, max(x$ArrDelay-x$DepDelay)) }, schema)
> head(df5)
23 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
spark.lapply
à Ideal way for distributing existing R functionality and packages
24 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
spark.lapply
for (lambda in c(0.5, 1.5)) {
for (alpha in c(0.1, 0.5, 1.0)) {
model <- glmnet(A, b, lambda=lambda, alpha=alpha)
c <- predit(model, A)
c(coef(model), auc(c, b))
}
}
25 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
spark.lapply
values <- c(c(0.5, 0.1), c(0.5, 0.5), c(0.5, 1.0), c(1.5,
0.1), c(1.5, 0.5), c(1.5, 1.0))
train <- function(value) {
lambda <- value[1]
alpha <- value[2]
model <- glmnet(A, b, lambda=lambda, alpha=alpha)
c(coef(model), auc(c, b))
}
models <- spark.lapply(values, train)
26 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
spark.lapply
executor
executor
executor
executor
executor
Driver
lambda = c(0.5, 1.5)
alpha = c(0.1, 0.5, 1.0)
executor
27 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
spark.lapply
(0.5, 0.1)
executor
(1.5, 0.1)
executor
(0.5, 0.5)
executor
(0.5, 1.0)
executor
(1.5, 1.0)
executor
Driver
(1.5, 0.5)
executor
28 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Virtual environment
(glmnet)
executor
(glmnet)
executor
(glmnet)
executor
(glmnet)
executor
(glmnet)
executor
Driver
(glmnet)
executor
29 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Virtual environment
download.packages(”glmnet", packagesDir, repos =
"https://cran.r-project.org")
filename <- list.files(packagesDir, "^glmnet")
packagesPath <- file.path(packagesDir, filename)
spark.addFile(packagesPath)
30 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Virtual environment
values <- c(c(0.5, 0.1), c(0.5, 0.5), c(0.5, 1.0), c(1.5, 0.1), c(1.5,
0.5), c(1.5, 1.0))
train <- function(value) {
path <- spark.getSparkFiles(filename)
install.packages(path, repos = NULL, type = "source")
library(glmnet)
lambda <- value[1]
alpha <- value[2]
model <- glmnet(A, b, lambda=lambda, alpha=alpha)
c(coef(model), auc(c, b))
}
models <- spark.lapply(values, train)
31 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Large scale machine learning
32 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Large scale machine learning
33 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Large scale machine learning
> model <- glm(ArrDelay ~ DepDelay + Distance + aircraft_type,
family = "gaussian", data = df3)
> summary(model)
34 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Outline
à Introduction to	R	and SparkR.
à Typical data science workflow.
à SparkR + R for typical data science problem.
– Big data, small learning.
– Partition aggregate.
– Large scale machine learning.
à Future directions.
35 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Future directions
à Improve collect/createDataFrame performance in SparkR (SPARK-18924).
à More scalable machine learning algorithms from MLlib.
à Better R formula support.
à Improve UDF performance.
June 2017
Yanbo Liang
Apache Spark committer
Hortonworks
SparkR best practices for R data scientists

More Related Content

PDF
Apache Spark Crash Course
PDF
SparkR best practices for R data scientist
PPTX
Real-Time Ingesting and Transforming Sensor Data and Social Data with NiFi an...
PDF
Apache Hadoop Crash Course
PPTX
Dancing Elephants - Efficiently Working with Object Stories from Apache Spark...
PPTX
Hadoop & Cloud Storage: Object Store Integration in Production
PPTX
Interactive Analytics at Scale in Apache Hive Using Druid
PPTX
YARN - Past, Present, & Future
Apache Spark Crash Course
SparkR best practices for R data scientist
Real-Time Ingesting and Transforming Sensor Data and Social Data with NiFi an...
Apache Hadoop Crash Course
Dancing Elephants - Efficiently Working with Object Stories from Apache Spark...
Hadoop & Cloud Storage: Object Store Integration in Production
Interactive Analytics at Scale in Apache Hive Using Druid
YARN - Past, Present, & Future

What's hot (20)

PPTX
Enabling Apache Zeppelin and Spark for Data Science in the Enterprise
PPTX
Why is my Hadoop* job slow?
PPTX
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
PDF
Apache Hadoop Crash Course
PPTX
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
PPTX
An Overview on Optimization in Apache Hive: Past, Present Future
PPTX
Mool - Automated Log Analysis using Data Science and ML
PDF
Dataflow with Apache NiFi - Crash Course - HS16SJ
PPTX
Row/Column- Level Security in SQL for Apache Spark
PDF
Apache Hadoop Crash Course - HS16SJ
PPTX
Hive edw-dataworks summit-eu-april-2017
PPTX
Sharing metadata across the data lake and streams
PDF
Apache Spark 2.3 boosts advanced analytics and deep learning with Python
PDF
Intro to Spark & Zeppelin - Crash Course - HS16SJ
PPTX
Why is my Hadoop cluster slow?
PDF
Achieving a 360-degree view of manufacturing via open source industrial data ...
PPTX
Big Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and Parquet
PDF
#HSTokyo16 Apache Spark Crash Course
PPTX
Deep learning on yarn running distributed tensorflow etc on hadoop cluster v3
PDF
Visualizing Big Data in Realtime
Enabling Apache Zeppelin and Spark for Data Science in the Enterprise
Why is my Hadoop* job slow?
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
Apache Hadoop Crash Course
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
An Overview on Optimization in Apache Hive: Past, Present Future
Mool - Automated Log Analysis using Data Science and ML
Dataflow with Apache NiFi - Crash Course - HS16SJ
Row/Column- Level Security in SQL for Apache Spark
Apache Hadoop Crash Course - HS16SJ
Hive edw-dataworks summit-eu-april-2017
Sharing metadata across the data lake and streams
Apache Spark 2.3 boosts advanced analytics and deep learning with Python
Intro to Spark & Zeppelin - Crash Course - HS16SJ
Why is my Hadoop cluster slow?
Achieving a 360-degree view of manufacturing via open source industrial data ...
Big Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and Parquet
#HSTokyo16 Apache Spark Crash Course
Deep learning on yarn running distributed tensorflow etc on hadoop cluster v3
Visualizing Big Data in Realtime
Ad

Viewers also liked (16)

PDF
Beyond Big Data: Data Science and AI
PDF
Data Guarantees and Fault Tolerance in Streaming Systems
PDF
Data Science Crash Course
PDF
Next Generation Execution for Apache Storm
PDF
Delivering Data Science to the Business
PDF
How Big Data and Deep Learning are Revolutionizing AML and Financial Crime De...
PDF
Data-In-Motion Unleashed
PDF
Driving in the Desert - Running Your HDP Cluster with Helion, Openstack, and ...
PDF
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron
PDF
The Power of Intelligent Flows: Real-Time IoT Botnet Classification with Apac...
PDF
The Apache Way
PDF
The Future of Data in Telecom and the Rise of Connected Communities
PDF
Running Zeppelin in Enterprise
PDF
An Apache Hive Based Data Warehouse
PDF
Intelligently Collecting Data at the Edge - Intro to Apache MiNiFi
PPTX
Performance Update: When Apache ORC Met Apache Spark
Beyond Big Data: Data Science and AI
Data Guarantees and Fault Tolerance in Streaming Systems
Data Science Crash Course
Next Generation Execution for Apache Storm
Delivering Data Science to the Business
How Big Data and Deep Learning are Revolutionizing AML and Financial Crime De...
Data-In-Motion Unleashed
Driving in the Desert - Running Your HDP Cluster with Helion, Openstack, and ...
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron
The Power of Intelligent Flows: Real-Time IoT Botnet Classification with Apac...
The Apache Way
The Future of Data in Telecom and the Rise of Connected Communities
Running Zeppelin in Enterprise
An Apache Hive Based Data Warehouse
Intelligently Collecting Data at the Edge - Intro to Apache MiNiFi
Performance Update: When Apache ORC Met Apache Spark
Ad

Similar to SparkR Best Practices for R Data Scientists (20)

PDF
Integrate SparkR with existing R packages to accelerate data science workflows
PPTX
Machine Learning with SparkR
PDF
Recent Developments In SparkR For Advanced Analytics
PDF
Scalable Data Science with SparkR
PDF
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
PDF
Introduction to SparkR
PDF
Introduction to SparkR
PPTX
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
PDF
Sparkr sigmod
PDF
Data processing with spark in r &amp; python
PDF
Parallelizing Existing R Packages
PDF
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
PDF
Enabling exploratory data science with Spark and R
PDF
Big data analysis using spark r published
PDF
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
PDF
SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...
PDF
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
PDF
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
PDF
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Integrate SparkR with existing R packages to accelerate data science workflows
Machine Learning with SparkR
Recent Developments In SparkR For Advanced Analytics
Scalable Data Science with SparkR
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Introduction to SparkR
Introduction to SparkR
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Sparkr sigmod
Data processing with spark in r &amp; python
Parallelizing Existing R Packages
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Enabling exploratory data science with Spark and R
Big data analysis using spark r published
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
Enabling Exploratory Analysis of Large Data with Apache Spark and R

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
PPTX
Managing the Dewey Decimal System
PPTX
Practical NoSQL: Accumulo's dirlist Example
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
PPTX
Security Framework for Multitenant Architecture
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
PPTX
Extending Twitter's Data Platform to Google Cloud
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
PDF
Computer Vision: Coming to a Store Near You
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Data Science Crash Course
Floating on a RAFT: HBase Durability with Apache Ratis
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
HBase Tales From the Trenches - Short stories about most common HBase operati...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Managing the Dewey Decimal System
Practical NoSQL: Accumulo's dirlist Example
HBase Global Indexing to support large-scale data ingestion at Uber
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Security Framework for Multitenant Architecture
Presto: Optimizing Performance of SQL-on-Anything Engine
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Extending Twitter's Data Platform to Google Cloud
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Computer Vision: Coming to a Store Near You
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Recently uploaded (20)

PDF
Encapsulation theory and applications.pdf
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
OMC Textile Division Presentation 2021.pptx
PPTX
Tartificialntelligence_presentation.pptx
PDF
Empathic Computing: Creating Shared Understanding
PPTX
1. Introduction to Computer Programming.pptx
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PPTX
A Presentation on Artificial Intelligence
PPT
Teaching material agriculture food technology
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
Heart disease approach using modified random forest and particle swarm optimi...
PDF
Mushroom cultivation and it's methods.pdf
PPTX
Spectroscopy.pptx food analysis technology
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Encapsulation theory and applications.pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Encapsulation_ Review paper, used for researhc scholars
Building Integrated photovoltaic BIPV_UPV.pdf
OMC Textile Division Presentation 2021.pptx
Tartificialntelligence_presentation.pptx
Empathic Computing: Creating Shared Understanding
1. Introduction to Computer Programming.pptx
Assigned Numbers - 2025 - Bluetooth® Document
A Presentation on Artificial Intelligence
Teaching material agriculture food technology
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Group 1 Presentation -Planning and Decision Making .pptx
Heart disease approach using modified random forest and particle swarm optimi...
Mushroom cultivation and it's methods.pdf
Spectroscopy.pptx food analysis technology
Network Security Unit 5.pdf for BCA BBA.
Advanced methodologies resolving dimensionality complications for autism neur...
SOPHOS-XG Firewall Administrator PPT.pptx
Profit Center Accounting in SAP S/4HANA, S4F28 Col11

SparkR Best Practices for R Data Scientists