SlideShare a Scribd company logo
Scaling SparkR in Production.
Lessons from the Field.
Heiko Korndorf
Wireframe, CEO & Founder
About me
Heiko Korndorf
• CEO & Founder Wireframe
• MS in Computer Science
• Application Areas: ERP, CRM, BI, EAI
• Serving companies in
• Manufacturing
• Telecommunications
• Financial Services
• Utilities
• Oil & Gas
• Professional Services
Rapid Application Development
for Hadoop/Spark
Test Data Generation/Simulation
What we’ll talk about
Classify this talk ….
• Data Science: Scaling your R application with SparkR
• Data Engineering: How to bring Data Science applications into
your production pipelines, i.e. adding R to your toolset.
• Management: Integrating Data Science and Data Engineering with
SparkR
Agenda
• SparkR Architecture 1.x/2.x
• Reference Projects I + II
• Approach with Spark 1.5/1.6
• Parallelization via YARN
• Dynamic R Deployment, incl. dependencies/packages
• Approach with Spark 2.0
• Parallelization via SparkR
• R-Graphics: headless environment, concurrency
• Use Spark APIs: SQL, Mllib
• On-Prem vs Cloud (Elasticity/decouple storage and compute)
• Integrating Data Science and Data Engineering
• A Broader Look at the Ecosystem
• Outlook and Next Steps
Data Science with R
• Very popular language
• Designed by statisticians
• Large community
• > 10.000 packages
• plus: integrated package management
• But: Limited as Single-Node platform
• Data has to fit in memory
• Limited concurrency for processing
SparkR Projects
SparkR as seen from R
• Import SparkR-package and initialize SparkSession
• Convert data frames from local R data frames to Spark DataFrame and back
• Read and write data stored in Hadoop HDFS, HBase, Cassandra, and more
• Use Spark Libraries, such as SparkSQL and ML
• User cluster hardware to distribute data frames and parallelize computation
SparkR Architecture
• Execute R on cluster
• Data Integration
• Spark DataFrame – R data frame
• Access Big Data File Formats
• Parallelization with UDFs
• Use Spark APIs
• SparkSQL
• Spark MLlib
SparkSQL from R
• Execute SQL against
Spark DataFrame
• SELECT
• Specify Projection
• WHERE
• Filter criteria
• GROUPBY
• Group/Aggregate
• JOIN
• Join tables
Native Spark ML
Time Series Forecasting
• ARIMA(p,d,q)
• AR: p = order of the autoregressive part
• I: d = degree of first differencing involved
• MA: q = order of the moving average part
• Time Series: a series of data points indexed in time order
• Methods:
• Exponential Smoothing
• Neural Networks
• ARIMA:
“Pedestrian” Challenges
• Modify some Spark and R (custom-build)
• Submit Spark job with R (incl. packages)
as YARN dependency
• Challenge: R not installed on cluster
• R’s installation location is hard-coded in R
• “R Markdown” produces HTML, PDF,
and more
• Complex objects (.RDS) for metadata,
KPIs, etc.
• Producing additional output during run
• Creating graphics in headless
environments
Installing R (+Pkg’s) on cluster Managing Non-Tabular Output
Parallelization with SparkR 1.x
• Sequential computation: > 20 hrs.
• Single-Server, parallelized: > 4.5 hrs
Parallelization with SparkR 1.x
• Sequential computation: > 20 hrs.
• Single-Server, parallelized: > 4.5 hrs
• SparkR 1.6.2, 25 nodes, 4 cores: ca. 12 mins.
Microsoft R Server for Spark
• Microsoft R Server for HDInsight
integrates Spark and R
• Based on Revolution Analytics
• UDFs via rxExec()
• Data Sources
• RxXdfFile
• RxTextFile
• RxHiveData
• RxParquetData
Parallelization with SparkR 2.x
Support for User-Defined Functions
• dapply (dapplyCollect)
• input: DataFrame, func [, Schema]
• output: DataFrame
• gapply (gapplyCollect)
• input: DataFrame¦GroupedData,
groupBy, func [, Schema]
• output: DataFrame
• spark.lapply
• input: parameters, func
• Access to data/HDFS
• output: List
Cultural Integration
The (Data) Science Process
Public Perception of Science
Source: Birth of a Theorem – with Cedric Villani (https://www.youtube.com/watch?v=yYwydG_aHPE)
The (Data) Science Process
Public Perception of Science Science in Reality
Source: Birth of a Theorem – with Cedric Villani (https://www.youtube.com/watch?v=yYwydG_aHPE)
Integrating Dev and Prod
• No Need to Re-Write Applications
for Production
• Common Environment for
Development, Test and Production
• “Looks like R to Data Science,
looks like Spark to Data
Engineers”
• Oozie-SparkAction vs ShellAction
• Prepare Dev-/Prod-Environment
2-Level Parallelization
(1) Submit multiple jobs to your cluster:
- Cluster Manager (YARN, Spark, Mesos)
- Spark Job: Driver and Executors
(2) Use GPGPU
- Spark Job: Driver and Executor
- Let Executor use GPGPU
(3) Combine 1 and 2
Mix Scala and R
• Call R from Scala
• Add DataScience Module to
your Spark Application
• Use Spark/Scala for ETL, R for
Science code
• Call Spark from R
• Implement high-performance
code in Spark
• More granular control over
cluster resources
SparkR: A Dynamic Ecosystem
Hadoop, Spark & R: Many interesting projects and options
• SparkR (Apache, Databricks)
• R Server for Spark (Microsoft)
• Sparklyr (RStudio)
• Oracle R for Analytics, FastR (Oracle)
• SystemML (IBM)
• Renjin (BeDataDriven)
Outlook & Misc
• Organizational: Deepen Integration of Data Engineering & Data Science
• Source Code Control & Versioning (git …)
• Continuous Build
• Test Management (RUnit, testthat…?)
• Multi-Output (Rmarkdown)
• Technical: New Approaches
• Simplify/Unify Data Pipelines (SparkSQL)
• Performance Improvement: use MLlib
• Performance Improvement: move calculation to GPU
Thank You.
Heiko Korndorf
heiko.korndorf@wireframe.li

More Related Content

PDF
Fishing Graphs in a Hadoop Data Lake
PPTX
Efficient Data Formats for Analytics with Parquet and Arrow
PPTX
Apache Hadoop 3.0 Community Update
PPTX
Dancing elephants - efficiently working with object stores from Apache Spark ...
PPTX
Dynamic DDL: Adding structure to streaming IoT data on the fly
PPTX
Big Data in the Cloud - The What, Why and How from the Experts
PDF
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
PPTX
HPE Keynote Hadoop Summit San Jose 2016
Fishing Graphs in a Hadoop Data Lake
Efficient Data Formats for Analytics with Parquet and Arrow
Apache Hadoop 3.0 Community Update
Dancing elephants - efficiently working with object stores from Apache Spark ...
Dynamic DDL: Adding structure to streaming IoT data on the fly
Big Data in the Cloud - The What, Why and How from the Experts
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
HPE Keynote Hadoop Summit San Jose 2016

What's hot (20)

PDF
Realizing the Promise of Portable Data Processing with Apache Beam
PPTX
Cloudy with a Chance of Hadoop - Real World Considerations
PPTX
Empower Data-Driven Organizations with HPE and Hadoop
PPTX
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
PPTX
Accelerating Big Data Insights
PPT
Running Spark in Production
PPTX
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
PPTX
Accelerate Your Big Data Analytics Efforts with SAS and Hadoop
PPTX
HDFS Tiered Storage: Mounting Object Stores in HDFS
PPTX
Cloudy with a chance of Hadoop - real world considerations
PPTX
Deep Learning using Spark and DL4J for fun and profit
PDF
The state of SQL-on-Hadoop in the Cloud
PPTX
Securing Spark Applications
PPTX
Dancing Elephants - Efficiently Working with Object Stories from Apache Spark...
PPTX
The columnar roadmap: Apache Parquet and Apache Arrow
PPTX
Schema Registry - Set Your Data Free
PPTX
To The Cloud and Back: A Look At Hybrid Analytics
PPTX
Hadoop in the Cloud - The what, why and how from the experts
PPTX
Don't Let the Spark Burn Your House: Perspectives on Securing Spark
Realizing the Promise of Portable Data Processing with Apache Beam
Cloudy with a Chance of Hadoop - Real World Considerations
Empower Data-Driven Organizations with HPE and Hadoop
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Big Data Insights
Running Spark in Production
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
Accelerate Your Big Data Analytics Efforts with SAS and Hadoop
HDFS Tiered Storage: Mounting Object Stores in HDFS
Cloudy with a chance of Hadoop - real world considerations
Deep Learning using Spark and DL4J for fun and profit
The state of SQL-on-Hadoop in the Cloud
Securing Spark Applications
Dancing Elephants - Efficiently Working with Object Stories from Apache Spark...
The columnar roadmap: Apache Parquet and Apache Arrow
Schema Registry - Set Your Data Free
To The Cloud and Back: A Look At Hybrid Analytics
Hadoop in the Cloud - The what, why and how from the experts
Don't Let the Spark Burn Your House: Perspectives on Securing Spark
Ad

Similar to Using SparkR to Scale Data Science Applications in Production. Lessons from the Field (20)

PDF
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
PDF
Spark Summit EU talk by Heiko Korndorf
PPTX
Nodes2020 | Graph of enterprise_metadata | NEO4J Conference
PDF
Apache Spark for Everyone - Women Who Code Workshop
PDF
Big Data visualization with Apache Spark and Zeppelin
PDF
BDM25 - Spark runtime internal
PDF
Jump Start with Apache Spark 2.0 on Databricks
PDF
Spark Under the Hood - Meetup @ Data Science London
PPTX
Data Science at Scale with Apache Spark and Zeppelin Notebook
PDF
Transitioning Compute Models: Hadoop MapReduce to Spark
PPTX
Apache Spark Fundamentals
PDF
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
PPTX
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
PDF
Data processing with spark in r & python
PDF
Integrating Deep Learning Libraries with Apache Spark
PPTX
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
PDF
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
PPTX
Paris Data Geek - Spark Streaming
PDF
Hands on with Apache Spark
PDF
Koalas: Unifying Spark and pandas APIs
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Spark Summit EU talk by Heiko Korndorf
Nodes2020 | Graph of enterprise_metadata | NEO4J Conference
Apache Spark for Everyone - Women Who Code Workshop
Big Data visualization with Apache Spark and Zeppelin
BDM25 - Spark runtime internal
Jump Start with Apache Spark 2.0 on Databricks
Spark Under the Hood - Meetup @ Data Science London
Data Science at Scale with Apache Spark and Zeppelin Notebook
Transitioning Compute Models: Hadoop MapReduce to Spark
Apache Spark Fundamentals
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
Data processing with spark in r & python
Integrating Deep Learning Libraries with Apache Spark
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Paris Data Geek - Spark Streaming
Hands on with Apache Spark
Koalas: Unifying Spark and pandas APIs
Ad

More from DataWorks Summit/Hadoop Summit (20)

PPT
Running Apache Spark & Apache Zeppelin in Production
PPT
State of Security: Apache Spark & Apache Zeppelin
PDF
Unleashing the Power of Apache Atlas with Apache Ranger
PDF
Enabling Digital Diagnostics with a Data Science Platform
PDF
Revolutionize Text Mining with Spark and Zeppelin
PDF
Double Your Hadoop Performance with Hortonworks SmartSense
PDF
Hadoop Crash Course
PDF
Data Science Crash Course
PDF
Apache Spark Crash Course
PDF
Dataflow with Apache NiFi
PPTX
Schema Registry - Set you Data Free
PPTX
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
PDF
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
PPTX
Mool - Automated Log Analysis using Data Science and ML
PPTX
How Hadoop Makes the Natixis Pack More Efficient
PPTX
HBase in Practice
PPTX
The Challenge of Driving Business Value from the Analytics of Things (AOT)
PDF
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
PPTX
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
PPTX
Backup and Disaster Recovery in Hadoop
Running Apache Spark & Apache Zeppelin in Production
State of Security: Apache Spark & Apache Zeppelin
Unleashing the Power of Apache Atlas with Apache Ranger
Enabling Digital Diagnostics with a Data Science Platform
Revolutionize Text Mining with Spark and Zeppelin
Double Your Hadoop Performance with Hortonworks SmartSense
Hadoop Crash Course
Data Science Crash Course
Apache Spark Crash Course
Dataflow with Apache NiFi
Schema Registry - Set you Data Free
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Mool - Automated Log Analysis using Data Science and ML
How Hadoop Makes the Natixis Pack More Efficient
HBase in Practice
The Challenge of Driving Business Value from the Analytics of Things (AOT)
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
Backup and Disaster Recovery in Hadoop

Recently uploaded (20)

PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Encapsulation theory and applications.pdf
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Approach and Philosophy of On baking technology
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Empathic Computing: Creating Shared Understanding
PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
Spectroscopy.pptx food analysis technology
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPTX
Big Data Technologies - Introduction.pptx
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
Understanding_Digital_Forensics_Presentation.pptx
Advanced methodologies resolving dimensionality complications for autism neur...
Encapsulation theory and applications.pdf
Chapter 3 Spatial Domain Image Processing.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
The AUB Centre for AI in Media Proposal.docx
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Approach and Philosophy of On baking technology
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Empathic Computing: Creating Shared Understanding
Programs and apps: productivity, graphics, security and other tools
Spectroscopy.pptx food analysis technology
Unlocking AI with Model Context Protocol (MCP)
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
MIND Revenue Release Quarter 2 2025 Press Release
Big Data Technologies - Introduction.pptx
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Encapsulation_ Review paper, used for researhc scholars
Per capita expenditure prediction using model stacking based on satellite ima...

Using SparkR to Scale Data Science Applications in Production. Lessons from the Field

  • 1. Scaling SparkR in Production. Lessons from the Field. Heiko Korndorf Wireframe, CEO & Founder
  • 2. About me Heiko Korndorf • CEO & Founder Wireframe • MS in Computer Science • Application Areas: ERP, CRM, BI, EAI • Serving companies in • Manufacturing • Telecommunications • Financial Services • Utilities • Oil & Gas • Professional Services Rapid Application Development for Hadoop/Spark Test Data Generation/Simulation
  • 3. What we’ll talk about Classify this talk …. • Data Science: Scaling your R application with SparkR • Data Engineering: How to bring Data Science applications into your production pipelines, i.e. adding R to your toolset. • Management: Integrating Data Science and Data Engineering with SparkR
  • 4. Agenda • SparkR Architecture 1.x/2.x • Reference Projects I + II • Approach with Spark 1.5/1.6 • Parallelization via YARN • Dynamic R Deployment, incl. dependencies/packages • Approach with Spark 2.0 • Parallelization via SparkR • R-Graphics: headless environment, concurrency • Use Spark APIs: SQL, Mllib • On-Prem vs Cloud (Elasticity/decouple storage and compute) • Integrating Data Science and Data Engineering • A Broader Look at the Ecosystem • Outlook and Next Steps
  • 5. Data Science with R • Very popular language • Designed by statisticians • Large community • > 10.000 packages • plus: integrated package management • But: Limited as Single-Node platform • Data has to fit in memory • Limited concurrency for processing
  • 7. SparkR as seen from R • Import SparkR-package and initialize SparkSession • Convert data frames from local R data frames to Spark DataFrame and back • Read and write data stored in Hadoop HDFS, HBase, Cassandra, and more • Use Spark Libraries, such as SparkSQL and ML • User cluster hardware to distribute data frames and parallelize computation
  • 8. SparkR Architecture • Execute R on cluster • Data Integration • Spark DataFrame – R data frame • Access Big Data File Formats • Parallelization with UDFs • Use Spark APIs • SparkSQL • Spark MLlib
  • 9. SparkSQL from R • Execute SQL against Spark DataFrame • SELECT • Specify Projection • WHERE • Filter criteria • GROUPBY • Group/Aggregate • JOIN • Join tables
  • 11. Time Series Forecasting • ARIMA(p,d,q) • AR: p = order of the autoregressive part • I: d = degree of first differencing involved • MA: q = order of the moving average part • Time Series: a series of data points indexed in time order • Methods: • Exponential Smoothing • Neural Networks • ARIMA:
  • 12. “Pedestrian” Challenges • Modify some Spark and R (custom-build) • Submit Spark job with R (incl. packages) as YARN dependency • Challenge: R not installed on cluster • R’s installation location is hard-coded in R • “R Markdown” produces HTML, PDF, and more • Complex objects (.RDS) for metadata, KPIs, etc. • Producing additional output during run • Creating graphics in headless environments Installing R (+Pkg’s) on cluster Managing Non-Tabular Output
  • 13. Parallelization with SparkR 1.x • Sequential computation: > 20 hrs. • Single-Server, parallelized: > 4.5 hrs
  • 14. Parallelization with SparkR 1.x • Sequential computation: > 20 hrs. • Single-Server, parallelized: > 4.5 hrs • SparkR 1.6.2, 25 nodes, 4 cores: ca. 12 mins.
  • 15. Microsoft R Server for Spark • Microsoft R Server for HDInsight integrates Spark and R • Based on Revolution Analytics • UDFs via rxExec() • Data Sources • RxXdfFile • RxTextFile • RxHiveData • RxParquetData
  • 16. Parallelization with SparkR 2.x Support for User-Defined Functions • dapply (dapplyCollect) • input: DataFrame, func [, Schema] • output: DataFrame • gapply (gapplyCollect) • input: DataFrame¦GroupedData, groupBy, func [, Schema] • output: DataFrame • spark.lapply • input: parameters, func • Access to data/HDFS • output: List
  • 18. The (Data) Science Process Public Perception of Science Source: Birth of a Theorem – with Cedric Villani (https://www.youtube.com/watch?v=yYwydG_aHPE)
  • 19. The (Data) Science Process Public Perception of Science Science in Reality Source: Birth of a Theorem – with Cedric Villani (https://www.youtube.com/watch?v=yYwydG_aHPE)
  • 20. Integrating Dev and Prod • No Need to Re-Write Applications for Production • Common Environment for Development, Test and Production • “Looks like R to Data Science, looks like Spark to Data Engineers” • Oozie-SparkAction vs ShellAction • Prepare Dev-/Prod-Environment
  • 21. 2-Level Parallelization (1) Submit multiple jobs to your cluster: - Cluster Manager (YARN, Spark, Mesos) - Spark Job: Driver and Executors (2) Use GPGPU - Spark Job: Driver and Executor - Let Executor use GPGPU (3) Combine 1 and 2
  • 22. Mix Scala and R • Call R from Scala • Add DataScience Module to your Spark Application • Use Spark/Scala for ETL, R for Science code • Call Spark from R • Implement high-performance code in Spark • More granular control over cluster resources
  • 23. SparkR: A Dynamic Ecosystem Hadoop, Spark & R: Many interesting projects and options • SparkR (Apache, Databricks) • R Server for Spark (Microsoft) • Sparklyr (RStudio) • Oracle R for Analytics, FastR (Oracle) • SystemML (IBM) • Renjin (BeDataDriven)
  • 24. Outlook & Misc • Organizational: Deepen Integration of Data Engineering & Data Science • Source Code Control & Versioning (git …) • Continuous Build • Test Management (RUnit, testthat…?) • Multi-Output (Rmarkdown) • Technical: New Approaches • Simplify/Unify Data Pipelines (SparkSQL) • Performance Improvement: use MLlib • Performance Improvement: move calculation to GPU