SlideShare a Scribd company logo
Deep Learning on Apache®
Spark™: Workflows and Best
Practices
Tim Hunter (Software Engineer)
Jules S. Damji (Spark Community Evangelist)
May 4, 2017
Agenda
• Logistics
• Databricks Overview
• Deep Learning on Apache® Spark™: Workflows and Best
Practices
• Q & A
Logistics
• We can’t hear you…
• Recording will be available...
• Slides and Notebooks will be available...
• Queue up Questions ….
• Orange Button for Tech Support difficulties...
Empower anyone to innovate faster with big data.
Founded by the creators of Apache Spark.
Contributes 75%of the open source code,
10x more than any other company.
VISION
WHO WE ARE
A data processing for data scientists, data engineers, and data
analysts that simplifies that data integration, real-time
experimentation, machine learning and deployment of
production pipelines .
PRODUCT
A New Paradigm
SECOND GENERATION
THE BEST OF BOTH WORLDS
Hadoop + data lake
Hard to centralize data and
extract value with disparate tools
Virtual analytics
• Holisticallyanalyze data from
data warehouses, data lakes,
and other data stores
• Utilize a single engine for batch,
ML, streaming & real-time
queries
• Enable enterprise-wide
collaboration
+
FIRST GENERATION
Data warehouses
ETL process is rigid, scaling out
is expensive, limited to SQL
CLUSTER TUNING &
MANAGEMENT
INTERACTIVE
WORKSPACE
PRODUCTION
PIPELINE
AUTOMATION
OPTIMIZED DATA
ACCESS
DATABRICKS ENTERPRISE SECURITY
YOUR	TEAMS
Data Science
Data Engineering
Many others…
BI Analysts
YOUR	DATA
Cloud Storage
Data Warehouses
Data Lake
VIRTUAL ANALYTICS PLATFORM
Deep Learning on Apache®
Spark™: Workflows and Best
Practices
Tim Hunter (Software Engineer)
May 4 , 2017
About Me
• Tim Hunter
• Software engineer @ Databricks
• Ph.D. from UC Berkeley in Machine Learning
• Very early Spark user
• Contributor to MLlib
• Author of TensorFrames and GraphFrames
Deep Learning and Apache Spark
Deep Learning frameworks w/ Spark bindings
• Caffe (CaffeOnSpark)
• Keras (Elephas)
• MXNet
• Paddle
• TensorFlow(TensorFlowOnSpark,TensorFrames)
Extensions to Spark for specialized hardware
• Blaze (UCLA & Falcon Computing Solutions)
• IBM Conductor with Spark
Native Spark
• BigDL
• DeepDist
• DeepLearning4J
• MLlib
• SparkCL
• SparkNet
Deep Learning and Apache Spark
2016: the year of emerging solutions for Spark + Deep Learning
No consensus
• Many approaches for libraries: integrate existing ones with Spark, build on
top of Spark, modify Spark itself
• Official Spark MLlib support is limited(perceptron-like networks)
One Framework to Rule Them All?
Should we look for The One Deep Learning Framework?
Databricks’ perspective
• Databricks: hosted Spark platform on public cloud
• GPUs for compute-intensive workloads
• Customers use many Deep Learning frameworks: TensorFlow, MXNet, BigDL,
Theano, Caffe, and more
This talk
• Lessons learned from supporting many Deep Learning frameworks
• Multiple ways to integrate Deep Learning & Spark
• Best practices for these integrations
Outline
• Deep Learning in data pipelines
• Recurring patterns in Spark + Deep Learning integrations
• Developer tips
• Monitoring
Outline
• Deep Learning in data pipelines
• Recurring patterns in Spark + Deep Learning integrations
• Developer tips
• Monitoring
ML is a small part of data pipelines.
Hidden	technical	debt	in	Machine	Learning	systems
Sculley et	al.,	NIPS	2016
DL in a data pipeline: Training
Data
collection
ETL Featurization Deep
Learning
Validation Export,
Serving
compute intensive IO intensiveIO intensive
Large cluster
High memory/CPU ratio
Small cluster
Low memory/CPU ratio
DL in a data pipeline: Transformation
Specialized data transforms: feature extraction & prediction
Input Output
cat
dog
dog
Saulius Garalevicius - CC BY-SA3.0
Outline
• Deep Learning in data pipelines
• Recurringpatterns in Spark + Deep Learning integrations
• Developer tips
• Monitoring
Recurring patterns
Spark as a scheduler
• Data-parallel tasks
• Data stored outside Spark
Embedded Deep Learning transforms
• Data-parallel tasks
• Data stored in DataFrames/RDDs
Cooperative frameworks
• Multiple passes over data
• Heavy and/or specialized communication
Streaming data through DL
Primary storage choices:
• Cold layer (HDFS/S3/etc.)
• Local storage: files, Spark’s on-disk persistence layer
• In memory: SparkRDDs or SparkDataFrames
Find out if you are I/O constrained or processor-constrained
• How big is your dataset? MNIST or ImageNet?
If using PySpark:
• All frameworks heavily optimized for diskI/O
• Use Spark’s broadcastfor small datasets that fitin memory
• Reading files is fast: use local files when it does not fit
Cooperative frameworks
• Use Spark for data input
• Examples:
• IBM GPU efforts
• Skymind’s DeepLearning4J
• DistML and other Parameter Server efforts
RDD
Partition	1
Partition	n
RDD
Partition	1
Partition	m
Black	box
Cooperative frameworks
• Bypass Spark for asynchronous / specific communication
patterns across machines
• Lose benefit of RDDs and DataFrames and
reproducibility/determinism
• But these guarantees are not requested anyway when doing
deep learning (stochastic gradient)
• “reproducibility is worth a factor of 2” (Leon Bottou, quoted by
John Langford)
Outline
• Deep Learning in data pipelines
• Recurring patterns in Spark + Deep Learning integrations
• Developer tips
• Monitoring
The GPU software stack
• Deep Learning commonly used with GPUs
• A lot of workon Spark dependencies:
• Few dependencies on local machine when compiling Spark
• The build process works well in a largenumber of configurations (just scala +
maven)
• GPUs present challenges: CUDA, support libraries, drivers, etc.
• Deep softwarestack, requires careful construction (hardware+ drivers + CUDA
+ libraries)
• All these are expected by the user
• Turnkey stacks just starting to appear
• Provide a Docker image with all the GPU SDK
• Pre-install GPU drivers on the instance
Container:
nvidia-docker,
lxc,	etc.
The GPU software stack
GPU	hardware
Linux	kernel NV	Kernel	driver
CuBLAS CuDNN
Deep	learning	libraries
(Tensorflow,	etc.) JCUDA
Python	/	JVM	clients
CUDA
NV	kernel	driver	(userspace interface)
Using GPUs through PySpark
• Popular choice for many independent tasks
• Many DL packages have Python interfaces: TensorFlow,
Theano, Caffe, MXNet, etc.
• Lifetime for python packages: the process
• Requires some configuration tweaks in Spark
PySpark recommendation
• spark.executor.cores = 1
• Gives the DL framework full access over all the resources
• Important for frameworks that optimize processor pipelines
Outline
• Deep Learning in data pipelines
• Recurring patterns in Spark + Deep Learning integrations
• Developer tips
• Monitoring
Monitoring
?
Monitoring
• How do you monitor the progress of your tasks?
• It depends on the granularity
• Around tasks
• Inside (long-running) tasks
Monitoring: Accumulators
• Good to check throughput
or failure rate
• Works for Scala
• Limited use for Python
(for now, SPARK-2868)
• No “real-time” update
batchesAcc = sc.accumulator(1)
def processBatch(i):
global acc
acc += 1
# Process image batch here
images = sc.parallelize(…)
images.map(processBatch).collect()
Monitoring: external system
• Plugs into an external system
• Existing solutions: Grafana, Graphite, Prometheus, etc.
• Most flexible, but more complex to deploy
Conclusion
• Distributed deep learning: exciting and fast-moving space
• Most insights are specific to a task, a dataset and an algorithm:
nothing replaces experiments
• Get started with data-parallel jobs
• Move to cooperative frameworks only when your data are too large.
Challenges to address
For Spark developers
• Monitoringlong-running tasks
• Presentingand introspecting intermediate results
For DL developers
• What boundary to put between the algorithm and Spark?
• How to integrate with Spark at the low-level?
Resources
Recent blog posts — http://databricks.com/blog
• TensorFrames
• GPU acceleration
• Getting started with Deep Learning
• Intel’s BigDL
Docs for Deep Learning on Databricks — http://docs.databricks.com
• Getting started
• Spark integration
SPARK SUMMIT 2017
DATA SCIENCE AND ENGINEERING AT SCALE
JUNE 5 – 7 | MOSCONE CENTER | SAN FRANCISCO
ORGANIZED BY spark-summit.org/2017
Thank You!
Questions?
Happy Sparking & Deep Learning!

More Related Content

PPTX
Data Science at Scale by Sarah Guido
PDF
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
PDF
Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
PPTX
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
PDF
Announcing Databricks Cloud (Spark Summit 2014)
PDF
Pandas UDF: Scalable Analysis with Python and PySpark
PDF
Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer wit...
PDF
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
Data Science at Scale by Sarah Guido
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
Announcing Databricks Cloud (Spark Summit 2014)
Pandas UDF: Scalable Analysis with Python and PySpark
Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer wit...
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...

What's hot (18)

PPTX
Apache Spark in Scientific Applciations
PDF
Fast and Scalable Python
PDF
RISELab:Enabling Intelligent Real-Time Decisions
PDF
SparkApplicationDevMadeEasy_Spark_Summit_2015
PDF
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...
PDF
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
PDF
Scala: the unpredicted lingua franca for data science
PDF
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
PDF
Bring Satellite and Drone Imagery into your Data Science Workflows
PDF
Geospatial Analytics at Scale with Deep Learning and Apache Spark with Tim hu...
PDF
Insights Without Tradeoffs: Using Structured Streaming
PDF
Improving the Life of Data Scientists: Automating ML Lifecycle through MLflow
PDF
QCon São Paulo: Real-Time Analytics with Spark Streaming
PDF
A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...
PDF
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
PDF
Spark summit 2019 infrastructure for deep learning in apache spark 0425
PDF
Snorkel: Dark Data and Machine Learning with Christopher Ré
PDF
Strata EU 2014: Spark Streaming Case Studies
Apache Spark in Scientific Applciations
Fast and Scalable Python
RISELab:Enabling Intelligent Real-Time Decisions
SparkApplicationDevMadeEasy_Spark_Summit_2015
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Scala: the unpredicted lingua franca for data science
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Bring Satellite and Drone Imagery into your Data Science Workflows
Geospatial Analytics at Scale with Deep Learning and Apache Spark with Tim hu...
Insights Without Tradeoffs: Using Structured Streaming
Improving the Life of Data Scientists: Automating ML Lifecycle through MLflow
QCon São Paulo: Real-Time Analytics with Spark Streaming
A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
Spark summit 2019 infrastructure for deep learning in apache spark 0425
Snorkel: Dark Data and Machine Learning with Christopher Ré
Strata EU 2014: Spark Streaming Case Studies
Ad

Similar to Deep Learning on Apache® Spark™ : Workflows and Best Practices (20)

PPTX
Apache Spark Fundamentals
PPTX
Tuning and Monitoring Deep Learning on Apache Spark
PDF
Deep learning and Apache Spark
PDF
Infrastructure for Deep Learning in Apache Spark
PPTX
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
PDF
Bringing Deep Learning into production
PPTX
Data Science at Scale: Using Apache Spark for Data Science at Bitly
PDF
Apache Spark for Everyone - Women Who Code Workshop
PDF
Integrating Deep Learning Libraries with Apache Spark
PPTX
From Pipelines to Refineries: scaling big data applications with Tim Hunter
PDF
Sa introduction to big data pipelining with cassandra & spark west mins...
PDF
From Pipelines to Refineries: Scaling Big Data Applications
PPTX
Combining Machine Learning frameworks with Apache Spark
PPTX
Apache Spark in Industry
PPTX
Taboola Road To Scale With Apache Spark
PPTX
Pyspark presentationfsfsfjspfsjfsfsfjsfpsfsf
PPTX
Combining Machine Learning Frameworks with Apache Spark
PDF
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
PPTX
Big Data training
PPTX
Apache Spark sql
Apache Spark Fundamentals
Tuning and Monitoring Deep Learning on Apache Spark
Deep learning and Apache Spark
Infrastructure for Deep Learning in Apache Spark
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
Bringing Deep Learning into production
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Apache Spark for Everyone - Women Who Code Workshop
Integrating Deep Learning Libraries with Apache Spark
From Pipelines to Refineries: scaling big data applications with Tim Hunter
Sa introduction to big data pipelining with cassandra & spark west mins...
From Pipelines to Refineries: Scaling Big Data Applications
Combining Machine Learning frameworks with Apache Spark
Apache Spark in Industry
Taboola Road To Scale With Apache Spark
Pyspark presentationfsfsfjspfsjfsfsfjsfpsfsf
Combining Machine Learning Frameworks with Apache Spark
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Big Data training
Apache Spark sql
Ad

More from Jen Aman (20)

PPTX
Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
PDF
Deep Learning on Apache® Spark™: Workflows and Best Practices
PDF
Spatial Analysis On Histological Images Using Spark
PDF
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
PDF
A Graph-Based Method For Cross-Entity Threat Detection
PDF
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
PDF
Time-Evolving Graph Processing On Commodity Clusters
PDF
Deploying Accelerators At Datacenter Scale Using Spark
PDF
Re-Architecting Spark For Performance Understandability
PDF
Re-Architecting Spark For Performance Understandability
PDF
Low Latency Execution For Apache Spark
PDF
Efficient State Management With Spark 2.0 And Scale-Out Databases
PDF
Livy: A REST Web Service For Apache Spark
PDF
GPU Computing With Apache Spark And Python
PDF
Spark And Cassandra: 2 Fast, 2 Furious
PDF
Building Custom Machine Learning Algorithms With Apache SystemML
PDF
Spark on Mesos
PDF
Elasticsearch And Apache Lucene For Apache Spark And MLlib
PDF
Spark at Bloomberg: Dynamically Composable Analytics
PDF
Spark Uber Development Kit
Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Deep Learning on Apache® Spark™: Workflows and Best Practices
Spatial Analysis On Histological Images Using Spark
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
A Graph-Based Method For Cross-Entity Threat Detection
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Time-Evolving Graph Processing On Commodity Clusters
Deploying Accelerators At Datacenter Scale Using Spark
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance Understandability
Low Latency Execution For Apache Spark
Efficient State Management With Spark 2.0 And Scale-Out Databases
Livy: A REST Web Service For Apache Spark
GPU Computing With Apache Spark And Python
Spark And Cassandra: 2 Fast, 2 Furious
Building Custom Machine Learning Algorithms With Apache SystemML
Spark on Mesos
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Spark at Bloomberg: Dynamically Composable Analytics
Spark Uber Development Kit

Recently uploaded (20)

PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PDF
Lecture1 pattern recognition............
PDF
Mega Projects Data Mega Projects Data
PDF
.pdf is not working space design for the following data for the following dat...
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
Introduction to machine learning and Linear Models
PDF
Clinical guidelines as a resource for EBP(1).pdf
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
IB Computer Science - Internal Assessment.pptx
PDF
annual-report-2024-2025 original latest.
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
Database Infoormation System (DBIS).pptx
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
Supervised vs unsupervised machine learning algorithms
Data_Analytics_and_PowerBI_Presentation.pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Lecture1 pattern recognition............
Mega Projects Data Mega Projects Data
.pdf is not working space design for the following data for the following dat...
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Introduction to machine learning and Linear Models
Clinical guidelines as a resource for EBP(1).pdf
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Business Ppt On Nestle.pptx huunnnhhgfvu
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
IB Computer Science - Internal Assessment.pptx
annual-report-2024-2025 original latest.
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Database Infoormation System (DBIS).pptx
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
Supervised vs unsupervised machine learning algorithms

Deep Learning on Apache® Spark™ : Workflows and Best Practices

  • 1. Deep Learning on Apache® Spark™: Workflows and Best Practices Tim Hunter (Software Engineer) Jules S. Damji (Spark Community Evangelist) May 4, 2017
  • 2. Agenda • Logistics • Databricks Overview • Deep Learning on Apache® Spark™: Workflows and Best Practices • Q & A
  • 3. Logistics • We can’t hear you… • Recording will be available... • Slides and Notebooks will be available... • Queue up Questions …. • Orange Button for Tech Support difficulties...
  • 4. Empower anyone to innovate faster with big data. Founded by the creators of Apache Spark. Contributes 75%of the open source code, 10x more than any other company. VISION WHO WE ARE A data processing for data scientists, data engineers, and data analysts that simplifies that data integration, real-time experimentation, machine learning and deployment of production pipelines . PRODUCT
  • 5. A New Paradigm SECOND GENERATION THE BEST OF BOTH WORLDS Hadoop + data lake Hard to centralize data and extract value with disparate tools Virtual analytics • Holisticallyanalyze data from data warehouses, data lakes, and other data stores • Utilize a single engine for batch, ML, streaming & real-time queries • Enable enterprise-wide collaboration + FIRST GENERATION Data warehouses ETL process is rigid, scaling out is expensive, limited to SQL
  • 6. CLUSTER TUNING & MANAGEMENT INTERACTIVE WORKSPACE PRODUCTION PIPELINE AUTOMATION OPTIMIZED DATA ACCESS DATABRICKS ENTERPRISE SECURITY YOUR TEAMS Data Science Data Engineering Many others… BI Analysts YOUR DATA Cloud Storage Data Warehouses Data Lake VIRTUAL ANALYTICS PLATFORM
  • 7. Deep Learning on Apache® Spark™: Workflows and Best Practices Tim Hunter (Software Engineer) May 4 , 2017
  • 8. About Me • Tim Hunter • Software engineer @ Databricks • Ph.D. from UC Berkeley in Machine Learning • Very early Spark user • Contributor to MLlib • Author of TensorFrames and GraphFrames
  • 9. Deep Learning and Apache Spark Deep Learning frameworks w/ Spark bindings • Caffe (CaffeOnSpark) • Keras (Elephas) • MXNet • Paddle • TensorFlow(TensorFlowOnSpark,TensorFrames) Extensions to Spark for specialized hardware • Blaze (UCLA & Falcon Computing Solutions) • IBM Conductor with Spark Native Spark • BigDL • DeepDist • DeepLearning4J • MLlib • SparkCL • SparkNet
  • 10. Deep Learning and Apache Spark 2016: the year of emerging solutions for Spark + Deep Learning No consensus • Many approaches for libraries: integrate existing ones with Spark, build on top of Spark, modify Spark itself • Official Spark MLlib support is limited(perceptron-like networks)
  • 11. One Framework to Rule Them All? Should we look for The One Deep Learning Framework?
  • 12. Databricks’ perspective • Databricks: hosted Spark platform on public cloud • GPUs for compute-intensive workloads • Customers use many Deep Learning frameworks: TensorFlow, MXNet, BigDL, Theano, Caffe, and more This talk • Lessons learned from supporting many Deep Learning frameworks • Multiple ways to integrate Deep Learning & Spark • Best practices for these integrations
  • 13. Outline • Deep Learning in data pipelines • Recurring patterns in Spark + Deep Learning integrations • Developer tips • Monitoring
  • 14. Outline • Deep Learning in data pipelines • Recurring patterns in Spark + Deep Learning integrations • Developer tips • Monitoring
  • 15. ML is a small part of data pipelines. Hidden technical debt in Machine Learning systems Sculley et al., NIPS 2016
  • 16. DL in a data pipeline: Training Data collection ETL Featurization Deep Learning Validation Export, Serving compute intensive IO intensiveIO intensive Large cluster High memory/CPU ratio Small cluster Low memory/CPU ratio
  • 17. DL in a data pipeline: Transformation Specialized data transforms: feature extraction & prediction Input Output cat dog dog Saulius Garalevicius - CC BY-SA3.0
  • 18. Outline • Deep Learning in data pipelines • Recurringpatterns in Spark + Deep Learning integrations • Developer tips • Monitoring
  • 19. Recurring patterns Spark as a scheduler • Data-parallel tasks • Data stored outside Spark Embedded Deep Learning transforms • Data-parallel tasks • Data stored in DataFrames/RDDs Cooperative frameworks • Multiple passes over data • Heavy and/or specialized communication
  • 20. Streaming data through DL Primary storage choices: • Cold layer (HDFS/S3/etc.) • Local storage: files, Spark’s on-disk persistence layer • In memory: SparkRDDs or SparkDataFrames Find out if you are I/O constrained or processor-constrained • How big is your dataset? MNIST or ImageNet? If using PySpark: • All frameworks heavily optimized for diskI/O • Use Spark’s broadcastfor small datasets that fitin memory • Reading files is fast: use local files when it does not fit
  • 21. Cooperative frameworks • Use Spark for data input • Examples: • IBM GPU efforts • Skymind’s DeepLearning4J • DistML and other Parameter Server efforts RDD Partition 1 Partition n RDD Partition 1 Partition m Black box
  • 22. Cooperative frameworks • Bypass Spark for asynchronous / specific communication patterns across machines • Lose benefit of RDDs and DataFrames and reproducibility/determinism • But these guarantees are not requested anyway when doing deep learning (stochastic gradient) • “reproducibility is worth a factor of 2” (Leon Bottou, quoted by John Langford)
  • 23. Outline • Deep Learning in data pipelines • Recurring patterns in Spark + Deep Learning integrations • Developer tips • Monitoring
  • 24. The GPU software stack • Deep Learning commonly used with GPUs • A lot of workon Spark dependencies: • Few dependencies on local machine when compiling Spark • The build process works well in a largenumber of configurations (just scala + maven) • GPUs present challenges: CUDA, support libraries, drivers, etc. • Deep softwarestack, requires careful construction (hardware+ drivers + CUDA + libraries) • All these are expected by the user • Turnkey stacks just starting to appear
  • 25. • Provide a Docker image with all the GPU SDK • Pre-install GPU drivers on the instance Container: nvidia-docker, lxc, etc. The GPU software stack GPU hardware Linux kernel NV Kernel driver CuBLAS CuDNN Deep learning libraries (Tensorflow, etc.) JCUDA Python / JVM clients CUDA NV kernel driver (userspace interface)
  • 26. Using GPUs through PySpark • Popular choice for many independent tasks • Many DL packages have Python interfaces: TensorFlow, Theano, Caffe, MXNet, etc. • Lifetime for python packages: the process • Requires some configuration tweaks in Spark
  • 27. PySpark recommendation • spark.executor.cores = 1 • Gives the DL framework full access over all the resources • Important for frameworks that optimize processor pipelines
  • 28. Outline • Deep Learning in data pipelines • Recurring patterns in Spark + Deep Learning integrations • Developer tips • Monitoring
  • 30. Monitoring • How do you monitor the progress of your tasks? • It depends on the granularity • Around tasks • Inside (long-running) tasks
  • 31. Monitoring: Accumulators • Good to check throughput or failure rate • Works for Scala • Limited use for Python (for now, SPARK-2868) • No “real-time” update batchesAcc = sc.accumulator(1) def processBatch(i): global acc acc += 1 # Process image batch here images = sc.parallelize(…) images.map(processBatch).collect()
  • 32. Monitoring: external system • Plugs into an external system • Existing solutions: Grafana, Graphite, Prometheus, etc. • Most flexible, but more complex to deploy
  • 33. Conclusion • Distributed deep learning: exciting and fast-moving space • Most insights are specific to a task, a dataset and an algorithm: nothing replaces experiments • Get started with data-parallel jobs • Move to cooperative frameworks only when your data are too large.
  • 34. Challenges to address For Spark developers • Monitoringlong-running tasks • Presentingand introspecting intermediate results For DL developers • What boundary to put between the algorithm and Spark? • How to integrate with Spark at the low-level?
  • 35. Resources Recent blog posts — http://databricks.com/blog • TensorFrames • GPU acceleration • Getting started with Deep Learning • Intel’s BigDL Docs for Deep Learning on Databricks — http://docs.databricks.com • Getting started • Spark integration
  • 36. SPARK SUMMIT 2017 DATA SCIENCE AND ENGINEERING AT SCALE JUNE 5 – 7 | MOSCONE CENTER | SAN FRANCISCO ORGANIZED BY spark-summit.org/2017