SlideShare a Scribd company logo
Build, Scale, and Deploy Deep
Learning Pipelines with Ease
Using Apache Spark
Tim Hunter (Software Engineer)
Sue Ann Hong (Software Engineer)
Spark Meetup - August 22nd, 2017
About Us
• Sue Ann Hong
• Software engineer @ Databricks
• Ph.D. from CMU in Machine Learning
• Tim Hunter
• Software engineer @ Databricks
• Ph.D. from UC Berkeley in Machine Learning
• Very early Spark user
Today
• Deep Learning at scale made easy: the vision
• Processing images with DL Pipelines
• Building simple Deep Learning models with transfer learning
• Model deployment via SQL
More advanced topics will be covered during the Q&A and other
meetups.
Deep Learning with ease
What is Deep Learning?
• A set of machine learning techniques that use layers that
transform numerical inputs
• Classification
• Regression
• Arbitrary mapping
• Popular in the 80’s as Neural Networks
• Recently came back thanks to advances in data collection,
computation techniques, and hardware.
Success of Deep Learning
• Tremendous success for applications with complex data
• AlphaGo
• Image interpretation
• Automatictranslation
• Speech recognition
But still requires a lot of effort
• Low level APIs with steep learning curve
• Tedious to distribute computations
• Not well integrated with other enterprise tools
• No exact science around deep learning
• Success requires many engineer-hours
Deep Learning in industry
• Currently limited adoption
• Huge potential beyond the industrial giants
• How do we accelerate the road to massive availability?
A typical Deep Learning workflow
• Load data (images, text, time series, …)
• Interactive work
• Train
• Select an architecture for a neural network
• Optimize the weights of the NN
• Evaluateresults, potentially re-train
• Apply:
• Pass the data through the NN to produce new features or output
How can Spark help?
• A lot of libraries available for Deep Learning in Spark
• TensorFlowOnSpark, BigDL, …
• Goes from simple to very advanced
• See our previous meetuptalks for more detail
• Spark is great at scaling out computations
• Distribute the transforms
• Manage the trainingcomputation
• Spark MLlib Pipelines
• Simple, concise APIto capture the ML workflow
Deep Learning Pipelines:
Deep Learning with Simplicity
• Open-source Databricks library:
https://github.com/databricks/spark-deep-learning
• Focuses on easeof useand integration,without sacrificing
performance
• Scales out common tasks
• Integrates with Spark APIs
• Primary language: Python
Deep Learning Pipelines
• Load data
• Interactive work
• Train
• Evaluate model
• Apply
• Image	loading	in	Spark
• Deploying	models	in	SQL
• Transfer	learning
• Distributed	tuning
• Distributed	prediction
• Pre-trained	models
This
talk:
✓
✓
✓
✓
Image processing with DL
Pipelines and Databricks
Adds support for images in Spark
• ImageSchema, reader, conversion functions to/from numpy
arrays
• Most of the tools we’ll describe work on ImageSchema columns
from sparkdl import readImages
image_df = readImages(sample_img_dir)
Applying popular models
• Popular pre-trained models accessible through MLlib
Transformers
predictor = DeepImagePredictor(inputCol="image",
outputCol="predicted_labels",
modelName="InceptionV3")
predictions_df = predictor.transform(image_df)
Applying popular models
predictor = DeepImagePredictor(inputCol="image",
outputCol="predicted_labels",
modelName="InceptionV3")
predictions_df = predictor.transform(image_df)
Fast model training via
transfer learning
Example: Identify the James Bond cars
DEMO
Transfer Learning
Transfer Learning
Transfer Learning
Transfer Learning
Transfer Learning
SoftMax
GIANT PANDA 0.9
RED PANDA 0.05
RACCOON 0.01
…
Classifier
Transfer Learning
DeepImageFeaturizer
MLlib primer
• MLlib: the machine learning library included with Spark
• Transformer
• Transforms the data: takes a Spark dataframe and appends a new column
• Estimator
• Produces a model (fit)
• Pipeline: sequence of transformers and estimators
Transfer Learning as a Pipeline
MLlib Pipeline
Image
Loading Preprocessing
Logistic
Regression
DeepImageFeaturizer
DEMO
Sharing and exporting Deep
Learning models
Classifier
Deep	Learning	Model
Model Export and Sharing
Shipping predictors in SQL
Take a trained model / Pipeline, register a SQL UDF usable by
anyone in the organization
In Spark SQL:
registerKerasUDF(”my_object_recognition_function",
keras_model_file="/mymodels/007model.h5")
select image, my_object_recognition_function(image) as objects
from traffic_imgs
DEMO
Conclusion
Deep Learning without Deep Pockets
• Simple API for Deep Learning, integrated with MLlib
• Scales common tasks with transformers and estimators
• Embeds Deep Learning models in MLlib and SparkSQL
• Early release of Deep Learning Pipelines
https://github.com/databricks/spark-deep-learning
Deep Learning Pipelines - future
In progress
• Hyper-parameter tuning for Keras models
• Official image support in Spark
• Scala API
(Potential) future work
• Text models
• Support for more backends, e.g. MXNet, PyTorch, BigDL
Resources
Blog posts & webinars — http://databricks.com/blog
• Deep Learning Pipelines
• GPU acceleration in Databricks
• BigDL on Databricks
• Deep Learning and Apache Spark
Docs for Deep Learning on Databricks — http://docs.databricks.com
• Getting started
• Deep Learning Pipelines Example
• Spark integration
https://spark-summit.org/eu-2017/
15% Discount code: Databricks
https://databricks.com/company/careers
GREAT
Thank You!
Questions?
Happy Sparking & Deep Learning!
Ad

Recommended

Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Databricks
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Jen Aman
 
Build, Scale, and Deploy Deep Learning Pipelines with Ease
Build, Scale, and Deploy Deep Learning Pipelines with Ease
Databricks
 
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Databricks
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
Anyscale
 
What's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You Care
Databricks
 
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
Databricks
 
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Databricks
 
Composable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and Weld
Databricks
 
Integrating Deep Learning Libraries with Apache Spark
Integrating Deep Learning Libraries with Apache Spark
Databricks
 
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit
 
A Tale of Three Tools: Kubernetes, Jsonnet, and Bazel
A Tale of Three Tools: Kubernetes, Jsonnet, and Bazel
Databricks
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
Spark r under the hood with Hossein Falaki
Spark r under the hood with Hossein Falaki
Databricks
 
Operational Tips For Deploying Apache Spark
Operational Tips For Deploying Apache Spark
Databricks
 
Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0
Databricks
 
Apache Spark Usage in the Open Source Ecosystem
Apache Spark Usage in the Open Source Ecosystem
Databricks
 
Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
Databricks
 
Spark Summit 2016: Connecting Python to the Spark Ecosystem
Spark Summit 2016: Connecting Python to the Spark Ecosystem
Daniel Rodriguez
 
Spark Summit EU talk by Tim Hunter
Spark Summit EU talk by Tim Hunter
Spark Summit
 
What's New in Upcoming Apache Spark 2.3
What's New in Upcoming Apache Spark 2.3
Databricks
 
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
Spark Summit
 
From Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim Hunter
Databricks
 
Simplifying Big Data Applications with Apache Spark 2.0
Simplifying Big Data Applications with Apache Spark 2.0
Spark Summit
 
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Databricks
 
Resource-Efficient Deep Learning Model Selection on Apache Spark
Resource-Efficient Deep Learning Model Selection on Apache Spark
Databricks
 
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Databricks
 
Building a Business Logic Translation Engine with Spark Streaming for Communi...
Building a Business Logic Translation Engine with Spark Streaming for Communi...
Spark Summit
 
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
Databricks
 
Combining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache Spark
DataWorks Summit/Hadoop Summit
 

More Related Content

What's hot (20)

Composable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and Weld
Databricks
 
Integrating Deep Learning Libraries with Apache Spark
Integrating Deep Learning Libraries with Apache Spark
Databricks
 
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit
 
A Tale of Three Tools: Kubernetes, Jsonnet, and Bazel
A Tale of Three Tools: Kubernetes, Jsonnet, and Bazel
Databricks
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
Spark r under the hood with Hossein Falaki
Spark r under the hood with Hossein Falaki
Databricks
 
Operational Tips For Deploying Apache Spark
Operational Tips For Deploying Apache Spark
Databricks
 
Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0
Databricks
 
Apache Spark Usage in the Open Source Ecosystem
Apache Spark Usage in the Open Source Ecosystem
Databricks
 
Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
Databricks
 
Spark Summit 2016: Connecting Python to the Spark Ecosystem
Spark Summit 2016: Connecting Python to the Spark Ecosystem
Daniel Rodriguez
 
Spark Summit EU talk by Tim Hunter
Spark Summit EU talk by Tim Hunter
Spark Summit
 
What's New in Upcoming Apache Spark 2.3
What's New in Upcoming Apache Spark 2.3
Databricks
 
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
Spark Summit
 
From Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim Hunter
Databricks
 
Simplifying Big Data Applications with Apache Spark 2.0
Simplifying Big Data Applications with Apache Spark 2.0
Spark Summit
 
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Databricks
 
Resource-Efficient Deep Learning Model Selection on Apache Spark
Resource-Efficient Deep Learning Model Selection on Apache Spark
Databricks
 
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Databricks
 
Building a Business Logic Translation Engine with Spark Streaming for Communi...
Building a Business Logic Translation Engine with Spark Streaming for Communi...
Spark Summit
 
Composable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and Weld
Databricks
 
Integrating Deep Learning Libraries with Apache Spark
Integrating Deep Learning Libraries with Apache Spark
Databricks
 
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit
 
A Tale of Three Tools: Kubernetes, Jsonnet, and Bazel
A Tale of Three Tools: Kubernetes, Jsonnet, and Bazel
Databricks
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
Spark r under the hood with Hossein Falaki
Spark r under the hood with Hossein Falaki
Databricks
 
Operational Tips For Deploying Apache Spark
Operational Tips For Deploying Apache Spark
Databricks
 
Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0
Databricks
 
Apache Spark Usage in the Open Source Ecosystem
Apache Spark Usage in the Open Source Ecosystem
Databricks
 
Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
Databricks
 
Spark Summit 2016: Connecting Python to the Spark Ecosystem
Spark Summit 2016: Connecting Python to the Spark Ecosystem
Daniel Rodriguez
 
Spark Summit EU talk by Tim Hunter
Spark Summit EU talk by Tim Hunter
Spark Summit
 
What's New in Upcoming Apache Spark 2.3
What's New in Upcoming Apache Spark 2.3
Databricks
 
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
Spark Summit
 
From Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim Hunter
Databricks
 
Simplifying Big Data Applications with Apache Spark 2.0
Simplifying Big Data Applications with Apache Spark 2.0
Spark Summit
 
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Databricks
 
Resource-Efficient Deep Learning Model Selection on Apache Spark
Resource-Efficient Deep Learning Model Selection on Apache Spark
Databricks
 
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Databricks
 
Building a Business Logic Translation Engine with Spark Streaming for Communi...
Building a Business Logic Translation Engine with Spark Streaming for Communi...
Spark Summit
 

Similar to Build, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache Spark (20)

Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
Databricks
 
Combining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache Spark
DataWorks Summit/Hadoop Summit
 
Combining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache Spark
Databricks
 
Deep learning with DL4J - Hadoop Summit 2015
Deep learning with DL4J - Hadoop Summit 2015
Josh Patterson
 
Deep learning and Apache Spark
Deep learning and Apache Spark
QuantUniversity
 
AI and Spark - IBM Community AI Day
AI and Spark - IBM Community AI Day
Nick Pentreath
 
Applied Deep Learning with Spark and Deeplearning4j
Applied Deep Learning with Spark and Deeplearning4j
DataWorks Summit
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best Practices
Databricks
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best Practices
Jen Aman
 
How to Build Deep Learning Models
How to Build Deep Learning Models
Josh Patterson
 
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Spark Summit
 
Fighting Fraud with Apache Spark
Fighting Fraud with Apache Spark
Miklos Christine
 
Data Engineering Course Syllabus - WeCloudData
Data Engineering Course Syllabus - WeCloudData
WeCloudData
 
Apache Spark's MLlib's Past Trajectory and new Directions
Apache Spark's MLlib's Past Trajectory and new Directions
Databricks
 
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
Databricks
 
Enterprise Deep Learning with DL4J
Enterprise Deep Learning with DL4J
Josh Patterson
 
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Jason Dai
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
Jose Quesada (hiring)
 
Apache Spark MLlib
Apache Spark MLlib
Zahra Eskandari
 
Big Data Introduction - Solix empower
Big Data Introduction - Solix empower
Durga Gadiraju
 
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
Databricks
 
Combining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache Spark
DataWorks Summit/Hadoop Summit
 
Combining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache Spark
Databricks
 
Deep learning with DL4J - Hadoop Summit 2015
Deep learning with DL4J - Hadoop Summit 2015
Josh Patterson
 
Deep learning and Apache Spark
Deep learning and Apache Spark
QuantUniversity
 
AI and Spark - IBM Community AI Day
AI and Spark - IBM Community AI Day
Nick Pentreath
 
Applied Deep Learning with Spark and Deeplearning4j
Applied Deep Learning with Spark and Deeplearning4j
DataWorks Summit
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best Practices
Databricks
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best Practices
Jen Aman
 
How to Build Deep Learning Models
How to Build Deep Learning Models
Josh Patterson
 
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Spark Summit
 
Fighting Fraud with Apache Spark
Fighting Fraud with Apache Spark
Miklos Christine
 
Data Engineering Course Syllabus - WeCloudData
Data Engineering Course Syllabus - WeCloudData
WeCloudData
 
Apache Spark's MLlib's Past Trajectory and new Directions
Apache Spark's MLlib's Past Trajectory and new Directions
Databricks
 
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
Databricks
 
Enterprise Deep Learning with DL4J
Enterprise Deep Learning with DL4J
Josh Patterson
 
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Jason Dai
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
Jose Quesada (hiring)
 
Big Data Introduction - Solix empower
Big Data Introduction - Solix empower
Durga Gadiraju
 
Ad

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Ad

Recently uploaded (20)

About Certivo | Intelligent Compliance Solutions for Global Regulatory Needs
About Certivo | Intelligent Compliance Solutions for Global Regulatory Needs
certivoai
 
UPDASP a project coordination unit ......
UPDASP a project coordination unit ......
withrj1
 
Software Testing & it’s types (DevOps)
Software Testing & it’s types (DevOps)
S Pranav (Deepu)
 
Decipher SEO Solutions for your startup needs.
Decipher SEO Solutions for your startup needs.
mathai2
 
Step by step guide to install Flutter and Dart
Step by step guide to install Flutter and Dart
S Pranav (Deepu)
 
What is data visualization and how data visualization tool can help.pdf
What is data visualization and how data visualization tool can help.pdf
Varsha Nayak
 
wAIred_RabobankIgniteSession_12062025.pptx
wAIred_RabobankIgniteSession_12062025.pptx
SimonedeGijt
 
On-Device AI: Is It Time to Go All-In, or Do We Still Need the Cloud?
On-Device AI: Is It Time to Go All-In, or Do We Still Need the Cloud?
Hassan Abid
 
Shell Skill Tree - LabEx Certification (LabEx)
Shell Skill Tree - LabEx Certification (LabEx)
VICTOR MAESTRE RAMIREZ
 
dp-700 exam questions sample docume .pdf
dp-700 exam questions sample docume .pdf
pravkumarbiz
 
Open Source Software Development Methods
Open Source Software Development Methods
VICTOR MAESTRE RAMIREZ
 
Meet You in the Middle: 1000x Performance for Parquet Queries on PB-Scale Dat...
Meet You in the Middle: 1000x Performance for Parquet Queries on PB-Scale Dat...
Alluxio, Inc.
 
Smadav Pro 2025 Rev 15.4 Crack Full Version With Registration Key
Smadav Pro 2025 Rev 15.4 Crack Full Version With Registration Key
joybepari360
 
Enable Your Cloud Journey With Microsoft Trusted Partner | IFI Tech
Enable Your Cloud Journey With Microsoft Trusted Partner | IFI Tech
IFI Techsolutions
 
Milwaukee Marketo User Group June 2025 - Optimize and Enhance Efficiency - Sm...
Milwaukee Marketo User Group June 2025 - Optimize and Enhance Efficiency - Sm...
BradBedford3
 
Making significant Software Architecture decisions
Making significant Software Architecture decisions
Bert Jan Schrijver
 
Folding Cheat Sheet # 9 - List Unfolding 𝑢𝑛𝑓𝑜𝑙𝑑 as the Computational Dual of ...
Folding Cheat Sheet # 9 - List Unfolding 𝑢𝑛𝑓𝑜𝑙𝑑 as the Computational Dual of ...
Philip Schwarz
 
Artificial Intelligence Workloads and Data Center Management
Artificial Intelligence Workloads and Data Center Management
SandeepKS52
 
Zoneranker’s Digital marketing solutions
Zoneranker’s Digital marketing solutions
reenashriee
 
Porting Qt 5 QML Modules to Qt 6 Webinar
Porting Qt 5 QML Modules to Qt 6 Webinar
ICS
 
About Certivo | Intelligent Compliance Solutions for Global Regulatory Needs
About Certivo | Intelligent Compliance Solutions for Global Regulatory Needs
certivoai
 
UPDASP a project coordination unit ......
UPDASP a project coordination unit ......
withrj1
 
Software Testing & it’s types (DevOps)
Software Testing & it’s types (DevOps)
S Pranav (Deepu)
 
Decipher SEO Solutions for your startup needs.
Decipher SEO Solutions for your startup needs.
mathai2
 
Step by step guide to install Flutter and Dart
Step by step guide to install Flutter and Dart
S Pranav (Deepu)
 
What is data visualization and how data visualization tool can help.pdf
What is data visualization and how data visualization tool can help.pdf
Varsha Nayak
 
wAIred_RabobankIgniteSession_12062025.pptx
wAIred_RabobankIgniteSession_12062025.pptx
SimonedeGijt
 
On-Device AI: Is It Time to Go All-In, or Do We Still Need the Cloud?
On-Device AI: Is It Time to Go All-In, or Do We Still Need the Cloud?
Hassan Abid
 
Shell Skill Tree - LabEx Certification (LabEx)
Shell Skill Tree - LabEx Certification (LabEx)
VICTOR MAESTRE RAMIREZ
 
dp-700 exam questions sample docume .pdf
dp-700 exam questions sample docume .pdf
pravkumarbiz
 
Open Source Software Development Methods
Open Source Software Development Methods
VICTOR MAESTRE RAMIREZ
 
Meet You in the Middle: 1000x Performance for Parquet Queries on PB-Scale Dat...
Meet You in the Middle: 1000x Performance for Parquet Queries on PB-Scale Dat...
Alluxio, Inc.
 
Smadav Pro 2025 Rev 15.4 Crack Full Version With Registration Key
Smadav Pro 2025 Rev 15.4 Crack Full Version With Registration Key
joybepari360
 
Enable Your Cloud Journey With Microsoft Trusted Partner | IFI Tech
Enable Your Cloud Journey With Microsoft Trusted Partner | IFI Tech
IFI Techsolutions
 
Milwaukee Marketo User Group June 2025 - Optimize and Enhance Efficiency - Sm...
Milwaukee Marketo User Group June 2025 - Optimize and Enhance Efficiency - Sm...
BradBedford3
 
Making significant Software Architecture decisions
Making significant Software Architecture decisions
Bert Jan Schrijver
 
Folding Cheat Sheet # 9 - List Unfolding 𝑢𝑛𝑓𝑜𝑙𝑑 as the Computational Dual of ...
Folding Cheat Sheet # 9 - List Unfolding 𝑢𝑛𝑓𝑜𝑙𝑑 as the Computational Dual of ...
Philip Schwarz
 
Artificial Intelligence Workloads and Data Center Management
Artificial Intelligence Workloads and Data Center Management
SandeepKS52
 
Zoneranker’s Digital marketing solutions
Zoneranker’s Digital marketing solutions
reenashriee
 
Porting Qt 5 QML Modules to Qt 6 Webinar
Porting Qt 5 QML Modules to Qt 6 Webinar
ICS
 

Build, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache Spark

  • 1. Build, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache Spark Tim Hunter (Software Engineer) Sue Ann Hong (Software Engineer) Spark Meetup - August 22nd, 2017
  • 2. About Us • Sue Ann Hong • Software engineer @ Databricks • Ph.D. from CMU in Machine Learning • Tim Hunter • Software engineer @ Databricks • Ph.D. from UC Berkeley in Machine Learning • Very early Spark user
  • 3. Today • Deep Learning at scale made easy: the vision • Processing images with DL Pipelines • Building simple Deep Learning models with transfer learning • Model deployment via SQL More advanced topics will be covered during the Q&A and other meetups.
  • 5. What is Deep Learning? • A set of machine learning techniques that use layers that transform numerical inputs • Classification • Regression • Arbitrary mapping • Popular in the 80’s as Neural Networks • Recently came back thanks to advances in data collection, computation techniques, and hardware.
  • 6. Success of Deep Learning • Tremendous success for applications with complex data • AlphaGo • Image interpretation • Automatictranslation • Speech recognition
  • 7. But still requires a lot of effort • Low level APIs with steep learning curve • Tedious to distribute computations • Not well integrated with other enterprise tools • No exact science around deep learning • Success requires many engineer-hours
  • 8. Deep Learning in industry • Currently limited adoption • Huge potential beyond the industrial giants • How do we accelerate the road to massive availability?
  • 9. A typical Deep Learning workflow • Load data (images, text, time series, …) • Interactive work • Train • Select an architecture for a neural network • Optimize the weights of the NN • Evaluateresults, potentially re-train • Apply: • Pass the data through the NN to produce new features or output
  • 10. How can Spark help? • A lot of libraries available for Deep Learning in Spark • TensorFlowOnSpark, BigDL, … • Goes from simple to very advanced • See our previous meetuptalks for more detail • Spark is great at scaling out computations • Distribute the transforms • Manage the trainingcomputation • Spark MLlib Pipelines • Simple, concise APIto capture the ML workflow
  • 11. Deep Learning Pipelines: Deep Learning with Simplicity • Open-source Databricks library: https://github.com/databricks/spark-deep-learning • Focuses on easeof useand integration,without sacrificing performance • Scales out common tasks • Integrates with Spark APIs • Primary language: Python
  • 12. Deep Learning Pipelines • Load data • Interactive work • Train • Evaluate model • Apply • Image loading in Spark • Deploying models in SQL • Transfer learning • Distributed tuning • Distributed prediction • Pre-trained models This talk: ✓ ✓ ✓ ✓
  • 13. Image processing with DL Pipelines and Databricks
  • 14. Adds support for images in Spark • ImageSchema, reader, conversion functions to/from numpy arrays • Most of the tools we’ll describe work on ImageSchema columns from sparkdl import readImages image_df = readImages(sample_img_dir)
  • 15. Applying popular models • Popular pre-trained models accessible through MLlib Transformers predictor = DeepImagePredictor(inputCol="image", outputCol="predicted_labels", modelName="InceptionV3") predictions_df = predictor.transform(image_df)
  • 16. Applying popular models predictor = DeepImagePredictor(inputCol="image", outputCol="predicted_labels", modelName="InceptionV3") predictions_df = predictor.transform(image_df)
  • 17. Fast model training via transfer learning
  • 18. Example: Identify the James Bond cars
  • 19. DEMO
  • 25. SoftMax GIANT PANDA 0.9 RED PANDA 0.05 RACCOON 0.01 … Classifier Transfer Learning DeepImageFeaturizer
  • 26. MLlib primer • MLlib: the machine learning library included with Spark • Transformer • Transforms the data: takes a Spark dataframe and appends a new column • Estimator • Produces a model (fit) • Pipeline: sequence of transformers and estimators
  • 27. Transfer Learning as a Pipeline MLlib Pipeline Image Loading Preprocessing Logistic Regression DeepImageFeaturizer
  • 28. DEMO
  • 29. Sharing and exporting Deep Learning models
  • 31. Shipping predictors in SQL Take a trained model / Pipeline, register a SQL UDF usable by anyone in the organization In Spark SQL: registerKerasUDF(”my_object_recognition_function", keras_model_file="/mymodels/007model.h5") select image, my_object_recognition_function(image) as objects from traffic_imgs
  • 32. DEMO
  • 34. Deep Learning without Deep Pockets • Simple API for Deep Learning, integrated with MLlib • Scales common tasks with transformers and estimators • Embeds Deep Learning models in MLlib and SparkSQL • Early release of Deep Learning Pipelines https://github.com/databricks/spark-deep-learning
  • 35. Deep Learning Pipelines - future In progress • Hyper-parameter tuning for Keras models • Official image support in Spark • Scala API (Potential) future work • Text models • Support for more backends, e.g. MXNet, PyTorch, BigDL
  • 36. Resources Blog posts & webinars — http://databricks.com/blog • Deep Learning Pipelines • GPU acceleration in Databricks • BigDL on Databricks • Deep Learning and Apache Spark Docs for Deep Learning on Databricks — http://docs.databricks.com • Getting started • Deep Learning Pipelines Example • Spark integration