Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark

Build, Scale, and Deploy
Deep Learning Pipelines
Using Apache Spark
Bay Area Spark Meetup
Nov8, 2017
Sue Ann Hong, Databricks

This talk
• Deep Learning at scale: current state
• Deep Learning Pipelines: the philosophy
• End-to-end workflow with DL Pipelines
• Future

Deep Learning at Scale
: current state
3put your #assignedhashtag here by setting the

What is Deep Learning?
• A set of machine learning techniques that use layers that
transform numerical inputs
• Classification
• Regression
• Arbitrary mapping
• Popular in the 80’s as Neural Networks
• Recently came back thanks to advances in data collection,
computation techniques, and hardware.

Success of Deep Learning
Tremendous success for applications with complex data
• AlphaGo
• Image interpretation
• Automatictranslation
• Speech recognition

Deep Learning is often challenging
Labeled
Data
Compute Resources
& Time
Engineer hours
• Tedious or difficult to distribute computations
• No exact science around deep learning à lots of tweaking
• Low level APIs with steep learning curve
• Complex models à need a lot of data

Deep Learning in industry
• Currently limited adoption
• Huge potential beyond the industrial giants
• How do we accelerate the road to massive availability?

Deep Learning Pipelines:
Deep Learning with Simplicity
• Open-source Databricks library
• Focuses on easeof useand integration
• without sacrificing performance

Compute Resources
& Time
Engineer hours Labeled
Data

Instead
• Be easy to scale
• Require little tweaking
• Be easy to write
• Require little or no data
Common workflows should

How
• Be easy to scale
• Require little tweaking
• Be easy to write
• Require little or no data
Common workflows should
• Use Apache Spark for scaling out common tasks
• Leverage well-knownmodel architectures
• Integrate with MLlib Pipelines API to capture ML workflowconcisely
• Leverage pre-trained models for common tasks
Apache Spark for scaling out
MLlib Pipelines API
pre-trained models

Demo: Build a visual recommendation AI
10minutes
7lines of code
Elastic Scale-out
using Apache Spark
MLlib Pipelines API Leverage
pre-trained models
0labels

Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark

End-to-EndWorkflow
with Deep Learning Pipelines

A typical Deep Learning workflow
• Load data (images, text, time series, …)
• Interactive work
• Train
• Select an architecture for a neural network
• Optimize the weights of the NN
• Evaluateresults, potentially re-train
• Apply:
• Pass the data through the NN to produce new features or output
Load data
Interactive work
Train
Evaluate
Apply

A typical Deep Learning workflow
Load data
Interactive work
Train
Evaluate
Apply
• Image loading in Spark
• Distributed batch prediction
• Deploying models in SQL
• Transfer learning
• Distributed tuning
• Pre-trained models

• Load data
• Train
• Evaluate model
• Apply

Adds support for images in Spark
• ImageSchema, reader, conversion functions to/from numpy arrays
• Most of the tools we’ll describe work on ImageSchema columns
from sparkdl import readImages
image_df = readImages(sample_img_dir)

Upcoming: built-in support in Spark
• Spark-21866
• Contributing image format & reading to Spark
• Targeted for Spark 2.3
• Joint work with Microsoft

Applying popular models
• Popular pre-trained models accessible through MLlib
Transformers
predictor = DeepImagePredictor(inputCol="image",
outputCol="predicted_labels",
modelName="InceptionV3")
predictions_df = predictor.transform(image_df)

Applying popular models
predictor = DeepImagePredictor(inputCol="image",
outputCol="predicted_labels",
predictions_df = predictor.transform(image_df)

• Load data
• Train
• Evaluate model
• Apply
Hyperparameter tuning
Transfer learning

Transfer learning
• Pre-trained models may not be directly applicable
• New domain, e.g. shoes
• Training from scratch requires
• Enormous amounts of data
• A lot of compute resources & time
• Idea: intermediate representations learned for one task may be useful
for other related tasks

Transfer Learning
SoftMax
GIANT PANDA 0.9
RACCOON 0.05
RED PANDA 0.01
…

Transfer Learning
Classifier
Rose: 0.7
Daisy: 0.3

Transfer Learning as a Pipeline
DeepImageFeaturizer
Image
Loading Preprocessing
Logistic
Regression
MLlib Pipeline

Transfer Learning as a Pipeline
featurizer = DeepImageFeaturizer(inputCol="image",
outputCol="features",
lr = LogisticRegression(labelCol="label")
p = Pipeline(stages=[featurizer, lr])
p_model = p.fit(train_df)

Transfer Learning
• Usually for classification tasks
• Similar task, new domain
• But other forms of learning leveraging learned representations
can be loosely considered transfer learning

Featurization for similarity-based ML
DeepImageFeaturizer
Image
Logistic
Regression

DeepImageFeaturizer
Image
Clustering
KMeans
GaussianMixture
Nearest Neighbor
KNN LSH
Distance
computation

DeepImageFeaturizer
Image
Clustering
KMeans
GaussianMixture
Nearest Neighbor
KNN LSH
Distance
computation
Duplicate
Detection
Recommendation
Anomaly
Detection
Search result
diversification

• Load data
• Train
• Evaluate model
• Apply
Spark SQL
Batch prediction
s

• Load data
• Train
• Evaluate model
• Apply
Spark SQL
Batch prediction

Batch prediction as an MLlib Transformer
• A model is a Transformer in MLlib
• DataFrame goes in, DataFrame comes out with output columns
predictor = XXTransformer(inputCol="image",
outputCol="prediction",
modelSpecification={…})
predictions = predictor.transform(test_df)

Hierarchy of DL transformers for images
46
TFImageTransformer
KerasImageTransformer
DeepImageFeaturizerDeepImagePredictor
NamedImageTransformer(model_name)
(keras.Model)
(tf.Graph)
Input
(Image)
Output
(vector)
model_namemodel_name
Keras.Model
tf.Graph

Hierarchy of DL transformers for images
47
TFImageTransformer
NamedImageTransformer(model_name)
(keras.Model)
(tf.Graph)
Input
(Image)
Output
(vector)
Keras.Model
tf.Graph

Hierarchy of DL transformers
48
TFImageTransformer
DeepImageF
eaturizer
DeepImageP
redictor
NamedImageTransformer
)
Input
(Image
)
Output
(vector
)
Keras.Model
tf.Graph
TFTextTransformer
KerasTextTransformer
DeepTextFeat
urizer
DeepTextPre
dictor
NamedTextTransformerInput
(Text)
Output
(vector)
Keras.Model
tf.Graph
TFTransformer

49
TFImageTransformer
TFTextTransformer
DeepTextFeaturizerDeepTextPredictor
NamedTextTransformer
TFTransformer

50
TFImageTransformer
TFTextTransformer
TFTransformer

CommonTasks Easy
Advanced Ones Possible
51
TFImageTransformer
TFTextTransformer
TFTransformer

CommonTasks Easy
Advanced Ones Possible
52
TFImageTransformer
DeepImageFeaturizer
DeepImagePredictor
TFTextTransformer
DeepTextFeaturizer
DeepTextPredictor
TFTransformer
Pre-built Solutions
80%-built Solutions
Self-built Solutions

Deep Learning Pipelines : Future
In progress
• Text featurization (embeddings)
• TFTransformer for arbitrary vectors
Future directions
• Non-image data domains: video, text, speech, …
• Distributed training
• Support for more backends, e.g. MXNet, PyTorch, BigDL

Questions?
Thank you!

Resources
DL Pipelines GitHub Repo, Spark Summit Europe 2017 Deep Dive
Blog posts & webinars (http://databricks.com/blog)
• Deep Learning Pipelines
• GPU acceleration in Databricks
• BigDL on Databricks
• Deep Learning and Apache Spark
Docs for Deep Learningon Databricks (http://docs.databricks.com)
• Getting started
• Deep Learning Pipelines Example
• Spark integration

Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark

Recommended

More Related Content

What's hot (20)

Similar to Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark (20)

More from Databricks (20)

Recently uploaded (20)

Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark