SlideShare a Scribd company logo
Build, Scale, and Deploy
Deep Learning Pipelines
Using Apache Spark
Tim Hunter, Databricks
Spark Meetup London, March 2018
About Me
Tim Hunter
• Software engineer @ Databricks
• Ph.D. from UC Berkeley in Machine Learning
• Very early Spark user (Spark 0.0.2)
• Co-creator of GraphFrames, TensorFrames,
Joint work with
Sue Ann Hong
TEAM
About Databricks
Started Spark project (now Apache Spark) at UC Berkeley in 2009
PRODUCT
Unified Analytics Platform
MISSION
Making Big Data Simple
Try for free today.
databricks.com
This talk
• Deep Learning at scale: current state
• Deep Learning Pipelines: the vision
• End-to-end workflow with DL Pipelines
• Future
Deep Learning at Scale
: current state
5put	your	#assignedhashtag	here	by	setting	the
What is Deep Learning?
• A set of machine learning techniques that use layers that
transform numerical inputs
• Classification
• Regression
• Arbitrary mapping
• Popular in the 80’s as Neural Networks
• Recently came back thanks to advances in data collection,
computation techniques, and hardware.
t
Success of Deep Learning
Tremendous success for applications with complex data
• AlphaGo
• Image interpretation
• Automatic translation
• Speech recognition
But requires a lot of effort
• No exact science around deep learning
• Success requires many engineer-hours
• Low level APIs with steep learning curve
• Not well integrated with other enterprise tools
• Tedious to distribute computations
What does Spark offer?
Very little in Apache Spark MLlib itself (multilayer perceptron)
Many Spark packages
Integrations with existing DL libraries
• Deep Learning Pipelines (from Databricks)
• Caffe (CaffeOnSpark)
• Keras (Elephas)
• mxnet
• Paddle
• TensorFlow (TensorFlow on Spark, TensorFrames)
• CNTK (mmlspark)
Implementations of DL on Spark
• BigDL
• DeepDist
• DeepLearning4J
• MLlib
• SparkCL
• SparkNet
Deep Learning in industry
• Currently limited adoption
• Huge potential beyond the industrial giants
• How do we accelerate the road to massive availability?
Deep Learning Pipelines
11put	your	#assignedhashtag	here	by	setting	the
Deep Learning Pipelines:
Deep Learning with Simplicity
• Open-source Databricks library
• Focuses on ease of use and integration
• without sacrificing performance
• Primary language: Python
• Uses Apache Spark for scaling out common tasks
• Integrates with MLlib Pipelines to capture the ML workflow
concisely
s
A typical Deep Learning workflow
• Load data (images, text, time series, …)
• Interactive work
• Train
• Select an architecture for a neural network
• Optimize the weights of the NN
• Evaluate results, potentially re-train
• Apply:
• Pass the data through the NN to produce new features or output
Load data
Interactive work
Train
Evaluate
Apply
A typical Deep Learning workflow
Load data
Interactive work
Train
Evaluate
Apply
• Image	loading	in	Spark
• Distributed	batch	prediction
• Deploying	models	in	SQL
• Transfer	learning
• Distributed	tuning
• Pre-trained	models
End-to-End Workflow
with Deep Learning Pipelines
15put	your	#assignedhashtag	here	by	setting	the
Deep Learning Pipelines
• Load data
• Interactive work
• Train
• Evaluate model
• Apply
t
Built-in support in Spark
• In Spark 2.3
• Collaboration with Microsoft
• ImageSchema, reader, conversion functions to/from numpy arrays
• Most of the tools we’ll describe work on ImageSchema columns
images = spark.readImages(img_dir,
recursive = True,
sampleRatio = 0.1)
Deep Learning Pipelines
• Load data
• Interactive work
• Train
• Evaluate model
• Apply
Applying popular models
• Popular pre-trained models accessible through MLlib
Transformers
predictor = DeepImagePredictor(inputCol="image",
outputCol="predicted_labels",
modelName="InceptionV3")
predictions_df = predictor.transform(image_df)
Applying popular models
predictor = DeepImagePredictor(inputCol="image",
outputCol="predicted_labels",
modelName="InceptionV3")
predictions_df = predictor.transform(image_df)
Deep Learning Pipelines
• Load data
• Interactive work
• Train
• Evaluate model
• Apply
Hyperparameter tuning
Transfer	learning
s
Deep Learning Pipelines
• Load data
• Interactive work
• Train
• Evaluate model
• Apply
Hyperparameter tuning
Transfer	learning
Transfer learning
• Pre-trained models may not be directly applicable
• New domain, e.g. shoes
• Training from scratch requires
• Enormous amounts of data
• A lot of compute resources & time
• Idea: intermediate representations learned for one task may be useful
for other related tasks
Transfer Learning
SoftMax
GIANT PANDA 0.9
RACCOON 0.05
RED PANDA 0.01
…
Transfer Learning
Transfer Learning
Classifier
Transfer Learning
Classifier
Rose: 0.7
Daisy: 0.3
MLlib Pipelines primer
• MLlib: the machine learning library included with Spark
• Transformer
• Takes in a Spark dataframe
• Returns a Spark dataframe with new column(s) containing “transformed” data
• e.g. a Model is a Transformer
• Estimator
• A learning algorithm, e.g. lr = LogisticRegression()
• Produces a Model via lr.fit()
• Pipeline: a sequence of Transformers and Estimators
Transfer Learning as a Pipeline
ClassifierDeepImageFeaturizer
Rose / Daisy
Transfer Learning as a Pipeline
DeepImageFeaturizer
Image
Loading Preprocessing
Logistic
Regression
MLlib Pipeline
Transfer Learning as a Pipeline
31put	your	#assignedhashtag	here	by	setting	the	
featurizer = DeepImageFeaturizer(inputCol="image",
outputCol="features",
modelName="InceptionV3")
lr = LogisticRegression(labelCol="label")
p = Pipeline(stages=[featurizer, lr])
p_model = p.fit(train_df)
Transfer Learning
• Usually for classification tasks
• Similar task, new domain
• But other forms of learning leveraging learned representations
can be loosely considered transfer learning
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Featurization for similarity-based ML
DeepImageFeaturizer
Image
Loading Preprocessing
Logistic
Regression
Featurization for similarity-based ML
DeepImageFeaturizer
Image
Loading Preprocessing
Clustering
KMeans
GaussianMixture
Nearest Neighbor
KNN LSH
Distance
computation
Featurization for similarity-based ML
DeepImageFeaturizer
Image
Loading Preprocessing
Clustering
KMeans
GaussianMixture
Nearest Neighbor
KNN LSH
Distance
computation
Duplicate
Detection
Recommendation
Anomaly
Detection
Search result
diversification
Keras
37
model = Sequential()
model.add(Dense(32, input_dim=784))
model.add(Activation('relu'))
• A popular, declarative interface to build DL models
• High level, expressive API in python
• Executes on TensorFlow, Theano, CNTK
model = Sequential()
model.add(...)
model.save(model_filename)
estimator = KerasImageFileEstimator(
kerasOptimizer=“adam“,
kerasLoss=“categorical_crossentropy“,
kerasFitParams={“batch_size“:100},
modelFile=model_filename)
model = model.fit(dataframe)
38
Keras Estimator
39
Keras Estimator in Model Selection
estimator = KerasImageFileEstimator(
kerasOptimizer=“adam“,
kerasLoss=“categorical_crossentropy“)
paramGrid = ( ParamGridBuilder()
.addGrid(kerasFitParams=[{“batch_size“:100}, {“batch_size“:200}])
.addGrid(modelFile=[model1, model2]) )
cv = CrossValidator(estimator=estimator,
estimatorParamMaps=paramGrid,
evaluator=BinaryClassificationEvaluator(),
numFolds=3)
best_model = cv.fit(train_df)
Deep Learning Pipelines
• Load data
• Interactive work
• Train
• Evaluate model
• Apply
Deep Learning Pipelines
• Load data
• Interactive work
• Train
• Evaluate model
• Apply
Spark	SQL
Batch	prediction
s
Deep Learning Pipelines
• Load data
• Interactive work
• Train
• Evaluate model
• Apply
Spark	SQL
Batch	prediction
Batch prediction as an MLlib Transformer
• Recall a model is a Transformer in MLlib
predictor = XXTransformer(inputCol="image",
outputCol=”predictions",
modelSpecification={…})
predictions = predictor.transform(test_df)
Deep Learning Pipelines
• Load data
• Interactive work
• Train
• Evaluate model
• Apply
Spark	SQL
Batch	prediction
s
Shipping predictors in SQL
Take a trained model / Pipeline, register a SQL UDF usable by
anyone in the organization
In Spark SQL:
registerKerasUDF(”my_object_recognition_function",
keras_model_file="/mymodels/007model.h5")
select image, my_object_recognition_function(image) as objects
from traffic_imgs
This means you can apply deep learning models in streaming!
Deep Learning Pipelines : Future
In progress
• Scala API for DeepImageFeaturizer
• Text featurization (embeddings)
• TFTransformer for arbitrary vectors
Future
• Distributed training
• Support for more backends, e.g. MXNet, PyTorch, BigDL
Deep Learning without Deep Pockets
• Simple API for Deep Learning, integrated with MLlib
• Scales common tasks with transformers and estimators
• Embeds Deep Learning models in MLlib and SparkSQL
• Check out https://github.com/databricks/spark-deep-
learning !
Resources
Blog posts & webinars (http://databricks.com/blog)
• Deep Learning Pipelines
• GPU acceleration in Databricks
• BigDL on Databricks
• Deep Learning and Apache Spark
Docs for Deep Learning on Databricks (http://docs.databricks.com)
• Getting started
• Deep Learning Pipelines Example
• Spark integration
49
WWW.DATABRICKS.COM/SPARKAISUMMIT
DATE: June 4-6, 2018
LOCATION: San Francisco -
Moscone
TRACKS: Artificial
Intelligence, Spark Use
Cases, Enterprise,
Productionizing ML, Deep
Learning, Hardware in the
Cloud
ATTENDEES: 4000+ Data
Scientists, Data Engineers,
Analysts, & VP/CxOs
https://databricks.com/company/careers
GREAT
Thank You!
Questions?
Happy Sparking & Deep Learning!
Ad

Recommended

Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Databricks
 
Build, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache Spark
Databricks
 
Build, Scale, and Deploy Deep Learning Pipelines with Ease
Build, Scale, and Deploy Deep Learning Pipelines with Ease
Databricks
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Jen Aman
 
Composable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and Weld
Databricks
 
Integrating Deep Learning Libraries with Apache Spark
Integrating Deep Learning Libraries with Apache Spark
Databricks
 
Extending Spark's Ingestion: Build Your Own Java Data Source with Jean George...
Extending Spark's Ingestion: Build Your Own Java Data Source with Jean George...
Databricks
 
Spark DataFrames and ML Pipelines
Spark DataFrames and ML Pipelines
Databricks
 
Resource-Efficient Deep Learning Model Selection on Apache Spark
Resource-Efficient Deep Learning Model Selection on Apache Spark
Databricks
 
What's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You Care
Databricks
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
Improving the Life of Data Scientists: Automating ML Lifecycle through MLflow
Improving the Life of Data Scientists: Automating ML Lifecycle through MLflow
Databricks
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
Databricks
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
Anyscale
 
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
Databricks
 
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...
Databricks
 
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Spark Summit
 
From Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim Hunter
Databricks
 
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
Spark Summit
 
Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
Databricks
 
DASK and Apache Spark
DASK and Apache Spark
Databricks
 
Dev Ops Training
Dev Ops Training
Spark Summit
 
Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0
Databricks
 
Accelerating Data Science with Better Data Engineering on Databricks
Accelerating Data Science with Better Data Engineering on Databricks
Databricks
 
Designing Distributed Machine Learning on Apache Spark
Designing Distributed Machine Learning on Apache Spark
Databricks
 
Simplify Distributed TensorFlow Training for Fast Image Categorization at Sta...
Simplify Distributed TensorFlow Training for Fast Image Categorization at Sta...
Databricks
 
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Databricks
 
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...
Databricks
 
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
Databricks
 
Combining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache Spark
DataWorks Summit/Hadoop Summit
 

More Related Content

What's hot (20)

Resource-Efficient Deep Learning Model Selection on Apache Spark
Resource-Efficient Deep Learning Model Selection on Apache Spark
Databricks
 
What's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You Care
Databricks
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
Improving the Life of Data Scientists: Automating ML Lifecycle through MLflow
Improving the Life of Data Scientists: Automating ML Lifecycle through MLflow
Databricks
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
Databricks
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
Anyscale
 
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
Databricks
 
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...
Databricks
 
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Spark Summit
 
From Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim Hunter
Databricks
 
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
Spark Summit
 
Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
Databricks
 
DASK and Apache Spark
DASK and Apache Spark
Databricks
 
Dev Ops Training
Dev Ops Training
Spark Summit
 
Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0
Databricks
 
Accelerating Data Science with Better Data Engineering on Databricks
Accelerating Data Science with Better Data Engineering on Databricks
Databricks
 
Designing Distributed Machine Learning on Apache Spark
Designing Distributed Machine Learning on Apache Spark
Databricks
 
Simplify Distributed TensorFlow Training for Fast Image Categorization at Sta...
Simplify Distributed TensorFlow Training for Fast Image Categorization at Sta...
Databricks
 
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Databricks
 
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...
Databricks
 
Resource-Efficient Deep Learning Model Selection on Apache Spark
Resource-Efficient Deep Learning Model Selection on Apache Spark
Databricks
 
What's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You Care
Databricks
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
Improving the Life of Data Scientists: Automating ML Lifecycle through MLflow
Improving the Life of Data Scientists: Automating ML Lifecycle through MLflow
Databricks
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
Databricks
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
Anyscale
 
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
Databricks
 
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...
Databricks
 
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Spark Summit
 
From Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim Hunter
Databricks
 
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
Spark Summit
 
Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
Databricks
 
DASK and Apache Spark
DASK and Apache Spark
Databricks
 
Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0
Databricks
 
Accelerating Data Science with Better Data Engineering on Databricks
Accelerating Data Science with Better Data Engineering on Databricks
Databricks
 
Designing Distributed Machine Learning on Apache Spark
Designing Distributed Machine Learning on Apache Spark
Databricks
 
Simplify Distributed TensorFlow Training for Fast Image Categorization at Sta...
Simplify Distributed TensorFlow Training for Fast Image Categorization at Sta...
Databricks
 
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Databricks
 
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...
Databricks
 

Similar to Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark (20)

Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
Databricks
 
Combining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache Spark
DataWorks Summit/Hadoop Summit
 
Combining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache Spark
Databricks
 
Deep learning and Apache Spark
Deep learning and Apache Spark
QuantUniversity
 
Fighting Fraud with Apache Spark
Fighting Fraud with Apache Spark
Miklos Christine
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
Jose Quesada (hiring)
 
Apache Spark MLlib
Apache Spark MLlib
Zahra Eskandari
 
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Jason Dai
 
Data Science at Scale with Apache Spark and Zeppelin Notebook
Data Science at Scale with Apache Spark and Zeppelin Notebook
Carolyn Duby
 
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
Databricks
 
Metail and Elastic MapReduce
Metail and Elastic MapReduce
Gareth Rogers
 
BDM25 - Spark runtime internal
BDM25 - Spark runtime internal
David Lauzon
 
MLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning Library
jeykottalam
 
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
Miklos Christine
 
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Anyscale
 
Apache Spark's MLlib's Past Trajectory and new Directions
Apache Spark's MLlib's Past Trajectory and new Directions
Databricks
 
Enterprise Deep Learning with DL4J
Enterprise Deep Learning with DL4J
Josh Patterson
 
A machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companies
DataWorks Summit
 
Apache spark
Apache spark
Hitesh Dua
 
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Spark Summit
 
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
Databricks
 
Combining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache Spark
DataWorks Summit/Hadoop Summit
 
Combining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache Spark
Databricks
 
Deep learning and Apache Spark
Deep learning and Apache Spark
QuantUniversity
 
Fighting Fraud with Apache Spark
Fighting Fraud with Apache Spark
Miklos Christine
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
Jose Quesada (hiring)
 
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Jason Dai
 
Data Science at Scale with Apache Spark and Zeppelin Notebook
Data Science at Scale with Apache Spark and Zeppelin Notebook
Carolyn Duby
 
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
Databricks
 
Metail and Elastic MapReduce
Metail and Elastic MapReduce
Gareth Rogers
 
BDM25 - Spark runtime internal
BDM25 - Spark runtime internal
David Lauzon
 
MLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning Library
jeykottalam
 
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
Miklos Christine
 
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Anyscale
 
Apache Spark's MLlib's Past Trajectory and new Directions
Apache Spark's MLlib's Past Trajectory and new Directions
Databricks
 
Enterprise Deep Learning with DL4J
Enterprise Deep Learning with DL4J
Josh Patterson
 
A machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companies
DataWorks Summit
 
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Spark Summit
 
Ad

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Ad

Recently uploaded (20)

You are not excused! How to avoid security blind spots on the way to production
You are not excused! How to avoid security blind spots on the way to production
Michele Leroux Bustamante
 
Edge-banding-machines-edgeteq-s-200-en-.pdf
Edge-banding-machines-edgeteq-s-200-en-.pdf
AmirStern2
 
War_And_Cyber_3_Years_Of_Struggle_And_Lessons_For_Global_Security.pdf
War_And_Cyber_3_Years_Of_Struggle_And_Lessons_For_Global_Security.pdf
biswajitbanerjee38
 
FIDO Alliance Seminar State of Passkeys.pptx
FIDO Alliance Seminar State of Passkeys.pptx
FIDO Alliance
 
OpenPOWER Foundation & Open-Source Core Innovations
OpenPOWER Foundation & Open-Source Core Innovations
IBM
 
“Key Requirements to Successfully Implement Generative AI in Edge Devices—Opt...
“Key Requirements to Successfully Implement Generative AI in Edge Devices—Opt...
Edge AI and Vision Alliance
 
Turning the Page – How AI is Exponentially Increasing Speed, Accuracy, and Ef...
Turning the Page – How AI is Exponentially Increasing Speed, Accuracy, and Ef...
Impelsys Inc.
 
Enabling BIM / GIS integrations with Other Systems with FME
Enabling BIM / GIS integrations with Other Systems with FME
Safe Software
 
ENERGY CONSUMPTION CALCULATION IN ENERGY-EFFICIENT AIR CONDITIONER.pdf
ENERGY CONSUMPTION CALCULATION IN ENERGY-EFFICIENT AIR CONDITIONER.pdf
Muhammad Rizwan Akram
 
FME for Distribution & Transmission Integrity Management Program (DIMP & TIMP)
FME for Distribution & Transmission Integrity Management Program (DIMP & TIMP)
Safe Software
 
Smarter Aviation Data Management: Lessons from Swedavia Airports and Sweco
Smarter Aviation Data Management: Lessons from Swedavia Airports and Sweco
Safe Software
 
From Manual to Auto Searching- FME in the Driver's Seat
From Manual to Auto Searching- FME in the Driver's Seat
Safe Software
 
ReSTIR [DI]: Spatiotemporal reservoir resampling for real-time ray tracing ...
ReSTIR [DI]: Spatiotemporal reservoir resampling for real-time ray tracing ...
revolcs10
 
FIDO Seminar: Targeting Trust: The Future of Identity in the Workforce.pptx
FIDO Seminar: Targeting Trust: The Future of Identity in the Workforce.pptx
FIDO Alliance
 
Improving Data Integrity: Synchronization between EAM and ArcGIS Utility Netw...
Improving Data Integrity: Synchronization between EAM and ArcGIS Utility Netw...
Safe Software
 
Information Security Response Team Nepal_npCERT_Vice_President_Sudan_Jha.pdf
Information Security Response Team Nepal_npCERT_Vice_President_Sudan_Jha.pdf
ICT Frame Magazine Pvt. Ltd.
 
Viral>Wondershare Filmora 14.5.18.12900 Crack Free Download
Viral>Wondershare Filmora 14.5.18.12900 Crack Free Download
Puppy jhon
 
Security Tips for Enterprise Azure Solutions
Security Tips for Enterprise Azure Solutions
Michele Leroux Bustamante
 
The Future of Data, AI, and AR: Innovation Inspired by You.pdf
The Future of Data, AI, and AR: Innovation Inspired by You.pdf
Safe Software
 
Securing Account Lifecycles in the Age of Deepfakes.pptx
Securing Account Lifecycles in the Age of Deepfakes.pptx
FIDO Alliance
 
You are not excused! How to avoid security blind spots on the way to production
You are not excused! How to avoid security blind spots on the way to production
Michele Leroux Bustamante
 
Edge-banding-machines-edgeteq-s-200-en-.pdf
Edge-banding-machines-edgeteq-s-200-en-.pdf
AmirStern2
 
War_And_Cyber_3_Years_Of_Struggle_And_Lessons_For_Global_Security.pdf
War_And_Cyber_3_Years_Of_Struggle_And_Lessons_For_Global_Security.pdf
biswajitbanerjee38
 
FIDO Alliance Seminar State of Passkeys.pptx
FIDO Alliance Seminar State of Passkeys.pptx
FIDO Alliance
 
OpenPOWER Foundation & Open-Source Core Innovations
OpenPOWER Foundation & Open-Source Core Innovations
IBM
 
“Key Requirements to Successfully Implement Generative AI in Edge Devices—Opt...
“Key Requirements to Successfully Implement Generative AI in Edge Devices—Opt...
Edge AI and Vision Alliance
 
Turning the Page – How AI is Exponentially Increasing Speed, Accuracy, and Ef...
Turning the Page – How AI is Exponentially Increasing Speed, Accuracy, and Ef...
Impelsys Inc.
 
Enabling BIM / GIS integrations with Other Systems with FME
Enabling BIM / GIS integrations with Other Systems with FME
Safe Software
 
ENERGY CONSUMPTION CALCULATION IN ENERGY-EFFICIENT AIR CONDITIONER.pdf
ENERGY CONSUMPTION CALCULATION IN ENERGY-EFFICIENT AIR CONDITIONER.pdf
Muhammad Rizwan Akram
 
FME for Distribution & Transmission Integrity Management Program (DIMP & TIMP)
FME for Distribution & Transmission Integrity Management Program (DIMP & TIMP)
Safe Software
 
Smarter Aviation Data Management: Lessons from Swedavia Airports and Sweco
Smarter Aviation Data Management: Lessons from Swedavia Airports and Sweco
Safe Software
 
From Manual to Auto Searching- FME in the Driver's Seat
From Manual to Auto Searching- FME in the Driver's Seat
Safe Software
 
ReSTIR [DI]: Spatiotemporal reservoir resampling for real-time ray tracing ...
ReSTIR [DI]: Spatiotemporal reservoir resampling for real-time ray tracing ...
revolcs10
 
FIDO Seminar: Targeting Trust: The Future of Identity in the Workforce.pptx
FIDO Seminar: Targeting Trust: The Future of Identity in the Workforce.pptx
FIDO Alliance
 
Improving Data Integrity: Synchronization between EAM and ArcGIS Utility Netw...
Improving Data Integrity: Synchronization between EAM and ArcGIS Utility Netw...
Safe Software
 
Information Security Response Team Nepal_npCERT_Vice_President_Sudan_Jha.pdf
Information Security Response Team Nepal_npCERT_Vice_President_Sudan_Jha.pdf
ICT Frame Magazine Pvt. Ltd.
 
Viral>Wondershare Filmora 14.5.18.12900 Crack Free Download
Viral>Wondershare Filmora 14.5.18.12900 Crack Free Download
Puppy jhon
 
Security Tips for Enterprise Azure Solutions
Security Tips for Enterprise Azure Solutions
Michele Leroux Bustamante
 
The Future of Data, AI, and AR: Innovation Inspired by You.pdf
The Future of Data, AI, and AR: Innovation Inspired by You.pdf
Safe Software
 
Securing Account Lifecycles in the Age of Deepfakes.pptx
Securing Account Lifecycles in the Age of Deepfakes.pptx
FIDO Alliance
 

Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark

  • 1. Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark Tim Hunter, Databricks Spark Meetup London, March 2018
  • 2. About Me Tim Hunter • Software engineer @ Databricks • Ph.D. from UC Berkeley in Machine Learning • Very early Spark user (Spark 0.0.2) • Co-creator of GraphFrames, TensorFrames, Joint work with Sue Ann Hong
  • 3. TEAM About Databricks Started Spark project (now Apache Spark) at UC Berkeley in 2009 PRODUCT Unified Analytics Platform MISSION Making Big Data Simple Try for free today. databricks.com
  • 4. This talk • Deep Learning at scale: current state • Deep Learning Pipelines: the vision • End-to-end workflow with DL Pipelines • Future
  • 5. Deep Learning at Scale : current state 5put your #assignedhashtag here by setting the
  • 6. What is Deep Learning? • A set of machine learning techniques that use layers that transform numerical inputs • Classification • Regression • Arbitrary mapping • Popular in the 80’s as Neural Networks • Recently came back thanks to advances in data collection, computation techniques, and hardware. t
  • 7. Success of Deep Learning Tremendous success for applications with complex data • AlphaGo • Image interpretation • Automatic translation • Speech recognition
  • 8. But requires a lot of effort • No exact science around deep learning • Success requires many engineer-hours • Low level APIs with steep learning curve • Not well integrated with other enterprise tools • Tedious to distribute computations
  • 9. What does Spark offer? Very little in Apache Spark MLlib itself (multilayer perceptron) Many Spark packages Integrations with existing DL libraries • Deep Learning Pipelines (from Databricks) • Caffe (CaffeOnSpark) • Keras (Elephas) • mxnet • Paddle • TensorFlow (TensorFlow on Spark, TensorFrames) • CNTK (mmlspark) Implementations of DL on Spark • BigDL • DeepDist • DeepLearning4J • MLlib • SparkCL • SparkNet
  • 10. Deep Learning in industry • Currently limited adoption • Huge potential beyond the industrial giants • How do we accelerate the road to massive availability?
  • 12. Deep Learning Pipelines: Deep Learning with Simplicity • Open-source Databricks library • Focuses on ease of use and integration • without sacrificing performance • Primary language: Python • Uses Apache Spark for scaling out common tasks • Integrates with MLlib Pipelines to capture the ML workflow concisely s
  • 13. A typical Deep Learning workflow • Load data (images, text, time series, …) • Interactive work • Train • Select an architecture for a neural network • Optimize the weights of the NN • Evaluate results, potentially re-train • Apply: • Pass the data through the NN to produce new features or output Load data Interactive work Train Evaluate Apply
  • 14. A typical Deep Learning workflow Load data Interactive work Train Evaluate Apply • Image loading in Spark • Distributed batch prediction • Deploying models in SQL • Transfer learning • Distributed tuning • Pre-trained models
  • 15. End-to-End Workflow with Deep Learning Pipelines 15put your #assignedhashtag here by setting the
  • 16. Deep Learning Pipelines • Load data • Interactive work • Train • Evaluate model • Apply t
  • 17. Built-in support in Spark • In Spark 2.3 • Collaboration with Microsoft • ImageSchema, reader, conversion functions to/from numpy arrays • Most of the tools we’ll describe work on ImageSchema columns images = spark.readImages(img_dir, recursive = True, sampleRatio = 0.1)
  • 18. Deep Learning Pipelines • Load data • Interactive work • Train • Evaluate model • Apply
  • 19. Applying popular models • Popular pre-trained models accessible through MLlib Transformers predictor = DeepImagePredictor(inputCol="image", outputCol="predicted_labels", modelName="InceptionV3") predictions_df = predictor.transform(image_df)
  • 20. Applying popular models predictor = DeepImagePredictor(inputCol="image", outputCol="predicted_labels", modelName="InceptionV3") predictions_df = predictor.transform(image_df)
  • 21. Deep Learning Pipelines • Load data • Interactive work • Train • Evaluate model • Apply Hyperparameter tuning Transfer learning s
  • 22. Deep Learning Pipelines • Load data • Interactive work • Train • Evaluate model • Apply Hyperparameter tuning Transfer learning
  • 23. Transfer learning • Pre-trained models may not be directly applicable • New domain, e.g. shoes • Training from scratch requires • Enormous amounts of data • A lot of compute resources & time • Idea: intermediate representations learned for one task may be useful for other related tasks
  • 24. Transfer Learning SoftMax GIANT PANDA 0.9 RACCOON 0.05 RED PANDA 0.01 …
  • 28. MLlib Pipelines primer • MLlib: the machine learning library included with Spark • Transformer • Takes in a Spark dataframe • Returns a Spark dataframe with new column(s) containing “transformed” data • e.g. a Model is a Transformer • Estimator • A learning algorithm, e.g. lr = LogisticRegression() • Produces a Model via lr.fit() • Pipeline: a sequence of Transformers and Estimators
  • 29. Transfer Learning as a Pipeline ClassifierDeepImageFeaturizer Rose / Daisy
  • 30. Transfer Learning as a Pipeline DeepImageFeaturizer Image Loading Preprocessing Logistic Regression MLlib Pipeline
  • 31. Transfer Learning as a Pipeline 31put your #assignedhashtag here by setting the featurizer = DeepImageFeaturizer(inputCol="image", outputCol="features", modelName="InceptionV3") lr = LogisticRegression(labelCol="label") p = Pipeline(stages=[featurizer, lr]) p_model = p.fit(train_df)
  • 32. Transfer Learning • Usually for classification tasks • Similar task, new domain • But other forms of learning leveraging learned representations can be loosely considered transfer learning
  • 34. Featurization for similarity-based ML DeepImageFeaturizer Image Loading Preprocessing Logistic Regression
  • 35. Featurization for similarity-based ML DeepImageFeaturizer Image Loading Preprocessing Clustering KMeans GaussianMixture Nearest Neighbor KNN LSH Distance computation
  • 36. Featurization for similarity-based ML DeepImageFeaturizer Image Loading Preprocessing Clustering KMeans GaussianMixture Nearest Neighbor KNN LSH Distance computation Duplicate Detection Recommendation Anomaly Detection Search result diversification
  • 37. Keras 37 model = Sequential() model.add(Dense(32, input_dim=784)) model.add(Activation('relu')) • A popular, declarative interface to build DL models • High level, expressive API in python • Executes on TensorFlow, Theano, CNTK
  • 38. model = Sequential() model.add(...) model.save(model_filename) estimator = KerasImageFileEstimator( kerasOptimizer=“adam“, kerasLoss=“categorical_crossentropy“, kerasFitParams={“batch_size“:100}, modelFile=model_filename) model = model.fit(dataframe) 38 Keras Estimator
  • 39. 39 Keras Estimator in Model Selection estimator = KerasImageFileEstimator( kerasOptimizer=“adam“, kerasLoss=“categorical_crossentropy“) paramGrid = ( ParamGridBuilder() .addGrid(kerasFitParams=[{“batch_size“:100}, {“batch_size“:200}]) .addGrid(modelFile=[model1, model2]) ) cv = CrossValidator(estimator=estimator, estimatorParamMaps=paramGrid, evaluator=BinaryClassificationEvaluator(), numFolds=3) best_model = cv.fit(train_df)
  • 40. Deep Learning Pipelines • Load data • Interactive work • Train • Evaluate model • Apply
  • 41. Deep Learning Pipelines • Load data • Interactive work • Train • Evaluate model • Apply Spark SQL Batch prediction s
  • 42. Deep Learning Pipelines • Load data • Interactive work • Train • Evaluate model • Apply Spark SQL Batch prediction
  • 43. Batch prediction as an MLlib Transformer • Recall a model is a Transformer in MLlib predictor = XXTransformer(inputCol="image", outputCol=”predictions", modelSpecification={…}) predictions = predictor.transform(test_df)
  • 44. Deep Learning Pipelines • Load data • Interactive work • Train • Evaluate model • Apply Spark SQL Batch prediction s
  • 45. Shipping predictors in SQL Take a trained model / Pipeline, register a SQL UDF usable by anyone in the organization In Spark SQL: registerKerasUDF(”my_object_recognition_function", keras_model_file="/mymodels/007model.h5") select image, my_object_recognition_function(image) as objects from traffic_imgs This means you can apply deep learning models in streaming!
  • 46. Deep Learning Pipelines : Future In progress • Scala API for DeepImageFeaturizer • Text featurization (embeddings) • TFTransformer for arbitrary vectors Future • Distributed training • Support for more backends, e.g. MXNet, PyTorch, BigDL
  • 47. Deep Learning without Deep Pockets • Simple API for Deep Learning, integrated with MLlib • Scales common tasks with transformers and estimators • Embeds Deep Learning models in MLlib and SparkSQL • Check out https://github.com/databricks/spark-deep- learning !
  • 48. Resources Blog posts & webinars (http://databricks.com/blog) • Deep Learning Pipelines • GPU acceleration in Databricks • BigDL on Databricks • Deep Learning and Apache Spark Docs for Deep Learning on Databricks (http://docs.databricks.com) • Getting started • Deep Learning Pipelines Example • Spark integration
  • 49. 49 WWW.DATABRICKS.COM/SPARKAISUMMIT DATE: June 4-6, 2018 LOCATION: San Francisco - Moscone TRACKS: Artificial Intelligence, Spark Use Cases, Enterprise, Productionizing ML, Deep Learning, Hardware in the Cloud ATTENDEES: 4000+ Data Scientists, Data Engineers, Analysts, & VP/CxOs