SlideShare a Scribd company logo
Resource-Efficient Deep Learning Model Selection on Apache Spark
Resource-efficient Deep
Learning Model Selection
on Apache Spark
Yuhao Zhang and Supun Nakandala
ADALab, University of California, San Diego
About us
▪ PHD students from ADALab at UCSD, advised by
Prof. Arun Kumar
▪ Our research mission: democratize data science
▪ More:
Supun Nakandala
https://scnakandala.github.io/
Yuhao Zhang
https://yhzhang.info/
ADALab
https://adalabucsd.github.io/
Introduction
Artificial Neural Networks (ANNs) are
revolutionizing many domains - “Deep Learning”
Problem: training deep nets is Painful!
Batch size?
8, 16, 64, 256 ...
Model architecture?
3 layer CNN,5 layer
CNN, LSTM…
Learning rate?
0.1, 0.01, 0.001,
0.0001 ...
Regularization?
L2, L1, Dropout,
Batchnorm ...
4 4 4 4
256 Different configurations !
Model performance = f(model architecture, hyperparameters, ...)
→Trial and error
Need for speed → $$$
(Distributed DL)
→ Better utilization of resources
Outline
1. Background
a. Mini-batch SGD
b. Task Parallelism
c. Data Parallelism
2. Model Hopper Parallelism (MOP)
3. MOP on Apache Spark
a. Implementation
b. APIs
c. Tests
Outline
1. Background
a. Mini-batch SGD
b. Task Parallelism
c. Data Parallelism
2. Model Hopper Parallelism (MOP)
3. MOP on Apache Spark
a. Implementation
b. APIs
c. Tests
Introduction - mini-batch SGD
Model
Updated Model
η ∇
Learning
rate
Avg. of
gradients
X1 X2 y
1.1 2.3 0
0.9 1.6 1
0.6 1.3 1
... ... ...
... ... ...
... ... ...
... ... ...
... ... ...
... ... ...
One
mini-batch
The most popular algorithm family for
training deep nets
Introduction - mini-batch SGD
X1 X2 y
1.1 2.3 0
0.9 1.6 1
0.6 1.3 1
... ... ...
... ... ...
... ... ...
... ... ...
... ... ...
... ... ...
One epoch
One mini-batch
Sequential
Outline
1. Background
a. Mini-batch SGD
b. Task Parallelism
c. Data Parallelism
2. Model Hopper Parallelism (MOP)
3. MOP on Apache Spark
a. Implementation
b. APIs
c. Tests
Models (tasks)
Machines with replicated
datasets
Task Parallelism - Problem Setting
(Embarrassing) Task Parallelism
Con: wasted storage
(Embarrassing) Task Parallelism
Con: wasted network
Shared FS or
data repo
Outline
1. Background
a. Mini-batch SGD
b. Task Parallelism
c. Data Parallelism
2. Model Hopper Parallelism (MOP)
3. MOP on Apache Spark
a. Implementation
b. APIs
c. Tests
Data Parallelism - Problem Setting
Models(tasks)
Partitioned data
High data scalability
Data Parallelism
Queue
Training on one mini-batch
or full partition
● Update only per epoch: bulk synchronous parallelism
(model averaging)
○ Bad convergence
● Update per mini-batch: sync parameter server
○ + Async updates: async parameter server
○ + Decentralized: MPI allreduce (Horovod)
○ High communication cost
Updates
Task Parallelism
+ high throughput
- low data scalability
- memory/storage wastage
Data Parallelism
+ high data scalability
- low throughput
- high communication cost
Model Hopper Parallelism (Cerebro)
+ high throughput
+ high data scalability
+ low communication cost
+ no memory/storage wastage
Outline
1. Background
a. Mini-batch SGD
b. Task Parallelism
c. Data Parallelism
2. Model Hopper Parallelism (MOP)
3. MOP on Apache Spark
a. Implementation
b. APIs
c. Tests
Model Hopper Parallelism -
Problem Setting
Models (tasks)
Partitioned data
Model Hopper Parallelism
Training on full
local partitions
One
sub-epoch
Model Hopper Parallelism
Training on full
local partitions
Model hopping
& training
One
sub-epoch
Model Hopper Parallelism
Training on full
local partitions
Model hopping
& training
Model hopping
& training
One
sub-epoch
Model Hopper Parallelism
Training on full
local partitions
Model hopping
& training
Model hopping
& trainingOne
epoch
One
sub-epoch
Heterogeneous Tasks
Time
Redundant sync barrier!
Queue
Randomized Scheduler
Time
Cerebro -- Data System with MOP
Outline
1. Background
a. Mini-batch SGD
b. Task Parallelism
c. Data Parallelism
2. Model Hopper Parallelism (MOP)
3. MOP on Apache Spark
a. Implementation
b. APIs
c. Tests
MOP (Cerebro)
on Spark Spark Driver
Cerebro
Scheduler
Spark Worker
Cerebro
Worker
Spark Worker
Cerebro
Worker
Distributed File System (HDFS, NFS)
Implementation Details
▪ Spark DataFrames converted to partitioned Parquet
and locally cached in workers
▪ TensorFlow threads run training on local data
partitions
▪ Model Hopping implemented via shared file system
Outline
1. Background
a. Mini-batch SGD
b. Task Parallelism
c. Data Parallelism
2. Model Hopper Parallelism (MOP)
3. MOP on Apache Spark
a. Implementation
b. APIs
c. Tests
Example: Grid Search on
Model Selection + Hyperparameter
Search
▪ Two model architecture: {VGG16, ResNet50}
▪ Two learning rate: {1e-4, 1e-6}
▪ Two batch size: {32, 256}
Initialization
from pyspark.sql import SparkSession
import cerebro
spark = SparkSession.builder.master(...) # initialize spark
spark_backend = cerebro.backend.SparkBackend(
spark_context=spark.sparkContext, num_workers=num_workers
) # initialize cerebro
data_store = cerebro.storage.HDFSStore('hdfs://...') # set the shared data
storage
Define the Models
params = {'model_arch':['vgg16', 'resnet50'], 'learning_rate':[1e-4, 1e-6], 'batch_size':[32, 256]}
def estimator_gen_fn(params):
'''A model factory that returns an estimator,
given the input hyper-parameters, as well as model architectures'''
if params['model_arch'] == 'resnet50':
model = ... # tf.keras model
elif params['model_arch'] == 'vgg16':
model = ... # tf.keras model
optimizer = tf.keras.optimizers.Adam(lr=params['learning_rate']) # choose optimizer
loss = ... # define loss
estimator = cerebro.keras.SparkEstimator(model=model,
optimizer=optimizer,
loss=loss,
batch_size=params['batch_size'])
return estimator
Run Grid Search
df = ... # read data in as Spark DataFrame
grid_search = cerebro.tune.GridSearch(spark_backend,
data_store,
estimator_gen_fn,
params,
epoch=5,
validation=0.2,
feature_columns=['features'],
label_columns=['labels'])
model = grid_search.fit(df)
Outline
1. Background
a. Mini-batch SGD
b. Task Parallelism
c. Data Parallelism
2. Model Hopper Parallelism (MOP)
3. MOP on Apache Spark
a. Implementation
b. APIs
c. Tests
Tests - Setups - Hardware
▪ 9-node cluster, 1 master + 8 workers
▪ On each nodes:
▪ Intel Xeon 10-core 2.20 GHz CPU x 2
▪ 192 GB RAM
▪ Nvidia P100 GPU x 1
Tests - Setups - Workload
▪ Model selection + hyperparameter tuning on
ImageNet
▪ Adam optimizer
▪ Grid search space:
▪ Model architecture: {ResNet50, VGG16}
▪ Learning rate: {1e-4, 1e-6}
▪ Batch size: {32, 256}
▪ L2 regularization: {1e-4, 1e-6}
Tests - Results - Learning Curves
Tests - Results - Learning Curves
Tests - Results - Per Epoch Runtimes
* Horovod uses GPU kernels for communication. Thus, it has high GPU utilization.
Tests - Results - Runtimes
* Horovod uses GPU kernels for communication. Thus, it has high GPU utilization.
System
Runtime (hrs/epoch)
GPU Utili. (%)
Storage
Footprint (GiB)
Train Validation
TF PS - Async 8.6 250
Horovod 92.1 250
Cerebro-Spark 2.63 0.57 42.4 250
TF Model Averaging 1.94 0.03 72.1 250
Celery 1.69 0.03 82.4 2000
Cerebro-Standalone 1.72 0.05 79.8 250
Tests - Cerebro-Spark Gantt Chart
▪ Only overhead: stragglers randomly caused by TF 2.1 Keras Model saving/loading.
Overheads range from 1% to 300%
Stragglers
Tests - Cerebro-Spark Gantt Chart
▪ One epoch of training
▪ (Almost) optimal!
Tests - Cerebro-Standalone Gantt Chart
Other Available Hyperparameter
Tuning Algorithms
▪ PBT
▪ HyperBand
▪ ASHA
▪ Hyperopt
More Features to Come
▪ Grouped learning
▪ API for transfer learning
▪ Model parallelism
References
▪ Cerebro project site
▪ https://adalabucsd.github.io/cerebro-system
▪ Github repo
▪ https://github.com/adalabucsd/cerebro-system
▪ Blog post
▪ https://adalabucsd.github.io/research-blog/cerebro.html
▪ Tech report
▪ https://adalabucsd.github.io/papers/TR_2020_Cerebro.pdf
Questions?
Thank you!
Feedback
Your feedback is important to us.
Don’t forget to rate and
review the sessions.
Resource-Efficient Deep Learning Model Selection on Apache Spark

More Related Content

PDF
Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSA
PDF
Apache Spark MLlib 2.0 Preview: Data Science and Production
PDF
Operationalize Apache Spark Analytics
PDF
A Journey into Databricks' Pipelines: Journey and Lessons Learned
PDF
Degrading Performance? You Might be Suffering From the Small Files Syndrome
PDF
Operational Tips For Deploying Apache Spark
PDF
Spark Summit EU talk by Berni Schiefer
PDF
High Performance Python on Apache Spark
Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSA
Apache Spark MLlib 2.0 Preview: Data Science and Production
Operationalize Apache Spark Analytics
A Journey into Databricks' Pipelines: Journey and Lessons Learned
Degrading Performance? You Might be Suffering From the Small Files Syndrome
Operational Tips For Deploying Apache Spark
Spark Summit EU talk by Berni Schiefer
High Performance Python on Apache Spark

What's hot (20)

PDF
Deep Learning on Apache® Spark™ : Workflows and Best Practices
PDF
Improving the Life of Data Scientists: Automating ML Lifecycle through MLflow
PDF
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
PDF
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
PDF
Koalas: How Well Does Koalas Work?
PPTX
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
PDF
Spark Summit EU talk by Josef Habdank
PDF
Spark Summit EU talk by Oscar Castaneda
PDF
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
PDF
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
PDF
Best Practices for Building Robust Data Platform with Apache Spark and Delta
PDF
Spark Summit EU talk by Rolf Jagerman
PDF
Spark Summit 2016: Connecting Python to the Spark Ecosystem
PPTX
Large-Scale Data Science in Apache Spark 2.0
PDF
Advanced Natural Language Processing with Apache Spark NLP
PDF
Build, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache Spark
PDF
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
PDF
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
PDF
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan Zhu
PDF
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Improving the Life of Data Scientists: Automating ML Lifecycle through MLflow
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Koalas: How Well Does Koalas Work?
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Spark Summit EU talk by Josef Habdank
Spark Summit EU talk by Oscar Castaneda
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Best Practices for Building Robust Data Platform with Apache Spark and Delta
Spark Summit EU talk by Rolf Jagerman
Spark Summit 2016: Connecting Python to the Spark Ecosystem
Large-Scale Data Science in Apache Spark 2.0
Advanced Natural Language Processing with Apache Spark NLP
Build, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache Spark
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan Zhu
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Ad

Similar to Resource-Efficient Deep Learning Model Selection on Apache Spark (20)

PDF
Spark Summit EU talk by Luca Canali
PDF
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
PDF
CaffeOnSpark: Deep Learning On Spark Cluster
PDF
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
PDF
Deep Dive into GPU Support in Apache Spark 3.x
PPTX
Explore big data at speed of thought with Spark 2.0 and Snappydata
PPTX
Profiling & Testing with Spark
PDF
End-to-End Deep Learning with Horovod on Apache Spark
PDF
Towards a Systematic Study of Big Data Performance and Benchmarking
PPTX
GPU and Deep learning best practices
PPTX
2018 03 25 system ml ai and openpower meetup
PPTX
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
PDF
Machine Learning With H2O vs SparkML
PPTX
Distributed Deep Learning on Hadoop Clusters
PDF
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
PDF
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
PDF
Boosting spark performance: An Overview of Techniques
PDF
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
PDF
夏俊鸾:Spark——基于内存的下一代大数据分析框架
PDF
Scaling TensorFlow with Hops, Global AI Conference Santa Clara
Spark Summit EU talk by Luca Canali
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
CaffeOnSpark: Deep Learning On Spark Cluster
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Deep Dive into GPU Support in Apache Spark 3.x
Explore big data at speed of thought with Spark 2.0 and Snappydata
Profiling & Testing with Spark
End-to-End Deep Learning with Horovod on Apache Spark
Towards a Systematic Study of Big Data Performance and Benchmarking
GPU and Deep learning best practices
2018 03 25 system ml ai and openpower meetup
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
Machine Learning With H2O vs SparkML
Distributed Deep Learning on Hadoop Clusters
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Boosting spark performance: An Overview of Techniques
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
夏俊鸾:Spark——基于内存的下一代大数据分析框架
Scaling TensorFlow with Hops, Global AI Conference Santa Clara
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake

Recently uploaded (20)

PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPT
ISS -ESG Data flows What is ESG and HowHow
PDF
Fluorescence-microscope_Botany_detailed content
PDF
annual-report-2024-2025 original latest.
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PDF
Lecture1 pattern recognition............
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
Introduction to machine learning and Linear Models
PPTX
Computer network topology notes for revision
PDF
.pdf is not working space design for the following data for the following dat...
PPT
Quality review (1)_presentation of this 21
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
ISS -ESG Data flows What is ESG and HowHow
Fluorescence-microscope_Botany_detailed content
annual-report-2024-2025 original latest.
Business Acumen Training GuidePresentation.pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
Lecture1 pattern recognition............
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Introduction to machine learning and Linear Models
Computer network topology notes for revision
.pdf is not working space design for the following data for the following dat...
Quality review (1)_presentation of this 21
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Business Ppt On Nestle.pptx huunnnhhgfvu
Data_Analytics_and_PowerBI_Presentation.pptx

Resource-Efficient Deep Learning Model Selection on Apache Spark

  • 2. Resource-efficient Deep Learning Model Selection on Apache Spark Yuhao Zhang and Supun Nakandala ADALab, University of California, San Diego
  • 3. About us ▪ PHD students from ADALab at UCSD, advised by Prof. Arun Kumar ▪ Our research mission: democratize data science ▪ More: Supun Nakandala https://scnakandala.github.io/ Yuhao Zhang https://yhzhang.info/ ADALab https://adalabucsd.github.io/
  • 4. Introduction Artificial Neural Networks (ANNs) are revolutionizing many domains - “Deep Learning”
  • 5. Problem: training deep nets is Painful! Batch size? 8, 16, 64, 256 ... Model architecture? 3 layer CNN,5 layer CNN, LSTM… Learning rate? 0.1, 0.01, 0.001, 0.0001 ... Regularization? L2, L1, Dropout, Batchnorm ... 4 4 4 4 256 Different configurations ! Model performance = f(model architecture, hyperparameters, ...) →Trial and error Need for speed → $$$ (Distributed DL) → Better utilization of resources
  • 6. Outline 1. Background a. Mini-batch SGD b. Task Parallelism c. Data Parallelism 2. Model Hopper Parallelism (MOP) 3. MOP on Apache Spark a. Implementation b. APIs c. Tests
  • 7. Outline 1. Background a. Mini-batch SGD b. Task Parallelism c. Data Parallelism 2. Model Hopper Parallelism (MOP) 3. MOP on Apache Spark a. Implementation b. APIs c. Tests
  • 8. Introduction - mini-batch SGD Model Updated Model η ∇ Learning rate Avg. of gradients X1 X2 y 1.1 2.3 0 0.9 1.6 1 0.6 1.3 1 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... One mini-batch The most popular algorithm family for training deep nets
  • 9. Introduction - mini-batch SGD X1 X2 y 1.1 2.3 0 0.9 1.6 1 0.6 1.3 1 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... One epoch One mini-batch Sequential
  • 10. Outline 1. Background a. Mini-batch SGD b. Task Parallelism c. Data Parallelism 2. Model Hopper Parallelism (MOP) 3. MOP on Apache Spark a. Implementation b. APIs c. Tests
  • 11. Models (tasks) Machines with replicated datasets Task Parallelism - Problem Setting
  • 13. (Embarrassing) Task Parallelism Con: wasted network Shared FS or data repo
  • 14. Outline 1. Background a. Mini-batch SGD b. Task Parallelism c. Data Parallelism 2. Model Hopper Parallelism (MOP) 3. MOP on Apache Spark a. Implementation b. APIs c. Tests
  • 15. Data Parallelism - Problem Setting Models(tasks) Partitioned data High data scalability
  • 16. Data Parallelism Queue Training on one mini-batch or full partition ● Update only per epoch: bulk synchronous parallelism (model averaging) ○ Bad convergence ● Update per mini-batch: sync parameter server ○ + Async updates: async parameter server ○ + Decentralized: MPI allreduce (Horovod) ○ High communication cost Updates
  • 17. Task Parallelism + high throughput - low data scalability - memory/storage wastage Data Parallelism + high data scalability - low throughput - high communication cost Model Hopper Parallelism (Cerebro) + high throughput + high data scalability + low communication cost + no memory/storage wastage
  • 18. Outline 1. Background a. Mini-batch SGD b. Task Parallelism c. Data Parallelism 2. Model Hopper Parallelism (MOP) 3. MOP on Apache Spark a. Implementation b. APIs c. Tests
  • 19. Model Hopper Parallelism - Problem Setting Models (tasks) Partitioned data
  • 20. Model Hopper Parallelism Training on full local partitions One sub-epoch
  • 21. Model Hopper Parallelism Training on full local partitions Model hopping & training One sub-epoch
  • 22. Model Hopper Parallelism Training on full local partitions Model hopping & training Model hopping & training One sub-epoch
  • 23. Model Hopper Parallelism Training on full local partitions Model hopping & training Model hopping & trainingOne epoch One sub-epoch
  • 26. Cerebro -- Data System with MOP
  • 27. Outline 1. Background a. Mini-batch SGD b. Task Parallelism c. Data Parallelism 2. Model Hopper Parallelism (MOP) 3. MOP on Apache Spark a. Implementation b. APIs c. Tests
  • 28. MOP (Cerebro) on Spark Spark Driver Cerebro Scheduler Spark Worker Cerebro Worker Spark Worker Cerebro Worker Distributed File System (HDFS, NFS)
  • 29. Implementation Details ▪ Spark DataFrames converted to partitioned Parquet and locally cached in workers ▪ TensorFlow threads run training on local data partitions ▪ Model Hopping implemented via shared file system
  • 30. Outline 1. Background a. Mini-batch SGD b. Task Parallelism c. Data Parallelism 2. Model Hopper Parallelism (MOP) 3. MOP on Apache Spark a. Implementation b. APIs c. Tests
  • 31. Example: Grid Search on Model Selection + Hyperparameter Search ▪ Two model architecture: {VGG16, ResNet50} ▪ Two learning rate: {1e-4, 1e-6} ▪ Two batch size: {32, 256}
  • 32. Initialization from pyspark.sql import SparkSession import cerebro spark = SparkSession.builder.master(...) # initialize spark spark_backend = cerebro.backend.SparkBackend( spark_context=spark.sparkContext, num_workers=num_workers ) # initialize cerebro data_store = cerebro.storage.HDFSStore('hdfs://...') # set the shared data storage
  • 33. Define the Models params = {'model_arch':['vgg16', 'resnet50'], 'learning_rate':[1e-4, 1e-6], 'batch_size':[32, 256]} def estimator_gen_fn(params): '''A model factory that returns an estimator, given the input hyper-parameters, as well as model architectures''' if params['model_arch'] == 'resnet50': model = ... # tf.keras model elif params['model_arch'] == 'vgg16': model = ... # tf.keras model optimizer = tf.keras.optimizers.Adam(lr=params['learning_rate']) # choose optimizer loss = ... # define loss estimator = cerebro.keras.SparkEstimator(model=model, optimizer=optimizer, loss=loss, batch_size=params['batch_size']) return estimator
  • 34. Run Grid Search df = ... # read data in as Spark DataFrame grid_search = cerebro.tune.GridSearch(spark_backend, data_store, estimator_gen_fn, params, epoch=5, validation=0.2, feature_columns=['features'], label_columns=['labels']) model = grid_search.fit(df)
  • 35. Outline 1. Background a. Mini-batch SGD b. Task Parallelism c. Data Parallelism 2. Model Hopper Parallelism (MOP) 3. MOP on Apache Spark a. Implementation b. APIs c. Tests
  • 36. Tests - Setups - Hardware ▪ 9-node cluster, 1 master + 8 workers ▪ On each nodes: ▪ Intel Xeon 10-core 2.20 GHz CPU x 2 ▪ 192 GB RAM ▪ Nvidia P100 GPU x 1
  • 37. Tests - Setups - Workload ▪ Model selection + hyperparameter tuning on ImageNet ▪ Adam optimizer ▪ Grid search space: ▪ Model architecture: {ResNet50, VGG16} ▪ Learning rate: {1e-4, 1e-6} ▪ Batch size: {32, 256} ▪ L2 regularization: {1e-4, 1e-6}
  • 38. Tests - Results - Learning Curves
  • 39. Tests - Results - Learning Curves
  • 40. Tests - Results - Per Epoch Runtimes * Horovod uses GPU kernels for communication. Thus, it has high GPU utilization.
  • 41. Tests - Results - Runtimes * Horovod uses GPU kernels for communication. Thus, it has high GPU utilization. System Runtime (hrs/epoch) GPU Utili. (%) Storage Footprint (GiB) Train Validation TF PS - Async 8.6 250 Horovod 92.1 250 Cerebro-Spark 2.63 0.57 42.4 250 TF Model Averaging 1.94 0.03 72.1 250 Celery 1.69 0.03 82.4 2000 Cerebro-Standalone 1.72 0.05 79.8 250
  • 42. Tests - Cerebro-Spark Gantt Chart ▪ Only overhead: stragglers randomly caused by TF 2.1 Keras Model saving/loading. Overheads range from 1% to 300% Stragglers
  • 43. Tests - Cerebro-Spark Gantt Chart ▪ One epoch of training ▪ (Almost) optimal!
  • 45. Other Available Hyperparameter Tuning Algorithms ▪ PBT ▪ HyperBand ▪ ASHA ▪ Hyperopt
  • 46. More Features to Come ▪ Grouped learning ▪ API for transfer learning ▪ Model parallelism
  • 47. References ▪ Cerebro project site ▪ https://adalabucsd.github.io/cerebro-system ▪ Github repo ▪ https://github.com/adalabucsd/cerebro-system ▪ Blog post ▪ https://adalabucsd.github.io/research-blog/cerebro.html ▪ Tech report ▪ https://adalabucsd.github.io/papers/TR_2020_Cerebro.pdf
  • 50. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.