SlideShare a Scribd company logo
Xiangrui Meng, Databricks
Updates from Project Hydrogen: Unifying
State-of-the-Art AI and Big Data in Apache Spark
#UnifiedAnalytics #SparkAISummit
2
About me
● Software Engineer at Databricks
○ machine learning and data science/engineering
● Committer and PMC member of Apache Spark
○ MLlib, SparkR, PySpark, Spark Packages, etc
3
Announced last June, Project Hydrogen is a major Spark initiative
to unify state-of-the-art AI and big data workloads.
About Project Hydrogen
Barrier
Execution
Mode
Optimized
Data
Exchange
Accelerator
Aware
Scheduling
4
Why Spark + AI?
Runtime
Delta
Spark Core Engine
Big Data Processing
ETL + SQL +Streaming
Machine Learning
MLlib + SparkR
Apache Spark:
The First Unified Analytics Engine
5
and many more...
Internet of ThingsDigital Personalization
Huge disruptive innovations are affecting most enterprises on the planet
Healthcare and Genomics Fraud Prevention
AI is re-shaping the world
6
Better AI needs more data
7
When AI goes distributed ...
When datasets get bigger and bigger, we see more and more
distributed training scenarios and open-source offerings, e.g.,
distributed TensorFlow, Horovod, and distributed MXNet.
This is where Spark and AI cross.
8
9
Why Project Hydrogen?
Two simple stories
As a data scientist, I can:
● build a pipeline that fetches training events from a production
data warehouse and trains a DL model in parallel;
● apply a trained DL model to a distributed stream of events and
enrich it with predicted labels.
10
Distributed training
data warehouse load fit model
Required: Be able to read from
Databricks Delta, Parquet,
MySQL, Hive, etc.
Answer: Apache Spark
Required: distributed GPU cluster
for fast training
Answer: Horovod, Distributed
Tensorflow, etc
11
Two separate data and AI clusters?
load using a Spark
cluster
fit on a GPU
cluster
model
save data
required: glue code
12
Streaming model inference
Kafka load predict model
required:
● save to stream sink
● GPU for fast inference
13
A hybrid Spark and AI cluster?
load using a Spark
cluster w/ GPUs
fit a model
distributedly
on the same
cluster
model
load using a Spark
cluster w/ GPUs
predict w/ GPUs as
a Spark task
model
14
Unfortunately, it doesn’t work out of the box.
See a previous demo.
16
Project Hydrogen to fill the major gaps
Barrier
Execution
Mode
Optimized
Data
Exchange
Accelerator
Aware
Scheduling
17
Updates from Project Hydrogen
As a Spark contributor, I want to present:
● what features from Project Hydrogen are available,
● what features are in development.
As a Databricks engineer, I want to share:
● how we utilized features from Project Hydrogen,
● lessons learned and best practices.
18
Story #1:
Distributed training
load using a Spark
cluster w/ GPUs
fit a model
distributedly
on the same
cluster
model
19
Project Hydrogen: barrier execution mode
Barrier
Execution
Mode
Optimized
Data
Exchange
Accelerator
Aware
Scheduling
20
Different execution models
Task 1
Task 2
Task 3
Spark (MapReduce)
Tasks are independent of each other
Embarrassingly parallel & massively scalable
Distributed training
Complete coordination among tasks
Optimized for communication
Task 1
Task 2 Task 3
21
Barrier execution mode
We introduced gang scheduling to Spark on top of MapReduce
execution model. So a distributed DL job can run as a Spark job.
● It starts all tasks together.
● It provides sufficient info and tooling to run a hybrid distributed job.
● It cancels and restarts all tasks in case of failures.
JIRA: SPARK-24374 (Spark 2.4)
22
API: RDD.barrier()
RDD.barrier() tells Spark to launch the tasks together.
rdd.barrier().mapPartitions { iter =>
val context = BarrierTaskContext.get()
...
}
23
API: context.barrier()
context.barrier() places a global barrier and waits until all tasks in
this stage hit this barrier.
val context = BarrierTaskContext.get()
… // preparation
context.barrier()
24
API: context.getTaskInfos()
context.getTaskInfos() returns info about all tasks in this stage.
if (context.partitionId == 0) {
val addrs = context.getTaskInfos().map(_.address)
... // start a hybrid training job, e.g., via MPI
}
context.barrier() // wait until training finishes
25
Barrier mode integration
26
Horovod (an LF AI hosted project)
Horovod is a distributed training framework for TensorFlow,
Keras, PyTorch, and MXNet. It is originally developed at Uber, now
an LF AI hosted project at Linux Foundation.
● Little modification to single-node code.
● High-performance I/O via MPI and NCCL.
● Same convergence theory.
Some limitation:
● Before v0.16, user still needs to use mpirun to launch a job,
● … with a python training script: mpirun -np 16 -H server1:4,server2:4,server3:4,server4:4 -bind-to none
-map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -mca pml ob1 -mca btl ^openib python train.py
27
Hydrogen integration with Horovod
Databricks released HorovodRunner w/ Runtime 5.0 ML built on
top of Horovod and Project Hydrogen.
● Runs Horovod under barrier execution mode.
● Hides cluster setup, scripts, MPI command line from users.
def train_hvd():
hvd.init()
… # train using Horovod
HorovodRunner(np=2).run(train_hvd)
28
Implementation of HorovodRunner
Integrating Horovod with barrier mode is straightforward:
● Pickle and broadcast the train function.
○ Inspect code and warn users about potential issues.
● Launch a Spark job in barrier execution mode.
● In the first executor, use worker addresses to launch the Horovod MPI job.
● Terminate Horovod if the Spark job got cancelled.
○ Hint: PR_SET_PDEATHSIG
Limitation:
● Tailored for Databricks Runtime ML
○ Horovod built with TensorFlow/PyTorch, SSH, OpenMPI, NCCL, etc.
○ Spark 2.4, GPU cluster configuration, etc.
29
horovod.spark
horovod.spark is a new feature in Horovod 0.16 release. Similar to
HorovodRunner, it runs Horovod as a Spark job and takes python
train functions. Its assumption is more general:
● no dependency on SSH,
● system-independent process termination,
● multiple Spark versions,
● and more … also check out horovodrun:)
30
Collaboration on Horovod + Spark
Engineers at Uber and Databricks are collaborating on improving
the integration between Horovod and Spark. Goals:
● Merge design and code development into horovod.spark.
● HorovodRunner uses horovod.spark implementation with
extra Databricks-specific features.
● Support barrier execution mode and GPU-aware scheduling.
Stay tuned for the announcement from LF/Uber/Databricks!
31
Project Hydrogen: GPU-aware scheduling
Barrier
Execution
Mode
Optimized
Data
Exchange
Accelerator
Aware
Scheduling
32
Accelerator-aware scheduling
Accelerators (GPUs, FPGAs) are widely used for accelerating
specialized workloads like deep learning and signal processing.
To utilize accelerators in a Spark cluster, Spark needs to be aware
of the accelerators assigned to the driver and executors and
schedule them according to user requests.
JIRA: SPARK-24615 (ETA: Spark 3.0)
33
● Mesos, YARN, and Kubernetes already support GPUs.
● However, even GPUs are allocated by a cluster manager for a Spark application,
Spark itself is not aware of the GPUs.
● Consider a simple case where one task needs one GPU:
Why Spark needs GPU awareness?
Executor 0
GPU:0
GPU:1
Task 0
Task 1
Executor 1
GPU:0
GPU:1
Task 2
Task 3
Task 4 ?
34
Workarounds (a.k.a hacks)
● Limit Spark task slots per node to 1.
○ The running task can safely claim all GPUs on the node.
○ It might lead to resource waste if the workload doesn’t need all GPUs.
○ User also needs to write multithreading code to maximize data I/O.
● Let running tasks themselves to collaboratively decide which
GPUs to use, e.g., via shared locks.
35
User Spark Cluster Manager
0. Auto-discover resources.
1. Submit an application with
resource requests.
2. Pass resource requests to
cluster manager.
4. Register executors.
3. Allocate executors with
resource isolation.
5. Submit a Spark job. 6. Schedule tasks on available
executors.
7. Dynamic allocation.
8. Retrieve assigned resources
and use them in tasks.
9. Monitor and recover failed
executors.
Proposed workflow
36
Discover and request GPUs
Admin can specify a script to auto-discover GPUs (#24406)
● spark.driver.resource.gpu.discoveryScript
● spark.executor.resource.gpu.discoveryScript
● e.g., `nvidia-smi --query-gpu=index ...`
User can request GPUs at application level (#24374)
● spark.executor.resource.gpu.count
● spark.driver.resource.gpu.count
● spark.task.resource.gpu.count
37
Retrieve assigned GPUs
User can retrieve assigned GPUs from task context (#24374)
context = TaskContext.get()
assigned_gpu = context.getResources()[“gpu”][0]
with tf.device(assigned_gpu):
# training code ...
38
Cluster manager support
YARN
SPARK-27361
Kubernetes
SPARK-27362
Mesos
SPARK-27363
Standalone
SPARK-27361
39
Jenkins support (SPARK-27365)
To support end-to-end integration test, we are adding GPU cards
to Spark Jenkins machines hosted by Berkeley RISELab.
Thanks NVIDIA for donating the latest Tesla T4 cards!
40
Support other accelerators
We focus on GPU support but keep the interfaces general to
support other types of accelerators in the future, e.g., FPGA.
● “GPU” is not a hard-coded resource type.
● spark.executor.resource.{resourceType}.discoveryScript
● context.getResources() returns a map from resourceType to assigned addresses.
41
Features beyond the current SPIP
● Resource request at task level.
● Fine-grained scheduling within one GPU.
● Affinity and anti-affinity.
● ...
More on distributed training: data flow
We recommend the following data flow for training:
● Load and preprocess training data using Spark.
● Save preprocessed training data to a shared storage.
○ What format? TFRecords, Parquet + Petastorm.
○ Which shared storage? S3, Azure Blob Storage, HDFS, NFS, etc.
● Load training data in DL frameworks.
○ But DL frameworks do not work well with remote storage.
42
Connect DL frameworks to remote storage
We recommend high-performance FUSE clients to mount remote
storage as local files so DL frameworks can load/save data easily.
43
s3://bucket
wasb://container
file:/mnt/...
TensorFlow
+
Horovod
FUSE clients:
Goofys
blobfuse
worker
44
Story #2:
Streaming model inference
load using a Spark
cluster w/ GPUs
predict w/ GPUs as
a Spark task
model
45
Project Hydrogen: Optimized data exchange
Barrier
Execution
Mode
Optimized
Data
Exchange
Accelerator
Aware
Scheduling
46
Optimized data exchange
None of the integrations are possible without exchanging data
between Spark and AI frameworks. And performance matters.
JIRA: SPARK-24579
47
Pandas UDF
Pandas UDF was introduced in Spark 2.3, which uses Arrow for
data exchange and utilizes Pandas for vectorized computation.
48
Pandas UDF for distributed inference
Pandas UDF makes it simple to apply a model to a data stream.
@pandas_udf(...)
def predict(features):
...
spark.readStream(...) 
.withColumn(‘prediction’, predict(col(‘features’)))
49
Return StructType from Pandas UDF
We improved scalar Pandas UDF to complex return types. So users
can return predicted labels and raw scores together.
JIRA: SPARK-23836 (Spark 3.0)
@pandas_udf(...)
def predict(features):
# ...
return pd.DataFrame({'labels': labels, 'scores': scores})
50
Data pipelining
CPU GPU
t1 fetch batch #1
t2 fetch batch #2 process batch #1
t3 fetch batch #3 process batch #2
t4 process batch #3
CPU GPU
t1 fetch batch #1
t2 process batch #1
t3 fetch batch #2
t4 process batch #2
t5 fetch batch #3
t6 process batch #3 (pipelining)
51
Pandas UDF prefetch
To improve the throughput, we prefetch Arrow record batches in
the queue while executing Pandas UDF on the current batch.
● Enabled by default on Databricks Runtime 5.2.
● Up to 2x for I/O and compute balanced workload.
● Observed 1.5x in real workload.
JIRA: SPARK-27569 (ETA: Spark 3.0)
52
Per-batch initialization overhead
Loading model per batch introduces a constant overhead. We
propose a new Pandas UDF interface that takes an iterator of
batches so we only need to load the model once.
JIRA: SPARK-26412 (WIP)
@pandas_udf(...)
def predict(batches):
model = … # load model once
for batch in batches:
yield model.predict(batch)
53
Standardize on the Arrow format
Many accelerated computing libraries now support Arrow. The
community is discussing whether we should expose the Arrow
format in a public interface.
● Simplify data exchange.
● Reduce data copy/conversion overhead.
● Allow pluggable vectorization code.
JIRA: SPARK-27396 (pending vote)
54
Acknowledgement
● Many ideas in Project Hydrogen are based on previous
community work: TensorFrames, BigDL, Apache Arrow, Pandas
UDF, Spark GPU support, MPI, etc.
● We would like to thank many Spark committers and
contributors who helped the project proposal, design, and
implementation.
55
Acknowledgement
● Xingbo Jiang
● Thomas Graves
● Andy Feng
● Alex Sergeev
● Shane Knapp
● Xiao Li
● Li Jin
● Bryan Cutler
● Takuya Ueshin
● Wenchen Fan
● Jason Lowe
● Hyukjin Kwon
● Madhukar Korupolu
● Robert Evans
● Yinan Li
● Felix Cheung
● Imran Rashid
● Saisai Shao
● Mark Hamstra
● Sean Owen
● Yu Jiang
● … and many more!
Thank you!

More Related Content

PDF
Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark w...
PDF
Jump Start with Apache Spark 2.0 on Databricks
PDF
ROCm and Distributed Deep Learning on Spark and TensorFlow
PDF
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
PDF
Project Hydrogen: State-of-the-Art Deep Learning on Apache Spark
PDF
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
PPTX
Big Data Analytics-Open Source Toolkits
PDF
Flock: Data Science Platform @ CISL
Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark w...
Jump Start with Apache Spark 2.0 on Databricks
ROCm and Distributed Deep Learning on Spark and TensorFlow
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
Project Hydrogen: State-of-the-Art Deep Learning on Apache Spark
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Big Data Analytics-Open Source Toolkits
Flock: Data Science Platform @ CISL

What's hot (20)

PDF
Improving the Life of Data Scientists: Automating ML Lifecycle through MLflow
PDF
Data Agility—A Journey to Advanced Analytics and Machine Learning at Scale
PDF
Build a deep learning pipeline on apache spark for ads optimization
PDF
Benchmark Tests and How-Tos of Convolutional Neural Network on HorovodRunner ...
PDF
Reliable Performance at Scale with Apache Spark on Kubernetes
PDF
Spark Summit EU talk by Christos Erotocritou
PDF
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...
PDF
Databricks: What We Have Learned by Eating Our Dog Food
PDF
Speeding Up Spark with Data Compression on Xeon+FPGA with David Ojika
PDF
Scaling Machine Learning To Billions Of Parameters
PDF
Composable Parallel Processing in Apache Spark and Weld
PDF
An AI-Powered Chatbot to Simplify Apache Spark Performance Management
PDF
Simplify Distributed TensorFlow Training for Fast Image Categorization at Sta...
PDF
Powering Custom Apps at Facebook using Spark Script Transformation
PPTX
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
PDF
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
PDF
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
PDF
Parallelizing Large Simulations with Apache SparkR with Daniel Jeavons and Wa...
PDF
Scaling Up AI Research to Production with PyTorch and MLFlow
PDF
Resource-Efficient Deep Learning Model Selection on Apache Spark
Improving the Life of Data Scientists: Automating ML Lifecycle through MLflow
Data Agility—A Journey to Advanced Analytics and Machine Learning at Scale
Build a deep learning pipeline on apache spark for ads optimization
Benchmark Tests and How-Tos of Convolutional Neural Network on HorovodRunner ...
Reliable Performance at Scale with Apache Spark on Kubernetes
Spark Summit EU talk by Christos Erotocritou
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...
Databricks: What We Have Learned by Eating Our Dog Food
Speeding Up Spark with Data Compression on Xeon+FPGA with David Ojika
Scaling Machine Learning To Billions Of Parameters
Composable Parallel Processing in Apache Spark and Weld
An AI-Powered Chatbot to Simplify Apache Spark Performance Management
Simplify Distributed TensorFlow Training for Fast Image Categorization at Sta...
Powering Custom Apps at Facebook using Spark Script Transformation
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Parallelizing Large Simulations with Apache SparkR with Daniel Jeavons and Wa...
Scaling Up AI Research to Production with PyTorch and MLFlow
Resource-Efficient Deep Learning Model Selection on Apache Spark
Ad

Similar to Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark (20)

PDF
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
PDF
Big Data LDN 2018: PROJECT HYDROGEN: UNIFYING AI WITH APACHE SPARK
PDF
Project Hydrogen, HorovodRunner, and Pandas UDF: Distributed Deep Learning Tr...
PDF
Infrastructure for Deep Learning in Apache Spark
PPTX
Introduction to pyspark for civil engineers
PDF
Spark summit 2019 infrastructure for deep learning in apache spark 0425
PDF
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
PDF
Data Analytics and Machine Learning: From Node to Cluster on ARM64
PDF
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
PDF
Spark Summit EU talk by Tim Hunter
PPTX
Tuning and Monitoring Deep Learning on Apache Spark
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Deep learning on HDP 2018 Prague
PDF
End-to-End Deep Learning with Horovod on Apache Spark
PDF
Accelerating Big Data beyond the JVM - Fosdem 2018
PPTX
Flux - Open Machine Learning Stack / Pipeline
PDF
Deep Learning on Apache® Spark™: Workflows and Best Practices
PDF
Deep Learning on Apache® Spark™ : Workflows and Best Practices
PDF
Deep Learning on Apache® Spark™: Workflows and Best Practices
PDF
Unifying State-of-the-Art AI and Big Data in Apache Spark with Reynold Xin
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Big Data LDN 2018: PROJECT HYDROGEN: UNIFYING AI WITH APACHE SPARK
Project Hydrogen, HorovodRunner, and Pandas UDF: Distributed Deep Learning Tr...
Infrastructure for Deep Learning in Apache Spark
Introduction to pyspark for civil engineers
Spark summit 2019 infrastructure for deep learning in apache spark 0425
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
Data Analytics and Machine Learning: From Node to Cluster on ARM64
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
Spark Summit EU talk by Tim Hunter
Tuning and Monitoring Deep Learning on Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Deep learning on HDP 2018 Prague
End-to-End Deep Learning with Horovod on Apache Spark
Accelerating Big Data beyond the JVM - Fosdem 2018
Flux - Open Machine Learning Stack / Pipeline
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best Practices
Unifying State-of-the-Art AI and Big Data in Apache Spark with Reynold Xin
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Massive Data Processing in Adobe Using Delta Lake
PDF
Machine Learning CI/CD for Email Attack Detection
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Massive Data Processing in Adobe Using Delta Lake
Machine Learning CI/CD for Email Attack Detection

Recently uploaded (20)

PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPT
Quality review (1)_presentation of this 21
PPTX
1_Introduction to advance data techniques.pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
Computer network topology notes for revision
PDF
Mega Projects Data Mega Projects Data
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
Introduction to machine learning and Linear Models
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Qualitative Qantitative and Mixed Methods.pptx
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Quality review (1)_presentation of this 21
1_Introduction to advance data techniques.pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Reliability_Chapter_ presentation 1221.5784
Business Ppt On Nestle.pptx huunnnhhgfvu
Computer network topology notes for revision
Mega Projects Data Mega Projects Data
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Introduction to machine learning and Linear Models
IB Computer Science - Internal Assessment.pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Introduction-to-Cloud-ComputingFinal.pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx

Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark

  • 1. Xiangrui Meng, Databricks Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark #UnifiedAnalytics #SparkAISummit
  • 2. 2 About me ● Software Engineer at Databricks ○ machine learning and data science/engineering ● Committer and PMC member of Apache Spark ○ MLlib, SparkR, PySpark, Spark Packages, etc
  • 3. 3 Announced last June, Project Hydrogen is a major Spark initiative to unify state-of-the-art AI and big data workloads. About Project Hydrogen Barrier Execution Mode Optimized Data Exchange Accelerator Aware Scheduling
  • 5. Runtime Delta Spark Core Engine Big Data Processing ETL + SQL +Streaming Machine Learning MLlib + SparkR Apache Spark: The First Unified Analytics Engine 5
  • 6. and many more... Internet of ThingsDigital Personalization Huge disruptive innovations are affecting most enterprises on the planet Healthcare and Genomics Fraud Prevention AI is re-shaping the world 6
  • 7. Better AI needs more data 7
  • 8. When AI goes distributed ... When datasets get bigger and bigger, we see more and more distributed training scenarios and open-source offerings, e.g., distributed TensorFlow, Horovod, and distributed MXNet. This is where Spark and AI cross. 8
  • 10. Two simple stories As a data scientist, I can: ● build a pipeline that fetches training events from a production data warehouse and trains a DL model in parallel; ● apply a trained DL model to a distributed stream of events and enrich it with predicted labels. 10
  • 11. Distributed training data warehouse load fit model Required: Be able to read from Databricks Delta, Parquet, MySQL, Hive, etc. Answer: Apache Spark Required: distributed GPU cluster for fast training Answer: Horovod, Distributed Tensorflow, etc 11
  • 12. Two separate data and AI clusters? load using a Spark cluster fit on a GPU cluster model save data required: glue code 12
  • 13. Streaming model inference Kafka load predict model required: ● save to stream sink ● GPU for fast inference 13
  • 14. A hybrid Spark and AI cluster? load using a Spark cluster w/ GPUs fit a model distributedly on the same cluster model load using a Spark cluster w/ GPUs predict w/ GPUs as a Spark task model 14
  • 15. Unfortunately, it doesn’t work out of the box. See a previous demo.
  • 16. 16 Project Hydrogen to fill the major gaps Barrier Execution Mode Optimized Data Exchange Accelerator Aware Scheduling
  • 17. 17 Updates from Project Hydrogen As a Spark contributor, I want to present: ● what features from Project Hydrogen are available, ● what features are in development. As a Databricks engineer, I want to share: ● how we utilized features from Project Hydrogen, ● lessons learned and best practices.
  • 18. 18 Story #1: Distributed training load using a Spark cluster w/ GPUs fit a model distributedly on the same cluster model
  • 19. 19 Project Hydrogen: barrier execution mode Barrier Execution Mode Optimized Data Exchange Accelerator Aware Scheduling
  • 20. 20 Different execution models Task 1 Task 2 Task 3 Spark (MapReduce) Tasks are independent of each other Embarrassingly parallel & massively scalable Distributed training Complete coordination among tasks Optimized for communication Task 1 Task 2 Task 3
  • 21. 21 Barrier execution mode We introduced gang scheduling to Spark on top of MapReduce execution model. So a distributed DL job can run as a Spark job. ● It starts all tasks together. ● It provides sufficient info and tooling to run a hybrid distributed job. ● It cancels and restarts all tasks in case of failures. JIRA: SPARK-24374 (Spark 2.4)
  • 22. 22 API: RDD.barrier() RDD.barrier() tells Spark to launch the tasks together. rdd.barrier().mapPartitions { iter => val context = BarrierTaskContext.get() ... }
  • 23. 23 API: context.barrier() context.barrier() places a global barrier and waits until all tasks in this stage hit this barrier. val context = BarrierTaskContext.get() … // preparation context.barrier()
  • 24. 24 API: context.getTaskInfos() context.getTaskInfos() returns info about all tasks in this stage. if (context.partitionId == 0) { val addrs = context.getTaskInfos().map(_.address) ... // start a hybrid training job, e.g., via MPI } context.barrier() // wait until training finishes
  • 26. 26 Horovod (an LF AI hosted project) Horovod is a distributed training framework for TensorFlow, Keras, PyTorch, and MXNet. It is originally developed at Uber, now an LF AI hosted project at Linux Foundation. ● Little modification to single-node code. ● High-performance I/O via MPI and NCCL. ● Same convergence theory. Some limitation: ● Before v0.16, user still needs to use mpirun to launch a job, ● … with a python training script: mpirun -np 16 -H server1:4,server2:4,server3:4,server4:4 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -mca pml ob1 -mca btl ^openib python train.py
  • 27. 27 Hydrogen integration with Horovod Databricks released HorovodRunner w/ Runtime 5.0 ML built on top of Horovod and Project Hydrogen. ● Runs Horovod under barrier execution mode. ● Hides cluster setup, scripts, MPI command line from users. def train_hvd(): hvd.init() … # train using Horovod HorovodRunner(np=2).run(train_hvd)
  • 28. 28 Implementation of HorovodRunner Integrating Horovod with barrier mode is straightforward: ● Pickle and broadcast the train function. ○ Inspect code and warn users about potential issues. ● Launch a Spark job in barrier execution mode. ● In the first executor, use worker addresses to launch the Horovod MPI job. ● Terminate Horovod if the Spark job got cancelled. ○ Hint: PR_SET_PDEATHSIG Limitation: ● Tailored for Databricks Runtime ML ○ Horovod built with TensorFlow/PyTorch, SSH, OpenMPI, NCCL, etc. ○ Spark 2.4, GPU cluster configuration, etc.
  • 29. 29 horovod.spark horovod.spark is a new feature in Horovod 0.16 release. Similar to HorovodRunner, it runs Horovod as a Spark job and takes python train functions. Its assumption is more general: ● no dependency on SSH, ● system-independent process termination, ● multiple Spark versions, ● and more … also check out horovodrun:)
  • 30. 30 Collaboration on Horovod + Spark Engineers at Uber and Databricks are collaborating on improving the integration between Horovod and Spark. Goals: ● Merge design and code development into horovod.spark. ● HorovodRunner uses horovod.spark implementation with extra Databricks-specific features. ● Support barrier execution mode and GPU-aware scheduling. Stay tuned for the announcement from LF/Uber/Databricks!
  • 31. 31 Project Hydrogen: GPU-aware scheduling Barrier Execution Mode Optimized Data Exchange Accelerator Aware Scheduling
  • 32. 32 Accelerator-aware scheduling Accelerators (GPUs, FPGAs) are widely used for accelerating specialized workloads like deep learning and signal processing. To utilize accelerators in a Spark cluster, Spark needs to be aware of the accelerators assigned to the driver and executors and schedule them according to user requests. JIRA: SPARK-24615 (ETA: Spark 3.0)
  • 33. 33 ● Mesos, YARN, and Kubernetes already support GPUs. ● However, even GPUs are allocated by a cluster manager for a Spark application, Spark itself is not aware of the GPUs. ● Consider a simple case where one task needs one GPU: Why Spark needs GPU awareness? Executor 0 GPU:0 GPU:1 Task 0 Task 1 Executor 1 GPU:0 GPU:1 Task 2 Task 3 Task 4 ?
  • 34. 34 Workarounds (a.k.a hacks) ● Limit Spark task slots per node to 1. ○ The running task can safely claim all GPUs on the node. ○ It might lead to resource waste if the workload doesn’t need all GPUs. ○ User also needs to write multithreading code to maximize data I/O. ● Let running tasks themselves to collaboratively decide which GPUs to use, e.g., via shared locks.
  • 35. 35 User Spark Cluster Manager 0. Auto-discover resources. 1. Submit an application with resource requests. 2. Pass resource requests to cluster manager. 4. Register executors. 3. Allocate executors with resource isolation. 5. Submit a Spark job. 6. Schedule tasks on available executors. 7. Dynamic allocation. 8. Retrieve assigned resources and use them in tasks. 9. Monitor and recover failed executors. Proposed workflow
  • 36. 36 Discover and request GPUs Admin can specify a script to auto-discover GPUs (#24406) ● spark.driver.resource.gpu.discoveryScript ● spark.executor.resource.gpu.discoveryScript ● e.g., `nvidia-smi --query-gpu=index ...` User can request GPUs at application level (#24374) ● spark.executor.resource.gpu.count ● spark.driver.resource.gpu.count ● spark.task.resource.gpu.count
  • 37. 37 Retrieve assigned GPUs User can retrieve assigned GPUs from task context (#24374) context = TaskContext.get() assigned_gpu = context.getResources()[“gpu”][0] with tf.device(assigned_gpu): # training code ...
  • 39. 39 Jenkins support (SPARK-27365) To support end-to-end integration test, we are adding GPU cards to Spark Jenkins machines hosted by Berkeley RISELab. Thanks NVIDIA for donating the latest Tesla T4 cards!
  • 40. 40 Support other accelerators We focus on GPU support but keep the interfaces general to support other types of accelerators in the future, e.g., FPGA. ● “GPU” is not a hard-coded resource type. ● spark.executor.resource.{resourceType}.discoveryScript ● context.getResources() returns a map from resourceType to assigned addresses.
  • 41. 41 Features beyond the current SPIP ● Resource request at task level. ● Fine-grained scheduling within one GPU. ● Affinity and anti-affinity. ● ...
  • 42. More on distributed training: data flow We recommend the following data flow for training: ● Load and preprocess training data using Spark. ● Save preprocessed training data to a shared storage. ○ What format? TFRecords, Parquet + Petastorm. ○ Which shared storage? S3, Azure Blob Storage, HDFS, NFS, etc. ● Load training data in DL frameworks. ○ But DL frameworks do not work well with remote storage. 42
  • 43. Connect DL frameworks to remote storage We recommend high-performance FUSE clients to mount remote storage as local files so DL frameworks can load/save data easily. 43 s3://bucket wasb://container file:/mnt/... TensorFlow + Horovod FUSE clients: Goofys blobfuse worker
  • 44. 44 Story #2: Streaming model inference load using a Spark cluster w/ GPUs predict w/ GPUs as a Spark task model
  • 45. 45 Project Hydrogen: Optimized data exchange Barrier Execution Mode Optimized Data Exchange Accelerator Aware Scheduling
  • 46. 46 Optimized data exchange None of the integrations are possible without exchanging data between Spark and AI frameworks. And performance matters. JIRA: SPARK-24579
  • 47. 47 Pandas UDF Pandas UDF was introduced in Spark 2.3, which uses Arrow for data exchange and utilizes Pandas for vectorized computation.
  • 48. 48 Pandas UDF for distributed inference Pandas UDF makes it simple to apply a model to a data stream. @pandas_udf(...) def predict(features): ... spark.readStream(...) .withColumn(‘prediction’, predict(col(‘features’)))
  • 49. 49 Return StructType from Pandas UDF We improved scalar Pandas UDF to complex return types. So users can return predicted labels and raw scores together. JIRA: SPARK-23836 (Spark 3.0) @pandas_udf(...) def predict(features): # ... return pd.DataFrame({'labels': labels, 'scores': scores})
  • 50. 50 Data pipelining CPU GPU t1 fetch batch #1 t2 fetch batch #2 process batch #1 t3 fetch batch #3 process batch #2 t4 process batch #3 CPU GPU t1 fetch batch #1 t2 process batch #1 t3 fetch batch #2 t4 process batch #2 t5 fetch batch #3 t6 process batch #3 (pipelining)
  • 51. 51 Pandas UDF prefetch To improve the throughput, we prefetch Arrow record batches in the queue while executing Pandas UDF on the current batch. ● Enabled by default on Databricks Runtime 5.2. ● Up to 2x for I/O and compute balanced workload. ● Observed 1.5x in real workload. JIRA: SPARK-27569 (ETA: Spark 3.0)
  • 52. 52 Per-batch initialization overhead Loading model per batch introduces a constant overhead. We propose a new Pandas UDF interface that takes an iterator of batches so we only need to load the model once. JIRA: SPARK-26412 (WIP) @pandas_udf(...) def predict(batches): model = … # load model once for batch in batches: yield model.predict(batch)
  • 53. 53 Standardize on the Arrow format Many accelerated computing libraries now support Arrow. The community is discussing whether we should expose the Arrow format in a public interface. ● Simplify data exchange. ● Reduce data copy/conversion overhead. ● Allow pluggable vectorization code. JIRA: SPARK-27396 (pending vote)
  • 54. 54 Acknowledgement ● Many ideas in Project Hydrogen are based on previous community work: TensorFrames, BigDL, Apache Arrow, Pandas UDF, Spark GPU support, MPI, etc. ● We would like to thank many Spark committers and contributors who helped the project proposal, design, and implementation.
  • 55. 55 Acknowledgement ● Xingbo Jiang ● Thomas Graves ● Andy Feng ● Alex Sergeev ● Shane Knapp ● Xiao Li ● Li Jin ● Bryan Cutler ● Takuya Ueshin ● Wenchen Fan ● Jason Lowe ● Hyukjin Kwon ● Madhukar Korupolu ● Robert Evans ● Yinan Li ● Felix Cheung ● Imran Rashid ● Saisai Shao ● Mark Hamstra ● Sean Owen ● Yu Jiang ● … and many more!