SlideShare a Scribd company logo
9
Spark
DataFrame
Spark Dataset Converter API Overview
TensorFlow
Dataset
PyTorch
DataLoader
Spark
Dataset
Converter
from petastorm.spark import make_spark_converter
converter = make_spark_converter(df)
with converter.make_tf_dataset() as dataset:
tf_model.fit(dataset)
with converter.make_torch_dataloader() as dataloader:
train(torch_model, dataloader)
Most read
10
Spark Dataset Converter API
HDFS/DBFS
Spark
DataFrame
tf.data.Dataset /
torch.dataloader
Found
cached
parquet file?
Cache
DataFrame in
parquet file
data.parquet
No
Yes Load cached
parquet file with
petastorm
ETL Training
Most read
13
Demo notebooks
• Image Classification
• Spark to TensorFlow Dataset
• https://docs.databricks.com/_static/notebooks/deep-learning/petastorm-spark-converter-tenso
rflow.html
• Spark to PyTorch DataLoader
• https://docs.databricks.com/_static/notebooks/deep-learning/petastorm-spark-converter-pytor
ch.html
Most read
Simplify Data
Conversion from
Spark to Deep
Learning
Liang Zhang
Software Engineer @ databricks
About Me
▪ Machine Learning Team
@ Databricks
▪ Master in Carnegie
Mellon University Liang Zhang
linkedin.com/in/liangz1/
Agenda
▪ Why should we care
about data conversion
between spark and deep
learning frameworks?
▪ Pain points
▪ Overview of the Spark
Dataset Converter
▪ Demo
▪ Best Practices
Spark
DataFrame
Motivation: Data Conversion from Spark to DL
TensorFlow
PyTorch
?
• Images from driving camera: Detect traffic lights
• Large amount of data - TBs
• New images arriving every day
• Data cleaning and labeling
• Train the model with all available data and periodically re-train with new data
• Predict the label of new images
Pain points: Data Conversion from Spark to Deep
Learning frameworks
Pain points: Data Conversion from Spark to DL
• Single-node training:
• Collect a sample of data to the driver in a pandas DataFrame
• Distributed training:
• Save the Spark DataFrame to TFRecords files and load TFRecords using
TensorFlow
• Save the Spark DataFrame to parquet files and write your custom PyTorch
DataLoader to load the partitions
Pain points: Data Conversion from Spark to DL
• Single-node training:
• Collect a sample of data to the driver in a pandas DataFrame
• Distributed training:
• Save the Spark DataFrame to TFRecords files and parse the serialized data
in TFRecords using TensorFlow
• Save the Spark DataFrame to parquet files and write your custom PyTorch
DataLoader to load the partitions
• Hard to migrate from single-node to distributed training
• Many lines of extra code to save, load and parse intermediate
files
Overview of the Spark Dataset Converter
Spark
DataFrame
Spark Dataset Converter API Overview
TensorFlow
Dataset
PyTorch
DataLoader
Spark
Dataset
Converter
from petastorm.spark import make_spark_converter
converter = make_spark_converter(df)
with converter.make_tf_dataset() as dataset:
tf_model.fit(dataset)
with converter.make_torch_dataloader() as dataloader:
train(torch_model, dataloader)
Spark Dataset Converter API
HDFS/DBFS
Spark
DataFrame
tf.data.Dataset /
torch.dataloader
Found
cached
parquet file?
Cache
DataFrame in
parquet file
data.parquet
No
Yes Load cached
parquet file with
petastorm
ETL Training
Spark Dataset Converter Features
▪ Recognize cached Spark
DataFrame by checking
the analyzed query plan
▪ Automatic cache cleaning
at program exit
• Change two arguments
to migrate your data
loading code from
single-node to
distributed setting
• Easy migration to distributed
• Cache intermediate files
• Convert MLlib vectors to
1D arrays automatically
• MLlib vector Handling
How to use the Spark Dataset Converter API?
(demo)
Demo notebooks
• Image Classification
• Spark to TensorFlow Dataset
• https://docs.databricks.com/_static/notebooks/deep-learning/petastorm-spark-converter-tenso
rflow.html
• Spark to PyTorch DataLoader
• https://docs.databricks.com/_static/notebooks/deep-learning/petastorm-spark-converter-pytor
ch.html
Best Practices
Best Practices with Spark Dataset Converter
• Image data decoding and preprocessing
• Decode image bytes and preprocess in TransformSpec, not in Spark
• Spark -> TransformSpec -> Dataset.map -> in the model (GPU)
• Generate infinite batches using num_epochs=None
• In distributed training, to guarantee that every worker get exactly the same
amount of data.
• Manage the lifecycle of cache data
• On local laptop or in a scheduled job on Databricks, the cache files will be
automatically deleted when the python process exits.
• In Databricks notebook, we recommend configuring lifecycle rules for the
underlying S3 buckets storing the cache files.
Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.

More Related Content

What's hot (20)

Map Reduce
Map ReduceMap Reduce
Map Reduce
Prashant Gupta
 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduce
Newvewm
 
Database Triggers
Database TriggersDatabase Triggers
Database Triggers
Aliya Saldanha
 
Demystifying data engineering
Demystifying data engineeringDemystifying data engineering
Demystifying data engineering
Thang Bui (Bob)
 
DAX (Data Analysis eXpressions) from Zero to Hero
DAX (Data Analysis eXpressions) from Zero to HeroDAX (Data Analysis eXpressions) from Zero to Hero
DAX (Data Analysis eXpressions) from Zero to Hero
Microsoft TechNet - Belgium and Luxembourg
 
Hive(ppt)
Hive(ppt)Hive(ppt)
Hive(ppt)
Abhinav Tyagi
 
Datastage ppt
Datastage pptDatastage ppt
Datastage ppt
Newyorksys.com
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
Data Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future OutlookData Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future Outlook
James Serra
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Graph databases
Graph databasesGraph databases
Graph databases
Vinoth Kannan
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFrames
Databricks
 
Sql Basics And Advanced
Sql Basics And AdvancedSql Basics And Advanced
Sql Basics And Advanced
rainynovember12
 
Neo4j Spatial - Backing a GIS with a true graph database
Neo4j Spatial - Backing a GIS with a true graph databaseNeo4j Spatial - Backing a GIS with a true graph database
Neo4j Spatial - Backing a GIS with a true graph database
Craig Taverner
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
Data Management in R
Data Management in RData Management in R
Data Management in R
Sankhya_Analytics
 
Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
Prashant Gupta
 
Mining Data Streams
Mining Data StreamsMining Data Streams
Mining Data Streams
SujaAldrin
 
Introduction to Pig
Introduction to PigIntroduction to Pig
Introduction to Pig
Prashanth Babu
 
Mongo db intro.pptx
Mongo db intro.pptxMongo db intro.pptx
Mongo db intro.pptx
JWORKS powered by Ordina
 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduce
Newvewm
 
Demystifying data engineering
Demystifying data engineeringDemystifying data engineering
Demystifying data engineering
Thang Bui (Bob)
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
Data Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future OutlookData Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future Outlook
James Serra
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFrames
Databricks
 
Neo4j Spatial - Backing a GIS with a true graph database
Neo4j Spatial - Backing a GIS with a true graph databaseNeo4j Spatial - Backing a GIS with a true graph database
Neo4j Spatial - Backing a GIS with a true graph database
Craig Taverner
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
Mining Data Streams
Mining Data StreamsMining Data Streams
Mining Data Streams
SujaAldrin
 

Similar to Simplify Data Conversion from Spark to TensorFlow and PyTorch (20)

Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
Databricks
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
Spark Summit
 
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDPBuild Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
Databricks
 
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Databricks
 
Integrating Deep Learning Libraries with Apache Spark
Integrating Deep Learning Libraries with Apache SparkIntegrating Deep Learning Libraries with Apache Spark
Integrating Deep Learning Libraries with Apache Spark
Databricks
 
Spark Worshop
Spark WorshopSpark Worshop
Spark Worshop
Juan Pedro Moreno
 
Introduction to Apache Spark
Introduction to Apache Spark Introduction to Apache Spark
Introduction to Apache Spark
Juan Pedro Moreno
 
Spark Storage Data Formats and its usage.pptx
Spark Storage Data Formats and its usage.pptxSpark Storage Data Formats and its usage.pptx
Spark Storage Data Formats and its usage.pptx
bharatkumarbhojwani
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
Anyscale
 
Leveraging Apache Spark for Scalable Data Prep and Inference in Deep Learning
Leveraging Apache Spark for Scalable Data Prep and Inference in Deep LearningLeveraging Apache Spark for Scalable Data Prep and Inference in Deep Learning
Leveraging Apache Spark for Scalable Data Prep and Inference in Deep Learning
Databricks
 
Data processing with spark in r & python
Data processing with spark in r & pythonData processing with spark in r & python
Data processing with spark in r & python
Maloy Manna, PMP®
 
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonThe Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
Miklos Christine
 
Koalas: How Well Does Koalas Work?
Koalas: How Well Does Koalas Work?Koalas: How Well Does Koalas Work?
Koalas: How Well Does Koalas Work?
Databricks
 
Introduction to and Extending Spark ML
Introduction to and Extending Spark MLIntroduction to and Extending Spark ML
Introduction to and Extending Spark ML
Holden Karau
 
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Holden Karau
 
Deploying Data Science Engines to Production
Deploying Data Science Engines to ProductionDeploying Data Science Engines to Production
Deploying Data Science Engines to Production
Mostafa Majidpour
 
Using Databricks as an Analysis Platform
Using Databricks as an Analysis PlatformUsing Databricks as an Analysis Platform
Using Databricks as an Analysis Platform
Databricks
 
DIscover Spark and Spark streaming
DIscover Spark and Spark streamingDIscover Spark and Spark streaming
DIscover Spark and Spark streaming
Maturin BADO
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Jason Dai
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
Databricks
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
Spark Summit
 
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDPBuild Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
Databricks
 
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Databricks
 
Integrating Deep Learning Libraries with Apache Spark
Integrating Deep Learning Libraries with Apache SparkIntegrating Deep Learning Libraries with Apache Spark
Integrating Deep Learning Libraries with Apache Spark
Databricks
 
Introduction to Apache Spark
Introduction to Apache Spark Introduction to Apache Spark
Introduction to Apache Spark
Juan Pedro Moreno
 
Spark Storage Data Formats and its usage.pptx
Spark Storage Data Formats and its usage.pptxSpark Storage Data Formats and its usage.pptx
Spark Storage Data Formats and its usage.pptx
bharatkumarbhojwani
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
Anyscale
 
Leveraging Apache Spark for Scalable Data Prep and Inference in Deep Learning
Leveraging Apache Spark for Scalable Data Prep and Inference in Deep LearningLeveraging Apache Spark for Scalable Data Prep and Inference in Deep Learning
Leveraging Apache Spark for Scalable Data Prep and Inference in Deep Learning
Databricks
 
Data processing with spark in r & python
Data processing with spark in r & pythonData processing with spark in r & python
Data processing with spark in r & python
Maloy Manna, PMP®
 
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonThe Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
Miklos Christine
 
Koalas: How Well Does Koalas Work?
Koalas: How Well Does Koalas Work?Koalas: How Well Does Koalas Work?
Koalas: How Well Does Koalas Work?
Databricks
 
Introduction to and Extending Spark ML
Introduction to and Extending Spark MLIntroduction to and Extending Spark ML
Introduction to and Extending Spark ML
Holden Karau
 
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Holden Karau
 
Deploying Data Science Engines to Production
Deploying Data Science Engines to ProductionDeploying Data Science Engines to Production
Deploying Data Science Engines to Production
Mostafa Majidpour
 
Using Databricks as an Analysis Platform
Using Databricks as an Analysis PlatformUsing Databricks as an Analysis Platform
Using Databricks as an Analysis Platform
Databricks
 
DIscover Spark and Spark streaming
DIscover Spark and Spark streamingDIscover Spark and Spark streaming
DIscover Spark and Spark streaming
Maturin BADO
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Jason Dai
 
Ad

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
Databricks
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
Databricks
 
Ad

Recently uploaded (20)

apidays Singapore 2025 - 4 Identity Essentials for Scaling SaaS in Large Orgs...
apidays Singapore 2025 - 4 Identity Essentials for Scaling SaaS in Large Orgs...apidays Singapore 2025 - 4 Identity Essentials for Scaling SaaS in Large Orgs...
apidays Singapore 2025 - 4 Identity Essentials for Scaling SaaS in Large Orgs...
apidays
 
apidays New York 2025 - Building Agentic Workflows with FDC3 Intents by Nick ...
apidays New York 2025 - Building Agentic Workflows with FDC3 Intents by Nick ...apidays New York 2025 - Building Agentic Workflows with FDC3 Intents by Nick ...
apidays New York 2025 - Building Agentic Workflows with FDC3 Intents by Nick ...
apidays
 
2.5-DESPATCH-ORDINARY MAILS.pptxlminub7b7t6f7h7t6f6g7g6fg
2.5-DESPATCH-ORDINARY MAILS.pptxlminub7b7t6f7h7t6f6g7g6fg2.5-DESPATCH-ORDINARY MAILS.pptxlminub7b7t6f7h7t6f6g7g6fg
2.5-DESPATCH-ORDINARY MAILS.pptxlminub7b7t6f7h7t6f6g7g6fg
mk1227103
 
apidays New York 2025 - The FINOS Common Domain Model for Capital Markets by ...
apidays New York 2025 - The FINOS Common Domain Model for Capital Markets by ...apidays New York 2025 - The FINOS Common Domain Model for Capital Markets by ...
apidays New York 2025 - The FINOS Common Domain Model for Capital Markets by ...
apidays
 
Pause Travail 22 Hostiou Girard 12 juin 2025.pdf
Pause Travail 22 Hostiou Girard 12 juin 2025.pdfPause Travail 22 Hostiou Girard 12 juin 2025.pdf
Pause Travail 22 Hostiou Girard 12 juin 2025.pdf
Institut de l'Elevage - Idele
 
Retort Instrumentation laboratory practi
Retort Instrumentation laboratory practiRetort Instrumentation laboratory practi
Retort Instrumentation laboratory practi
ADINDADYAHMUKHLASIN
 
MEDIA_LITERACY_INDEX_OF_EDUCATORS_ENG.pdf
MEDIA_LITERACY_INDEX_OF_EDUCATORS_ENG.pdfMEDIA_LITERACY_INDEX_OF_EDUCATORS_ENG.pdf
MEDIA_LITERACY_INDEX_OF_EDUCATORS_ENG.pdf
OlhaTatokhina1
 
Part Departement Head Presentation for Business
Part Departement Head Presentation for BusinessPart Departement Head Presentation for Business
Part Departement Head Presentation for Business
Rizki229625
 
apidays New York 2025 - Open Source and disrupting the travel distribution ec...
apidays New York 2025 - Open Source and disrupting the travel distribution ec...apidays New York 2025 - Open Source and disrupting the travel distribution ec...
apidays New York 2025 - Open Source and disrupting the travel distribution ec...
apidays
 
Hypothesis Testing Training Material.pdf
Hypothesis Testing Training Material.pdfHypothesis Testing Training Material.pdf
Hypothesis Testing Training Material.pdf
AbdirahmanAli51
 
unit- 5 Biostatistics and Research Methodology.pdf
unit- 5 Biostatistics and Research Methodology.pdfunit- 5 Biostatistics and Research Methodology.pdf
unit- 5 Biostatistics and Research Methodology.pdf
KRUTIKA CHANNE
 
apidays New York 2025 - Building Scalable AI Systems by Sai Prasad Veluru (Ap...
apidays New York 2025 - Building Scalable AI Systems by Sai Prasad Veluru (Ap...apidays New York 2025 - Building Scalable AI Systems by Sai Prasad Veluru (Ap...
apidays New York 2025 - Building Scalable AI Systems by Sai Prasad Veluru (Ap...
apidays
 
apidays New York 2025 - CIAM in the wild by Michael Gruen (Layr)
apidays New York 2025 - CIAM in the wild by Michael Gruen (Layr)apidays New York 2025 - CIAM in the wild by Michael Gruen (Layr)
apidays New York 2025 - CIAM in the wild by Michael Gruen (Layr)
apidays
 
BODMAS-Rule-&-Unit-Digit-Concept-pdf.pdf
BODMAS-Rule-&-Unit-Digit-Concept-pdf.pdfBODMAS-Rule-&-Unit-Digit-Concept-pdf.pdf
BODMAS-Rule-&-Unit-Digit-Concept-pdf.pdf
SiddharthSean
 
MICROSOFT POWERPOINT AND USES(BEST)..pdf
MICROSOFT POWERPOINT AND USES(BEST)..pdfMICROSOFT POWERPOINT AND USES(BEST)..pdf
MICROSOFT POWERPOINT AND USES(BEST)..pdf
bathyates
 
apidays New York 2025 - Using GraphQL SDL files as executable API Contracts b...
apidays New York 2025 - Using GraphQL SDL files as executable API Contracts b...apidays New York 2025 - Using GraphQL SDL files as executable API Contracts b...
apidays New York 2025 - Using GraphQL SDL files as executable API Contracts b...
apidays
 
apidays New York 2025 - Why an SDK is Needed to Protect APIs from Mobile Apps...
apidays New York 2025 - Why an SDK is Needed to Protect APIs from Mobile Apps...apidays New York 2025 - Why an SDK is Needed to Protect APIs from Mobile Apps...
apidays New York 2025 - Why an SDK is Needed to Protect APIs from Mobile Apps...
apidays
 
SAP_S4HANA_EWM_Food_Processing_Industry.pptx
SAP_S4HANA_EWM_Food_Processing_Industry.pptxSAP_S4HANA_EWM_Food_Processing_Industry.pptx
SAP_S4HANA_EWM_Food_Processing_Industry.pptx
vemulavenu484
 
apidays New York 2025 - The Challenge is Not the Pattern, But the Best Integr...
apidays New York 2025 - The Challenge is Not the Pattern, But the Best Integr...apidays New York 2025 - The Challenge is Not the Pattern, But the Best Integr...
apidays New York 2025 - The Challenge is Not the Pattern, But the Best Integr...
apidays
 
Ch01_Introduction_to_Information_Securit
Ch01_Introduction_to_Information_SecuritCh01_Introduction_to_Information_Securit
Ch01_Introduction_to_Information_Securit
KawukiDerrick
 
apidays Singapore 2025 - 4 Identity Essentials for Scaling SaaS in Large Orgs...
apidays Singapore 2025 - 4 Identity Essentials for Scaling SaaS in Large Orgs...apidays Singapore 2025 - 4 Identity Essentials for Scaling SaaS in Large Orgs...
apidays Singapore 2025 - 4 Identity Essentials for Scaling SaaS in Large Orgs...
apidays
 
apidays New York 2025 - Building Agentic Workflows with FDC3 Intents by Nick ...
apidays New York 2025 - Building Agentic Workflows with FDC3 Intents by Nick ...apidays New York 2025 - Building Agentic Workflows with FDC3 Intents by Nick ...
apidays New York 2025 - Building Agentic Workflows with FDC3 Intents by Nick ...
apidays
 
2.5-DESPATCH-ORDINARY MAILS.pptxlminub7b7t6f7h7t6f6g7g6fg
2.5-DESPATCH-ORDINARY MAILS.pptxlminub7b7t6f7h7t6f6g7g6fg2.5-DESPATCH-ORDINARY MAILS.pptxlminub7b7t6f7h7t6f6g7g6fg
2.5-DESPATCH-ORDINARY MAILS.pptxlminub7b7t6f7h7t6f6g7g6fg
mk1227103
 
apidays New York 2025 - The FINOS Common Domain Model for Capital Markets by ...
apidays New York 2025 - The FINOS Common Domain Model for Capital Markets by ...apidays New York 2025 - The FINOS Common Domain Model for Capital Markets by ...
apidays New York 2025 - The FINOS Common Domain Model for Capital Markets by ...
apidays
 
Retort Instrumentation laboratory practi
Retort Instrumentation laboratory practiRetort Instrumentation laboratory practi
Retort Instrumentation laboratory practi
ADINDADYAHMUKHLASIN
 
MEDIA_LITERACY_INDEX_OF_EDUCATORS_ENG.pdf
MEDIA_LITERACY_INDEX_OF_EDUCATORS_ENG.pdfMEDIA_LITERACY_INDEX_OF_EDUCATORS_ENG.pdf
MEDIA_LITERACY_INDEX_OF_EDUCATORS_ENG.pdf
OlhaTatokhina1
 
Part Departement Head Presentation for Business
Part Departement Head Presentation for BusinessPart Departement Head Presentation for Business
Part Departement Head Presentation for Business
Rizki229625
 
apidays New York 2025 - Open Source and disrupting the travel distribution ec...
apidays New York 2025 - Open Source and disrupting the travel distribution ec...apidays New York 2025 - Open Source and disrupting the travel distribution ec...
apidays New York 2025 - Open Source and disrupting the travel distribution ec...
apidays
 
Hypothesis Testing Training Material.pdf
Hypothesis Testing Training Material.pdfHypothesis Testing Training Material.pdf
Hypothesis Testing Training Material.pdf
AbdirahmanAli51
 
unit- 5 Biostatistics and Research Methodology.pdf
unit- 5 Biostatistics and Research Methodology.pdfunit- 5 Biostatistics and Research Methodology.pdf
unit- 5 Biostatistics and Research Methodology.pdf
KRUTIKA CHANNE
 
apidays New York 2025 - Building Scalable AI Systems by Sai Prasad Veluru (Ap...
apidays New York 2025 - Building Scalable AI Systems by Sai Prasad Veluru (Ap...apidays New York 2025 - Building Scalable AI Systems by Sai Prasad Veluru (Ap...
apidays New York 2025 - Building Scalable AI Systems by Sai Prasad Veluru (Ap...
apidays
 
apidays New York 2025 - CIAM in the wild by Michael Gruen (Layr)
apidays New York 2025 - CIAM in the wild by Michael Gruen (Layr)apidays New York 2025 - CIAM in the wild by Michael Gruen (Layr)
apidays New York 2025 - CIAM in the wild by Michael Gruen (Layr)
apidays
 
BODMAS-Rule-&-Unit-Digit-Concept-pdf.pdf
BODMAS-Rule-&-Unit-Digit-Concept-pdf.pdfBODMAS-Rule-&-Unit-Digit-Concept-pdf.pdf
BODMAS-Rule-&-Unit-Digit-Concept-pdf.pdf
SiddharthSean
 
MICROSOFT POWERPOINT AND USES(BEST)..pdf
MICROSOFT POWERPOINT AND USES(BEST)..pdfMICROSOFT POWERPOINT AND USES(BEST)..pdf
MICROSOFT POWERPOINT AND USES(BEST)..pdf
bathyates
 
apidays New York 2025 - Using GraphQL SDL files as executable API Contracts b...
apidays New York 2025 - Using GraphQL SDL files as executable API Contracts b...apidays New York 2025 - Using GraphQL SDL files as executable API Contracts b...
apidays New York 2025 - Using GraphQL SDL files as executable API Contracts b...
apidays
 
apidays New York 2025 - Why an SDK is Needed to Protect APIs from Mobile Apps...
apidays New York 2025 - Why an SDK is Needed to Protect APIs from Mobile Apps...apidays New York 2025 - Why an SDK is Needed to Protect APIs from Mobile Apps...
apidays New York 2025 - Why an SDK is Needed to Protect APIs from Mobile Apps...
apidays
 
SAP_S4HANA_EWM_Food_Processing_Industry.pptx
SAP_S4HANA_EWM_Food_Processing_Industry.pptxSAP_S4HANA_EWM_Food_Processing_Industry.pptx
SAP_S4HANA_EWM_Food_Processing_Industry.pptx
vemulavenu484
 
apidays New York 2025 - The Challenge is Not the Pattern, But the Best Integr...
apidays New York 2025 - The Challenge is Not the Pattern, But the Best Integr...apidays New York 2025 - The Challenge is Not the Pattern, But the Best Integr...
apidays New York 2025 - The Challenge is Not the Pattern, But the Best Integr...
apidays
 
Ch01_Introduction_to_Information_Securit
Ch01_Introduction_to_Information_SecuritCh01_Introduction_to_Information_Securit
Ch01_Introduction_to_Information_Securit
KawukiDerrick
 

Simplify Data Conversion from Spark to TensorFlow and PyTorch

  • 1. Simplify Data Conversion from Spark to Deep Learning Liang Zhang Software Engineer @ databricks
  • 2. About Me ▪ Machine Learning Team @ Databricks ▪ Master in Carnegie Mellon University Liang Zhang linkedin.com/in/liangz1/
  • 3. Agenda ▪ Why should we care about data conversion between spark and deep learning frameworks? ▪ Pain points ▪ Overview of the Spark Dataset Converter ▪ Demo ▪ Best Practices
  • 4. Spark DataFrame Motivation: Data Conversion from Spark to DL TensorFlow PyTorch ? • Images from driving camera: Detect traffic lights • Large amount of data - TBs • New images arriving every day • Data cleaning and labeling • Train the model with all available data and periodically re-train with new data • Predict the label of new images
  • 5. Pain points: Data Conversion from Spark to Deep Learning frameworks
  • 6. Pain points: Data Conversion from Spark to DL • Single-node training: • Collect a sample of data to the driver in a pandas DataFrame • Distributed training: • Save the Spark DataFrame to TFRecords files and load TFRecords using TensorFlow • Save the Spark DataFrame to parquet files and write your custom PyTorch DataLoader to load the partitions
  • 7. Pain points: Data Conversion from Spark to DL • Single-node training: • Collect a sample of data to the driver in a pandas DataFrame • Distributed training: • Save the Spark DataFrame to TFRecords files and parse the serialized data in TFRecords using TensorFlow • Save the Spark DataFrame to parquet files and write your custom PyTorch DataLoader to load the partitions • Hard to migrate from single-node to distributed training • Many lines of extra code to save, load and parse intermediate files
  • 8. Overview of the Spark Dataset Converter
  • 9. Spark DataFrame Spark Dataset Converter API Overview TensorFlow Dataset PyTorch DataLoader Spark Dataset Converter from petastorm.spark import make_spark_converter converter = make_spark_converter(df) with converter.make_tf_dataset() as dataset: tf_model.fit(dataset) with converter.make_torch_dataloader() as dataloader: train(torch_model, dataloader)
  • 10. Spark Dataset Converter API HDFS/DBFS Spark DataFrame tf.data.Dataset / torch.dataloader Found cached parquet file? Cache DataFrame in parquet file data.parquet No Yes Load cached parquet file with petastorm ETL Training
  • 11. Spark Dataset Converter Features ▪ Recognize cached Spark DataFrame by checking the analyzed query plan ▪ Automatic cache cleaning at program exit • Change two arguments to migrate your data loading code from single-node to distributed setting • Easy migration to distributed • Cache intermediate files • Convert MLlib vectors to 1D arrays automatically • MLlib vector Handling
  • 12. How to use the Spark Dataset Converter API? (demo)
  • 13. Demo notebooks • Image Classification • Spark to TensorFlow Dataset • https://docs.databricks.com/_static/notebooks/deep-learning/petastorm-spark-converter-tenso rflow.html • Spark to PyTorch DataLoader • https://docs.databricks.com/_static/notebooks/deep-learning/petastorm-spark-converter-pytor ch.html
  • 15. Best Practices with Spark Dataset Converter • Image data decoding and preprocessing • Decode image bytes and preprocess in TransformSpec, not in Spark • Spark -> TransformSpec -> Dataset.map -> in the model (GPU) • Generate infinite batches using num_epochs=None • In distributed training, to guarantee that every worker get exactly the same amount of data. • Manage the lifecycle of cache data • On local laptop or in a scheduled job on Databricks, the cache files will be automatically deleted when the python process exits. • In Databricks notebook, we recommend configuring lifecycle rules for the underlying S3 buckets storing the cache files.
  • 16. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.