SlideShare a Scribd company logo
Distributed Models Over Distributed Data
with MLflow, Pyspark, and Pandas
Thunder Shiviah
thunder@databricks.com
Senior Solutions Architect, Databricks
SAIS EUROPE - 2019
Abstract
● More data often improves DL models in high variance problem spaces (with semi or
unstructured data) such as NLP, image, video but does not significantly improve high
bias problem spaces where traditional ML is more appropriate.
● That said, in the limits of more data, there still not a reason to go distributed if you
can instead use transfer learning.
● Data scientists have pain points
○ running many models in parallel
○ automating the experimental set up
○ Getting others (especially analysts) within an organization to use their models
● Databricks solves these problems using pandas udfs, ml runtime and mlflow.
Why do single node data science?
Is more data always better?
The Unreasonable Effectiveness of Data, Alon Halevy, Peter Norvig, and Fernando Pereira,
Google 2009
Banko and Brill, 2001
Team A came up with a very sophisticated algorithm using the Netflix
data. Team B used a very simple algorithm, but they added in
additional data beyond the Netflix set: information about movie genres
from the Internet Movie Database (IMDB). Guess which team did
better?
Team B got much better results, close to the best results on the Netflix
leaderboard! I'm really happy for them, and they're going to tune their
algorithm and take a crack at the grand prize.
But the bigger point is, adding more, independent data usually
beats out designing ever-better algorithms to analyze an existing
data set. I'm often surprised that many people in the business, and
even in academia, don't realize this. - "More data usually beats better algorithms", Anand
Rajaraman
Okay, case closed. Right?
Not so fast!
● First Norvig, Banko & Brill were focused on NLP specifically, not general ML.
● Second, Norvig et al were not saying that ML models trained on more data are better
than ML models trained on less data. Norvig was comparing ML (statistical
learning) to hand coded language rules in NLP: “...a deep approach that relies on
hand-coded grammars and ontologies, represented as complex networks of relations;”
(Norvig, 2009)
● In his 2010 UBC Department of Computer Science's Distinguished Lecture Series
talk, Norvig mentions that regarding NLP, google is quickly hitting their limits
with simple models and is needing to focus on more complex models (deep
learning).
● Re: the Netflix data set: The winning team actually then commented and said
Our experience with the Netflix data is different.
IMDB data and the likes gives more information only about the movies, not about the users ... The test set is
dominated by heavily rated movies (but by sparsely rating users), thus we don't really need this extra
information about the movies.
Our experiments clearly show that once you have strong CF models, such extra data is redundant
and cannot improve accuracy on the Netflix dataset.
Netflix algorithm in production
Netflix recommendations: beyond the 5 stars, Xavier Amatriain (Netflix)
Oh, and what about the size of those data sets?
● 1 billion word corpus = ~2GB
● Netflix prize data = 700Mb compressed
○ 1.5 GB uncompressed (source)
So where does that leave us?
Bias vs Variance
Distributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
Andrew Ng, AI is the New Electricity
Conclusion: more data makes sense for high variance
(semi-structured or unstructured) problem domains like
text and images. Sampling makes sense for high bias
domains such as structured problem domains.
Should we always use more data with deep learning?
No! Transfer learning on smaller data often beats
training nets from scratch on larger data-sets.
Open AI pointed out that while the amount of compute has been a key component of AI
progress, “Massive compute is certainly not a requirement to produce important results.”
(source)
In a benchmark run by our very own Matei Zaharia at Stanford, Fast.ai was able to win both
fastest and cheapest image classification:
Imagenet competition, our results were:
● Fastest on publicly available infrastructure, fastest on GPUs, and fastest on a single
machine (and faster than Intel’s entry that used a cluster of 128 machines!)
● Lowest actual cost (although DAWNBench’s official results didn’t use our actual cost,
as discussed below).Overall, our findings were:
● Algorithmic creativity is more important than bare-metal performance
(source)
Transfer learning models with a small number of training examples can achieve better
performance than models trained from scratch on 100x the data
Introducing state of the art text classification with universal language models,
Jeremy Howard and Sebastian Ruder
Take-away: Even in the case of deep learning, if an
established model exists, it’s better to use transfer
learning on small data then train from scratch on larger
data
So where does databricks fit into this story?
Training models (including hyperparameter search and
cross validation) is embarrassingly parallel
Shift from distributed data to distributed models
Introducing Pandas UDF for PySpark: How to run your native Python code with PySpark, fast.
Data Scientists spend lots of time setting up their
environment
Rule of frequency: automate the things you do the most
Source: xkcd
Maybe...
Source: xkcd
ML runtime
Data Scientists have a difficult time tracking models
Distributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
Data Scientists don’t know how to get the rest of the
organization to use their models
Key insight: Most modern enterprise run on SQL not REST
APIs
Use MLflow + spark UDFs to democratize ML within the
org.
See my mlflow deployment example notebook.
Summary
● More data often improves DL models in high variance problem spaces (with semi or
unstructured data) such as NLP, image, video but does not significantly improve high
bias problem spaces where traditional ML is more appropriate.
● That said, in the limits of more data, there still not a reason to go distributed if you
can instead use transfer learning.
● Data scientists have pain points
○ running many models in parallel
○ automating the experimental set up
○ Getting others (especially analysts) within an organization to use their models
● Databricks solves these problems using pandas udfs, ml runtime and mlflow.
● See my paper for more information.
Thank you!
thunder@databricks.com

More Related Content

PPTX
Bert.pptx
PDF
Introduction to MLflow
PPTX
Deep Learning Workflows: Training and Inference
PDF
Data Driven-Toyota Customer 360 Insights on Apache Spark and MLlib-(Brian Kur...
PDF
Desarrollo de Soluciones Escalables de Software como Servicio (SaaS)
DOC
أسئلة الحاسب-الالى-واجابتها لمسابقة التربية والتعليم
PDF
NLP using transformers
PDF
Bringing ML To Production, What Is Missing? AMLD 2020
Bert.pptx
Introduction to MLflow
Deep Learning Workflows: Training and Inference
Data Driven-Toyota Customer 360 Insights on Apache Spark and MLlib-(Brian Kur...
Desarrollo de Soluciones Escalables de Software como Servicio (SaaS)
أسئلة الحاسب-الالى-واجابتها لمسابقة التربية والتعليم
NLP using transformers
Bringing ML To Production, What Is Missing? AMLD 2020

What's hot (20)

PDF
CDMP Overview Professional Information Management Certification
PPTX
BI - Uso e Benefícios ( Business Intelligence )
PDF
Aula 01.3 - Fundamentos da Construção de Algoritmos e Programas
PPTX
JDemetra+Nowcasting: Macroeconomic Monitoring and Visualizing News
PDF
نظم المعلومات الإدارية
PDF
Aula 01 - Fundamentos de Banco de Dados (2).pdf
PPT
نظم التشغيل
PPTX
What is TensorFlow? | Introduction to TensorFlow | TensorFlow Tutorial For Be...
PDF
ورشة عمل NVivo
PDF
Big Data in e-Commerce
PPTX
التجارة الالكترونية.pptx
PPTX
ملخص ورشة أساسيات أمن المعلومات
PDF
مصادر المعلومات الالكترونية
PPTX
العملات الرقمية
PDF
Microservices - Death of the Enterprise Service Bus (ESB)? (Update 2016)
PDF
PDF
Big data-analytics-ebook
PDF
Algoritmos e lp parte 5-subalgoritmos
PPTX
Presentation on Big Data
PDF
2 3 утиліти для роботи з COM-портами ПК
CDMP Overview Professional Information Management Certification
BI - Uso e Benefícios ( Business Intelligence )
Aula 01.3 - Fundamentos da Construção de Algoritmos e Programas
JDemetra+Nowcasting: Macroeconomic Monitoring and Visualizing News
نظم المعلومات الإدارية
Aula 01 - Fundamentos de Banco de Dados (2).pdf
نظم التشغيل
What is TensorFlow? | Introduction to TensorFlow | TensorFlow Tutorial For Be...
ورشة عمل NVivo
Big Data in e-Commerce
التجارة الالكترونية.pptx
ملخص ورشة أساسيات أمن المعلومات
مصادر المعلومات الالكترونية
العملات الرقمية
Microservices - Death of the Enterprise Service Bus (ESB)? (Update 2016)
Big data-analytics-ebook
Algoritmos e lp parte 5-subalgoritmos
Presentation on Big Data
2 3 утиліти для роботи з COM-портами ПК
Ad

Similar to Distributed Models Over Distributed Data with MLflow, Pyspark, and Pandas (20)

PDF
10 Lessons Learned from Building Machine Learning Systems
PDF
Strata 2016 - Lessons Learned from building real-life Machine Learning Systems
PPTX
Data Science in business World
PPT
Unexpected Challenges in Large Scale Machine Learning by Charles Parker
PDF
Lessons learned from building practical deep learning systems
PDF
BIG2016- Lessons Learned from building real-life user-focused Big Data systems
PPTX
2024-02-24_Session 1 - PMLE_UPDATED.pptx
PDF
Pitfalls of machine learning in production
PDF
Machine learning systems for engineers
PDF
AI meets Big Data
PDF
10 more lessons learned from building Machine Learning systems - MLConf
PDF
Xavier Amatriain, VP of Engineering, Quora at MLconf SF - 11/13/15
PDF
10 more lessons learned from building Machine Learning systems
PPTX
Implementing Machine Learning in the Real World
PDF
Deep Learning For Practitioners, lecture 2: Selecting the right applications...
PDF
Large Scale Modeling Overview
ODP
Challenges in Large Scale Machine Learning
PDF
Fantastic Problems and Where to Find Them: Daryl Weir
PPTX
From Data to AI with the ML Canvas
PDF
Dato Keynote
10 Lessons Learned from Building Machine Learning Systems
Strata 2016 - Lessons Learned from building real-life Machine Learning Systems
Data Science in business World
Unexpected Challenges in Large Scale Machine Learning by Charles Parker
Lessons learned from building practical deep learning systems
BIG2016- Lessons Learned from building real-life user-focused Big Data systems
2024-02-24_Session 1 - PMLE_UPDATED.pptx
Pitfalls of machine learning in production
Machine learning systems for engineers
AI meets Big Data
10 more lessons learned from building Machine Learning systems - MLConf
Xavier Amatriain, VP of Engineering, Quora at MLconf SF - 11/13/15
10 more lessons learned from building Machine Learning systems
Implementing Machine Learning in the Real World
Deep Learning For Practitioners, lecture 2: Selecting the right applications...
Large Scale Modeling Overview
Challenges in Large Scale Machine Learning
Fantastic Problems and Where to Find Them: Daryl Weir
From Data to AI with the ML Canvas
Dato Keynote
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake

Recently uploaded (20)

PDF
Data Engineering Interview Questions & Answers Data Modeling (3NF, Star, Vaul...
PPT
DU, AIS, Big Data and Data Analytics.ppt
PPTX
Lesson-01intheselfoflifeofthekennyrogersoftheunderstandoftheunderstanded
PDF
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
PPTX
Leprosy and NLEP programme community medicine
PPTX
New ISO 27001_2022 standard and the changes
PDF
Navigating the Thai Supplements Landscape.pdf
PPTX
Topic 5 Presentation 5 Lesson 5 Corporate Fin
PDF
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
PPTX
chrmotography.pptx food anaylysis techni
PDF
[EN] Industrial Machine Downtime Prediction
PPTX
retention in jsjsksksksnbsndjddjdnFPD.pptx
PDF
Microsoft Core Cloud Services powerpoint
PPTX
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
PPT
Predictive modeling basics in data cleaning process
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
PPTX
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PDF
Optimise Shopper Experiences with a Strong Data Estate.pdf
PPT
statistic analysis for study - data collection
Data Engineering Interview Questions & Answers Data Modeling (3NF, Star, Vaul...
DU, AIS, Big Data and Data Analytics.ppt
Lesson-01intheselfoflifeofthekennyrogersoftheunderstandoftheunderstanded
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
Leprosy and NLEP programme community medicine
New ISO 27001_2022 standard and the changes
Navigating the Thai Supplements Landscape.pdf
Topic 5 Presentation 5 Lesson 5 Corporate Fin
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
chrmotography.pptx food anaylysis techni
[EN] Industrial Machine Downtime Prediction
retention in jsjsksksksnbsndjddjdnFPD.pptx
Microsoft Core Cloud Services powerpoint
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
Predictive modeling basics in data cleaning process
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
STERILIZATION AND DISINFECTION-1.ppthhhbx
Optimise Shopper Experiences with a Strong Data Estate.pdf
statistic analysis for study - data collection

Distributed Models Over Distributed Data with MLflow, Pyspark, and Pandas

  • 1. Distributed Models Over Distributed Data with MLflow, Pyspark, and Pandas Thunder Shiviah [email protected] Senior Solutions Architect, Databricks SAIS EUROPE - 2019
  • 2. Abstract ● More data often improves DL models in high variance problem spaces (with semi or unstructured data) such as NLP, image, video but does not significantly improve high bias problem spaces where traditional ML is more appropriate. ● That said, in the limits of more data, there still not a reason to go distributed if you can instead use transfer learning. ● Data scientists have pain points ○ running many models in parallel ○ automating the experimental set up ○ Getting others (especially analysts) within an organization to use their models ● Databricks solves these problems using pandas udfs, ml runtime and mlflow.
  • 3. Why do single node data science?
  • 4. Is more data always better? The Unreasonable Effectiveness of Data, Alon Halevy, Peter Norvig, and Fernando Pereira, Google 2009
  • 6. Team A came up with a very sophisticated algorithm using the Netflix data. Team B used a very simple algorithm, but they added in additional data beyond the Netflix set: information about movie genres from the Internet Movie Database (IMDB). Guess which team did better? Team B got much better results, close to the best results on the Netflix leaderboard! I'm really happy for them, and they're going to tune their algorithm and take a crack at the grand prize. But the bigger point is, adding more, independent data usually beats out designing ever-better algorithms to analyze an existing data set. I'm often surprised that many people in the business, and even in academia, don't realize this. - "More data usually beats better algorithms", Anand Rajaraman
  • 9. ● First Norvig, Banko & Brill were focused on NLP specifically, not general ML. ● Second, Norvig et al were not saying that ML models trained on more data are better than ML models trained on less data. Norvig was comparing ML (statistical learning) to hand coded language rules in NLP: “...a deep approach that relies on hand-coded grammars and ontologies, represented as complex networks of relations;” (Norvig, 2009) ● In his 2010 UBC Department of Computer Science's Distinguished Lecture Series talk, Norvig mentions that regarding NLP, google is quickly hitting their limits with simple models and is needing to focus on more complex models (deep learning). ● Re: the Netflix data set: The winning team actually then commented and said Our experience with the Netflix data is different. IMDB data and the likes gives more information only about the movies, not about the users ... The test set is dominated by heavily rated movies (but by sparsely rating users), thus we don't really need this extra information about the movies. Our experiments clearly show that once you have strong CF models, such extra data is redundant and cannot improve accuracy on the Netflix dataset.
  • 10. Netflix algorithm in production Netflix recommendations: beyond the 5 stars, Xavier Amatriain (Netflix)
  • 11. Oh, and what about the size of those data sets? ● 1 billion word corpus = ~2GB ● Netflix prize data = 700Mb compressed ○ 1.5 GB uncompressed (source)
  • 12. So where does that leave us?
  • 15. Andrew Ng, AI is the New Electricity
  • 16. Conclusion: more data makes sense for high variance (semi-structured or unstructured) problem domains like text and images. Sampling makes sense for high bias domains such as structured problem domains.
  • 17. Should we always use more data with deep learning?
  • 18. No! Transfer learning on smaller data often beats training nets from scratch on larger data-sets. Open AI pointed out that while the amount of compute has been a key component of AI progress, “Massive compute is certainly not a requirement to produce important results.” (source)
  • 19. In a benchmark run by our very own Matei Zaharia at Stanford, Fast.ai was able to win both fastest and cheapest image classification: Imagenet competition, our results were: ● Fastest on publicly available infrastructure, fastest on GPUs, and fastest on a single machine (and faster than Intel’s entry that used a cluster of 128 machines!) ● Lowest actual cost (although DAWNBench’s official results didn’t use our actual cost, as discussed below).Overall, our findings were: ● Algorithmic creativity is more important than bare-metal performance (source)
  • 20. Transfer learning models with a small number of training examples can achieve better performance than models trained from scratch on 100x the data Introducing state of the art text classification with universal language models, Jeremy Howard and Sebastian Ruder
  • 21. Take-away: Even in the case of deep learning, if an established model exists, it’s better to use transfer learning on small data then train from scratch on larger data
  • 22. So where does databricks fit into this story?
  • 23. Training models (including hyperparameter search and cross validation) is embarrassingly parallel
  • 24. Shift from distributed data to distributed models Introducing Pandas UDF for PySpark: How to run your native Python code with PySpark, fast.
  • 25. Data Scientists spend lots of time setting up their environment
  • 26. Rule of frequency: automate the things you do the most Source: xkcd
  • 29. Data Scientists have a difficult time tracking models
  • 31. Data Scientists don’t know how to get the rest of the organization to use their models
  • 32. Key insight: Most modern enterprise run on SQL not REST APIs
  • 33. Use MLflow + spark UDFs to democratize ML within the org. See my mlflow deployment example notebook.
  • 34. Summary ● More data often improves DL models in high variance problem spaces (with semi or unstructured data) such as NLP, image, video but does not significantly improve high bias problem spaces where traditional ML is more appropriate. ● That said, in the limits of more data, there still not a reason to go distributed if you can instead use transfer learning. ● Data scientists have pain points ○ running many models in parallel ○ automating the experimental set up ○ Getting others (especially analysts) within an organization to use their models ● Databricks solves these problems using pandas udfs, ml runtime and mlflow. ● See my paper for more information.