SlideShare a Scribd company logo
Distributed Models Over Distributed Data
with MLflow, Pyspark, and Pandas
Thunder Shiviah
thunder@databricks.com
Senior Solutions Architect, Databricks
SAIS EUROPE - 2019
Abstract
● More data often improves DL models in high variance problem spaces (with semi or
unstructured data) such as NLP, image, video but does not significantly improve high
bias problem spaces where traditional ML is more appropriate.
● That said, in the limits of more data, there still not a reason to go distributed if you
can instead use transfer learning.
● Data scientists have pain points
○ running many models in parallel
○ automating the experimental set up
○ Getting others (especially analysts) within an organization to use their models
● Databricks solves these problems using pandas udfs, ml runtime and mlflow.
Why do single node data science?
Is more data always better?
The Unreasonable Effectiveness of Data, Alon Halevy, Peter Norvig, and Fernando Pereira,
Google 2009
Banko and Brill, 2001
Team A came up with a very sophisticated algorithm using the Netflix
data. Team B used a very simple algorithm, but they added in
additional data beyond the Netflix set: information about movie genres
from the Internet Movie Database (IMDB). Guess which team did
better?
Team B got much better results, close to the best results on the Netflix
leaderboard! I'm really happy for them, and they're going to tune their
algorithm and take a crack at the grand prize.
But the bigger point is, adding more, independent data usually
beats out designing ever-better algorithms to analyze an existing
data set. I'm often surprised that many people in the business, and
even in academia, don't realize this. - "More data usually beats better algorithms", Anand
Rajaraman
Okay, case closed. Right?
Not so fast!
● First Norvig, Banko & Brill were focused on NLP specifically, not general ML.
● Second, Norvig et al were not saying that ML models trained on more data are better
than ML models trained on less data. Norvig was comparing ML (statistical
learning) to hand coded language rules in NLP: “...a deep approach that relies on
hand-coded grammars and ontologies, represented as complex networks of relations;”
(Norvig, 2009)
● In his 2010 UBC Department of Computer Science's Distinguished Lecture Series
talk, Norvig mentions that regarding NLP, google is quickly hitting their limits
with simple models and is needing to focus on more complex models (deep
learning).
● Re: the Netflix data set: The winning team actually then commented and said
Our experience with the Netflix data is different.
IMDB data and the likes gives more information only about the movies, not about the users ... The test set is
dominated by heavily rated movies (but by sparsely rating users), thus we don't really need this extra
information about the movies.
Our experiments clearly show that once you have strong CF models, such extra data is redundant
and cannot improve accuracy on the Netflix dataset.
Netflix algorithm in production
Netflix recommendations: beyond the 5 stars, Xavier Amatriain (Netflix)
Oh, and what about the size of those data sets?
● 1 billion word corpus = ~2GB
● Netflix prize data = 700Mb compressed
○ 1.5 GB uncompressed (source)
So where does that leave us?
Bias vs Variance
Distributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
Andrew Ng, AI is the New Electricity
Conclusion: more data makes sense for high variance
(semi-structured or unstructured) problem domains like
text and images. Sampling makes sense for high bias
domains such as structured problem domains.
Should we always use more data with deep learning?
No! Transfer learning on smaller data often beats
training nets from scratch on larger data-sets.
Open AI pointed out that while the amount of compute has been a key component of AI
progress, “Massive compute is certainly not a requirement to produce important results.”
(source)
In a benchmark run by our very own Matei Zaharia at Stanford, Fast.ai was able to win both
fastest and cheapest image classification:
Imagenet competition, our results were:
● Fastest on publicly available infrastructure, fastest on GPUs, and fastest on a single
machine (and faster than Intel’s entry that used a cluster of 128 machines!)
● Lowest actual cost (although DAWNBench’s official results didn’t use our actual cost,
as discussed below).Overall, our findings were:
● Algorithmic creativity is more important than bare-metal performance
(source)
Transfer learning models with a small number of training examples can achieve better
performance than models trained from scratch on 100x the data
Introducing state of the art text classification with universal language models,
Jeremy Howard and Sebastian Ruder
Take-away: Even in the case of deep learning, if an
established model exists, it’s better to use transfer
learning on small data then train from scratch on larger
data
So where does databricks fit into this story?
Training models (including hyperparameter search and
cross validation) is embarrassingly parallel
Shift from distributed data to distributed models
Introducing Pandas UDF for PySpark: How to run your native Python code with PySpark, fast.
Data Scientists spend lots of time setting up their
environment
Rule of frequency: automate the things you do the most
Source: xkcd
Maybe...
Source: xkcd
ML runtime
Data Scientists have a difficult time tracking models
Distributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
Data Scientists don’t know how to get the rest of the
organization to use their models
Key insight: Most modern enterprise run on SQL not REST
APIs
Use MLflow + spark UDFs to democratize ML within the
org.
See my mlflow deployment example notebook.
Summary
● More data often improves DL models in high variance problem spaces (with semi or
unstructured data) such as NLP, image, video but does not significantly improve high
bias problem spaces where traditional ML is more appropriate.
● That said, in the limits of more data, there still not a reason to go distributed if you
can instead use transfer learning.
● Data scientists have pain points
○ running many models in parallel
○ automating the experimental set up
○ Getting others (especially analysts) within an organization to use their models
● Databricks solves these problems using pandas udfs, ml runtime and mlflow.
● See my paper for more information.
Thank you!
thunder@databricks.com

More Related Content

What's hot (20)

PDF
DevOps for Databricks
Databricks
 
PDF
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
PPTX
How to build a successful Data Lake
DataWorks Summit/Hadoop Summit
 
PDF
mlflow: Accelerating the End-to-End ML lifecycle
Databricks
 
PDF
Introduction to Spark Internals
Pietro Michiardi
 
PDF
Apache Spark PDF
Naresh Rupareliya
 
PDF
Data Quality in the Banking Industry: Turning Regulatory Compliance into Busi...
Precisely
 
PDF
Vector database
Guy Korland
 
PDF
Memory Management in Apache Spark
Databricks
 
PDF
Introduction to apache spark
Aakashdata
 
PDF
Apache Spark on K8S Best Practice and Performance in the Cloud
Databricks
 
PDF
Apache Spark Core – Practical Optimization
Databricks
 
PDF
Apache kafka
NexThoughts Technologies
 
PPTX
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
Databricks
 
PPTX
Apache Spark Architecture
Alexey Grishchenko
 
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
PPTX
Nosql seminar
Shreyashkumar Nangnurwar
 
PDF
Google BigQuery
Matthias Feys
 
PDF
Running Apache Spark Jobs Using Kubernetes
Databricks
 
PDF
Spark SQL principes et fonctions
MICHRAFY MUSTAFA
 
DevOps for Databricks
Databricks
 
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
How to build a successful Data Lake
DataWorks Summit/Hadoop Summit
 
mlflow: Accelerating the End-to-End ML lifecycle
Databricks
 
Introduction to Spark Internals
Pietro Michiardi
 
Apache Spark PDF
Naresh Rupareliya
 
Data Quality in the Banking Industry: Turning Regulatory Compliance into Busi...
Precisely
 
Vector database
Guy Korland
 
Memory Management in Apache Spark
Databricks
 
Introduction to apache spark
Aakashdata
 
Apache Spark on K8S Best Practice and Performance in the Cloud
Databricks
 
Apache Spark Core – Practical Optimization
Databricks
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
Databricks
 
Apache Spark Architecture
Alexey Grishchenko
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Google BigQuery
Matthias Feys
 
Running Apache Spark Jobs Using Kubernetes
Databricks
 
Spark SQL principes et fonctions
MICHRAFY MUSTAFA
 

Similar to Distributed Models Over Distributed Data with MLflow, Pyspark, and Pandas (20)

PDF
10 Lessons Learned from Building Machine Learning Systems
Xavier Amatriain
 
PDF
Strata 2016 - Lessons Learned from building real-life Machine Learning Systems
Xavier Amatriain
 
PPTX
Data Science in business World
DeepikaGauriBaijal
 
PPT
Unexpected Challenges in Large Scale Machine Learning by Charles Parker
BigMine
 
PDF
Lessons learned from building practical deep learning systems
Xavier Amatriain
 
PDF
BIG2016- Lessons Learned from building real-life user-focused Big Data systems
Xavier Amatriain
 
PPTX
2024-02-24_Session 1 - PMLE_UPDATED.pptx
gdgsurrey
 
PDF
Pitfalls of machine learning in production
Antoine Sauray
 
PDF
Machine learning systems for engineers
Cameron Joannidis
 
PDF
AI meets Big Data
Jan Wiegelmann
 
PDF
10 more lessons learned from building Machine Learning systems - MLConf
Xavier Amatriain
 
PDF
Xavier Amatriain, VP of Engineering, Quora at MLconf SF - 11/13/15
MLconf
 
PDF
10 more lessons learned from building Machine Learning systems
Xavier Amatriain
 
PPTX
Implementing Machine Learning in the Real World
Jesus Rodriguez
 
PDF
Deep Learning For Practitioners, lecture 2: Selecting the right applications...
ananth
 
PDF
Large Scale Modeling Overview
Ferris Jumah
 
ODP
Challenges in Large Scale Machine Learning
Sudarsun Santhiappan
 
PDF
Fantastic Problems and Where to Find Them: Daryl Weir
Futurice
 
PPTX
From Data to AI with the ML Canvas
Alexandra Petruș
 
PDF
Dato Keynote
Turi, Inc.
 
10 Lessons Learned from Building Machine Learning Systems
Xavier Amatriain
 
Strata 2016 - Lessons Learned from building real-life Machine Learning Systems
Xavier Amatriain
 
Data Science in business World
DeepikaGauriBaijal
 
Unexpected Challenges in Large Scale Machine Learning by Charles Parker
BigMine
 
Lessons learned from building practical deep learning systems
Xavier Amatriain
 
BIG2016- Lessons Learned from building real-life user-focused Big Data systems
Xavier Amatriain
 
2024-02-24_Session 1 - PMLE_UPDATED.pptx
gdgsurrey
 
Pitfalls of machine learning in production
Antoine Sauray
 
Machine learning systems for engineers
Cameron Joannidis
 
AI meets Big Data
Jan Wiegelmann
 
10 more lessons learned from building Machine Learning systems - MLConf
Xavier Amatriain
 
Xavier Amatriain, VP of Engineering, Quora at MLconf SF - 11/13/15
MLconf
 
10 more lessons learned from building Machine Learning systems
Xavier Amatriain
 
Implementing Machine Learning in the Real World
Jesus Rodriguez
 
Deep Learning For Practitioners, lecture 2: Selecting the right applications...
ananth
 
Large Scale Modeling Overview
Ferris Jumah
 
Challenges in Large Scale Machine Learning
Sudarsun Santhiappan
 
Fantastic Problems and Where to Find Them: Daryl Weir
Futurice
 
From Data to AI with the ML Canvas
Alexandra Petruș
 
Dato Keynote
Turi, Inc.
 
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Ad

Recently uploaded (20)

PPTX
Mynd company all details what they are doing a
AniketKadam40952
 
PPTX
Generative AI Boost Data Governance and Quality- Tejasvi Addagada
Tejasvi Addagada
 
PDF
Data science AI/Ml basics to learn .pdf
deokhushi04
 
PPTX
covid 19 data analysis updates in our municipality
RhuAyungon1
 
PPT
intro to AI dfg fgh gggdrhre ghtwhg ewge
traineramrsiam
 
DOCX
COT Feb 19, 2025 DLLgvbbnnjjjjjj_Digestive System and its Functions_PISA_CBA....
kayemorales1105
 
PPTX
Artificial intelligence Presentation1.pptx
SaritaMahajan5
 
PDF
Blood pressure (3).pdfbdbsbsbhshshshhdhdhshshs
hernandezemma379
 
PPTX
Module-2_3-1eentzyssssssssssssssssssssss.pptx
ShahidHussain66691
 
PDF
ilide.info-tg-understanding-culture-society-and-politics-pr_127f984d2904c57ec...
jed P
 
PDF
Business Automation Solution with Excel 1.1.pdf
Vivek Kedia
 
PPT
Reliability Monitoring of Aircrfat commerce
Rizk2
 
PPTX
Data anlytics Hospitals Research India.pptx
SayantanChakravorty2
 
PDF
Orchestrating Data Workloads With Airflow.pdf
ssuserae5511
 
PDF
Microsoft Power BI - Advanced Certificate for Business Intelligence using Pow...
Prasenjit Debnath
 
PPTX
Presentation.pptx hhgihyugyygyijguuffddfffffff
abhiruppal2007
 
PDF
Datàaaaaaaaaaengineeeeeeeeeeeeeeeeeeeeeee
juadsr96
 
PPTX
Krezentios memories in college data.pptx
notknown9
 
PPTX
RESEARCH-FINAL-GROUP-3, about the final .pptx
gwapokoha1
 
PPSX
PPT1_CB_VII_CS_Ch3_FunctionsandChartsinCalc.ppsx
animaroy81
 
Mynd company all details what they are doing a
AniketKadam40952
 
Generative AI Boost Data Governance and Quality- Tejasvi Addagada
Tejasvi Addagada
 
Data science AI/Ml basics to learn .pdf
deokhushi04
 
covid 19 data analysis updates in our municipality
RhuAyungon1
 
intro to AI dfg fgh gggdrhre ghtwhg ewge
traineramrsiam
 
COT Feb 19, 2025 DLLgvbbnnjjjjjj_Digestive System and its Functions_PISA_CBA....
kayemorales1105
 
Artificial intelligence Presentation1.pptx
SaritaMahajan5
 
Blood pressure (3).pdfbdbsbsbhshshshhdhdhshshs
hernandezemma379
 
Module-2_3-1eentzyssssssssssssssssssssss.pptx
ShahidHussain66691
 
ilide.info-tg-understanding-culture-society-and-politics-pr_127f984d2904c57ec...
jed P
 
Business Automation Solution with Excel 1.1.pdf
Vivek Kedia
 
Reliability Monitoring of Aircrfat commerce
Rizk2
 
Data anlytics Hospitals Research India.pptx
SayantanChakravorty2
 
Orchestrating Data Workloads With Airflow.pdf
ssuserae5511
 
Microsoft Power BI - Advanced Certificate for Business Intelligence using Pow...
Prasenjit Debnath
 
Presentation.pptx hhgihyugyygyijguuffddfffffff
abhiruppal2007
 
Datàaaaaaaaaaengineeeeeeeeeeeeeeeeeeeeeee
juadsr96
 
Krezentios memories in college data.pptx
notknown9
 
RESEARCH-FINAL-GROUP-3, about the final .pptx
gwapokoha1
 
PPT1_CB_VII_CS_Ch3_FunctionsandChartsinCalc.ppsx
animaroy81
 

Distributed Models Over Distributed Data with MLflow, Pyspark, and Pandas

  • 1. Distributed Models Over Distributed Data with MLflow, Pyspark, and Pandas Thunder Shiviah [email protected] Senior Solutions Architect, Databricks SAIS EUROPE - 2019
  • 2. Abstract ● More data often improves DL models in high variance problem spaces (with semi or unstructured data) such as NLP, image, video but does not significantly improve high bias problem spaces where traditional ML is more appropriate. ● That said, in the limits of more data, there still not a reason to go distributed if you can instead use transfer learning. ● Data scientists have pain points ○ running many models in parallel ○ automating the experimental set up ○ Getting others (especially analysts) within an organization to use their models ● Databricks solves these problems using pandas udfs, ml runtime and mlflow.
  • 3. Why do single node data science?
  • 4. Is more data always better? The Unreasonable Effectiveness of Data, Alon Halevy, Peter Norvig, and Fernando Pereira, Google 2009
  • 6. Team A came up with a very sophisticated algorithm using the Netflix data. Team B used a very simple algorithm, but they added in additional data beyond the Netflix set: information about movie genres from the Internet Movie Database (IMDB). Guess which team did better? Team B got much better results, close to the best results on the Netflix leaderboard! I'm really happy for them, and they're going to tune their algorithm and take a crack at the grand prize. But the bigger point is, adding more, independent data usually beats out designing ever-better algorithms to analyze an existing data set. I'm often surprised that many people in the business, and even in academia, don't realize this. - "More data usually beats better algorithms", Anand Rajaraman
  • 9. ● First Norvig, Banko & Brill were focused on NLP specifically, not general ML. ● Second, Norvig et al were not saying that ML models trained on more data are better than ML models trained on less data. Norvig was comparing ML (statistical learning) to hand coded language rules in NLP: “...a deep approach that relies on hand-coded grammars and ontologies, represented as complex networks of relations;” (Norvig, 2009) ● In his 2010 UBC Department of Computer Science's Distinguished Lecture Series talk, Norvig mentions that regarding NLP, google is quickly hitting their limits with simple models and is needing to focus on more complex models (deep learning). ● Re: the Netflix data set: The winning team actually then commented and said Our experience with the Netflix data is different. IMDB data and the likes gives more information only about the movies, not about the users ... The test set is dominated by heavily rated movies (but by sparsely rating users), thus we don't really need this extra information about the movies. Our experiments clearly show that once you have strong CF models, such extra data is redundant and cannot improve accuracy on the Netflix dataset.
  • 10. Netflix algorithm in production Netflix recommendations: beyond the 5 stars, Xavier Amatriain (Netflix)
  • 11. Oh, and what about the size of those data sets? ● 1 billion word corpus = ~2GB ● Netflix prize data = 700Mb compressed ○ 1.5 GB uncompressed (source)
  • 12. So where does that leave us?
  • 15. Andrew Ng, AI is the New Electricity
  • 16. Conclusion: more data makes sense for high variance (semi-structured or unstructured) problem domains like text and images. Sampling makes sense for high bias domains such as structured problem domains.
  • 17. Should we always use more data with deep learning?
  • 18. No! Transfer learning on smaller data often beats training nets from scratch on larger data-sets. Open AI pointed out that while the amount of compute has been a key component of AI progress, “Massive compute is certainly not a requirement to produce important results.” (source)
  • 19. In a benchmark run by our very own Matei Zaharia at Stanford, Fast.ai was able to win both fastest and cheapest image classification: Imagenet competition, our results were: ● Fastest on publicly available infrastructure, fastest on GPUs, and fastest on a single machine (and faster than Intel’s entry that used a cluster of 128 machines!) ● Lowest actual cost (although DAWNBench’s official results didn’t use our actual cost, as discussed below).Overall, our findings were: ● Algorithmic creativity is more important than bare-metal performance (source)
  • 20. Transfer learning models with a small number of training examples can achieve better performance than models trained from scratch on 100x the data Introducing state of the art text classification with universal language models, Jeremy Howard and Sebastian Ruder
  • 21. Take-away: Even in the case of deep learning, if an established model exists, it’s better to use transfer learning on small data then train from scratch on larger data
  • 22. So where does databricks fit into this story?
  • 23. Training models (including hyperparameter search and cross validation) is embarrassingly parallel
  • 24. Shift from distributed data to distributed models Introducing Pandas UDF for PySpark: How to run your native Python code with PySpark, fast.
  • 25. Data Scientists spend lots of time setting up their environment
  • 26. Rule of frequency: automate the things you do the most Source: xkcd
  • 29. Data Scientists have a difficult time tracking models
  • 31. Data Scientists don’t know how to get the rest of the organization to use their models
  • 32. Key insight: Most modern enterprise run on SQL not REST APIs
  • 33. Use MLflow + spark UDFs to democratize ML within the org. See my mlflow deployment example notebook.
  • 34. Summary ● More data often improves DL models in high variance problem spaces (with semi or unstructured data) such as NLP, image, video but does not significantly improve high bias problem spaces where traditional ML is more appropriate. ● That said, in the limits of more data, there still not a reason to go distributed if you can instead use transfer learning. ● Data scientists have pain points ○ running many models in parallel ○ automating the experimental set up ○ Getting others (especially analysts) within an organization to use their models ● Databricks solves these problems using pandas udfs, ml runtime and mlflow. ● See my paper for more information.