SlideShare a Scribd company logo
Accelerating Data Processing in Spark SQL with Pandas UDFs
Accelerating Data
Processing in Spark
SQL with Pandas UDFs
Michael Tong, Quantcast
Machine Learning Engineer
Agenda
Review of Pandas UDFs
Review what they are and go over some
development tips
Modeling at Quantcast
How we use Spark SQL in production
Example Problem
Introduce a real problem from our model training
pipeline that will be the main focus of our
optimization efforts for this talk.
Optimization tips and tricks.
Iteratively and aggressively optimize this problem
with pandas UDFs
Optimization Tricks
Do more things in memory
Loops > spark SQL intermediate rows. Look for
ways to do as many things in memory as possible
Aggregate Keys
Try to reduce the number of unique keys in your
data and/or process multiple keys in a single UDF
call.
Use inverted indices
Works especially well with sparse data.
Use python libraries
Pandas is easy to work with but slow, use other
python libraries for better performance
Feedback
Your feedback is important to us.
Don’t forget to rate and
review the sessions.
Review of Pandas UDFs
What are Pandas UDFs?
▪ UDF = User Defined Function
▪ Pandas UDFs are part of Spark SQL,
which is Apache spark’s module for
working with structured data.
▪ Pandas UDFs are a great way of
writing custom data-processing
logic in a developer-friendly
environment.
Summary
What are Pandas UDFs?
▪ Scalar UDFs. One-to-one mapping
function that supports simple return
types (no Map/Struct types)
▪ Grouped Map UDFs. Requires a
groupby operation but can return a
variable number of output rows with
complicated return types.
▪ Grouped Agg UDFs. I recommend you
use Grouped Map UDFs instead.
Types of Pandas UDFs
Development tips and tricks
▪ Use an interactive development framework.
▪ At Quantcast we use jupyter notebooks.
▪ Develop with mock data.
▪ Pandas UDFs call python functions. Develop in your local environment using mock data in you interactive environment to quickly
iterate quickly on ideas when developing code.
▪ Use magic commands (if you are using jupyter)
▪ Useful commands like timeit, time and prun allow for easy profiling and performance tuning to squeeze every bit of performance out
of your pandas UDFs.
▪ Use python’s debugging tool
▪ The module is pdb (python debugger).
Modeling at Quantcast
Modeling at Quantcast
▪ Train tens of thousands of models
that refresh daily-weekly
▪ Models trained off first party data
from global internet traffic.
▪ We have a lot of data
▪ Around 400TB raw logs written/day
▪ Data is cleaned and compressed to
about 4-5 TB/day
▪ Typically train off of several days-
months of data for each model.
Scale, scale, and even more scale
Example Problem
Example Problem
▪ We have about 10k models that we want to train.
▪ Each of them cover different geo regions
▪ Some of them are over large regions (i.e. everybody in the US)
▪ Some of them are over specific regions (i.e. everybody in San Francisco)
▪ Some of them are over large regions but exclude specific regions (i.e. everybody in the US except people in San Francisco)
▪ For each model, we want to know how many unique ids (users) were
found in each region over a certain period of time.
A high level overview
Example Problem
▪ Each model will have a set of inclusion regions (i.e. US, San Francisco)
where each id must be in one of these regions to be considered part
of the model.
▪ Each model will have a set of exclusion regions (i.e. San Francisco)
where each id must be in none of these regions to be considered part
of the model.
▪ Each id only needs to satisfy the geo constraints once to be part of
the model (i.e., an id that moves from the US to Canada during the
training timeframe is considered valid for a US model)
More details
Example Problem
With some example tables
Feature Feature Id
US 0 or 100
San Francisco 1 or 101
Feature Map
Model Data and Result
Model Geo Incl. Geo Excl. # unique
Ids
ids
Model-0 (US) [0, 100] [] 4 A, B, C, D
Model-1 (US, not SF) [0, 100] [1, 101] 2 A, B
Id Timestamp Feature ids Model ids
A ts-1 [0] [0, 1]
B ts-2 [0, 1] [0, 1]
C ts-3 [100, 101] [0]
D ts-4 [0, 1, 2, 3, 4] [0]
D ts-5 [999] []
Feature Store
Optimization:
Tips and Tricks
Naive approach: Use Spark SQL
▪ Spark has built in functions to do everything we need
▪ Get all (row, model) pairs using a cross join
▪ Use functions.array_intersect for the inclusion/exclusion logic.
▪ Use groupby and aggregate to get the counts..
▪ Code is really simple (<10 lines of code)
▪ Will test this on a sample of 100k rows.
Naive approach: Use Spark SQL
Source code
Naive approach: Use Spark SQL
▪ Only processes about 25
rows/CPU/second
▪ To see why, look at the graph.
▪ We generate about 700x the number
of intermediate rows as our input
data to process this.
▪ This is because every row on average
belongs to several models.
▪ There has to be a better way.
This solution is terrible
10,067
(# models)
100,000
(# input rows)
69,697,819
(# rows)
Optimization: Use Pandas UDFs for Looping
▪ One reason why spark is really slow is
because of the large number of
intermediate rows.
▪ What if we wrote a simple UDF that
would iterate over all of the rows in
memory instead?
▪ For this example problem, it speeds
things up by ~1.8x
Optimization: Use Pandas UDFs for Looping
▪ Store the model data
(model_data_df) in a pandas
dataframe.
▪ Use a pandas GROUPED_MAP UDF to
process the data for each id.
▪ Figure out which models belong to an
id in a nested for loop
▪ This is faster because we do not
have to generate intermediate rows.
The code in a nutshell
Optimization: Aggregate keys
▪ In model training, there are some
commonly used filters.
▪ Instead of counting the number of
unique models, count the number of
unique filters.
▪ In this data set, there are 473 unique
model filters, which is much less
than 10k models.
▪ ~9.82x faster than the previous
solution.
Most common geo
inclusions/exclusions
Idea
Count Inclusion Exclusion
2035 [US] []
409 [GB] []
389 [CA] []
358 [AU] []
274 [DE] []
Optimization: Aggregate Keys
▪ Create a UDF that iterates over the
unique filters (by filterId) instead of
model ids.
▪ In order to get the model ids, create
a table that contains the mapping
from model ids to filter ids
(filter_model_id_pairs) and use a
broadcast hash join.
The code in a nutshell
Optimization: Aggregate Keys in Batches
▪ What if we grouped things by
something bigger than an id?
▪ Generate less intermediate rows.
▪ Take advantage of python
vectorization.
▪ We can rewrite a UDF that takes in
batches of ~10k ids per UDF call.
▪ ~2.9x faster than the previous
solution.
▪ ~51.3x faster than the naive one.
Idea
Optimization: Aggregate Keys in Batches
▪ Group things into batches based off
the hash of the id.
▪ Have the UDF group each batch by id
and count the number of ids that
satisfy each filter, returning a partial
count for each filter id.
▪ The group by operation becomes a
sum instead of a count because we
do partial counts in the batches.
The code in a nutshell
Optimization: Inverted Indexes
▪ Each feature store row has relatively
few unique features.
▪ Feature store rows have 10-20
features/row.
▪ There are ~500 unique filters.
▪ Use an inverted index to iterate over
the unique features instead of filters.
▪ Use set operations for
inclusion/exclusion logic
▪ ~6.44x faster than previous solution.
Idea
Optimization: Inverted Indexes
▪ Create maps for filter id to
inclusion/exclusion filters.
▪ Use those maps to get the set of
inclusion/exclusion filters each row
belongs to.
▪ Use set operations to perform the
inclusion/exclusion logic.
▪ Have each UDF call process batches
of ids.
The code in a nutshell
Optimization: Use python libraries
▪ Pandas is optimized for ease of use,
not speed.
▪ Use python libraries (itertools) to
make python code run faster.
▪ reduce, and numpy are also good
candidates to consider for other
UDFs.
▪ ~2.6x faster than previous solution.
▪ ~860x faster than naive solution!
Idea
Optimization: Use python libraries
▪ Use .values to extract the columns
from a pandas dataframe.
▪ Use itertools to iterate through for
loops faster than default for loops.
▪ itertools.group_by is used to sort and
group the data.
▪ itertools.chain.from_iterable is to iterate
through a nested for loop.
The code in a nutshell
Optimization: Summary
▪ Pandas UDFs are extremely flexible
and can be used to speed up spark
SQL.
▪ We discussed a problem where we
could apply optimization tricks for
almost 1000x speedup.
▪ Apply these tricks to your own
problems and watch things
accelerate.
Key takeaways
Questions?
Feedback
Your feedback is important to us.
Don’t forget to rate and
review the sessions.
Thank you

More Related Content

What's hot (20)

PDF
The Parquet Format and Performance Optimization Opportunities
Databricks
 
PDF
Understanding Query Plans and Spark UIs
Databricks
 
PDF
SOS: Optimizing Shuffle I/O with Brian Cho and Ergin Seyfe
Databricks
 
PDF
Common Strategies for Improving Performance on Your Delta Lakehouse
Databricks
 
PDF
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 
PPTX
The columnar roadmap: Apache Parquet and Apache Arrow
DataWorks Summit
 
PDF
Apache Airflow
Sumit Maheshwari
 
PDF
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
Databricks
 
PDF
Dynamic Partition Pruning in Apache Spark
Databricks
 
PDF
Spark SQL
Joud Khattab
 
PDF
Optimizing Delta/Parquet Data Lakes for Apache Spark
Databricks
 
PDF
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Chris Fregly
 
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
PDF
Apache Flink internals
Kostas Tzoumas
 
PPT
Parquet overview
Julien Le Dem
 
PPTX
Spark
Heena Madan
 
PPTX
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
Chester Chen
 
PDF
Iceberg: A modern table format for big data (Strata NY 2018)
Ryan Blue
 
PDF
Performance Troubleshooting Using Apache Spark Metrics
Databricks
 
PDF
Deep Dive into the New Features of Apache Spark 3.0
Databricks
 
The Parquet Format and Performance Optimization Opportunities
Databricks
 
Understanding Query Plans and Spark UIs
Databricks
 
SOS: Optimizing Shuffle I/O with Brian Cho and Ergin Seyfe
Databricks
 
Common Strategies for Improving Performance on Your Delta Lakehouse
Databricks
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 
The columnar roadmap: Apache Parquet and Apache Arrow
DataWorks Summit
 
Apache Airflow
Sumit Maheshwari
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
Databricks
 
Dynamic Partition Pruning in Apache Spark
Databricks
 
Spark SQL
Joud Khattab
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Databricks
 
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Chris Fregly
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Apache Flink internals
Kostas Tzoumas
 
Parquet overview
Julien Le Dem
 
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
Chester Chen
 
Iceberg: A modern table format for big data (Strata NY 2018)
Ryan Blue
 
Performance Troubleshooting Using Apache Spark Metrics
Databricks
 
Deep Dive into the New Features of Apache Spark 3.0
Databricks
 

Similar to Accelerating Data Processing in Spark SQL with Pandas UDFs (20)

PDF
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
Databricks
 
PDF
Vectorized UDF: Scalable Analysis with Python and PySpark with Li Jin
Databricks
 
PDF
Pandas UDF: Scalable Analysis with Python and PySpark
Li Jin
 
PDF
PandasUDFs: One Weird Trick to Scaled Ensembles
Databricks
 
PDF
Introduction to Spark Datasets - Functional and relational together at last
Holden Karau
 
PDF
Beyond SQL: Speeding up Spark with DataFrames
Databricks
 
PDF
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
Miklos Christine
 
PDF
Pandas UDF and Python Type Hint in Apache Spark 3.0
Databricks
 
PDF
Automating Predictive Modeling at Zynga with PySpark and Pandas UDFs
Databricks
 
PDF
Real-Time Spark: From Interactive Queries to Streaming
Databricks
 
PPTX
DataFusion and Arrow_ Supercharge Your Data Analytical Tool with a Rusty Quer...
Medcl1
 
PDF
Tactical Data Science Tips: Python and Spark Together
Databricks
 
PDF
DataFusion-and-Arrow_Supercharge-Your-Data-Analytical-Tool-with-a-Rusty-Query...
aiuy
 
PDF
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
Chetan Khatri
 
PDF
SparkR - Play Spark Using R (20160909 HadoopCon)
wqchen
 
PDF
Data Summer Conf 2018, “Hands-on with Apache Spark for Beginners (ENG)” — Akm...
Provectus
 
PDF
Nose Dive into Apache Spark ML
Ahmet Bulut
 
PDF
Spark SQL In Depth www.syedacademy.com
Syed Hadoop
 
PDF
Performance Optimization of Recommendation Training Pipeline at Netflix DB Ts...
Databricks
 
PPTX
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
Databricks
 
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
Databricks
 
Vectorized UDF: Scalable Analysis with Python and PySpark with Li Jin
Databricks
 
Pandas UDF: Scalable Analysis with Python and PySpark
Li Jin
 
PandasUDFs: One Weird Trick to Scaled Ensembles
Databricks
 
Introduction to Spark Datasets - Functional and relational together at last
Holden Karau
 
Beyond SQL: Speeding up Spark with DataFrames
Databricks
 
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
Miklos Christine
 
Pandas UDF and Python Type Hint in Apache Spark 3.0
Databricks
 
Automating Predictive Modeling at Zynga with PySpark and Pandas UDFs
Databricks
 
Real-Time Spark: From Interactive Queries to Streaming
Databricks
 
DataFusion and Arrow_ Supercharge Your Data Analytical Tool with a Rusty Quer...
Medcl1
 
Tactical Data Science Tips: Python and Spark Together
Databricks
 
DataFusion-and-Arrow_Supercharge-Your-Data-Analytical-Tool-with-a-Rusty-Query...
aiuy
 
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
Chetan Khatri
 
SparkR - Play Spark Using R (20160909 HadoopCon)
wqchen
 
Data Summer Conf 2018, “Hands-on with Apache Spark for Beginners (ENG)” — Akm...
Provectus
 
Nose Dive into Apache Spark ML
Ahmet Bulut
 
Spark SQL In Depth www.syedacademy.com
Syed Hadoop
 
Performance Optimization of Recommendation Training Pipeline at Netflix DB Ts...
Databricks
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
Databricks
 
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Ad

Recently uploaded (20)

DOCX
COT Feb 19, 2025 DLLgvbbnnjjjjjj_Digestive System and its Functions_PISA_CBA....
kayemorales1105
 
PDF
5991-5857_Agilent_MS_Theory_EN (1).pdf. pdf
NohaSalah45
 
PPTX
727325165-Unit-1-Data-Analytics-PPT-1.pptx
revathi148366
 
PPTX
english9quizw1-240228142338-e9bcf6fd.pptx
rossanthonytan130
 
PDF
5- Global Demography Concepts _ Population Pyramids .pdf
pkhadka824
 
PPSX
PPT1_CB_VII_CS_Ch3_FunctionsandChartsinCalc.ppsx
animaroy81
 
PDF
Kafka Use Cases Real-World Applications
Accentfuture
 
PDF
Exploiting the Low Volatility Anomaly: A Low Beta Model Portfolio for Risk-Ad...
Bradley Norbom, CFA
 
PPTX
Data anlytics Hospitals Research India.pptx
SayantanChakravorty2
 
PPTX
Generative AI Boost Data Governance and Quality- Tejasvi Addagada
Tejasvi Addagada
 
PPTX
covid 19 data analysis updates in our municipality
RhuAyungon1
 
PDF
TCU EVALUATION FACULTY TCU Taguig City 1st Semester 2017-2018
MELJUN CORTES
 
PPT
Reliability Monitoring of Aircrfat commerce
Rizk2
 
PPTX
microservices-with-container-apps-dapr.pptx
vjay22
 
PPTX
Mynd company all details what they are doing a
AniketKadam40952
 
PDF
Informatics Market Insights AI Workforce.pdf
karizaroxx
 
PPTX
Natural Language Processing Datascience.pptx
Anandh798253
 
DOCX
brigada_PROGRAM_25.docx the boys white house
RonelNebrao
 
PPTX
Module-2_3-1eentzyssssssssssssssssssssss.pptx
ShahidHussain66691
 
PDF
GOOGLE ADS (1).pdf THE ULTIMATE GUIDE TO
kushalkeshwanisou
 
COT Feb 19, 2025 DLLgvbbnnjjjjjj_Digestive System and its Functions_PISA_CBA....
kayemorales1105
 
5991-5857_Agilent_MS_Theory_EN (1).pdf. pdf
NohaSalah45
 
727325165-Unit-1-Data-Analytics-PPT-1.pptx
revathi148366
 
english9quizw1-240228142338-e9bcf6fd.pptx
rossanthonytan130
 
5- Global Demography Concepts _ Population Pyramids .pdf
pkhadka824
 
PPT1_CB_VII_CS_Ch3_FunctionsandChartsinCalc.ppsx
animaroy81
 
Kafka Use Cases Real-World Applications
Accentfuture
 
Exploiting the Low Volatility Anomaly: A Low Beta Model Portfolio for Risk-Ad...
Bradley Norbom, CFA
 
Data anlytics Hospitals Research India.pptx
SayantanChakravorty2
 
Generative AI Boost Data Governance and Quality- Tejasvi Addagada
Tejasvi Addagada
 
covid 19 data analysis updates in our municipality
RhuAyungon1
 
TCU EVALUATION FACULTY TCU Taguig City 1st Semester 2017-2018
MELJUN CORTES
 
Reliability Monitoring of Aircrfat commerce
Rizk2
 
microservices-with-container-apps-dapr.pptx
vjay22
 
Mynd company all details what they are doing a
AniketKadam40952
 
Informatics Market Insights AI Workforce.pdf
karizaroxx
 
Natural Language Processing Datascience.pptx
Anandh798253
 
brigada_PROGRAM_25.docx the boys white house
RonelNebrao
 
Module-2_3-1eentzyssssssssssssssssssssss.pptx
ShahidHussain66691
 
GOOGLE ADS (1).pdf THE ULTIMATE GUIDE TO
kushalkeshwanisou
 

Accelerating Data Processing in Spark SQL with Pandas UDFs

  • 2. Accelerating Data Processing in Spark SQL with Pandas UDFs Michael Tong, Quantcast Machine Learning Engineer
  • 3. Agenda Review of Pandas UDFs Review what they are and go over some development tips Modeling at Quantcast How we use Spark SQL in production Example Problem Introduce a real problem from our model training pipeline that will be the main focus of our optimization efforts for this talk. Optimization tips and tricks. Iteratively and aggressively optimize this problem with pandas UDFs
  • 4. Optimization Tricks Do more things in memory Loops > spark SQL intermediate rows. Look for ways to do as many things in memory as possible Aggregate Keys Try to reduce the number of unique keys in your data and/or process multiple keys in a single UDF call. Use inverted indices Works especially well with sparse data. Use python libraries Pandas is easy to work with but slow, use other python libraries for better performance
  • 5. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.
  • 7. What are Pandas UDFs? ▪ UDF = User Defined Function ▪ Pandas UDFs are part of Spark SQL, which is Apache spark’s module for working with structured data. ▪ Pandas UDFs are a great way of writing custom data-processing logic in a developer-friendly environment. Summary
  • 8. What are Pandas UDFs? ▪ Scalar UDFs. One-to-one mapping function that supports simple return types (no Map/Struct types) ▪ Grouped Map UDFs. Requires a groupby operation but can return a variable number of output rows with complicated return types. ▪ Grouped Agg UDFs. I recommend you use Grouped Map UDFs instead. Types of Pandas UDFs
  • 9. Development tips and tricks ▪ Use an interactive development framework. ▪ At Quantcast we use jupyter notebooks. ▪ Develop with mock data. ▪ Pandas UDFs call python functions. Develop in your local environment using mock data in you interactive environment to quickly iterate quickly on ideas when developing code. ▪ Use magic commands (if you are using jupyter) ▪ Useful commands like timeit, time and prun allow for easy profiling and performance tuning to squeeze every bit of performance out of your pandas UDFs. ▪ Use python’s debugging tool ▪ The module is pdb (python debugger).
  • 11. Modeling at Quantcast ▪ Train tens of thousands of models that refresh daily-weekly ▪ Models trained off first party data from global internet traffic. ▪ We have a lot of data ▪ Around 400TB raw logs written/day ▪ Data is cleaned and compressed to about 4-5 TB/day ▪ Typically train off of several days- months of data for each model. Scale, scale, and even more scale
  • 13. Example Problem ▪ We have about 10k models that we want to train. ▪ Each of them cover different geo regions ▪ Some of them are over large regions (i.e. everybody in the US) ▪ Some of them are over specific regions (i.e. everybody in San Francisco) ▪ Some of them are over large regions but exclude specific regions (i.e. everybody in the US except people in San Francisco) ▪ For each model, we want to know how many unique ids (users) were found in each region over a certain period of time. A high level overview
  • 14. Example Problem ▪ Each model will have a set of inclusion regions (i.e. US, San Francisco) where each id must be in one of these regions to be considered part of the model. ▪ Each model will have a set of exclusion regions (i.e. San Francisco) where each id must be in none of these regions to be considered part of the model. ▪ Each id only needs to satisfy the geo constraints once to be part of the model (i.e., an id that moves from the US to Canada during the training timeframe is considered valid for a US model) More details
  • 15. Example Problem With some example tables Feature Feature Id US 0 or 100 San Francisco 1 or 101 Feature Map Model Data and Result Model Geo Incl. Geo Excl. # unique Ids ids Model-0 (US) [0, 100] [] 4 A, B, C, D Model-1 (US, not SF) [0, 100] [1, 101] 2 A, B Id Timestamp Feature ids Model ids A ts-1 [0] [0, 1] B ts-2 [0, 1] [0, 1] C ts-3 [100, 101] [0] D ts-4 [0, 1, 2, 3, 4] [0] D ts-5 [999] [] Feature Store
  • 17. Naive approach: Use Spark SQL ▪ Spark has built in functions to do everything we need ▪ Get all (row, model) pairs using a cross join ▪ Use functions.array_intersect for the inclusion/exclusion logic. ▪ Use groupby and aggregate to get the counts.. ▪ Code is really simple (<10 lines of code) ▪ Will test this on a sample of 100k rows.
  • 18. Naive approach: Use Spark SQL Source code
  • 19. Naive approach: Use Spark SQL ▪ Only processes about 25 rows/CPU/second ▪ To see why, look at the graph. ▪ We generate about 700x the number of intermediate rows as our input data to process this. ▪ This is because every row on average belongs to several models. ▪ There has to be a better way. This solution is terrible 10,067 (# models) 100,000 (# input rows) 69,697,819 (# rows)
  • 20. Optimization: Use Pandas UDFs for Looping ▪ One reason why spark is really slow is because of the large number of intermediate rows. ▪ What if we wrote a simple UDF that would iterate over all of the rows in memory instead? ▪ For this example problem, it speeds things up by ~1.8x
  • 21. Optimization: Use Pandas UDFs for Looping ▪ Store the model data (model_data_df) in a pandas dataframe. ▪ Use a pandas GROUPED_MAP UDF to process the data for each id. ▪ Figure out which models belong to an id in a nested for loop ▪ This is faster because we do not have to generate intermediate rows. The code in a nutshell
  • 22. Optimization: Aggregate keys ▪ In model training, there are some commonly used filters. ▪ Instead of counting the number of unique models, count the number of unique filters. ▪ In this data set, there are 473 unique model filters, which is much less than 10k models. ▪ ~9.82x faster than the previous solution. Most common geo inclusions/exclusions Idea Count Inclusion Exclusion 2035 [US] [] 409 [GB] [] 389 [CA] [] 358 [AU] [] 274 [DE] []
  • 23. Optimization: Aggregate Keys ▪ Create a UDF that iterates over the unique filters (by filterId) instead of model ids. ▪ In order to get the model ids, create a table that contains the mapping from model ids to filter ids (filter_model_id_pairs) and use a broadcast hash join. The code in a nutshell
  • 24. Optimization: Aggregate Keys in Batches ▪ What if we grouped things by something bigger than an id? ▪ Generate less intermediate rows. ▪ Take advantage of python vectorization. ▪ We can rewrite a UDF that takes in batches of ~10k ids per UDF call. ▪ ~2.9x faster than the previous solution. ▪ ~51.3x faster than the naive one. Idea
  • 25. Optimization: Aggregate Keys in Batches ▪ Group things into batches based off the hash of the id. ▪ Have the UDF group each batch by id and count the number of ids that satisfy each filter, returning a partial count for each filter id. ▪ The group by operation becomes a sum instead of a count because we do partial counts in the batches. The code in a nutshell
  • 26. Optimization: Inverted Indexes ▪ Each feature store row has relatively few unique features. ▪ Feature store rows have 10-20 features/row. ▪ There are ~500 unique filters. ▪ Use an inverted index to iterate over the unique features instead of filters. ▪ Use set operations for inclusion/exclusion logic ▪ ~6.44x faster than previous solution. Idea
  • 27. Optimization: Inverted Indexes ▪ Create maps for filter id to inclusion/exclusion filters. ▪ Use those maps to get the set of inclusion/exclusion filters each row belongs to. ▪ Use set operations to perform the inclusion/exclusion logic. ▪ Have each UDF call process batches of ids. The code in a nutshell
  • 28. Optimization: Use python libraries ▪ Pandas is optimized for ease of use, not speed. ▪ Use python libraries (itertools) to make python code run faster. ▪ reduce, and numpy are also good candidates to consider for other UDFs. ▪ ~2.6x faster than previous solution. ▪ ~860x faster than naive solution! Idea
  • 29. Optimization: Use python libraries ▪ Use .values to extract the columns from a pandas dataframe. ▪ Use itertools to iterate through for loops faster than default for loops. ▪ itertools.group_by is used to sort and group the data. ▪ itertools.chain.from_iterable is to iterate through a nested for loop. The code in a nutshell
  • 30. Optimization: Summary ▪ Pandas UDFs are extremely flexible and can be used to speed up spark SQL. ▪ We discussed a problem where we could apply optimization tricks for almost 1000x speedup. ▪ Apply these tricks to your own problems and watch things accelerate. Key takeaways
  • 32. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.