SlideShare a Scribd company logo
Improving Pandas and
PySpark interoperability
with Apache Arrow
Li Jin
PyData NYC
November 2017
• The information presented here is offered for informational purposes only and should not be used for any other
purpose (including, without limitation, the making of investment decisions). Examples provided herein are for
illustrative purposes only and are not necessarily based on actual data. Nothing herein constitutes: an offer to sell
or the solicitation of any offer to buy any security or other interest; tax advice; or investment advice. This
presentation shall remain the property of Two Sigma Investments, LP (“Two Sigma”) and Two Sigma reserves the
right to require the return of this presentation at any time.
• Some of the images, logos or other material used herein may be protected by copyright and/or trademark. If so,
such copyrights and/or trademarks are most likely owned by the entity that created the material and are used
purely for identification and comment as fair use under international copyright and/or trademark laws. Use of
such image, copyright or trademark does not imply any association with such organization (or endorsement of
such organization) by Two Sigma, nor vice versa.
• Copyright © 2017 TWO SIGMA INVESTMENTS, LP. All rights reserved
IMPORTANT LEGAL INFORMATION
About Me
3
• Li Jin (@icexelloss)
• Software Engineer @ Two Sigma Investments
• Apache Arrow Committer
• Analytics Tools Smith
• Other Open Source Projects:
• Flint: A Time Series Library on Spark
• Cook: A Fair Scheduler on Mesos
• PySpark Overview
• PySpark UDF: current state and limitation
• Apache Arrow Overview
• Improvement to PySpark UDF with Apache Arrow
• Future Roadmap
This Talk
4
PySpark Overview
5
• A tool for distributed data analysis
• Apache project
• JVM-based with Python interface (PySpark)
• Functionality:
• Relational: Join, group, aggregate …
• Stats and ML: Spark MLlib
• Streaming
• …
Apache Spark
6
• Bigger Data:
• Pandas: 10G
• Spark: 1000G
• Better Parallelism:
• Pandas: Single core
• Spark: Hundreds of cores
Why Spark
7
• Python interface for Spark
• API front-end for built-in Spark functions
• df.withColumn(‘v2’, df.v1 + 1)
• Translated to Java code, running in JVM
• Interface for native Python code (User-defined function)
• df.withColumn(‘v2’, udf(lambda x: x+1, ‘double’)(df.v1))
• Running in Python runtime
PySpark Overview
8
PySpark UDF:
Current state and
limitation
9
• PySpark’s interface to interact with other Python libraries
• Types of UDFs:
• Row UDF
• Group UDF
PySpark User Defined Function (UDF)
10
• Operates on row by row basis
• Similar to `map` operator
• Example:
• String processing
• Timestamp processing
• Poor performance
• 1-2 orders of magnitude slower comparing to alternatives (built-in Spark
functions or vectorized operations)
Row UDF: Current
11
• UDF that operates on multiple rows
• Similar to `groupBy` followed by `map` operator
• Example:
• Monthly weighted mean
• Not supported out of box
• Poor performance
Group UDF: Current
12
• (values – values.mean()) / values.std()
Group UDF: Example
13
Group UDF: Example
14
Group UDF: Example
15
80% of
the code is
boilerplate
Slow
• Inefficient data movement between Java and Python (Serialization /
Deserialization)
• Scalar computation model
UDF Issues
16
Apache Arrow
17
• In memory columnar format
• Building on the success of Parquet
• Standard from the start:
• Developers from 13+ major open source projects involved
• Benefits:
• Share the effort
• Create an ecosystem
Apache Arrow
18
Calcite
Cassandra
Deeplearning4j
Drill
Hadoop
Hbase
Ibis
Impala
Kudu
Pandas
Parquet
Phoenix
Spark
Storm
R
High Performance Sharing & Interchange
Before With Arrow
Columnar Data Format
persons = [{
name: ’Joe',
age: 18,
phones: [
‘555-111-1111’,
‘555-222-2222’
]
}, {
name: ’Jack',
age: 37,
phones: [ ‘555-333-3333’
]
}]
Record Batch Construction
Schema
Dictionary
Batch
Record
Batch
Record
Batch
Record
Batch
name (offset)
name (data)
age (data)
phones (list offset)
phones (data)
data header (describes offsets into data)
name (bitmap)
age (bitmap)
phones (bitmap)
phones (offset)
{
name: ’Joe',
age: 18,
phones: [
‘555-111-1111’,
‘555-222-2222’
]
}
Each box (vector) is contiguous memory
The entire record batch is contiguous on wire
• Maximize CPU throughput
• Pipelining
• SIMD
• Cache locality
• Scatter/gather I/O
In Memory Columnar Format for Speed
• PySpark “toPandas” Improvement
• 53x Speedup
• Streaming Arrow Performance
• 7.75GB/s data movement
• Arrow Parquet C++ Integration
• 4GB/s reads
• Pandas Integration
• 9.71GB/s
Results
Read more on http://arrow.apache.org/blog/
23
Improving PySpark
UDF
24
Vectorizing Row
UDF
25
How PySpark UDF works
26
Executor
Python
Worker
UDF: Row -> Row
Rows (Pickle)
Rows (Pickle)
• Inefficient data movement (Serialization / Deserialization)
• Scalar computation model
Recap: Current issues with UDF
27
Profile lambda x: x+1
8 Mb/s
91.8% in
Ser/Deser
Vectorized UDF
Executor
Python
Worker
UDF: pd.DataFrame -> pd.DataFrame
Rows ->
RB
RB ->
Rows
Row UDF vs Vectorized UDF
* Actual runtime for row UDF is 2s without profiling
20x Speed Up
(Profiler overhead
adjusted*)
Row UDF vs Vectorized UDF
Ser/Deser
Overhead
Removed
Row UDF vs Vectorized UDF
Less System Call
Faster I/O
Improving Group
UDF
33
• Split-apply-combine
• Break a problem into smaller pieces
• Operate on each piece independently
• Put all pieces back together
• Common pattern supported in SQL, Spark, Pandas, R …
Introduce Group UDF
• Split: groupBy
• Apply: UDF (pd.DataFrame -> pd.DataFrame)
• Combine: Inherently done by Spark
Split-Apply-Combine (UDF)
Introduce groupBy().apply()
Rows
Rows
Rows
Groups
Groups
Groups
Groups
Groups
Groups
Each Group:
pd.DataFrame -> pd.DataFramegroupBy
• (values – values.mean()) / values.std()
Previous Example
37
Group UDF: Before and After
For updated API, see: https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html
Before: After*:
Performance
Reference: https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html
39
• Available in the upcoming Apache Spark 2.3 release
• Try it with Databricks community version:
• https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-
for-pyspark.html
Try It!
40
• Improving PySpark/Pandas interoperability (SPARK-22216)
• Working towards Arrow 1.0 release
• More Arrow integration
Future Roadmap
41
• dev@spark.apache.org
• dev@arrow.apache.org
Get involved
42
Bryan Cutler
Hyukjin Kwon
Jeff Reback
Leif Walsh
Li Jin
Liang-Chi Hsieh
Reynold Xin
Takuya Ueshin
Wenchen Fan
Wes McKinney
Xiao Li
Collaborators
43
Questions
44

More Related Content

What's hot (20)

PDF
GPU-Accelerating UDFs in PySpark with Numba and PyGDF
Keith Kraus
 
PDF
Python and Bigdata - An Introduction to Spark (PySpark)
hiteshnd
 
PPTX
Scalable Machine Learning with PySpark
Ladle Patel
 
PDF
Performant data processing with PySpark, SparkR and DataFrame API
Ryuji Tamagawa
 
PPTX
data science toolkit 101: set up Python, Spark, & Jupyter
Raj Singh
 
PDF
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
Databricks
 
PDF
How does that PySpark thing work? And why Arrow makes it faster?
Rubén Berenguel
 
PDF
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
Miklos Christine
 
PDF
Life of PySpark - A tale of two environments
Shankar M S
 
PDF
Koalas: Pandas on Apache Spark
Databricks
 
PDF
Pandas UDF and Python Type Hint in Apache Spark 3.0
Databricks
 
PDF
Koalas: Making an Easy Transition from Pandas to Apache Spark
Databricks
 
PDF
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
Databricks
 
PDF
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
Spark Summit
 
PDF
PySpark in practice slides
Dat Tran
 
PDF
Getting The Best Performance With PySpark
Spark Summit
 
PDF
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Databricks
 
PDF
Spark Under the Hood - Meetup @ Data Science London
Databricks
 
PDF
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Databricks
 
PDF
PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...
Edureka!
 
GPU-Accelerating UDFs in PySpark with Numba and PyGDF
Keith Kraus
 
Python and Bigdata - An Introduction to Spark (PySpark)
hiteshnd
 
Scalable Machine Learning with PySpark
Ladle Patel
 
Performant data processing with PySpark, SparkR and DataFrame API
Ryuji Tamagawa
 
data science toolkit 101: set up Python, Spark, & Jupyter
Raj Singh
 
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
Databricks
 
How does that PySpark thing work? And why Arrow makes it faster?
Rubén Berenguel
 
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
Miklos Christine
 
Life of PySpark - A tale of two environments
Shankar M S
 
Koalas: Pandas on Apache Spark
Databricks
 
Pandas UDF and Python Type Hint in Apache Spark 3.0
Databricks
 
Koalas: Making an Easy Transition from Pandas to Apache Spark
Databricks
 
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
Databricks
 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
Spark Summit
 
PySpark in practice slides
Dat Tran
 
Getting The Best Performance With PySpark
Spark Summit
 
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Databricks
 
Spark Under the Hood - Meetup @ Data Science London
Databricks
 
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Databricks
 
PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...
Edureka!
 

Similar to Improving Pandas and PySpark performance and interoperability with Apache Arrow (20)

PDF
Improving Python and Spark Performance and Interoperability with Apache Arrow...
Databricks
 
PPTX
Improving Python and Spark Performance and Interoperability with Apache Arrow
Julien Le Dem
 
PDF
Improving Python and Spark Performance and Interoperability with Apache Arrow
Li Jin
 
PDF
Improving Python and Spark Performance and Interoperability with Apache Arrow
Two Sigma
 
PDF
Vectorized UDF: Scalable Analysis with Python and PySpark with Li Jin
Databricks
 
PDF
Pandas UDF: Scalable Analysis with Python and PySpark
Li Jin
 
PDF
Improving Python and Spark Performance and Interoperability: Spark Summit Eas...
Spark Summit
 
PDF
Speeding up PySpark with Arrow
Rubén Berenguel
 
PDF
Introduction to Spark with Python
Gokhan Atil
 
PDF
Big Data Beyond the JVM - Strata San Jose 2018
Holden Karau
 
PDF
Introduction to Spark Datasets - Functional and relational together at last
Holden Karau
 
PDF
Introduction to PySpark maka sakinaka loda
vicky0x07
 
PDF
Making the big data ecosystem work together with python apache arrow, spark,...
Holden Karau
 
PDF
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Holden Karau
 
PPTX
Frustration-Reduced PySpark: Data engineering with DataFrames
Ilya Ganelin
 
PDF
Big data beyond the JVM - DDTX 2018
Holden Karau
 
PPTX
Future of pandas
Jeff Reback
 
PDF
Data manipulation with DataFrames bimboo
vicky0x07
 
PDF
Koalas: Unifying Spark and pandas APIs
Xiao Li
 
PDF
Jump Start into Apache® Spark™ and Databricks
Databricks
 
Improving Python and Spark Performance and Interoperability with Apache Arrow...
Databricks
 
Improving Python and Spark Performance and Interoperability with Apache Arrow
Julien Le Dem
 
Improving Python and Spark Performance and Interoperability with Apache Arrow
Li Jin
 
Improving Python and Spark Performance and Interoperability with Apache Arrow
Two Sigma
 
Vectorized UDF: Scalable Analysis with Python and PySpark with Li Jin
Databricks
 
Pandas UDF: Scalable Analysis with Python and PySpark
Li Jin
 
Improving Python and Spark Performance and Interoperability: Spark Summit Eas...
Spark Summit
 
Speeding up PySpark with Arrow
Rubén Berenguel
 
Introduction to Spark with Python
Gokhan Atil
 
Big Data Beyond the JVM - Strata San Jose 2018
Holden Karau
 
Introduction to Spark Datasets - Functional and relational together at last
Holden Karau
 
Introduction to PySpark maka sakinaka loda
vicky0x07
 
Making the big data ecosystem work together with python apache arrow, spark,...
Holden Karau
 
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Holden Karau
 
Frustration-Reduced PySpark: Data engineering with DataFrames
Ilya Ganelin
 
Big data beyond the JVM - DDTX 2018
Holden Karau
 
Future of pandas
Jeff Reback
 
Data manipulation with DataFrames bimboo
vicky0x07
 
Koalas: Unifying Spark and pandas APIs
Xiao Li
 
Jump Start into Apache® Spark™ and Databricks
Databricks
 
Ad

More from PyData (20)

PDF
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
PyData
 
PDF
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
PyData
 
PDF
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
PyData
 
PDF
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
PyData
 
PDF
Deploying Data Science for Distribution of The New York Times - Anne Bauer
PyData
 
PPTX
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
PyData
 
PPTX
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
PyData
 
PDF
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
PyData
 
PDF
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
PyData
 
PDF
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
PyData
 
PDF
Words in Space - Rebecca Bilbro
PyData
 
PDF
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
PyData
 
PPTX
Pydata beautiful soup - Monica Puerto
PyData
 
PDF
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
PyData
 
PPTX
Extending Pandas with Custom Types - Will Ayd
PyData
 
PDF
Measuring Model Fairness - Stephen Hoover
PyData
 
PDF
What's the Science in Data Science? - Skipper Seabold
PyData
 
PDF
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
PyData
 
PDF
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
PyData
 
PDF
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
PyData
 
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
PyData
 
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
PyData
 
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
PyData
 
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
PyData
 
Deploying Data Science for Distribution of The New York Times - Anne Bauer
PyData
 
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
PyData
 
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
PyData
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
PyData
 
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
PyData
 
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
PyData
 
Words in Space - Rebecca Bilbro
PyData
 
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
PyData
 
Pydata beautiful soup - Monica Puerto
PyData
 
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
PyData
 
Extending Pandas with Custom Types - Will Ayd
PyData
 
Measuring Model Fairness - Stephen Hoover
PyData
 
What's the Science in Data Science? - Skipper Seabold
PyData
 
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
PyData
 
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
PyData
 
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
PyData
 
Ad

Recently uploaded (20)

PDF
Kubernetes - Architecture & Components.pdf
geethak285
 
PDF
Why aren't you using FME Flow's CPU Time?
Safe Software
 
PDF
Quantum AI Discoveries: Fractal Patterns Consciousness and Cyclical Universes
Saikat Basu
 
PDF
Darley - FIRST Copenhagen Lightning Talk (2025-06-26) Epochalypse 2038 - Time...
treyka
 
PDF
Hyderabad MuleSoft In-Person Meetup (June 21, 2025) Slides
Ravi Tamada
 
PDF
Unlocking FME Flow’s Potential: Architecture Design for Modern Enterprises
Safe Software
 
PPTX
Smarter Governance with AI: What Every Board Needs to Know
OnBoard
 
PDF
Automating the Geo-Referencing of Historic Aerial Photography in Flanders
Safe Software
 
PPTX
Practical Applications of AI in Local Government
OnBoard
 
PPTX
01_Approach Cyber- DORA Incident Management.pptx
FinTech Belgium
 
PPTX
MARTSIA: A Tool for Confidential Data Exchange via Public Blockchain - Poster...
Michele Kryston
 
PDF
LLM Search Readiness Audit - Dentsu x SEO Square - June 2025.pdf
Nick Samuel
 
PDF
“Scaling i.MX Applications Processors’ Native Edge AI with Discrete AI Accele...
Edge AI and Vision Alliance
 
PDF
Database Benchmarking for Performance Masterclass: Session 2 - Data Modeling ...
ScyllaDB
 
PDF
ArcGIS Utility Network Migration - The Hunter Water Story
Safe Software
 
PPTX
Enabling the Digital Artisan – keynote at ICOCI 2025
Alan Dix
 
PDF
The Future of Product Management in AI ERA.pdf
Alyona Owens
 
PDF
The Growing Value and Application of FME & GenAI
Safe Software
 
PDF
EIS-Webinar-Engineering-Retail-Infrastructure-06-16-2025.pdf
Earley Information Science
 
PDF
My Journey from CAD to BIM: A True Underdog Story
Safe Software
 
Kubernetes - Architecture & Components.pdf
geethak285
 
Why aren't you using FME Flow's CPU Time?
Safe Software
 
Quantum AI Discoveries: Fractal Patterns Consciousness and Cyclical Universes
Saikat Basu
 
Darley - FIRST Copenhagen Lightning Talk (2025-06-26) Epochalypse 2038 - Time...
treyka
 
Hyderabad MuleSoft In-Person Meetup (June 21, 2025) Slides
Ravi Tamada
 
Unlocking FME Flow’s Potential: Architecture Design for Modern Enterprises
Safe Software
 
Smarter Governance with AI: What Every Board Needs to Know
OnBoard
 
Automating the Geo-Referencing of Historic Aerial Photography in Flanders
Safe Software
 
Practical Applications of AI in Local Government
OnBoard
 
01_Approach Cyber- DORA Incident Management.pptx
FinTech Belgium
 
MARTSIA: A Tool for Confidential Data Exchange via Public Blockchain - Poster...
Michele Kryston
 
LLM Search Readiness Audit - Dentsu x SEO Square - June 2025.pdf
Nick Samuel
 
“Scaling i.MX Applications Processors’ Native Edge AI with Discrete AI Accele...
Edge AI and Vision Alliance
 
Database Benchmarking for Performance Masterclass: Session 2 - Data Modeling ...
ScyllaDB
 
ArcGIS Utility Network Migration - The Hunter Water Story
Safe Software
 
Enabling the Digital Artisan – keynote at ICOCI 2025
Alan Dix
 
The Future of Product Management in AI ERA.pdf
Alyona Owens
 
The Growing Value and Application of FME & GenAI
Safe Software
 
EIS-Webinar-Engineering-Retail-Infrastructure-06-16-2025.pdf
Earley Information Science
 
My Journey from CAD to BIM: A True Underdog Story
Safe Software
 

Improving Pandas and PySpark performance and interoperability with Apache Arrow

  • 1. Improving Pandas and PySpark interoperability with Apache Arrow Li Jin PyData NYC November 2017
  • 2. • The information presented here is offered for informational purposes only and should not be used for any other purpose (including, without limitation, the making of investment decisions). Examples provided herein are for illustrative purposes only and are not necessarily based on actual data. Nothing herein constitutes: an offer to sell or the solicitation of any offer to buy any security or other interest; tax advice; or investment advice. This presentation shall remain the property of Two Sigma Investments, LP (“Two Sigma”) and Two Sigma reserves the right to require the return of this presentation at any time. • Some of the images, logos or other material used herein may be protected by copyright and/or trademark. If so, such copyrights and/or trademarks are most likely owned by the entity that created the material and are used purely for identification and comment as fair use under international copyright and/or trademark laws. Use of such image, copyright or trademark does not imply any association with such organization (or endorsement of such organization) by Two Sigma, nor vice versa. • Copyright © 2017 TWO SIGMA INVESTMENTS, LP. All rights reserved IMPORTANT LEGAL INFORMATION
  • 3. About Me 3 • Li Jin (@icexelloss) • Software Engineer @ Two Sigma Investments • Apache Arrow Committer • Analytics Tools Smith • Other Open Source Projects: • Flint: A Time Series Library on Spark • Cook: A Fair Scheduler on Mesos
  • 4. • PySpark Overview • PySpark UDF: current state and limitation • Apache Arrow Overview • Improvement to PySpark UDF with Apache Arrow • Future Roadmap This Talk 4
  • 6. • A tool for distributed data analysis • Apache project • JVM-based with Python interface (PySpark) • Functionality: • Relational: Join, group, aggregate … • Stats and ML: Spark MLlib • Streaming • … Apache Spark 6
  • 7. • Bigger Data: • Pandas: 10G • Spark: 1000G • Better Parallelism: • Pandas: Single core • Spark: Hundreds of cores Why Spark 7
  • 8. • Python interface for Spark • API front-end for built-in Spark functions • df.withColumn(‘v2’, df.v1 + 1) • Translated to Java code, running in JVM • Interface for native Python code (User-defined function) • df.withColumn(‘v2’, udf(lambda x: x+1, ‘double’)(df.v1)) • Running in Python runtime PySpark Overview 8
  • 9. PySpark UDF: Current state and limitation 9
  • 10. • PySpark’s interface to interact with other Python libraries • Types of UDFs: • Row UDF • Group UDF PySpark User Defined Function (UDF) 10
  • 11. • Operates on row by row basis • Similar to `map` operator • Example: • String processing • Timestamp processing • Poor performance • 1-2 orders of magnitude slower comparing to alternatives (built-in Spark functions or vectorized operations) Row UDF: Current 11
  • 12. • UDF that operates on multiple rows • Similar to `groupBy` followed by `map` operator • Example: • Monthly weighted mean • Not supported out of box • Poor performance Group UDF: Current 12
  • 13. • (values – values.mean()) / values.std() Group UDF: Example 13
  • 15. Group UDF: Example 15 80% of the code is boilerplate Slow
  • 16. • Inefficient data movement between Java and Python (Serialization / Deserialization) • Scalar computation model UDF Issues 16
  • 18. • In memory columnar format • Building on the success of Parquet • Standard from the start: • Developers from 13+ major open source projects involved • Benefits: • Share the effort • Create an ecosystem Apache Arrow 18 Calcite Cassandra Deeplearning4j Drill Hadoop Hbase Ibis Impala Kudu Pandas Parquet Phoenix Spark Storm R
  • 19. High Performance Sharing & Interchange Before With Arrow
  • 20. Columnar Data Format persons = [{ name: ’Joe', age: 18, phones: [ ‘555-111-1111’, ‘555-222-2222’ ] }, { name: ’Jack', age: 37, phones: [ ‘555-333-3333’ ] }]
  • 21. Record Batch Construction Schema Dictionary Batch Record Batch Record Batch Record Batch name (offset) name (data) age (data) phones (list offset) phones (data) data header (describes offsets into data) name (bitmap) age (bitmap) phones (bitmap) phones (offset) { name: ’Joe', age: 18, phones: [ ‘555-111-1111’, ‘555-222-2222’ ] } Each box (vector) is contiguous memory The entire record batch is contiguous on wire
  • 22. • Maximize CPU throughput • Pipelining • SIMD • Cache locality • Scatter/gather I/O In Memory Columnar Format for Speed
  • 23. • PySpark “toPandas” Improvement • 53x Speedup • Streaming Arrow Performance • 7.75GB/s data movement • Arrow Parquet C++ Integration • 4GB/s reads • Pandas Integration • 9.71GB/s Results Read more on http://arrow.apache.org/blog/ 23
  • 26. How PySpark UDF works 26 Executor Python Worker UDF: Row -> Row Rows (Pickle) Rows (Pickle)
  • 27. • Inefficient data movement (Serialization / Deserialization) • Scalar computation model Recap: Current issues with UDF 27
  • 28. Profile lambda x: x+1 8 Mb/s 91.8% in Ser/Deser
  • 29. Vectorized UDF Executor Python Worker UDF: pd.DataFrame -> pd.DataFrame Rows -> RB RB -> Rows
  • 30. Row UDF vs Vectorized UDF * Actual runtime for row UDF is 2s without profiling 20x Speed Up (Profiler overhead adjusted*)
  • 31. Row UDF vs Vectorized UDF Ser/Deser Overhead Removed
  • 32. Row UDF vs Vectorized UDF Less System Call Faster I/O
  • 34. • Split-apply-combine • Break a problem into smaller pieces • Operate on each piece independently • Put all pieces back together • Common pattern supported in SQL, Spark, Pandas, R … Introduce Group UDF
  • 35. • Split: groupBy • Apply: UDF (pd.DataFrame -> pd.DataFrame) • Combine: Inherently done by Spark Split-Apply-Combine (UDF)
  • 37. • (values – values.mean()) / values.std() Previous Example 37
  • 38. Group UDF: Before and After For updated API, see: https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html Before: After*:
  • 40. • Available in the upcoming Apache Spark 2.3 release • Try it with Databricks community version: • https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs- for-pyspark.html Try It! 40
  • 41. • Improving PySpark/Pandas interoperability (SPARK-22216) • Working towards Arrow 1.0 release • More Arrow integration Future Roadmap 41
  • 43. Bryan Cutler Hyukjin Kwon Jeff Reback Leif Walsh Li Jin Liang-Chi Hsieh Reynold Xin Takuya Ueshin Wenchen Fan Wes McKinney Xiao Li Collaborators 43