SlideShare a Scribd company logo
Apache Arrow and
Pandas UDF on Apache Spark
Takuya UESHIN
2018-12-08, Apache Arrow Tokyo Meetup 2018
2
About Me
- Software Engineer @databricks
- Apache Spark Committer
- Twitter: @ueshin
- GitHub: github.com/ueshin
3
Agenda
• Apache Spark and PySpark
• PySpark and Pandas
• Python UDF and Pandas UDF
• Pandas UDF and Apache Arrow
• Arrow IPC format and Converters
• Handling Communication
• Physical Operators
• Python worker
• Work In Progress
• Follow-up Events
4
Agenda
• Apache Spark and PySpark
• PySpark and Pandas
• Python UDF and Pandas UDF
• Pandas UDF and Apache Arrow
• Arrow IPC format and Converters
• Handling Communication
• Physical Operators
• Physical Operators
• Work In Progress
• Follow-up Events
5
Apache Spark and PySpark
“Apache Spark™ is a unified analytics engine for large-scale data
processing.”
https://spark.apache.org/
• The latest release:
2.4.0 (2018/11/02)
• PySpark is a Python API
• SparkR is an R API
6
PySpark and Pandas
“pandas is an open source, BSD-licensed library providing
high-performance, easy-to-use data structures and data analysis
tools for the Python programming language.”
• https://pandas.pydata.org/
• The latest release: v0.23.4 Final (2018/08/03)
• PySpark supports Pandas >= "0.19.2"
7
PySpark and Pandas
PySpark can convert data between PySpark DataFrame and
Pandas DataFrame.
• pdf = df.toPandas()
• df = spark.createDataFrame(pdf)
We can use Arrow as an intermediate format by setting config:
“spark.sql.execution.arrow.enabled” to “true” (“false” by default).
8
Python UDF and Pandas UDF
• UDF: User Defined Function
• Python UDF
• Serialize/Deserialize data with Pickle
• Fetch data block, but invoke UDF row by row
• Pandas UDF
• Serialize/Deserialize data with Arrow
• Fetch data block, and invoke UDF block by block
• PandasUDFType: SCALAR, GROUPED_MAP, GROUPED_AGG
We don’t need any config, but the declaration is different.
9
Python UDF and Pandas UDF
@udf(’double’)
def plus_one(v):
return v + 1
@pandas_udf(’double’, PandasUDFType.SCALAR)
def pandas_plus_one(v):
return v + 1
10
Python UDF and Pandas UDF
• SCALAR
• A transformation: One or more Pandas Series -> One Pandas Series
• The length of the returned Pandas Series must be of the same as the
input Pandas Series
• GROUPED_MAP
• A transformation: One Pandas DataFrame -> One Pandas DataFrame
• The length of the returned Pandas DataFrame can be arbitrary
• GROUPED_AGG
• A transformation: One or more Pandas Series -> One scalar
• The returned value type should be a primitive data type
11
Performance: Python UDF vs Pandas UDF
From a blog post: Introducing Pandas UDF for PySpark
• Plus One
• Cumulative Probability
• Subtract Mean
“Pandas UDFs perform much
better than Python UDFs,
ranging from 3x to over 100x.”
12
Agenda
• Apache Spark and PySpark
• PySpark and Pandas
• Python UDF and Pandas UDF
• Pandas UDF and Apache Arrow
• Arrow IPC format and Converters
• Handling Communication
• Physical Operators
• Python worker
• Work In Progress
• Follow-up Events
13
Apache Arrow
“A cross-language development platform for in-memory data”
https://arrow.apache.org/
• The latest release
- 0.11.0 (2018/10/08)
• Columnar In-Memory
• docs/memory_layout.html
PySpark supports Arrow >= "0.8.0"
• "0.10.0" is recommended
14
Apache Arrow and Pandas UDF
• Use Arrow to Serialize/Deserialize data
• Streaming format for Interprocess messaging / communication (IPC)
• ArrowWriter and ArrowColumnVector
• Communicate JVM and Python worker via Socket
• ArrowPythonRunner
• worker.py
• Physical Operators for each PythonUDFType
• ArrowEvalPythonExec
• FlatMapGroupsInPandasExec
• AggregateInPandasExec
15
Overview of Pandas UDF execution
Invoke UDF
Pandas
Pandas
RecordBatches
RecordBatches
Arrow
ArrowPythonRunner
PhysicalOperator
groups of rows
ColumnarBatches
ArrowColumnVectors
ArrowWriter ArrowStreamPandasSerializer
ArrowStreamPandasSerializer
16
Arrow IPC format and Converters
Invoke UDF
Pandas
Pandas
RecordBatches
RecordBatches
Arrow
ArrowPythonRunner
PhysicalOperator
groups of rows
ColumnarBatches
ArrowColumnVectors
ArrowWriter ArrowStreamPandasSerializer
ArrowStreamPandasSerializer
17
Encapsulated message format
• https://arrow.apache.org/docs/ipc.html
• Messages
• Schema, RecordBatch, DictionaryBatch, Tensor
• Formats
• Streaming format
– Schema + (DictionaryBatch + RecordBatch)+
• File format
– header + (Streaming format) + footer
Pandas UDFs use Streaming format.
18
Arrow Converters in Spark
in Java/Scala
• ArrowWriter [src]
• A wrapper for writing VectorSchemaRoot and ValueVectors
• ArrowColumnVector [src]
• A wrapper for reading ValueVectors, works with ColumnarBatch
in Python
• ArrowStreamPandasSerializer [src]
• A wrapper for RecordBatchReader and RecordBatchWriter
19
Handling Communication
Invoke UDF
Pandas
Pandas
RecordBatches
RecordBatches
Arrow
ArrowPythonRunner
PhysicalOperator
groups of rows
ColumnarBatches
ArrowColumnVectors
ArrowWriter ArrowStreamPandasSerializer
ArrowStreamPandasSerializer
20
Handling Communication
ArrowPythonRunner [src]
• Handle the communication between JVM and the Python
worker
• Create or reuse a Python worker
• Open a Socket to communicate
• Write data to the socket with ArrowWriter in a separate thread
• Read data from the socket
• Return an iterator of ColumnarBatch of ArrowColumnVectors
21
Physical Operators
Invoke UDF
Pandas
Pandas
RecordBatches
RecordBatches
Arrow
ArrowPythonRunner
PhysicalOperator
ArrowColumnVectors
ArrowWriter
groups of rows
ColumnarBatches
ArrowStreamPandasSerializer
ArrowStreamPandasSerializer
22
Physical Operators
Create a RDD to execute the UDF.
• There are several operators for each PythonUDFType
• Group input data and pass to ArrowPythonRunner
• SCALAR: every configured number of rows
– “spark.sql.execution.arrow.maxRecordsPerBatch” (10,000 by default)
• GROUP_XXX: every group
• Read the result iterator of ColumnarBatch
• Return the iterator of rows over ColumnarBatches
23
Python worker
Invoke UDF
Pandas
Pandas
RecordBatches
RecordBatches
Arrow
ArrowPythonRunner
PhysicalOperator
groups of rows
ColumnarBatches
ArrowColumnVectors
ArrowWriter ArrowStreamPandasSerializer
ArrowStreamPandasSerializer
24
Python worker
worker.py [src]
• Open a Socket to communicate
• Set up a UDF execution for each PythonUDFType
• Create a map function
– prepare the arguments
– invoke the UDF
– check and return the result
• Execute the map function over the input iterator of Pandas
DataFrame
• Write back the results
25
Agenda
• Apache Spark and PySpark
• PySpark and Pandas
• Python UDF and Pandas UDF
• Pandas UDF and Apache Arrow
• Arrow IPC format and Converters
• Handling Communication
• Physical Operators
• Python worker
• Work In Progress
• Follow-up Events
26
Work In Progress
We can track issues related to Pandas UDF.
• [SPARK-22216] Improving PySpark/Pandas interoperability
• 37 subtasks in total
• 3 subtasks are in progress
• 4 subtasks are open
27
Work In Progress
• Window Pandas UDF
• [SPARK-24561] User-defined window functions with pandas udf
(bounded window)
• Performance Improvement of toPandas -> merged!
• [SPARK-25274] Improve toPandas with Arrow by sending out-of-order
record batches
• SparkR
• [SPARK-25981] Arrow optimization for conversion from R DataFrame
to Spark DataFrame
28
Agenda
• Apache Spark and PySpark
• PySpark and Pandas
• Python UDF and Pandas UDF
• Pandas UDF and Apache Arrow
• Arrow IPC format and Converters
• Handling Communication
• Physical Operators
• Python worker
• Work In Progress
• Follow-up Events
29
Follow-up Events
Spark Developers Meetup
• 2018/12/15 (Sat) 10:00-18:00
• @ Yahoo! LODGE
• https://passmarket.yahoo.co.jp/event/show/detail/01a98dzxf
auj.html
30
Follow-up Events
Hadoop/Spark Conference Japan 2019
• 2019/03/14 (Thu)
• @ Oi-machi
• http://hadoop.apache.jp/
31
Follow-up Events
Spark+AI Summit 2019
• 2019/04/23 (Tue) - 04/25 (Thu)
• @ Moscone West Convention Center, San Francisco
• https://databricks.com/sparkaisummit/north-america
Thank you!
33
Appendix
How to contribute?
• See: Contributing to Spark
• Open an issue on JIRA
• Send a pull-request at GitHub
• Communicate with committers and reviewers
• Congratulations!
Thanks for your contributions!
34
Appendix
• PySpark Usage Guide for Pandas with Apache Arrow
• https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.
html
• Vectorized UDF: Scalable Analysis with Python and PySpark
• https://databricks.com/session/vectorized-udf-scalable-analysis-with-
python-and-pyspark
• Demo for Apache Arrow Tokyo Meetup 2018
• https://databricks-prod-cloudfront.cloud.databricks.com/public/4027
ec902e239c93eaaa8714f173bcfc/142158605138935/354623205913920
1/7497868276316206/latest.html

More Related Content

PDF
Introduction to Apache Spark
PDF
Introducing DataFrames in Spark for Large Scale Data Science
PPTX
Intro to Apache Spark
PPTX
Spark architecture
PDF
Firestore: The Basics
PPTX
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
PDF
Dive into PySpark
PDF
SparkSQL: A Compiler from Queries to RDDs
Introduction to Apache Spark
Introducing DataFrames in Spark for Large Scale Data Science
Intro to Apache Spark
Spark architecture
Firestore: The Basics
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Dive into PySpark
SparkSQL: A Compiler from Queries to RDDs

What's hot (20)

PDF
Fury - Docker Meetup
PDF
Introduction to Docker Compose
PPTX
Oracle REST Data Services Best Practices/ Overview
PPTX
Kubernetes #2 monitoring
PDF
A Deep Dive into Query Execution Engine of Spark SQL
PPTX
Flask – Python
PPTX
OOP concepts -in-Python programming language
PDF
MongoDB and Node.js
PPT
Docker introduction
PPTX
Introduction to Apache Spark Developer Training
PDF
Spring GraphQL
PPTX
Avro Tutorial - Records with Schema for Kafka and Hadoop
PPT
Sql Server Performance Tuning
PDF
Open source APM Scouter로 모니터링 잘 하기
PPTX
PySpark dataframe
PDF
SQL for NoSQL and how Apache Calcite can help
PPTX
Introduction to ML with Apache Spark MLlib
PDF
Introduction to Python
PDF
Architecture Overview: Kubernetes with Red Hat Enterprise Linux 7.1
PPTX
Docker introduction & benefits
Fury - Docker Meetup
Introduction to Docker Compose
Oracle REST Data Services Best Practices/ Overview
Kubernetes #2 monitoring
A Deep Dive into Query Execution Engine of Spark SQL
Flask – Python
OOP concepts -in-Python programming language
MongoDB and Node.js
Docker introduction
Introduction to Apache Spark Developer Training
Spring GraphQL
Avro Tutorial - Records with Schema for Kafka and Hadoop
Sql Server Performance Tuning
Open source APM Scouter로 모니터링 잘 하기
PySpark dataframe
SQL for NoSQL and how Apache Calcite can help
Introduction to ML with Apache Spark MLlib
Introduction to Python
Architecture Overview: Kubernetes with Red Hat Enterprise Linux 7.1
Docker introduction & benefits
Ad

Similar to Apache Arrow and Pandas UDF on Apache Spark (20)

PDF
Improving Pandas and PySpark interoperability with Apache Arrow
PDF
Improving Pandas and PySpark performance and interoperability with Apache Arrow
PDF
Improving Python and Spark Performance and Interoperability with Apache Arrow...
PDF
Speeding up PySpark with Arrow
PPTX
Improving Python and Spark Performance and Interoperability with Apache Arrow
PDF
Improving Python and Spark Performance and Interoperability with Apache Arrow
PDF
Improving Python and Spark Performance and Interoperability with Apache Arrow
PDF
How does that PySpark thing work? And why Arrow makes it faster?
PDF
Vectorized UDF: Scalable Analysis with Python and PySpark with Li Jin
PDF
Pandas UDF: Scalable Analysis with Python and PySpark
PDF
Pandas UDF and Python Type Hint in Apache Spark 3.0
PDF
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
PDF
Making the big data ecosystem work together with python apache arrow, spark,...
PDF
Big Data Beyond the JVM - Strata San Jose 2018
PDF
GPU-Accelerating UDFs in PySpark with Numba and PyGDF
PDF
Big data beyond the JVM - DDTX 2018
PDF
Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy
PDF
Unifying State-of-the-Art AI and Big Data in Apache Spark with Reynold Xin
PDF
Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark w...
PPTX
Meetup tensorframes
Improving Pandas and PySpark interoperability with Apache Arrow
Improving Pandas and PySpark performance and interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache Arrow...
Speeding up PySpark with Arrow
Improving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache Arrow
How does that PySpark thing work? And why Arrow makes it faster?
Vectorized UDF: Scalable Analysis with Python and PySpark with Li Jin
Pandas UDF: Scalable Analysis with Python and PySpark
Pandas UDF and Python Type Hint in Apache Spark 3.0
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with python apache arrow, spark,...
Big Data Beyond the JVM - Strata San Jose 2018
GPU-Accelerating UDFs in PySpark with Numba and PyGDF
Big data beyond the JVM - DDTX 2018
Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy
Unifying State-of-the-Art AI and Big Data in Apache Spark with Reynold Xin
Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark w...
Meetup tensorframes
Ad

More from Takuya UESHIN (11)

PDF
Introducing Koalas 1.0 (and 1.1)
PPTX
Koalas: Unifying Spark and pandas APIs
PPTX
Koalas: Unifying Spark and pandas APIs
PDF
2019.03.19 Deep Dive into Spark SQL with Advanced Performance Tuning
PDF
An Insider’s Guide to Maximizing Spark SQL Performance
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning
PPTX
Failing gracefully
PDF
20140908 spark sql & catalyst
PDF
Introduction to Spark SQL & Catalyst
PDF
20110616 HBase勉強会(第二回)
PDF
20100724 HBaseプログラミング
Introducing Koalas 1.0 (and 1.1)
Koalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIs
2019.03.19 Deep Dive into Spark SQL with Advanced Performance Tuning
An Insider’s Guide to Maximizing Spark SQL Performance
Deep Dive into Spark SQL with Advanced Performance Tuning
Failing gracefully
20140908 spark sql & catalyst
Introduction to Spark SQL & Catalyst
20110616 HBase勉強会(第二回)
20100724 HBaseプログラミング

Recently uploaded (20)

PDF
Best Practices for Rolling Out Competency Management Software.pdf
PPTX
Online Work Permit System for Fast Permit Processing
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PDF
How Creative Agencies Leverage Project Management Software.pdf
PPTX
ISO 45001 Occupational Health and Safety Management System
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PPTX
Odoo POS Development Services by CandidRoot Solutions
DOCX
The Five Best AI Cover Tools in 2025.docx
PPTX
Presentation of Computer CLASS 2 .pptx
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PDF
Digital Strategies for Manufacturing Companies
PPT
Introduction Database Management System for Course Database
PPTX
Introduction to Artificial Intelligence
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PDF
top salesforce developer skills in 2025.pdf
PPTX
ai tools demonstartion for schools and inter college
PDF
Become an Agentblazer Champion Challenge Kickoff
PDF
AI in Product Development-omnex systems
PDF
System and Network Administraation Chapter 3
Best Practices for Rolling Out Competency Management Software.pdf
Online Work Permit System for Fast Permit Processing
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
How Creative Agencies Leverage Project Management Software.pdf
ISO 45001 Occupational Health and Safety Management System
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
Odoo POS Development Services by CandidRoot Solutions
The Five Best AI Cover Tools in 2025.docx
Presentation of Computer CLASS 2 .pptx
Which alternative to Crystal Reports is best for small or large businesses.pdf
Upgrade and Innovation Strategies for SAP ERP Customers
Digital Strategies for Manufacturing Companies
Introduction Database Management System for Course Database
Introduction to Artificial Intelligence
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
top salesforce developer skills in 2025.pdf
ai tools demonstartion for schools and inter college
Become an Agentblazer Champion Challenge Kickoff
AI in Product Development-omnex systems
System and Network Administraation Chapter 3

Apache Arrow and Pandas UDF on Apache Spark

  • 1. Apache Arrow and Pandas UDF on Apache Spark Takuya UESHIN 2018-12-08, Apache Arrow Tokyo Meetup 2018
  • 2. 2 About Me - Software Engineer @databricks - Apache Spark Committer - Twitter: @ueshin - GitHub: github.com/ueshin
  • 3. 3 Agenda • Apache Spark and PySpark • PySpark and Pandas • Python UDF and Pandas UDF • Pandas UDF and Apache Arrow • Arrow IPC format and Converters • Handling Communication • Physical Operators • Python worker • Work In Progress • Follow-up Events
  • 4. 4 Agenda • Apache Spark and PySpark • PySpark and Pandas • Python UDF and Pandas UDF • Pandas UDF and Apache Arrow • Arrow IPC format and Converters • Handling Communication • Physical Operators • Physical Operators • Work In Progress • Follow-up Events
  • 5. 5 Apache Spark and PySpark “Apache Spark™ is a unified analytics engine for large-scale data processing.” https://spark.apache.org/ • The latest release: 2.4.0 (2018/11/02) • PySpark is a Python API • SparkR is an R API
  • 6. 6 PySpark and Pandas “pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.” • https://pandas.pydata.org/ • The latest release: v0.23.4 Final (2018/08/03) • PySpark supports Pandas >= "0.19.2"
  • 7. 7 PySpark and Pandas PySpark can convert data between PySpark DataFrame and Pandas DataFrame. • pdf = df.toPandas() • df = spark.createDataFrame(pdf) We can use Arrow as an intermediate format by setting config: “spark.sql.execution.arrow.enabled” to “true” (“false” by default).
  • 8. 8 Python UDF and Pandas UDF • UDF: User Defined Function • Python UDF • Serialize/Deserialize data with Pickle • Fetch data block, but invoke UDF row by row • Pandas UDF • Serialize/Deserialize data with Arrow • Fetch data block, and invoke UDF block by block • PandasUDFType: SCALAR, GROUPED_MAP, GROUPED_AGG We don’t need any config, but the declaration is different.
  • 9. 9 Python UDF and Pandas UDF @udf(’double’) def plus_one(v): return v + 1 @pandas_udf(’double’, PandasUDFType.SCALAR) def pandas_plus_one(v): return v + 1
  • 10. 10 Python UDF and Pandas UDF • SCALAR • A transformation: One or more Pandas Series -> One Pandas Series • The length of the returned Pandas Series must be of the same as the input Pandas Series • GROUPED_MAP • A transformation: One Pandas DataFrame -> One Pandas DataFrame • The length of the returned Pandas DataFrame can be arbitrary • GROUPED_AGG • A transformation: One or more Pandas Series -> One scalar • The returned value type should be a primitive data type
  • 11. 11 Performance: Python UDF vs Pandas UDF From a blog post: Introducing Pandas UDF for PySpark • Plus One • Cumulative Probability • Subtract Mean “Pandas UDFs perform much better than Python UDFs, ranging from 3x to over 100x.”
  • 12. 12 Agenda • Apache Spark and PySpark • PySpark and Pandas • Python UDF and Pandas UDF • Pandas UDF and Apache Arrow • Arrow IPC format and Converters • Handling Communication • Physical Operators • Python worker • Work In Progress • Follow-up Events
  • 13. 13 Apache Arrow “A cross-language development platform for in-memory data” https://arrow.apache.org/ • The latest release - 0.11.0 (2018/10/08) • Columnar In-Memory • docs/memory_layout.html PySpark supports Arrow >= "0.8.0" • "0.10.0" is recommended
  • 14. 14 Apache Arrow and Pandas UDF • Use Arrow to Serialize/Deserialize data • Streaming format for Interprocess messaging / communication (IPC) • ArrowWriter and ArrowColumnVector • Communicate JVM and Python worker via Socket • ArrowPythonRunner • worker.py • Physical Operators for each PythonUDFType • ArrowEvalPythonExec • FlatMapGroupsInPandasExec • AggregateInPandasExec
  • 15. 15 Overview of Pandas UDF execution Invoke UDF Pandas Pandas RecordBatches RecordBatches Arrow ArrowPythonRunner PhysicalOperator groups of rows ColumnarBatches ArrowColumnVectors ArrowWriter ArrowStreamPandasSerializer ArrowStreamPandasSerializer
  • 16. 16 Arrow IPC format and Converters Invoke UDF Pandas Pandas RecordBatches RecordBatches Arrow ArrowPythonRunner PhysicalOperator groups of rows ColumnarBatches ArrowColumnVectors ArrowWriter ArrowStreamPandasSerializer ArrowStreamPandasSerializer
  • 17. 17 Encapsulated message format • https://arrow.apache.org/docs/ipc.html • Messages • Schema, RecordBatch, DictionaryBatch, Tensor • Formats • Streaming format – Schema + (DictionaryBatch + RecordBatch)+ • File format – header + (Streaming format) + footer Pandas UDFs use Streaming format.
  • 18. 18 Arrow Converters in Spark in Java/Scala • ArrowWriter [src] • A wrapper for writing VectorSchemaRoot and ValueVectors • ArrowColumnVector [src] • A wrapper for reading ValueVectors, works with ColumnarBatch in Python • ArrowStreamPandasSerializer [src] • A wrapper for RecordBatchReader and RecordBatchWriter
  • 19. 19 Handling Communication Invoke UDF Pandas Pandas RecordBatches RecordBatches Arrow ArrowPythonRunner PhysicalOperator groups of rows ColumnarBatches ArrowColumnVectors ArrowWriter ArrowStreamPandasSerializer ArrowStreamPandasSerializer
  • 20. 20 Handling Communication ArrowPythonRunner [src] • Handle the communication between JVM and the Python worker • Create or reuse a Python worker • Open a Socket to communicate • Write data to the socket with ArrowWriter in a separate thread • Read data from the socket • Return an iterator of ColumnarBatch of ArrowColumnVectors
  • 22. 22 Physical Operators Create a RDD to execute the UDF. • There are several operators for each PythonUDFType • Group input data and pass to ArrowPythonRunner • SCALAR: every configured number of rows – “spark.sql.execution.arrow.maxRecordsPerBatch” (10,000 by default) • GROUP_XXX: every group • Read the result iterator of ColumnarBatch • Return the iterator of rows over ColumnarBatches
  • 23. 23 Python worker Invoke UDF Pandas Pandas RecordBatches RecordBatches Arrow ArrowPythonRunner PhysicalOperator groups of rows ColumnarBatches ArrowColumnVectors ArrowWriter ArrowStreamPandasSerializer ArrowStreamPandasSerializer
  • 24. 24 Python worker worker.py [src] • Open a Socket to communicate • Set up a UDF execution for each PythonUDFType • Create a map function – prepare the arguments – invoke the UDF – check and return the result • Execute the map function over the input iterator of Pandas DataFrame • Write back the results
  • 25. 25 Agenda • Apache Spark and PySpark • PySpark and Pandas • Python UDF and Pandas UDF • Pandas UDF and Apache Arrow • Arrow IPC format and Converters • Handling Communication • Physical Operators • Python worker • Work In Progress • Follow-up Events
  • 26. 26 Work In Progress We can track issues related to Pandas UDF. • [SPARK-22216] Improving PySpark/Pandas interoperability • 37 subtasks in total • 3 subtasks are in progress • 4 subtasks are open
  • 27. 27 Work In Progress • Window Pandas UDF • [SPARK-24561] User-defined window functions with pandas udf (bounded window) • Performance Improvement of toPandas -> merged! • [SPARK-25274] Improve toPandas with Arrow by sending out-of-order record batches • SparkR • [SPARK-25981] Arrow optimization for conversion from R DataFrame to Spark DataFrame
  • 28. 28 Agenda • Apache Spark and PySpark • PySpark and Pandas • Python UDF and Pandas UDF • Pandas UDF and Apache Arrow • Arrow IPC format and Converters • Handling Communication • Physical Operators • Python worker • Work In Progress • Follow-up Events
  • 29. 29 Follow-up Events Spark Developers Meetup • 2018/12/15 (Sat) 10:00-18:00 • @ Yahoo! LODGE • https://passmarket.yahoo.co.jp/event/show/detail/01a98dzxf auj.html
  • 30. 30 Follow-up Events Hadoop/Spark Conference Japan 2019 • 2019/03/14 (Thu) • @ Oi-machi • http://hadoop.apache.jp/
  • 31. 31 Follow-up Events Spark+AI Summit 2019 • 2019/04/23 (Tue) - 04/25 (Thu) • @ Moscone West Convention Center, San Francisco • https://databricks.com/sparkaisummit/north-america
  • 33. 33 Appendix How to contribute? • See: Contributing to Spark • Open an issue on JIRA • Send a pull-request at GitHub • Communicate with committers and reviewers • Congratulations! Thanks for your contributions!
  • 34. 34 Appendix • PySpark Usage Guide for Pandas with Apache Arrow • https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow. html • Vectorized UDF: Scalable Analysis with Python and PySpark • https://databricks.com/session/vectorized-udf-scalable-analysis-with- python-and-pyspark • Demo for Apache Arrow Tokyo Meetup 2018 • https://databricks-prod-cloudfront.cloud.databricks.com/public/4027 ec902e239c93eaaa8714f173bcfc/142158605138935/354623205913920 1/7497868276316206/latest.html