SlideShare a Scribd company logo
Improving Python and Spark Performance
and Interoperability with Apache Arrow
Julien Le Dem
Principal Architect
Dremio
Li Jin
Software Engineer
Two Sigma Investments
© 2017 Dremio Corporation, Two Sigma Investments, LP
About Us
• Architect at @DremioHQ
• Formerly Tech Lead at Twitter on Data 
Platforms
• Creator of Parquet
• Apache member
• Apache PMCs: Arrow, Kudu, Incubator, 
Pig, Parquet
Julien Le Dem
@J_
Li Jin
@icexelloss
• Software Engineer at Two Sigma Investments
• Building a python­based analytics platform with PySpark
• Other open source projects:
– Flint: A Time Series Library on 
Spark
– Cook: A Fair Share Scheduler on 
Mesos
© 2017 Dremio Corporation, Two Sigma Investments, LP
Agenda
• Current state and limitations of PySpark UDFs
• Apache Arrow overview
• Improvements realized
• Future roadmap
Current state 
and limitations 
of PySpark UDFs
© 2017 Dremio Corporation, Two Sigma Investments, LP
Why do we need User Defined Functions?
• Some computation is more easily expressed with Python than Spark 
built­in functions.
• Examples:
– weighted mean
– weighted correlation 
– exponential moving average
© 2017 Dremio Corporation, Two Sigma Investments, LP
What is PySpark UDF
• PySpark UDF is a user defined function executed in 
Python runtime.
• Two types:
– Row UDF: 
• lambda x: x + 1
• lambda date1, date2: (date1 - date2).years
– Group UDF (subject of this presentation):
• lambda values: np.mean(np.array(values))
© 2017 Dremio Corporation, Two Sigma Investments, LP
Row UDF
• Operates on a row by row basis
– Similar to `map` operator
• Example …
df.withColumn(
‘v2’,
udf(lambda x: x+1, DoubleType())(df.v1)
)
• Performance:
– 60x slower than build­in functions for simple case
© 2017 Dremio Corporation, Two Sigma Investments, LP
Group UDF
• UDF that operates on more than one row
– Similar to `groupBy` followed by `map` operator
• Example:
– Compute weighted mean by month
© 2017 Dremio Corporation, Two Sigma Investments, LP
Group UDF
• Not supported out of box:
– Need boiler plate code to pack/unpack multiple rows into a nested row
• Poor performance
– Groups are materialized and then converted to Python data structures
© 2017 Dremio Corporation, Two Sigma Investments, LP
Example: Data Normalization
(values – values.mean()) / values.std()
© 2017 Dremio Corporation, Two Sigma Investments, LP
Example: Data Normalization
© 2017 Dremio Corporation, Two Sigma Investments, LP
Example: Monthly Data Normalization
Useful bits
© 2017 Dremio Corporation, Two Sigma Investments, LP
Example: Monthly Data Normalization
Boilerplate
Boilerplate
© 2017 Dremio Corporation, Two Sigma Investments, LP
Example: Monthly Data Normalization
• Poor performance ­ 16x slower than baseline
groupBy().agg(collect_list())
© 2017 Dremio Corporation, Two Sigma Investments, LP
Problems
• Packing / unpacking nested rows
• Inefficient data movement (Serialization / Deserialization)
• Scalar computation model: object boxing and interpreter overhead
Apache 
Arrow
© 2017 Dremio Corporation, Two Sigma Investments, LP
Arrow: An open source standard
• Common need for in memory columnar
• Building on the success of Parquet.
• Top­level Apache project
• Standard from the start
– Developers from 13+ major open source projects involved
• Benefits:
– Share the effort
– Create an ecosystem
Calcite
Cassandra
Deeplearning4
j
Drill
Hadoop
HBase
Ibis
Impala
Kudu
Pandas
Parquet
Phoenix
Spark
Storm
R
© 2017 Dremio Corporation, Two Sigma Investments, LP
Arrow goals
• Well­documented and cross language compatible
• Designed to take advantage of modern CPU
• Embeddable 
­ In execution engines, storage layers, etc.
• Interoperable
© 2017 Dremio Corporation, Two Sigma Investments, LP
High Performance Sharing & Interchange
Before With Arrow
• Each system has its own internal
memory format
• 70-80% CPU wasted on
serialization and deserialization
• Functionality duplication and
unnecessary conversions
• All systems utilize the same
memory format
• No overhead for cross-system
communication
• Projects can share functionality
(eg: Parquet-to-Arrow reader)
© 2017 Dremio Corporation, Two Sigma Investments, LP
Columnar data
persons = [{
nam e:’Joe',
age:18,
phones:[
‘555-111-1111’,
‘555-222-2222’
]
},{
nam e:’Jack',
age:37,
phones:[‘555-333-3333’]
}]
© 2017 Dremio Corporation, Two Sigma Investments, LP
Record Batch Construction
Schema 
Negotiation
Schema 
Negotiation
Dictionary 
Batch
Dictionary 
Batch
Record 
Batch
Record 
Batch
Record 
Batch
Record 
Batch
Record 
Batch
Record 
Batch
name (offset)name (offset)
name (data)name (data)
age (data)age (data)
phones (list offset)phones (list offset)
phones (data)phones (data)
data header (describes offsets into data)data header (describes offsets into data)
name (bitmap)name (bitmap)
age (bitmap)age (bitmap)
phones (bitmap)phones (bitmap)
phones (offset)phones (offset)
{
nam e:’Joe',
age:18,
phones:[
‘555-111-1111’,
‘555-222-2222’
]
}
Each box (vector) is contiguous memory 
The entire record batch is contiguous on wire
Each box (vector) is contiguous memory 
The entire record batch is contiguous on wire
© 2017 Dremio Corporation, Two Sigma Investments, LP
In memory columnar format for speed
• Maximize CPU throughput
­ Pipelining
­ SIMD
­ cache locality
• Scatter/gather I/O
© 2017 Dremio Corporation, Two Sigma Investments, LP
Results
­ PySpark Integration: 
53x speedup (IBM spark work on SPARK­13534)
http://s.apache.org/arrowresult1
­ Streaming Arrow Performance
7.75GB/s data movement
http://s.apache.org/arrowresult2
­ Arrow Parquet C++ Integration
4GB/s reads
http://s.apache.org/arrowresult3
­ Pandas Integration
9.71GB/s
http://s.apache.org/arrowresult4
© 2017 Dremio Corporation, Two Sigma Investments, LP
Arrow Releases
178
195
311
85
237
131
76
17
Changes Days
Improvements 
to PySpark  
with Arrow
© 2017 Dremio Corporation, Two Sigma Investments, LP
How PySpark UDF works
Execut
or
Python
Worker
UDF: scalar -> scalar
Batched Rows
Batched Rows
© 2017 Dremio Corporation, Two Sigma Investments, LP
Current Issues with UDF
• Serialize / Deserialize in Python
• Scalar computation model (Python for loop)
© 2017 Dremio Corporation, Two Sigma Investments, LP
Profile lambda x: x+1 Actual Runtime is 2s without profiling.
8 Mb/s
91.8%
© 2017 Dremio Corporation, Two Sigma Investments, LP
Vectorize Row UDF
Executor
Python
Worker
UDF: pd.DataFrame ­> pd.DataFrame
Rows ­> 
RB
RB ­> 
Rows
© 2017 Dremio Corporation, Two Sigma Investments, LP
Why pandas.DataFrame
• Fast, feature­rich, widely used by Python users
• Already exists in PySpark (toPandas)
• Compatible with popular Python libraries:
­ NumPy, StatsModels, SciPy, scikit­learn…
• Zero copy to/from Arrow
© 2017 Dremio Corporation, Two Sigma Investments, LP
Scalar vs Vectorized UDF
20x Speed Up
Actual Runtime is 2s without profiling
© 2017 Dremio Corporation, Two Sigma Investments, LP
Scalar vs Vectorized UDF
Overhead
Removed
© 2017 Dremio Corporation, Two Sigma Investments, LP
Scalar vs Vectorized UDF
Less System Call
Faster I/O
© 2017 Dremio Corporation, Two Sigma Investments, LP
Scalar vs Vectorized UDF
4.5x Speed Up
© 2017 Dremio Corporation, Two Sigma Investments, LP
Support Group UDF
• Split­apply­combine:
­ Break a problem into smaller pieces
­ Operate on each piece independently
­ Put all pieces back together
• Common pattern supported in SQL, Spark, Pandas, R … 
© 2017 Dremio Corporation, Two Sigma Investments, LP
Split­Apply­Combine (Current)
• Split: groupBy, window, …
• Apply: mean, stddev, collect_list, rank …
• Combine: Inherently done by Spark
© 2017 Dremio Corporation, Two Sigma Investments, LP
Split­Apply­Combine (with Group UDF)
• Split: groupBy, window, …
• Apply: UDF
• Combine: Inherently done by Spark
© 2017 Dremio Corporation, Two Sigma Investments, LP
Introduce groupBy().apply()
• UDF: pd.DataFrame ­> pd.DataFrame
– Treat each group as a pandas DataFrame
– Apply UDF on each group
– Assemble as PySpark DataFrame
© 2017 Dremio Corporation, Two Sigma Investments, LP
Introduce groupBy().apply()
RowsRows
RowsRows
RowsRows
GroupsGroups
GroupsGroups
GroupsGroups
GroupsGroups
GroupsGroups
GroupsGroups
                 Each Group:
pd.DataFrame ­> pd.DataFramegroupBy
© 2017 Dremio Corporation, Two Sigma Investments, LP
Previous Example: Data Normalization
(values – values.mean()) / values.std()
© 2017 Dremio Corporation, Two Sigma Investments, LP
Previous Example: Data Normalization
5x Speed Up
Current: Group UDF:
© 2017 Dremio Corporation, Two Sigma Investments, LP
Limitations
• Requires Spark Row <­> Arrow RecordBatch conversion
– Incompatible memory layout (row vs column)
• (groupBy) No local aggregation
– Difficult due to how PySpark works. See 
https://issues.apache.org/jira/browse/SPARK­10915 
Future 
Roadmap
© 2017 Dremio Corporation, Two Sigma Investments, LP
What’s Next (Arrow)
• Arrow RPC/REST
• Arrow IPC
• Apache {Spark, Drill, Kudu} to Arrow Integration
– Faster UDFs, Storage interfaces
© 2017 Dremio Corporation, Two Sigma Investments, LP
What’s Next (PySpark UDF)
• Continue working on SPARK­20396
• Support Pandas UDF with more PySpark functions:
– groupBy().agg()
– window
© 2017 Dremio Corporation, Two Sigma Investments, LP
What’s Next (PySpark UDF)
© 2017 Dremio Corporation, Two Sigma Investments, LP
Get Involved
• Watch SPARK­20396
• Join the Arrow community
– dev@arrow.apache.org
– Slack:
• https://apachearrowslackin.herokuapp.com/
– http://arrow.apache.org
– Follow @ApacheArrow
© 2017 Dremio Corporation, Two Sigma Investments, LP
Thank you
• Bryan Cutler (IBM), Wes McKinney (Two Sigma Investments) for 
helping build this feature
• Apache Arrow community
• Spark Summit organizers
• Two Sigma and Dremio for supporting this work
This document is being distributed for informational and educational purposes only and is not an offer to sell or the solicitation of an offer to buy
any securities or other instruments. The information contained herein is not intended to provide, and should not be relied upon for investment
advice. The views expressed herein are not necessarily the views of Two Sigma Investments, LP or any of its affiliates (collectively, “Two Sigma”).
Such views reflect significant assumptions and subjective of the author(s) of the document and are subject to change without notice. The
document may employ data derived from third-party sources. No representation is made as to the accuracy of such information and the use of
such information in no way implies an endorsement of the source of such information or its validity.
The copyrights and/or trademarks in some of the images, logos or other material used herein may be owned by entities other than Two Sigma. If
so, such copyrights and/or trademarks are most likely owned by the entity that created the material and are used purely for identification and
comment as fair use under international copyright and/or trademark laws. Use of such image, copyright or trademark does not imply any
association with such organization (or endorsement of such organization) by Two Sigma, nor vice versa.

More Related Content

PPTX
Principles of REST API Design
PDF
Engage 2020: Six Polite Ways to Design a RESTful API for Your Application!
PDF
The 7 Deadly Sins of API Design
PDF
REST vs. GraphQL: Critical Look
PDF
GraphQL as an alternative approach to REST (as presented at Java2Days/CodeMon...
PDF
Scaling Your Team With GraphQL: Why Relationships Matter
PPTX
Adding Rules on Existing Hypermedia APIs
PPTX
APIs and Linked Data: A match made in Heaven
Principles of REST API Design
Engage 2020: Six Polite Ways to Design a RESTful API for Your Application!
The 7 Deadly Sins of API Design
REST vs. GraphQL: Critical Look
GraphQL as an alternative approach to REST (as presented at Java2Days/CodeMon...
Scaling Your Team With GraphQL: Why Relationships Matter
Adding Rules on Existing Hypermedia APIs
APIs and Linked Data: A match made in Heaven

What's hot (11)

PDF
Introduction to GraphQL (or How I Learned to Stop Worrying about REST APIs)
PPTX
API Athens Meetup - API standards 25-6-2014
PDF
Modeling REST API's Behaviour with Text, Graphics or Both?
PPTX
Webinar: Realizing Omni-Channel Retailing with MongoDB - One Step at a Time
PDF
Tracking and business intelligence
PDF
Better APIs with GraphQL
PPT
Share point apps the good, the bad, and the pot of gold at the end of the r...
PDF
SPEngage Raleigh 2017 Azure Active Directory For Office 365 Developers
PDF
MongoDB and Hadoop: Driving Business Insights
PPTX
Webtrends and bright starr webinar 01282015 sharepoint is evolving
PPTX
Maintainable API Docs and Other Rainbow Colored Unicorns
Introduction to GraphQL (or How I Learned to Stop Worrying about REST APIs)
API Athens Meetup - API standards 25-6-2014
Modeling REST API's Behaviour with Text, Graphics or Both?
Webinar: Realizing Omni-Channel Retailing with MongoDB - One Step at a Time
Tracking and business intelligence
Better APIs with GraphQL
Share point apps the good, the bad, and the pot of gold at the end of the r...
SPEngage Raleigh 2017 Azure Active Directory For Office 365 Developers
MongoDB and Hadoop: Driving Business Insights
Webtrends and bright starr webinar 01282015 sharepoint is evolving
Maintainable API Docs and Other Rainbow Colored Unicorns
Ad

Similar to Improving Python and Spark Performance and Interoperability with Apache Arrow (20)

PDF
Improving Python and Spark Performance and Interoperability with Apache Arrow...
PPTX
Improving Python and Spark Performance and Interoperability with Apache Arrow
PDF
Enabling Python to be a Better Big Data Citizen
DOC
Jitesh Agrawal plone
DOC
Jitesh agrawal Resume
PPTX
Efficient Data Formats for Analytics with Parquet and Arrow
PPTX
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
PDF
The Ignite Buzz That Drives Digital Transformation Success
PPTX
#ESPC18 how to migrate to the #SharePoint Framework?
PDF
2019-Nov: Domain Driven Design (DDD) and when not to use it
PDF
An Incomplete Data Tools Landscape for Hackers in 2015
PDF
Convert your Full Trust Solutions to the SharePoint Framework (SPFx)
PDF
Simplifying AI integration on Apache Spark
PPTX
vishwa ppt.pptxvishwa ppt.pptxvishwa ppt.pptx
DOC
6yearsResume
PDF
My Path From Data Engineer to Analytics Engineer
PDF
Deploy prometheus on kubernetes
PPTX
Light Speed Integrations With Anypoint Flow Designer
PDF
Building Business Applications in Office 365 SharePoint Online Using Logic Apps
DOCX
SamSegalResume
Improving Python and Spark Performance and Interoperability with Apache Arrow...
Improving Python and Spark Performance and Interoperability with Apache Arrow
Enabling Python to be a Better Big Data Citizen
Jitesh Agrawal plone
Jitesh agrawal Resume
Efficient Data Formats for Analytics with Parquet and Arrow
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Ignite Buzz That Drives Digital Transformation Success
#ESPC18 how to migrate to the #SharePoint Framework?
2019-Nov: Domain Driven Design (DDD) and when not to use it
An Incomplete Data Tools Landscape for Hackers in 2015
Convert your Full Trust Solutions to the SharePoint Framework (SPFx)
Simplifying AI integration on Apache Spark
vishwa ppt.pptxvishwa ppt.pptxvishwa ppt.pptx
6yearsResume
My Path From Data Engineer to Analytics Engineer
Deploy prometheus on kubernetes
Light Speed Integrations With Anypoint Flow Designer
Building Business Applications in Office 365 SharePoint Online Using Logic Apps
SamSegalResume
Ad

More from Two Sigma (19)

PPTX
The State of Open Data on School Bullying
PPTX
Halite @ Google Cloud Next 2018
PPTX
Future of Pandas - Jeff Reback
PPTX
BeakerX - Tiezheng Li
PPTX
Engineering with Open Source - Hyonjee Joo
PDF
Bringing Linux back to the Server BIOS with LinuxBoot - Trammel Hudson
PPTX
Waiter: An Open-Source Distributed Auto-Scaler
PPTX
Responsive and Scalable Real-time Data Analytics for SHPE 2017 - Cecilia Ye
PPTX
Archival Storage at Two Sigma - Josh Leners
PPTX
Smooth Storage - A distributed storage system for managing structured time se...
PDF
The Language of Compression - Leif Walsh
PDF
Identifying Emergent Behaviors in Complex Systems - Jane Adams
PDF
Algorithmic Data Science = Theory + Practice
PDF
HUOHUA: A Distributed Time Series Analysis Framework For Spark
PDF
TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fix...
PPTX
Exploring the Urban – Rural Incarceration Divide: Drivers of Local Jail Incar...
PDF
Graph Summarization with Quality Guarantees
PDF
Rademacher Averages: Theory and Practice
PDF
Credit-Implied Volatility
The State of Open Data on School Bullying
Halite @ Google Cloud Next 2018
Future of Pandas - Jeff Reback
BeakerX - Tiezheng Li
Engineering with Open Source - Hyonjee Joo
Bringing Linux back to the Server BIOS with LinuxBoot - Trammel Hudson
Waiter: An Open-Source Distributed Auto-Scaler
Responsive and Scalable Real-time Data Analytics for SHPE 2017 - Cecilia Ye
Archival Storage at Two Sigma - Josh Leners
Smooth Storage - A distributed storage system for managing structured time se...
The Language of Compression - Leif Walsh
Identifying Emergent Behaviors in Complex Systems - Jane Adams
Algorithmic Data Science = Theory + Practice
HUOHUA: A Distributed Time Series Analysis Framework For Spark
TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fix...
Exploring the Urban – Rural Incarceration Divide: Drivers of Local Jail Incar...
Graph Summarization with Quality Guarantees
Rademacher Averages: Theory and Practice
Credit-Implied Volatility

Recently uploaded (20)

PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
climate analysis of Dhaka ,Banglades.pptx
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PDF
Data Science Trends & Career Guide---ppt
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PDF
Report The-State-of-AIOps 20232032 3.pdf
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
Computer network topology notes for revision
PPTX
Challenges and opportunities in feeding a growing population
PPT
Quality review (1)_presentation of this 21
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
Bharatiya Antariksh Hackathon 2025 Idea Submission PPT.pptx
PPTX
batch data Retailer Data management Project.pptx
PDF
Fluorescence-microscope_Botany_detailed content
PDF
Launch Your Data Science Career in Kochi – 2025
PDF
Taxes Foundatisdcsdcsdon Certificate.pdf
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
climate analysis of Dhaka ,Banglades.pptx
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Data Science Trends & Career Guide---ppt
Business Ppt On Nestle.pptx huunnnhhgfvu
Report The-State-of-AIOps 20232032 3.pdf
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
.pdf is not working space design for the following data for the following dat...
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Computer network topology notes for revision
Challenges and opportunities in feeding a growing population
Quality review (1)_presentation of this 21
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Clinical guidelines as a resource for EBP(1).pdf
Bharatiya Antariksh Hackathon 2025 Idea Submission PPT.pptx
batch data Retailer Data management Project.pptx
Fluorescence-microscope_Botany_detailed content
Launch Your Data Science Career in Kochi – 2025
Taxes Foundatisdcsdcsdon Certificate.pdf

Improving Python and Spark Performance and Interoperability with Apache Arrow