SlideShare a Scribd company logo
Uwe L. Korn
PyData Paris 14th June 2016
How Apache Arrow and Parquet
boost cross-language interop
About me
• Data Scientist at Blue Yonder (@BlueYonderTech)
• We optimize Replenishment and Pricing for the Retail
industry with Predictive Analytics
• Contributor to Apache {Arrow, Parquet}
• Work in Python, Cython, C++11 and SQL
Agenda
The Problem
Arrow
Parquet
Outlook
Why is columnar better?
Image source: https://arrow.apache.org/img/simd.png ( https://arrow.apache.org/ )
Different Systems - Varying
Python Support
• Various levels of Python Support
• Build in Python
• Python API
• No Python at all
• Each tool/algorithm works on
columnar data
• Separate conversion routines for
each pair
• causes overhead
• there’s no one-size-fits-all solution
Image source: https://arrow.apache.org/img/copy2.png ( https://arrow.apache.org/ )
Apache Arrow
• Specification for in-memory
columnar data layout
• No overhead for cross-system /
cross-language communication
• Designed for efficiency (exploit
SIMD, cache locality, ..)
• Supports nested data structures
Image source: https://arrow.apache.org/img/shared2.png ( https://arrow.apache.org/ )
Apache Arrow - The Impact
• An example: Retrieve a dataset from an MPP database
and analyze it in Pandas
• Run a query in the DB
• Pass it in columnar form to the DB driver
• The OBDC layer transform it into row-wise form
• Pandas makes it columnar again
• Ugly real-life solution: export as CSV, bypass ODBC
• In future: Use Arrow as interface between the DB and
Pandas
Apache Arrow
• Top-level Apache project from the beginning
• Not only a specification: also includes C++ / Java /
Python / .. code.
• Arrow structures / classes
• RPC (upcoming) & IPC (alpha) support
• Conversion code for Parquet, Pandas, ..
• Combined effort from developer of over 13 major OSS
projects
• Impala, Kudu, Spark, Cassandra, Drill, Pandas, R, ..
• Spec: https://github.com/apache/arrow/blob/master/format/Layout.md
Arrow in Action: Feather
• Language-agnostic file format for
binary data frame storage
• Read performance close to raw
disk I/O
• by Wes McKinney (Python) and
Hadley Wickham (R)
• Julia Support in progress
Arrow Arrays
Feather Metadata
(flatbuffers)
Apache Parquet
Apache Parquet
• Binary file format for nested columnar data
• Inspired from Google Dremel paper
• space and query efficient
• multiple encodings
• predicate pushdown
• column-wise compression
• many tools use Parquet as the default input format
• very popular in the JVM/Hadoop-based world
The Basics
• 1 File, includes metadata
• Several row groups
• all with the same number of column chunks
• n pages per column chunk
• Benefits:
• pre-partitioned for fast distributed access
• statistics in the metadata for predicate pushdown
Blogpost by Julien Le Dem: https://blog.twitter.com/2013/dremel-made-
simple-with-parquet
File
Row Group
Column Chunk
Page
Using Parquet in Python
• You can use it already today with Python:
• sqlContext.read.parquet(“..“).toPandas()	
• Needs to pass through Spark, very slow
• Native Python support on its way:
• Parquet I/O to Arrow
• Arrow provides NumPy conversion
State of Arrow & Parquet
Arrow
in-memory spec for columnar data
• Java (beta)
• C++ (in progress)
• Python (in progress)
• Planned:
• Julia
• R
Parquet
columnar on-disk storage
• Java (mature)
• C++ (in progress)
• Python (in progress)
• Planned:
• Julia
• R
Upcoming
• Parquet <-Arrow-> Pandas
• IPC on its way
• alpha implementation using memory mapped files
• JVM <-> native with shared reference counting
Get Involved!
• dev@arrow.apache.org & dev@parquet.apache.org
• https://apachearrowslackin.herokuapp.com/
• https://arrow.apache.org/
• https://parquet.apache.org/
• @ApacheArrow & @ApacheParquet
Questions ?!

More Related Content

PDF
Apache Arrow (Strata-Hadoop World San Jose 2016)
PPTX
Apache Arrow - An Overview
PDF
Strata London 2016: The future of column oriented data processing with Arrow ...
PDF
PyData London 2017 – Efficient and portable DataFrame storage with Apache Par...
PPTX
Improving Python and Spark Performance and Interoperability with Apache Arrow
PDF
An Incomplete Data Tools Landscape for Hackers in 2015
PPTX
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
PPTX
Data Eng Conf NY Nov 2016 Parquet Arrow
Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow - An Overview
Strata London 2016: The future of column oriented data processing with Arrow ...
PyData London 2017 – Efficient and portable DataFrame storage with Apache Par...
Improving Python and Spark Performance and Interoperability with Apache Arrow
An Incomplete Data Tools Landscape for Hackers in 2015
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
Data Eng Conf NY Nov 2016 Parquet Arrow

What's hot (20)

PDF
Apache Arrow and Python: The latest
PDF
Data Science Languages and Industry Analytics
PDF
Apache Arrow -- Cross-language development platform for in-memory data
PDF
My Data Journey with Python (SciPy 2015 Keynote)
PPTX
Strata NY 2018: The deconstructed database
PDF
Ibis: Scaling the Python Data Experience
PPTX
Strata NY 2017 Parquet Arrow roadmap
PDF
HUG_Ireland_Apache_Arrow_Tomer_Shiran
PDF
If you have your own Columnar format, stop now and use Parquet 😛
PDF
Improving data interoperability in Python and R
PPTX
Efficient Data Formats for Analytics with Parquet and Arrow
PDF
Python Data Wrangling: Preparing for the Future
PPTX
Apache Arrow: In Theory, In Practice
PDF
Apache Arrow at DataEngConf Barcelona 2018
PDF
High Performance Python on Apache Spark
PPTX
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
PDF
Ursa Labs and Apache Arrow in 2019
PDF
DataFrames: The Extended Cut
PDF
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PPTX
Node Labels in YARN
Apache Arrow and Python: The latest
Data Science Languages and Industry Analytics
Apache Arrow -- Cross-language development platform for in-memory data
My Data Journey with Python (SciPy 2015 Keynote)
Strata NY 2018: The deconstructed database
Ibis: Scaling the Python Data Experience
Strata NY 2017 Parquet Arrow roadmap
HUG_Ireland_Apache_Arrow_Tomer_Shiran
If you have your own Columnar format, stop now and use Parquet 😛
Improving data interoperability in Python and R
Efficient Data Formats for Analytics with Parquet and Arrow
Python Data Wrangling: Preparing for the Future
Apache Arrow: In Theory, In Practice
Apache Arrow at DataEngConf Barcelona 2018
High Performance Python on Apache Spark
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Ursa Labs and Apache Arrow in 2019
DataFrames: The Extended Cut
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
Node Labels in YARN
Ad

Similar to How Apache Arrow and Parquet boost cross-language interoperability (20)

PDF
PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landsca...
PPTX
The columnar roadmap: Apache Parquet and Apache Arrow
PDF
Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy
PDF
Next-generation Python Big Data Tools, powered by Apache Arrow
PDF
Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...
PDF
PyCon Ireland 2022 - PyArrow full stack.pdf
PDF
PyData Sofia May 2024 - Intro to Apache Arrow
PDF
Improving Data Interoperability for Python and R
PDF
Apache Arrow
PPTX
An Introduction to Apache Arrow for Python Programmers.pptx
PDF
PyData Frankfurt - (Efficient) Data Exchange with "Foreign" Ecosystems
PPTX
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
PDF
Apache Arrow: Present and Future @ ScaledML 2020
PDF
Apache Arrow: Cross-language Development Platform for In-memory Data
PDF
Extending Pandas using Apache Arrow and Numba
PDF
OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...
PDF
ACM TechTalks : Apache Arrow and the Future of Data Frames
PPTX
The columnar roadmap: Apache Parquet and Apache Arrow
PDF
ApacheCon Europe Big Data 2016 – Parquet in practice & detail
PDF
New Directions for Apache Arrow
PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landsca...
The columnar roadmap: Apache Parquet and Apache Arrow
Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy
Next-generation Python Big Data Tools, powered by Apache Arrow
Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...
PyCon Ireland 2022 - PyArrow full stack.pdf
PyData Sofia May 2024 - Intro to Apache Arrow
Improving Data Interoperability for Python and R
Apache Arrow
An Introduction to Apache Arrow for Python Programmers.pptx
PyData Frankfurt - (Efficient) Data Exchange with "Foreign" Ecosystems
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Cross-language Development Platform for In-memory Data
Extending Pandas using Apache Arrow and Numba
OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...
ACM TechTalks : Apache Arrow and the Future of Data Frames
The columnar roadmap: Apache Parquet and Apache Arrow
ApacheCon Europe Big Data 2016 – Parquet in practice & detail
New Directions for Apache Arrow
Ad

Recently uploaded (20)

PDF
Oracle OFSAA_ The Complete Guide to Transforming Financial Risk Management an...
PPTX
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
PPTX
Leprosy and NLEP programme community medicine
PPTX
SAP 2 completion done . PRESENTATION.pptx
PDF
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
PPTX
importance of Data-Visualization-in-Data-Science. for mba studnts
PPTX
A Complete Guide to Streamlining Business Processes
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PPTX
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
PPT
Predictive modeling basics in data cleaning process
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PDF
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PDF
annual-report-2024-2025 original latest.
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
PPT
DATA COLLECTION METHODS-ppt for nursing research
Oracle OFSAA_ The Complete Guide to Transforming Financial Risk Management an...
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
Leprosy and NLEP programme community medicine
SAP 2 completion done . PRESENTATION.pptx
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
importance of Data-Visualization-in-Data-Science. for mba studnts
A Complete Guide to Streamlining Business Processes
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
Predictive modeling basics in data cleaning process
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
Acceptance and paychological effects of mandatory extra coach I classes.pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
annual-report-2024-2025 original latest.
IBA_Chapter_11_Slides_Final_Accessible.pptx
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
DATA COLLECTION METHODS-ppt for nursing research

How Apache Arrow and Parquet boost cross-language interoperability

  • 1. Uwe L. Korn PyData Paris 14th June 2016 How Apache Arrow and Parquet boost cross-language interop
  • 2. About me • Data Scientist at Blue Yonder (@BlueYonderTech) • We optimize Replenishment and Pricing for the Retail industry with Predictive Analytics • Contributor to Apache {Arrow, Parquet} • Work in Python, Cython, C++11 and SQL
  • 4. Why is columnar better? Image source: https://arrow.apache.org/img/simd.png ( https://arrow.apache.org/ )
  • 5. Different Systems - Varying Python Support • Various levels of Python Support • Build in Python • Python API • No Python at all • Each tool/algorithm works on columnar data • Separate conversion routines for each pair • causes overhead • there’s no one-size-fits-all solution Image source: https://arrow.apache.org/img/copy2.png ( https://arrow.apache.org/ )
  • 6. Apache Arrow • Specification for in-memory columnar data layout • No overhead for cross-system / cross-language communication • Designed for efficiency (exploit SIMD, cache locality, ..) • Supports nested data structures Image source: https://arrow.apache.org/img/shared2.png ( https://arrow.apache.org/ )
  • 7. Apache Arrow - The Impact • An example: Retrieve a dataset from an MPP database and analyze it in Pandas • Run a query in the DB • Pass it in columnar form to the DB driver • The OBDC layer transform it into row-wise form • Pandas makes it columnar again • Ugly real-life solution: export as CSV, bypass ODBC • In future: Use Arrow as interface between the DB and Pandas
  • 8. Apache Arrow • Top-level Apache project from the beginning • Not only a specification: also includes C++ / Java / Python / .. code. • Arrow structures / classes • RPC (upcoming) & IPC (alpha) support • Conversion code for Parquet, Pandas, .. • Combined effort from developer of over 13 major OSS projects • Impala, Kudu, Spark, Cassandra, Drill, Pandas, R, .. • Spec: https://github.com/apache/arrow/blob/master/format/Layout.md
  • 9. Arrow in Action: Feather • Language-agnostic file format for binary data frame storage • Read performance close to raw disk I/O • by Wes McKinney (Python) and Hadley Wickham (R) • Julia Support in progress Arrow Arrays Feather Metadata (flatbuffers)
  • 11. Apache Parquet • Binary file format for nested columnar data • Inspired from Google Dremel paper • space and query efficient • multiple encodings • predicate pushdown • column-wise compression • many tools use Parquet as the default input format • very popular in the JVM/Hadoop-based world
  • 12. The Basics • 1 File, includes metadata • Several row groups • all with the same number of column chunks • n pages per column chunk • Benefits: • pre-partitioned for fast distributed access • statistics in the metadata for predicate pushdown Blogpost by Julien Le Dem: https://blog.twitter.com/2013/dremel-made- simple-with-parquet File Row Group Column Chunk Page
  • 13. Using Parquet in Python • You can use it already today with Python: • sqlContext.read.parquet(“..“).toPandas() • Needs to pass through Spark, very slow • Native Python support on its way: • Parquet I/O to Arrow • Arrow provides NumPy conversion
  • 14. State of Arrow & Parquet Arrow in-memory spec for columnar data • Java (beta) • C++ (in progress) • Python (in progress) • Planned: • Julia • R Parquet columnar on-disk storage • Java (mature) • C++ (in progress) • Python (in progress) • Planned: • Julia • R
  • 15. Upcoming • Parquet <-Arrow-> Pandas • IPC on its way • alpha implementation using memory mapped files • JVM <-> native with shared reference counting
  • 16. Get Involved! • [email protected] & [email protected] • https://apachearrowslackin.herokuapp.com/ • https://arrow.apache.org/ • https://parquet.apache.org/ • @ApacheArrow & @ApacheParquet