SlideShare a Scribd company logo
1© Cloudera, Inc. All rights reserved.
Improving data interoperability
in Python and R
Wes McKinney @wesmckinn
NYC R Conference April 8, 2016
2© Cloudera, Inc. All rights reserved.
http://numfocus.org
3© Cloudera, Inc. All rights reserved.
Me
• Data Science Tools at Cloudera, formerly DataPad CEO/founder
• Serial creator of structured data tools / user interfaces
• Wrote bestseller Python for Data Analysis 2012
• Open source projects
• Python {pandas, Ibis, statsmodels}
• Apache {Arrow, Parquet, Kudu (incubating)}
• Mostly work in Python and Cython/C/C++
4© Cloudera, Inc. All rights reserved.
In process:
Python for Data Analysis: 2nd
Edition
Coming late 2016 / early
2017
5© Cloudera, Inc. All rights reserved.
Apache
Arrow
http://arrow.apache.org
Some slides from Strata-HW talk w/
Jacques Nadeau
6© Cloudera, Inc. All rights reserved.
Arrow in a Slide
• New Top-level Apache Software Foundation project
• Focused on Columnar In-Memory Analytics
1. 10-100x speedup on many workloads
2. Common data layer enables companies to choose best of
breed systems
3. Designed to work with any programming language
4. Support for both relational and complex data as-is
• Developers from 13+ major open source projects involved
• A significant % of the world’s data will be processed through
Arrow!
Calcite
Cassandra
Deeplearning4j
Drill
Hadoop
HBase
Ibis
Impala
Kudu
Pandas
Parquet
Phoenix
Spark
Storm
R
7© Cloudera, Inc. All rights reserved.
High Performance Sharing & Interchange
Today With Arrow
• Each system has its own internal
memory format
• 70-80% CPU wasted on serialization
and deserialization
• Similar functionality implemented in
multiple projects
• All systems utilize the same memory
format
• No overhead for cross-system
communication
• Projects can share functionality (eg,
Parquet-to-Arrow reader)
8© Cloudera, Inc. All rights reserved.
Arrow in action: Feather File Format for Python and R
•Problem: fast, language-
agnostic binary data frame
file format
•By Wes McKinney (Python)
and Hadley Wickham (R)
•Read speeds close to disk IO
performance
9© Cloudera, Inc. All rights reserved.
Real World Example: Feather File Format for Python
and R
library(feather)
path <- "my_data.feather"
write_feather(df, path)
df <- read_feather(path)
import feather
path = 'my_data.feather'
feather.write_dataframe(df, path)
df = feather.read_dataframe(path)
R Python
10© Cloudera, Inc. All rights reserved.
More on Feather
array 0
array 1
array 2
...
array n - 1
METADATA
Feather File
libfeather
C++ library
Rcpp
Cython
R data.frame
pandas DataFrame
11© Cloudera, Inc. All rights reserved.
Feather: the good and not-so-good
• Good
• Language-agnostic memory representation
• Extremely fast
• New storage features can be added without much difficulty
• Not-so-good
• Data must be convert to/from storage representation (Arrow) and in-
memory “proprietary” data structures (R / Python data frames)
12© Cloudera, Inc. All rights reserved.
Shared needs for Python, R, Julia, ...
• If PLs can establish a common data frame C/C++-level memory representation,
we can share algorithms and libraries much more easily
• Example: dplyr’s in-memory backend
• Other requirements
• Permissive licensing (Python / Julia require MIT/Apache-like)
• Common build/test/packaging for shared C/C++ library components
13© Cloudera, Inc. All rights reserved.
Get Involved in Arrow
• Join the community
• dev@arrow.apache.org
• Slack: https://apachearrowslackin.herokuapp.com/
• http://arrow.apache.org
• @ApacheArrow
14© Cloudera, Inc. All rights reserved.
Thank you
Wes McKinney @wesmckinn
Views are my own

More Related Content

PDF
Data Science Challenges in Personal Program Analysis
PDF
Dr. Datascience or: How I Learned to Stop Munging and Love Tests
PDF
What We Learned Building an R-Python Hybrid Predictive Analytics Pipeline
PDF
Scaling Analysis Responsibly
PDF
Building Scalable Prediction Services in R
PDF
Julia + R for Data Science
PDF
Open Source Big Graph Analytics on Neo4j with Apache Spark
PDF
Flink Community Update 2015 June
Data Science Challenges in Personal Program Analysis
Dr. Datascience or: How I Learned to Stop Munging and Love Tests
What We Learned Building an R-Python Hybrid Predictive Analytics Pipeline
Scaling Analysis Responsibly
Building Scalable Prediction Services in R
Julia + R for Data Science
Open Source Big Graph Analytics on Neo4j with Apache Spark
Flink Community Update 2015 June

What's hot (20)

PPTX
R reproducibility
PDF
Query or Not to Query? Using Apache Spark Metrics to Highlight Potentially Pr...
PDF
Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...
PPTX
Reproducible Data Science with R
PDF
AI Library - An Open Source Machine Learning Framework
PPTX
SFrame
PPTX
Capgemini - Project industrialization with apache spark
PDF
Bringing Deep Learning into production
PDF
Large Scale Image Forensics using Tika and Tensorflow [ICMR MFSec 2017]
PPTX
Data Science at Scale: Using Apache Spark for Data Science at Bitly
PDF
Continuum Analytics and Python
PDF
Adopting software design practices for better machine learning
PPTX
Managing and Versioning Machine Learning Models in Python
PPTX
Spark: The Good, the Bad, and the Ugly
PDF
Data Programming: Creating Large Datasets, Quickly -- Presented at JPL MLRG
PDF
Best Practices for Engineering Production-Ready Software with Apache Spark
PPTX
DevOps and Machine Learning (Geekwire Cloud Tech Summit)
PPTX
Innovate Better Through Machine data Analytics
PPT
Enterprise Search Europe 2015: Fishing the big data streams - the future of ...
PDF
Stream Processing: Choosing the Right Tool for the Job
R reproducibility
Query or Not to Query? Using Apache Spark Metrics to Highlight Potentially Pr...
Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...
Reproducible Data Science with R
AI Library - An Open Source Machine Learning Framework
SFrame
Capgemini - Project industrialization with apache spark
Bringing Deep Learning into production
Large Scale Image Forensics using Tika and Tensorflow [ICMR MFSec 2017]
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Continuum Analytics and Python
Adopting software design practices for better machine learning
Managing and Versioning Machine Learning Models in Python
Spark: The Good, the Bad, and the Ugly
Data Programming: Creating Large Datasets, Quickly -- Presented at JPL MLRG
Best Practices for Engineering Production-Ready Software with Apache Spark
DevOps and Machine Learning (Geekwire Cloud Tech Summit)
Innovate Better Through Machine data Analytics
Enterprise Search Europe 2015: Fishing the big data streams - the future of ...
Stream Processing: Choosing the Right Tool for the Job
Ad

Viewers also liked (16)

PDF
Scaling Data Science at Airbnb
PDF
Using R at NYT Graphics
PDF
R for Everything
PDF
A Statistician Walks into a Tech Company: R at a Rapidly Scaling Healthcare S...
PDF
Iterating over statistical models: NCAA tournament edition
PDF
Thinking Small About Big Data
PDF
Broom: Converting Statistical Models to Tidy Data Frames
PDF
The Feels
PDF
Analyzing NYC Transit Data
PDF
R Packages for Time-Varying Networks and Extremal Dependence
PDF
I Don't Want to Be a Dummy! Encoding Predictors for Trees
PDF
Reflection on the Data Science Profession in NYC
PDF
The Political Impact of Social Penumbras
PDF
One Algorithm to Rule Them All: How to Automate Statistical Computation
PDF
High-Performance Python
PPTX
Inside the R Consortium
Scaling Data Science at Airbnb
Using R at NYT Graphics
R for Everything
A Statistician Walks into a Tech Company: R at a Rapidly Scaling Healthcare S...
Iterating over statistical models: NCAA tournament edition
Thinking Small About Big Data
Broom: Converting Statistical Models to Tidy Data Frames
The Feels
Analyzing NYC Transit Data
R Packages for Time-Varying Networks and Extremal Dependence
I Don't Want to Be a Dummy! Encoding Predictors for Trees
Reflection on the Data Science Profession in NYC
The Political Impact of Social Penumbras
One Algorithm to Rule Them All: How to Automate Statistical Computation
High-Performance Python
Inside the R Consortium
Ad

Similar to Improving Data Interoperability for Python and R (20)

PDF
Apache Arrow and Python: The latest
PDF
Next-generation Python Big Data Tools, powered by Apache Arrow
PDF
Python Data Ecosystem: Thoughts on Building for the Future
PDF
How Apache Arrow and Parquet boost cross-language interoperability
PDF
PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landsca...
PDF
PyCon Ireland 2022 - PyArrow full stack.pdf
PDF
Apache Arrow: Present and Future @ ScaledML 2020
PDF
Enabling Python to be a Better Big Data Citizen
PDF
Apache Arrow at DataEngConf Barcelona 2018
PDF
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
PDF
Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy
PDF
Apache Arrow -- Cross-language development platform for in-memory data
PDF
Solving Enterprise Data Challenges with Apache Arrow
PPTX
An Introduction to Apache Arrow for Python Programmers.pptx
PDF
Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...
PDF
PyData Sofia May 2024 - Intro to Apache Arrow
PDF
Apache Arrow: Cross-language Development Platform for In-memory Data
PPTX
Memory Interoperability in Analytics and Machine Learning
PDF
Extending Pandas using Apache Arrow and Numba
PDF
ACM TechTalks : Apache Arrow and the Future of Data Frames
Apache Arrow and Python: The latest
Next-generation Python Big Data Tools, powered by Apache Arrow
Python Data Ecosystem: Thoughts on Building for the Future
How Apache Arrow and Parquet boost cross-language interoperability
PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landsca...
PyCon Ireland 2022 - PyArrow full stack.pdf
Apache Arrow: Present and Future @ ScaledML 2020
Enabling Python to be a Better Big Data Citizen
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy
Apache Arrow -- Cross-language development platform for in-memory data
Solving Enterprise Data Challenges with Apache Arrow
An Introduction to Apache Arrow for Python Programmers.pptx
Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...
PyData Sofia May 2024 - Intro to Apache Arrow
Apache Arrow: Cross-language Development Platform for In-memory Data
Memory Interoperability in Analytics and Machine Learning
Extending Pandas using Apache Arrow and Numba
ACM TechTalks : Apache Arrow and the Future of Data Frames

More from Work-Bench (8)

PDF
2017 Enterprise Almanac
PDF
AI to Enable Next Generation of People Managers
PDF
Startup Recruiting Workbook: Sourcing and Interview Process
PDF
Cloud Native Infrastructure Management Solutions Compared
PPTX
Building a Demand Generation Machine at MongoDB
PPTX
How to Market Your Startup to the Enterprise
PDF
Marketing & Design for the Enterprise
PDF
Playing the Marketing Long Game
2017 Enterprise Almanac
AI to Enable Next Generation of People Managers
Startup Recruiting Workbook: Sourcing and Interview Process
Cloud Native Infrastructure Management Solutions Compared
Building a Demand Generation Machine at MongoDB
How to Market Your Startup to the Enterprise
Marketing & Design for the Enterprise
Playing the Marketing Long Game

Recently uploaded (20)

PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PPTX
1_Introduction to advance data techniques.pptx
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PDF
The Rise of Impact Investing- How to Align Profit with Purpose
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
Business Acumen Training GuidePresentation.pptx
PDF
Oracle OFSAA_ The Complete Guide to Transforming Financial Risk Management an...
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
Understanding Prototyping in Design and Development
PPTX
climate analysis of Dhaka ,Banglades.pptx
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PPTX
Measurement of Afordability for Water Supply and Sanitation in Bangladesh .pptx
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PDF
Data Science Trends & Career Guide---ppt
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
1_Introduction to advance data techniques.pptx
Moving the Public Sector (Government) to a Digital Adoption
Major-Components-ofNKJNNKNKNKNKronment.pptx
Clinical guidelines as a resource for EBP(1).pdf
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
The Rise of Impact Investing- How to Align Profit with Purpose
Galatica Smart Energy Infrastructure Startup Pitch Deck
Business Acumen Training GuidePresentation.pptx
Oracle OFSAA_ The Complete Guide to Transforming Financial Risk Management an...
oil_refinery_comprehensive_20250804084928 (1).pptx
Understanding Prototyping in Design and Development
climate analysis of Dhaka ,Banglades.pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
Measurement of Afordability for Water Supply and Sanitation in Bangladesh .pptx
Data_Analytics_and_PowerBI_Presentation.pptx
Data Science Trends & Career Guide---ppt

Improving Data Interoperability for Python and R

  • 1. 1© Cloudera, Inc. All rights reserved. Improving data interoperability in Python and R Wes McKinney @wesmckinn NYC R Conference April 8, 2016
  • 2. 2© Cloudera, Inc. All rights reserved. http://numfocus.org
  • 3. 3© Cloudera, Inc. All rights reserved. Me • Data Science Tools at Cloudera, formerly DataPad CEO/founder • Serial creator of structured data tools / user interfaces • Wrote bestseller Python for Data Analysis 2012 • Open source projects • Python {pandas, Ibis, statsmodels} • Apache {Arrow, Parquet, Kudu (incubating)} • Mostly work in Python and Cython/C/C++
  • 4. 4© Cloudera, Inc. All rights reserved. In process: Python for Data Analysis: 2nd Edition Coming late 2016 / early 2017
  • 5. 5© Cloudera, Inc. All rights reserved. Apache Arrow http://arrow.apache.org Some slides from Strata-HW talk w/ Jacques Nadeau
  • 6. 6© Cloudera, Inc. All rights reserved. Arrow in a Slide • New Top-level Apache Software Foundation project • Focused on Columnar In-Memory Analytics 1. 10-100x speedup on many workloads 2. Common data layer enables companies to choose best of breed systems 3. Designed to work with any programming language 4. Support for both relational and complex data as-is • Developers from 13+ major open source projects involved • A significant % of the world’s data will be processed through Arrow! Calcite Cassandra Deeplearning4j Drill Hadoop HBase Ibis Impala Kudu Pandas Parquet Phoenix Spark Storm R
  • 7. 7© Cloudera, Inc. All rights reserved. High Performance Sharing & Interchange Today With Arrow • Each system has its own internal memory format • 70-80% CPU wasted on serialization and deserialization • Similar functionality implemented in multiple projects • All systems utilize the same memory format • No overhead for cross-system communication • Projects can share functionality (eg, Parquet-to-Arrow reader)
  • 8. 8© Cloudera, Inc. All rights reserved. Arrow in action: Feather File Format for Python and R •Problem: fast, language- agnostic binary data frame file format •By Wes McKinney (Python) and Hadley Wickham (R) •Read speeds close to disk IO performance
  • 9. 9© Cloudera, Inc. All rights reserved. Real World Example: Feather File Format for Python and R library(feather) path <- "my_data.feather" write_feather(df, path) df <- read_feather(path) import feather path = 'my_data.feather' feather.write_dataframe(df, path) df = feather.read_dataframe(path) R Python
  • 10. 10© Cloudera, Inc. All rights reserved. More on Feather array 0 array 1 array 2 ... array n - 1 METADATA Feather File libfeather C++ library Rcpp Cython R data.frame pandas DataFrame
  • 11. 11© Cloudera, Inc. All rights reserved. Feather: the good and not-so-good • Good • Language-agnostic memory representation • Extremely fast • New storage features can be added without much difficulty • Not-so-good • Data must be convert to/from storage representation (Arrow) and in- memory “proprietary” data structures (R / Python data frames)
  • 12. 12© Cloudera, Inc. All rights reserved. Shared needs for Python, R, Julia, ... • If PLs can establish a common data frame C/C++-level memory representation, we can share algorithms and libraries much more easily • Example: dplyr’s in-memory backend • Other requirements • Permissive licensing (Python / Julia require MIT/Apache-like) • Common build/test/packaging for shared C/C++ library components
  • 13. 13© Cloudera, Inc. All rights reserved. Get Involved in Arrow • Join the community • [email protected] • Slack: https://apachearrowslackin.herokuapp.com/ • http://arrow.apache.org • @ApacheArrow
  • 14. 14© Cloudera, Inc. All rights reserved. Thank you Wes McKinney @wesmckinn Views are my own