SlideShare a Scribd company logo
www.twosigma.com
Memory Interoperability for
Analytics and Machine Learning
March 26, 2017All Rights Reserved
Wes McKinney @wesmckinn
ScaledML @ Stanford
March 25, 2017
Me
March 26, 2017
• Currently: Software Architect at Two Sigma Investments
• Creator of Python pandas project
• PMC member for Apache Arrow and Apache Parquet
• Author of Python for Data Analysis
• Other Python projects: Ibis, Feather, statsmodels
All Rights Reserved 2
Important Legal Information
March 26, 2017
The information presented here is offered for informational purposes only and should not be used for
any other purpose (including, without limitation, the making of investment decisions). Examples
provided herein are for illustrative purposes only and are not necessarily based on actual data. Nothing
herein constitutes: an offer to sell or the solicitation of any offer to buy any security or other interest;
tax advice; or investment advice. This presentation shall remain the property of Two Sigma Investments,
LP (“Two Sigma”) and Two Sigma reserves the right to require the return of this presentation at any
time.
Some of the images, logos or other material used herein may be protected by copyright and/or
trademark. If so, such copyrights and/or trademarks are most likely owned by the entity that created the
material and are used purely for identification and comment as fair use under international copyright
and/or trademark laws. Use of such image, copyright or trademark does not imply any association with
such organization (or endorsement of such organization) by Two Sigma, nor vice versa.
Copyright © 2017 TWO SIGMA INVESTMENTS, LP. All rights reserved
All Rights Reserved 3
This talk
4March 26, 2017
• Benefits of interoperable data and metadata
• Challenges to sharing memory between runtime environments
• Apache Arrow: Purpose and C++ architecture
• Opportunities for collaboration
• Example application: pandas 2.0
All Rights Reserved
Changing hardware landscape
March 26, 2017
• Intel has released first production 3D Xpoint SSD
• Reported 1000x faster than NAND, less expensive than RAM
• Convergence between RAM vs. shared memory / mmap performance
All Rights Reserved 5
Changing software landscape
March 26, 2017
• Next-gen ML / AI frameworks (TensorFlow, Torch, etc.)
• DIY open source architectures for machine learning in production
• Streaming / batch data processing pipelines
• Data cleaning and feature engineering
• Model fitting / scoring / serving
All Rights Reserved 6
“Zero-copy” memory interfaces
March 26, 2017
• Enables computational tools to process a dataset without any additional
serialization, or transfer to a different memory space
• Can do random access on a dataset that does not fit in RAM
• Another interpretation: reading a dataset is a metadata-only conversion
All Rights Reserved 7
Challenges to zero-copy memory sharing
March 26, 2017
• Cross-language issues
• Type metadata + logical types
• Byte/bit-level memory layout
• Language-specific issues
• In-memory data structures
• Memory allocation and sharing constructs
All Rights Reserved 8
What is pandas?
March 26, 2017
• Popular in-memory data manipulation tool for Python
• Focused on tabular datasets (“data frames”)
• Sprawling codebase spanning multiple areas
• IO for many data formats
• Array manipulations / data preparation
• OLAP-style analytics
• Internals implemented using NumPy array objects
All Rights Reserved 9
NumPy
March 26, 2017
• Tensor memory model ("ndarray") for numeric data
• Strided, homogeneously-typed, byte-addressable memory
• APL-inspired semantics
• Zero-copy construction from compatible memory layouts
• Computational tools support both strided and contiguous memory access
All Rights Reserved 10
pandas: Technical debt + Architectural issues
March 26, 2017
• Tensor library like NumPy awkward fit for pandas use cases
• Multidimensionality + strided memory access complicated algorithms
• Lack of built-in missing value support
• Weak on native string, variable length, or nested types
• pandas at core a “in-memory columnar” problem, similar to analytical SQL
engines
All Rights Reserved 11
Thesis: Tensors and Tables
March 26, 2017
• 2 data structures best suited for zero-copy sharing
• Tensors: N-dimensional, homogeneously-typed arrays
• Tables: Column-oriented, heterogeneously typed
• These data structures can be defined using common memory and metadata
primitives
All Rights Reserved 12
Observations
March 26, 2017
• A Tensor is semantically a multidimensional view of a 1D block of memory
• Writing computational code targeting arbitrary tensors is much more difficult
than 1D contiguous arrays
• Tensors of non-fixed size types (e.g. strings) occur less frequently
All Rights Reserved 13
Apache Arrow
March 26, 2017
• github.com/apache/arrow
• Collaboration amongst broad set of OSS projects around language-agnostic
shared data structures
• Initial focus
• In-memory columnar tables
• Canonical metadata
• Interoperability between JVM and native code (C/C++) ecosystem
All Rights Reserved 14
High performance data interchange
March 26, 2017All Rights Reserved
Today With Arrow
Source: Apache Arrow
15
What does Apache Arrow give you?
March 26, 2017
• Cache-efficient columnar memory: optimized for CPU affinity and SIMD /
parallel processing, O(1) random value access
• Zero-copy messaging / IPC: Language-agnostic metadata, batch/file-based
and streaming binary formats
• Complex schema support: Flat and nested data types
• Main implementations in C++ and Java: with integration tests
• Bindings / implementations for C, Python, Ruby, Javascript in various stages
of development
All Rights Reserved 16
Arrow in C++
March 26, 2017
• Reusable memory management and IO subsystem for native code applications
• Layered in multiple components
• Memory management
• Type metadata / schemas
• Array / Table containers
• IO interfaces
• Zero-copy IPC / messaging
All Rights Reserved 17
Arrow C++: Memory management
March 26, 2017
• arrow::Buffer
• RAII-based memory lifetime with std::shared_ptr<Buffer>
• arrow::MemoryMappedBuffer: for memory maps
• arrow::MemoryPool
• Abstract memory allocator for tracking all allocations
All Rights Reserved 18
Arrow C++: Type metadata
March 26, 2017
• arrow::DataType
• Base class for fixed size, variable size, and nested datatypes
• arrow::Field
• Type + name + additional metadata
• arrow::Schema
• Collection of fields
All Rights Reserved 19
Arrow C++: Array / Table containers
March 26, 2017
• arrow::Array
• 1-dimensional columnar arrays: Int32Array, ListArray, StructArray, etc.
• Support for dictionary-encoded arrays
• arrow::RecordBatch
• Collection of equal-length arrays
• arrow::Column
• Logical table “column” as chunked array
• arrow::Table
• Collection of columns
All Rights Reserved 20
Arrow C++: IO interfaces
March 26, 2017
• arrow::{InputStream, OutputStream}
• arrow::RandomAccessFile
• Abstract file interface
• arrow::MemoryMappedFile
• Zero-copy reads to arrow::Buffer
• Specific implementations for OS files, HDFS, etc.
All Rights Reserved 21
Arrow C++: Messaging / IPC
March 26, 2017
• Metadata read/write using Google’s Flatbuffers library
• Encapsulated Message type
• Write record batches, read with zero-copy
• arrow::{FileWriter, FileReader}
• Random access / “batch” binary format
• arrow::{StreamWriter, StreamReader}
• Streaming binary format
All Rights Reserved 22
In development: arrow::Tensor
March 26, 2017
• Targeting interoperability with memory layouts as used in NumPy,
TensorFlow, Torch, or other standard tensor-based frameworks
• data: arrow::Buffer
• shape: dimension sizes
• strides: memory ordering
• Zero-copy reads using Arrow’s shared memory tools
• Support Tensor math libraries for C++ like xtensor
All Rights Reserved 23
Example use: Ray ML framework from Berkeley RISELab
March 26, 2017All Rights Reserved 24
Source: https://arxiv.org/abs/1703.03924
• Shared memory-based object
store
• Zero-copy tensor reads using
Arrow libraries
Example use: pandas 2.0
March 26, 2017
• In-planning rearchitecture of pandas’s internals
• libpandas — largely Python-agnostic C++11 library
• Decoupling pandas data structures from NumPy tensors
• Support analytics targeting native Arrow memory
• Multicore / parallel algorithms
• Leverage latest SIMD intrinsics
• Lazy-loading DataFrames from primary input formats
• CSV, JSON, HDF5, Apache Parquet
All Rights Reserved 25
Other examples
March 26, 2017
• Spark integration (SPARK-13534)
• Weld integration (ARROW-649)
All Rights Reserved 26
Thank you
March 26, 2017
• Building code and community around
• IO subsystems
• Metadata
• Data structures and in-memory formats
All Rights Reserved 27

More Related Content

PDF
Python Data Wrangling: Preparing for the Future
PDF
Python Data Ecosystem: Thoughts on Building for the Future
PDF
Data Science Languages and Industry Analytics
PDF
Apache Arrow -- Cross-language development platform for in-memory data
PDF
DataFrames: The Extended Cut
PDF
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PDF
Apache Arrow at DataEngConf Barcelona 2018
PDF
Improving Python and Spark (PySpark) Performance and Interoperability
Python Data Wrangling: Preparing for the Future
Python Data Ecosystem: Thoughts on Building for the Future
Data Science Languages and Industry Analytics
Apache Arrow -- Cross-language development platform for in-memory data
DataFrames: The Extended Cut
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
Apache Arrow at DataEngConf Barcelona 2018
Improving Python and Spark (PySpark) Performance and Interoperability

What's hot (20)

PDF
Apache Arrow: Present and Future @ ScaledML 2020
PDF
Next-generation Python Big Data Tools, powered by Apache Arrow
PDF
An Incomplete Data Tools Landscape for Hackers in 2015
PDF
ACM TechTalks : Apache Arrow and the Future of Data Frames
PDF
Data Analysis and Statistics in Python using pandas and statsmodels
PDF
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PDF
Apache Arrow: Cross-language Development Platform for In-memory Data
PDF
PyData: The Next Generation
PDF
Apache Arrow Workshop at VLDB 2019 / BOSS Session
PDF
Ibis: Scaling the Python Data Experience
PDF
Improving data interoperability in Python and R
PDF
My Data Journey with Python (SciPy 2015 Keynote)
PDF
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
PDF
PyCon Singapore 2013 Keynote
PDF
Apache Arrow (Strata-Hadoop World San Jose 2016)
PPTX
Future of pandas
PPTX
Apache Arrow - An Overview
PDF
Apache Arrow Flight: A New Gold Standard for Data Transport
PPTX
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
PDF
Apache Arrow and Python: The latest
Apache Arrow: Present and Future @ ScaledML 2020
Next-generation Python Big Data Tools, powered by Apache Arrow
An Incomplete Data Tools Landscape for Hackers in 2015
ACM TechTalks : Apache Arrow and the Future of Data Frames
Data Analysis and Statistics in Python using pandas and statsmodels
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
Apache Arrow: Cross-language Development Platform for In-memory Data
PyData: The Next Generation
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Ibis: Scaling the Python Data Experience
Improving data interoperability in Python and R
My Data Journey with Python (SciPy 2015 Keynote)
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
PyCon Singapore 2013 Keynote
Apache Arrow (Strata-Hadoop World San Jose 2016)
Future of pandas
Apache Arrow - An Overview
Apache Arrow Flight: A New Gold Standard for Data Transport
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
Apache Arrow and Python: The latest
Ad

Viewers also liked (20)

PPTX
Raising the Tides: Open Source Analytics for Data Science
PDF
ドローン農業最前線
PDF
Functional go
PDF
Goをカンストさせる話
PDF
Startup Pitch Decks
PDF
Angular of things: angular2 + web bluetooth
PDF
マイクロサービスバックエンドAPIのためのRESTとgRPC
PDF
Dolor en rn
PDF
Introduction to Search Systems - ScaleConf Colombia 2017
PPT
Mapping Experiences - Workshop Presentation
PPTX
Chainerを使って細胞を数えてみた
PDF
Large Scale Deep Learning with TensorFlow
PPTX
HoloLens x Graphics 入門
PPTX
Unreal engine4を使ったVRコンテンツ製作で 120%役に立つtips集+GDC情報をご紹介
PDF
Cartilla de bienvenida a la comunidad educativa para el reinicio de clases, a...
PDF
PyCon APAC 2016 Keynote
PPTX
Surrounded by flowers (Michael and Inessa Garmash )
PDF
High Performance Python on Apache Spark
PDF
フォントの選び方・使い方
PPTX
Krenn algorithmic democracy_ab_jan_2016
Raising the Tides: Open Source Analytics for Data Science
ドローン農業最前線
Functional go
Goをカンストさせる話
Startup Pitch Decks
Angular of things: angular2 + web bluetooth
マイクロサービスバックエンドAPIのためのRESTとgRPC
Dolor en rn
Introduction to Search Systems - ScaleConf Colombia 2017
Mapping Experiences - Workshop Presentation
Chainerを使って細胞を数えてみた
Large Scale Deep Learning with TensorFlow
HoloLens x Graphics 入門
Unreal engine4を使ったVRコンテンツ製作で 120%役に立つtips集+GDC情報をご紹介
Cartilla de bienvenida a la comunidad educativa para el reinicio de clases, a...
PyCon APAC 2016 Keynote
Surrounded by flowers (Michael and Inessa Garmash )
High Performance Python on Apache Spark
フォントの選び方・使い方
Krenn algorithmic democracy_ab_jan_2016
Ad

Similar to Memory Interoperability in Analytics and Machine Learning (20)

PPTX
Engineering patterns for implementing data science models on big data platforms
PDF
Big data berlin
PDF
Design Choices for Cloud Data Platforms
PPTX
[DataCon.TW 2017] Data Lake: centralize in on-prem vs. decentralize on cloud
PPTX
01 introduction fundamentals_of_parallelism_and_code_optimization-www.astek.ir
PDF
Ursa Labs and Apache Arrow in 2019
PPTX
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
PPTX
Simplifying And Accelerating Data Access for Python With Dremio and Apache Arrow
PDF
An architecture for federated data discovery and lineage over on-prem datasou...
PDF
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
PPTX
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
PPTX
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
PPTX
The Challenges of Bringing Machine Learning to the Masses
PPTX
Scaling Data Science on Big Data
PPTX
Threat hunting using notebook technologies
PDF
Alluxio Webinar | Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distribu...
PDF
4K Video Downloader Crack + License Key 2025
PPTX
Architecting Your First Big Data Implementation
PDF
Nisha talagala keynote_inflow_2016
PDF
Python for Data: Past, Present, Future (PyCon JP 2017 Keynote)
Engineering patterns for implementing data science models on big data platforms
Big data berlin
Design Choices for Cloud Data Platforms
[DataCon.TW 2017] Data Lake: centralize in on-prem vs. decentralize on cloud
01 introduction fundamentals_of_parallelism_and_code_optimization-www.astek.ir
Ursa Labs and Apache Arrow in 2019
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
Simplifying And Accelerating Data Access for Python With Dremio and Apache Arrow
An architecture for federated data discovery and lineage over on-prem datasou...
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
The Challenges of Bringing Machine Learning to the Masses
Scaling Data Science on Big Data
Threat hunting using notebook technologies
Alluxio Webinar | Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distribu...
4K Video Downloader Crack + License Key 2025
Architecting Your First Big Data Implementation
Nisha talagala keynote_inflow_2016
Python for Data: Past, Present, Future (PyCon JP 2017 Keynote)

More from Wes McKinney (10)

PDF
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
PDF
Solving Enterprise Data Challenges with Apache Arrow
PDF
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
PDF
Apache Arrow: High Performance Columnar Data Framework
PDF
New Directions for Apache Arrow
PDF
Apache Arrow: Leveling Up the Analytics Stack
PDF
Apache Arrow: Leveling Up the Data Science Stack
PPTX
Shared Infrastructure for Data Science
PDF
Data Science Without Borders (JupyterCon 2017)
PDF
Enabling Python to be a Better Big Data Citizen
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
Solving Enterprise Data Challenges with Apache Arrow
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: High Performance Columnar Data Framework
New Directions for Apache Arrow
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Data Science Stack
Shared Infrastructure for Data Science
Data Science Without Borders (JupyterCon 2017)
Enabling Python to be a Better Big Data Citizen

Recently uploaded (20)

PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
Big Data Technologies - Introduction.pptx
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
Encapsulation theory and applications.pdf
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PPT
Teaching material agriculture food technology
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
Spectroscopy.pptx food analysis technology
PPTX
1. Introduction to Computer Programming.pptx
Network Security Unit 5.pdf for BCA BBA.
NewMind AI Weekly Chronicles - August'25-Week II
Mobile App Security Testing_ A Comprehensive Guide.pdf
Group 1 Presentation -Planning and Decision Making .pptx
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Spectral efficient network and resource selection model in 5G networks
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
MIND Revenue Release Quarter 2 2025 Press Release
MYSQL Presentation for SQL database connectivity
Big Data Technologies - Introduction.pptx
Assigned Numbers - 2025 - Bluetooth® Document
Encapsulation theory and applications.pdf
Accuracy of neural networks in brain wave diagnosis of schizophrenia
Teaching material agriculture food technology
Dropbox Q2 2025 Financial Results & Investor Presentation
Building Integrated photovoltaic BIPV_UPV.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Spectroscopy.pptx food analysis technology
1. Introduction to Computer Programming.pptx

Memory Interoperability in Analytics and Machine Learning

  • 1. www.twosigma.com Memory Interoperability for Analytics and Machine Learning March 26, 2017All Rights Reserved Wes McKinney @wesmckinn ScaledML @ Stanford March 25, 2017
  • 2. Me March 26, 2017 • Currently: Software Architect at Two Sigma Investments • Creator of Python pandas project • PMC member for Apache Arrow and Apache Parquet • Author of Python for Data Analysis • Other Python projects: Ibis, Feather, statsmodels All Rights Reserved 2
  • 3. Important Legal Information March 26, 2017 The information presented here is offered for informational purposes only and should not be used for any other purpose (including, without limitation, the making of investment decisions). Examples provided herein are for illustrative purposes only and are not necessarily based on actual data. Nothing herein constitutes: an offer to sell or the solicitation of any offer to buy any security or other interest; tax advice; or investment advice. This presentation shall remain the property of Two Sigma Investments, LP (“Two Sigma”) and Two Sigma reserves the right to require the return of this presentation at any time. Some of the images, logos or other material used herein may be protected by copyright and/or trademark. If so, such copyrights and/or trademarks are most likely owned by the entity that created the material and are used purely for identification and comment as fair use under international copyright and/or trademark laws. Use of such image, copyright or trademark does not imply any association with such organization (or endorsement of such organization) by Two Sigma, nor vice versa. Copyright © 2017 TWO SIGMA INVESTMENTS, LP. All rights reserved All Rights Reserved 3
  • 4. This talk 4March 26, 2017 • Benefits of interoperable data and metadata • Challenges to sharing memory between runtime environments • Apache Arrow: Purpose and C++ architecture • Opportunities for collaboration • Example application: pandas 2.0 All Rights Reserved
  • 5. Changing hardware landscape March 26, 2017 • Intel has released first production 3D Xpoint SSD • Reported 1000x faster than NAND, less expensive than RAM • Convergence between RAM vs. shared memory / mmap performance All Rights Reserved 5
  • 6. Changing software landscape March 26, 2017 • Next-gen ML / AI frameworks (TensorFlow, Torch, etc.) • DIY open source architectures for machine learning in production • Streaming / batch data processing pipelines • Data cleaning and feature engineering • Model fitting / scoring / serving All Rights Reserved 6
  • 7. “Zero-copy” memory interfaces March 26, 2017 • Enables computational tools to process a dataset without any additional serialization, or transfer to a different memory space • Can do random access on a dataset that does not fit in RAM • Another interpretation: reading a dataset is a metadata-only conversion All Rights Reserved 7
  • 8. Challenges to zero-copy memory sharing March 26, 2017 • Cross-language issues • Type metadata + logical types • Byte/bit-level memory layout • Language-specific issues • In-memory data structures • Memory allocation and sharing constructs All Rights Reserved 8
  • 9. What is pandas? March 26, 2017 • Popular in-memory data manipulation tool for Python • Focused on tabular datasets (“data frames”) • Sprawling codebase spanning multiple areas • IO for many data formats • Array manipulations / data preparation • OLAP-style analytics • Internals implemented using NumPy array objects All Rights Reserved 9
  • 10. NumPy March 26, 2017 • Tensor memory model ("ndarray") for numeric data • Strided, homogeneously-typed, byte-addressable memory • APL-inspired semantics • Zero-copy construction from compatible memory layouts • Computational tools support both strided and contiguous memory access All Rights Reserved 10
  • 11. pandas: Technical debt + Architectural issues March 26, 2017 • Tensor library like NumPy awkward fit for pandas use cases • Multidimensionality + strided memory access complicated algorithms • Lack of built-in missing value support • Weak on native string, variable length, or nested types • pandas at core a “in-memory columnar” problem, similar to analytical SQL engines All Rights Reserved 11
  • 12. Thesis: Tensors and Tables March 26, 2017 • 2 data structures best suited for zero-copy sharing • Tensors: N-dimensional, homogeneously-typed arrays • Tables: Column-oriented, heterogeneously typed • These data structures can be defined using common memory and metadata primitives All Rights Reserved 12
  • 13. Observations March 26, 2017 • A Tensor is semantically a multidimensional view of a 1D block of memory • Writing computational code targeting arbitrary tensors is much more difficult than 1D contiguous arrays • Tensors of non-fixed size types (e.g. strings) occur less frequently All Rights Reserved 13
  • 14. Apache Arrow March 26, 2017 • github.com/apache/arrow • Collaboration amongst broad set of OSS projects around language-agnostic shared data structures • Initial focus • In-memory columnar tables • Canonical metadata • Interoperability between JVM and native code (C/C++) ecosystem All Rights Reserved 14
  • 15. High performance data interchange March 26, 2017All Rights Reserved Today With Arrow Source: Apache Arrow 15
  • 16. What does Apache Arrow give you? March 26, 2017 • Cache-efficient columnar memory: optimized for CPU affinity and SIMD / parallel processing, O(1) random value access • Zero-copy messaging / IPC: Language-agnostic metadata, batch/file-based and streaming binary formats • Complex schema support: Flat and nested data types • Main implementations in C++ and Java: with integration tests • Bindings / implementations for C, Python, Ruby, Javascript in various stages of development All Rights Reserved 16
  • 17. Arrow in C++ March 26, 2017 • Reusable memory management and IO subsystem for native code applications • Layered in multiple components • Memory management • Type metadata / schemas • Array / Table containers • IO interfaces • Zero-copy IPC / messaging All Rights Reserved 17
  • 18. Arrow C++: Memory management March 26, 2017 • arrow::Buffer • RAII-based memory lifetime with std::shared_ptr<Buffer> • arrow::MemoryMappedBuffer: for memory maps • arrow::MemoryPool • Abstract memory allocator for tracking all allocations All Rights Reserved 18
  • 19. Arrow C++: Type metadata March 26, 2017 • arrow::DataType • Base class for fixed size, variable size, and nested datatypes • arrow::Field • Type + name + additional metadata • arrow::Schema • Collection of fields All Rights Reserved 19
  • 20. Arrow C++: Array / Table containers March 26, 2017 • arrow::Array • 1-dimensional columnar arrays: Int32Array, ListArray, StructArray, etc. • Support for dictionary-encoded arrays • arrow::RecordBatch • Collection of equal-length arrays • arrow::Column • Logical table “column” as chunked array • arrow::Table • Collection of columns All Rights Reserved 20
  • 21. Arrow C++: IO interfaces March 26, 2017 • arrow::{InputStream, OutputStream} • arrow::RandomAccessFile • Abstract file interface • arrow::MemoryMappedFile • Zero-copy reads to arrow::Buffer • Specific implementations for OS files, HDFS, etc. All Rights Reserved 21
  • 22. Arrow C++: Messaging / IPC March 26, 2017 • Metadata read/write using Google’s Flatbuffers library • Encapsulated Message type • Write record batches, read with zero-copy • arrow::{FileWriter, FileReader} • Random access / “batch” binary format • arrow::{StreamWriter, StreamReader} • Streaming binary format All Rights Reserved 22
  • 23. In development: arrow::Tensor March 26, 2017 • Targeting interoperability with memory layouts as used in NumPy, TensorFlow, Torch, or other standard tensor-based frameworks • data: arrow::Buffer • shape: dimension sizes • strides: memory ordering • Zero-copy reads using Arrow’s shared memory tools • Support Tensor math libraries for C++ like xtensor All Rights Reserved 23
  • 24. Example use: Ray ML framework from Berkeley RISELab March 26, 2017All Rights Reserved 24 Source: https://arxiv.org/abs/1703.03924 • Shared memory-based object store • Zero-copy tensor reads using Arrow libraries
  • 25. Example use: pandas 2.0 March 26, 2017 • In-planning rearchitecture of pandas’s internals • libpandas — largely Python-agnostic C++11 library • Decoupling pandas data structures from NumPy tensors • Support analytics targeting native Arrow memory • Multicore / parallel algorithms • Leverage latest SIMD intrinsics • Lazy-loading DataFrames from primary input formats • CSV, JSON, HDF5, Apache Parquet All Rights Reserved 25
  • 26. Other examples March 26, 2017 • Spark integration (SPARK-13534) • Weld integration (ARROW-649) All Rights Reserved 26
  • 27. Thank you March 26, 2017 • Building code and community around • IO subsystems • Metadata • Data structures and in-memory formats All Rights Reserved 27