SlideShare a Scribd company logo
Apache Arrow: Leveling Up the Data Science Stack
https://ursalabs.org
● Build cross-language, portable
computational libraries for data
science
● Grow Apache Arrow ecosystem
● Funding and employment for
full-time open source developers
● Not-for-profit, funded by multiple
corporations
Ursa Labs Mission
Strategic Partnership Model
•
•
•
•
Up to 80-90% of CPU
cycles spent on
de/serialization
Life without Arrow Life with Arrow
No de/serialization
•
•
•
•
Arrow C++ Platform
Multi-core Work Scheduler
Core Data
Platform
Query
Engine
Datasets
Framework
Arrow Flight RPC
Network
Storage
● Columnar format objects and utilities
● Memory management and generic IO
● Binary protocol / serialization functions
● Memory-mapping and zero-copy “parsing”
● Integration testing
Arrow Core
● Fast read and write of multi-file datasets
● Read only the parts of the dataset relevant to your analysis
(“predicate pushdown”)
C++ Datasets
File Formats Storage Systems
CSV
•
•
•
•
Arrow Flight RPC (Messaging)
● Efficient client-server dataset interchange
● Focused on gRPC (Google’s messaging framework), but may
support other transports in future
● It’s fast… really fast
○ Upwards 3GB/s server-to-client on localhost
Arrow for R
● Rcpp-based bindings
● https://github.com/apache/arrow/tree/master/r
● Goal: enable R package developers to leverage
Arrow ecosystem for better performance and
scalability
Arrow format vs. R data.frame
● Type-independent representation of NA values (bits vs. special
values)
● Better computational efficiency for strings
● Naturally chunk-based (vs. large contiguous allocations)
● Supports a much wider variety of data types, including nested
data (JSON-like)
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
flights %>%
group_by(year, month, day) %>%
select(arr_delay, dep_delay) %>%
summarise(
arr = mean(arr_delay, na.rm = TRUE),
dep = mean(dep_delay, na.rm = TRUE)
) %>%
filter(arr > 30 | dep > 30)
Can be a massive Arrow dataset
flights %>%
group_by(year, month, day) %>%
select(arr_delay, dep_delay) %>%
summarise(
arr = mean(arr_delay, na.rm = TRUE),
dep = mean(dep_delay, na.rm = TRUE)
) %>%
filter(arr > 30 | dep > 30)
dplyr verbs can be
translated to Arrow
computation graphs,
executed by parallel
runtime
Can be a massive Arrow dataset
flights %>%
group_by(year, month, day) %>%
select(arr_delay, dep_delay) %>%
summarise(
arr = mean(arr_delay, na.rm = TRUE),
dep = mean(dep_delay, na.rm = TRUE)
) %>%
filter(arr > 30 | dep > 30)
dplyr verbs can be
translated to Arrow
computation graphs,
executed by parallel
runtime
R expressions can be JIT-compiled with LLVM
Can be a massive Arrow dataset
Keep up to date at
https://arrow.apache.org
https://ursalabs.org
https://wesmckinney.com
Thanks

More Related Content

PDF
Ursa Labs and Apache Arrow in 2019
PDF
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PDF
Introduction to DataFusion An Embeddable Query Engine Written in Rust
PDF
Best Practices for Streaming IoT Data with MQTT and Apache Kafka®
PPTX
Monitoring with Dynatrace Presentation.pptx
PDF
Spark Summit EU talk by Bas Geerdink
PDF
Cassandra at eBay - Cassandra Summit 2012
ODP
Presto
Ursa Labs and Apache Arrow in 2019
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
Introduction to DataFusion An Embeddable Query Engine Written in Rust
Best Practices for Streaming IoT Data with MQTT and Apache Kafka®
Monitoring with Dynatrace Presentation.pptx
Spark Summit EU talk by Bas Geerdink
Cassandra at eBay - Cassandra Summit 2012
Presto

What's hot (20)

PPTX
Intrinsics: Low-level engine development with Burst - Unite Copenhagen 2019
PDF
Spark overview
PDF
Effectively-once semantics in Apache Pulsar
PDF
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...
PDF
Server monitoring using grafana and prometheus
PPTX
The Top 5 Apache Kafka Use Cases and Architectures in 2022
PPTX
Using Compass to Diagnose Performance Problems in Your Cluster
PDF
Alteryx Desktop Designer Overview
PDF
Intro to open source observability with grafana, prometheus, loki, and tempo(...
PDF
Introduction to Apache Flink - Fast and reliable big data processing
PDF
Developing Real-Time Data Pipelines with Apache Kafka
PDF
Understanding amazon demand side platform (dsp)
PDF
Loki - like prometheus, but for logs
PPTX
Data Con LA 2022 - Making real-time analytics a reality for digital transform...
PPTX
RedisConf17 - Distributed Java Map Structures and Services with Redisson
PDF
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
PDF
Hadoop Overview & Architecture
 
PPTX
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
PDF
Apache Spark Introduction
PDF
Data Streaming Ecosystem Management at Booking.com
Intrinsics: Low-level engine development with Burst - Unite Copenhagen 2019
Spark overview
Effectively-once semantics in Apache Pulsar
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...
Server monitoring using grafana and prometheus
The Top 5 Apache Kafka Use Cases and Architectures in 2022
Using Compass to Diagnose Performance Problems in Your Cluster
Alteryx Desktop Designer Overview
Intro to open source observability with grafana, prometheus, loki, and tempo(...
Introduction to Apache Flink - Fast and reliable big data processing
Developing Real-Time Data Pipelines with Apache Kafka
Understanding amazon demand side platform (dsp)
Loki - like prometheus, but for logs
Data Con LA 2022 - Making real-time analytics a reality for digital transform...
RedisConf17 - Distributed Java Map Structures and Services with Redisson
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
Hadoop Overview & Architecture
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Apache Spark Introduction
Data Streaming Ecosystem Management at Booking.com
Ad

Similar to Apache Arrow: Leveling Up the Data Science Stack (20)

PDF
Spark Summit - Stratio Streaming
PDF
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
PDF
PyCon Ireland 2022 - PyArrow full stack.pdf
PDF
New Directions for Apache Arrow
PPTX
Apache Hive for modern DBAs
PDF
Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache Beam
PPTX
Apache Avro in LivePerson [Hebrew]
PPT
Google Cluster Innards
PDF
Data engineering Stl Big Data IDEA user group
PPTX
Rust & Apache Arrow @ RMS
PDF
Towards sql for streams
PDF
Azure Cosmos DB - Technical Deep Dive
PPTX
Introduction To Programming In R for data analyst
PPTX
Technical overview of Azure Cosmos DB
PDF
Apache Eagle - Monitor Hadoop in Real Time
PDF
Data Structures Handling Trillions of Daily Streaming Events by Evan Chan
PDF
Flink Forward SF 2017: Malo Deniélou - No shard left behind: Dynamic work re...
PPTX
The Road to U-SQL: Experiences in Language Design (SQL Konferenz 2017 Keynote)
PPTX
Data saturday malta - ADX Azure Data Explorer overview
PPT
Scalable Data Analysis in R -- Lee Edlefsen
Spark Summit - Stratio Streaming
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
PyCon Ireland 2022 - PyArrow full stack.pdf
New Directions for Apache Arrow
Apache Hive for modern DBAs
Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache Beam
Apache Avro in LivePerson [Hebrew]
Google Cluster Innards
Data engineering Stl Big Data IDEA user group
Rust & Apache Arrow @ RMS
Towards sql for streams
Azure Cosmos DB - Technical Deep Dive
Introduction To Programming In R for data analyst
Technical overview of Azure Cosmos DB
Apache Eagle - Monitor Hadoop in Real Time
Data Structures Handling Trillions of Daily Streaming Events by Evan Chan
Flink Forward SF 2017: Malo Deniélou - No shard left behind: Dynamic work re...
The Road to U-SQL: Experiences in Language Design (SQL Konferenz 2017 Keynote)
Data saturday malta - ADX Azure Data Explorer overview
Scalable Data Analysis in R -- Lee Edlefsen
Ad

More from Wes McKinney (20)

PDF
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
PDF
Solving Enterprise Data Challenges with Apache Arrow
PDF
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
PDF
Apache Arrow: High Performance Columnar Data Framework
PDF
Apache Arrow Flight: A New Gold Standard for Data Transport
PDF
ACM TechTalks : Apache Arrow and the Future of Data Frames
PDF
Apache Arrow: Present and Future @ ScaledML 2020
PDF
Apache Arrow: Leveling Up the Analytics Stack
PDF
Apache Arrow Workshop at VLDB 2019 / BOSS Session
PDF
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PDF
Apache Arrow at DataEngConf Barcelona 2018
PDF
Apache Arrow: Cross-language Development Platform for In-memory Data
PDF
Apache Arrow -- Cross-language development platform for in-memory data
PPTX
Shared Infrastructure for Data Science
PDF
Data Science Without Borders (JupyterCon 2017)
PPTX
Memory Interoperability in Analytics and Machine Learning
PPTX
Raising the Tides: Open Source Analytics for Data Science
PDF
Improving Python and Spark (PySpark) Performance and Interoperability
PDF
Python Data Wrangling: Preparing for the Future
PDF
PyCon APAC 2016 Keynote
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
Solving Enterprise Data Challenges with Apache Arrow
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: High Performance Columnar Data Framework
Apache Arrow Flight: A New Gold Standard for Data Transport
ACM TechTalks : Apache Arrow and the Future of Data Frames
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow Workshop at VLDB 2019 / BOSS Session
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow -- Cross-language development platform for in-memory data
Shared Infrastructure for Data Science
Data Science Without Borders (JupyterCon 2017)
Memory Interoperability in Analytics and Machine Learning
Raising the Tides: Open Source Analytics for Data Science
Improving Python and Spark (PySpark) Performance and Interoperability
Python Data Wrangling: Preparing for the Future
PyCon APAC 2016 Keynote

Recently uploaded (20)

PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
A Presentation on Artificial Intelligence
PPTX
Tartificialntelligence_presentation.pptx
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPT
Teaching material agriculture food technology
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
Spectroscopy.pptx food analysis technology
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
“AI and Expert System Decision Support & Business Intelligence Systems”
Spectral efficient network and resource selection model in 5G networks
A Presentation on Artificial Intelligence
Tartificialntelligence_presentation.pptx
Mobile App Security Testing_ A Comprehensive Guide.pdf
20250228 LYD VKU AI Blended-Learning.pptx
Assigned Numbers - 2025 - Bluetooth® Document
Diabetes mellitus diagnosis method based random forest with bat algorithm
Teaching material agriculture food technology
SOPHOS-XG Firewall Administrator PPT.pptx
Group 1 Presentation -Planning and Decision Making .pptx
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Encapsulation_ Review paper, used for researhc scholars
Spectroscopy.pptx food analysis technology
Reach Out and Touch Someone: Haptics and Empathic Computing
Advanced methodologies resolving dimensionality complications for autism neur...
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Building Integrated photovoltaic BIPV_UPV.pdf

Apache Arrow: Leveling Up the Data Science Stack

  • 2. https://ursalabs.org ● Build cross-language, portable computational libraries for data science ● Grow Apache Arrow ecosystem ● Funding and employment for full-time open source developers ● Not-for-profit, funded by multiple corporations Ursa Labs Mission
  • 5. Up to 80-90% of CPU cycles spent on de/serialization Life without Arrow Life with Arrow No de/serialization
  • 7. Arrow C++ Platform Multi-core Work Scheduler Core Data Platform Query Engine Datasets Framework Arrow Flight RPC Network Storage
  • 8. ● Columnar format objects and utilities ● Memory management and generic IO ● Binary protocol / serialization functions ● Memory-mapping and zero-copy “parsing” ● Integration testing Arrow Core
  • 9. ● Fast read and write of multi-file datasets ● Read only the parts of the dataset relevant to your analysis (“predicate pushdown”) C++ Datasets File Formats Storage Systems CSV
  • 11. Arrow Flight RPC (Messaging) ● Efficient client-server dataset interchange ● Focused on gRPC (Google’s messaging framework), but may support other transports in future ● It’s fast… really fast ○ Upwards 3GB/s server-to-client on localhost
  • 12. Arrow for R ● Rcpp-based bindings ● https://github.com/apache/arrow/tree/master/r ● Goal: enable R package developers to leverage Arrow ecosystem for better performance and scalability
  • 13. Arrow format vs. R data.frame ● Type-independent representation of NA values (bits vs. special values) ● Better computational efficiency for strings ● Naturally chunk-based (vs. large contiguous allocations) ● Supports a much wider variety of data types, including nested data (JSON-like)
  • 17. flights %>% group_by(year, month, day) %>% select(arr_delay, dep_delay) %>% summarise( arr = mean(arr_delay, na.rm = TRUE), dep = mean(dep_delay, na.rm = TRUE) ) %>% filter(arr > 30 | dep > 30) Can be a massive Arrow dataset
  • 18. flights %>% group_by(year, month, day) %>% select(arr_delay, dep_delay) %>% summarise( arr = mean(arr_delay, na.rm = TRUE), dep = mean(dep_delay, na.rm = TRUE) ) %>% filter(arr > 30 | dep > 30) dplyr verbs can be translated to Arrow computation graphs, executed by parallel runtime Can be a massive Arrow dataset
  • 19. flights %>% group_by(year, month, day) %>% select(arr_delay, dep_delay) %>% summarise( arr = mean(arr_delay, na.rm = TRUE), dep = mean(dep_delay, na.rm = TRUE) ) %>% filter(arr > 30 | dep > 30) dplyr verbs can be translated to Arrow computation graphs, executed by parallel runtime R expressions can be JIT-compiled with LLVM Can be a massive Arrow dataset
  • 20. Keep up to date at https://arrow.apache.org https://ursalabs.org https://wesmckinney.com Thanks