Apache Arrow: Leveling Up the Data Science Stack

https://ursalabs.org
● Build cross-language, portable
computational libraries for data
science
● Grow Apache Arrow ecosystem
● Funding and employment for
full-time open source developers
● Not-for-profit, funded by multiple
corporations
Ursa Labs Mission

Up to 80-90% of CPU
cycles spent on
de/serialization
Life without Arrow Life with Arrow
No de/serialization

Arrow C++ Platform
Multi-core Work Scheduler
Core Data
Platform
Query
Engine
Datasets
Framework
Arrow Flight RPC
Network
Storage

● Columnar format objects and utilities
● Memory management and generic IO
● Binary protocol / serialization functions
● Memory-mapping and zero-copy “parsing”
● Integration testing
Arrow Core

● Fast read and write of multi-file datasets
● Read only the parts of the dataset relevant to your analysis
(“predicate pushdown”)
C++ Datasets
File Formats Storage Systems
CSV

Arrow Flight RPC (Messaging)
● Efficient client-server dataset interchange
● Focused on gRPC (Google’s messaging framework), but may
support other transports in future
● It’s fast… really fast
○ Upwards 3GB/s server-to-client on localhost

Arrow for R
● Rcpp-based bindings
● https://github.com/apache/arrow/tree/master/r
● Goal: enable R package developers to leverage
Arrow ecosystem for better performance and
scalability

Arrow format vs. R data.frame
● Type-independent representation of NA values (bits vs. special
values)
● Better computational efficiency for strings
● Naturally chunk-based (vs. large contiguous allocations)
● Supports a much wider variety of data types, including nested
data (JSON-like)

•
•
•
•
•
•
•
•

flights %>%
group_by(year, month, day) %>%
select(arr_delay, dep_delay) %>%
summarise(
arr = mean(arr_delay, na.rm = TRUE),
dep = mean(dep_delay, na.rm = TRUE)
) %>%
filter(arr > 30 | dep > 30)
Can be a massive Arrow dataset

flights %>%
summarise(
) %>%
dplyr verbs can be
translated to Arrow
computation graphs,
executed by parallel
runtime

flights %>%
summarise(
) %>%
dplyr verbs can be
translated to Arrow
computation graphs,
executed by parallel
runtime
R expressions can be JIT-compiled with LLVM

Keep up to date at
https://arrow.apache.org
https://ursalabs.org
https://wesmckinney.com
Thanks

Apache Arrow: Leveling Up the Data Science Stack

More Related Content

What's hot (20)

Similar to Apache Arrow: Leveling Up the Data Science Stack (20)

More from Wes McKinney (20)

Recently uploaded (20)

Apache Arrow: Leveling Up the Data Science Stack