Building a Virtual Data Lake with Apache Arrow

Tomer Shiran
Co-Founder
@tshiran

Analytics on modern
data is incredibly hard
Unprecedented complexity

The demands for data
are growing rapidly
Increasing demands
Reporting
New products
Forecasting
Threat detection
BI
Machine
Learning
Segmenting
Fraud prevention

Your analysts are hungry for data
SQL
But your data is everywhere
And it’s not in the shape they need

Today you engineer data flows and reshaping
Data Staging
• Custon ETL
• Fragile transforms
• Slow moving
SQL

Data Staging
Data Warehouse
• $$$
• High overhead
• Proprietary lock in
• Custon ETL
• Slow moving
SQL

Data Staging
Data Warehouse
Cubes, BI Extracts &
Aggregation Tables • Data sprawl
• Governance issues
• Slow to update
• $$$
• High overhead
• Proprietary lock in
• Custon ETL
• Slow moving
SQL
+
+
+
+
+
+
+
+
+

How can we Tackle this Age-old
Problem?
Direct access to data In-memory, GPU,
…
Columnar Distributed

Apache Arrow: Process & Move Data
Fast
• Top-level Apache project as of Feb 2016
• Collaboration among many open source projects around shared needs
• Three components:
• Language-independent columnar data structures
• Implementations available for C++, Java, Python
• Metadata for describing schemas/record batches
• Protocol for moving data between between processes without
serialization overhead

High-Performance Data Interchange
Today With Arrow
• Each system has its own internal memory format
• 70-80% CPU wasted on serialization and
deserialization
• Similar functionality implemented in multiple projects
• All systems utilize the same memory format
• No overhead for cross-system communication
• Projects can share functionality (eg, Parquet-
to-Arrow reader)

Data is Organized in Record Batches
Schema
Record Batch
Record Batch
Record Batch
Record Batch
Record Batch
Record Batch
Record Batch
Record Batch
Record Batch
Schema & File
Layout
Streaming Format File Format

Each Record Batch is Columnar
Intel CPU
SELECT * FROM clickstream WHERE
session_id = 1331246351
Traditional
Memory Buffer
Arrow
Memory Buffer
Arrow leverages the data parallelism
(SIMD) in modern Intel CPUs:

Example: Spark to
Pandas via Apache
Arrow

Fast Import of Arrow in Pandas & R
Credit: Wes McKinney, Two Sigma

Fast Export of Arrow in Spark
• Legacy export from Spark to Pandas (toPandas) was extremely
slow
• Row-by-row conversion from Spark driver to Python memory
• SPARK-13534 introduced an Arrow based implementation
• Wes McKinney (Two Sigma), Bryan Cutler (IBM), Li Jin (Two Sigma), and
Yin Xusen (IBM)
• Set spark.sql.execution.arrow.enable = True
Clock Time 12.5s 1.89s (6.6x)
Deserialization 88% of the time 1% of the time
Peak memory usage 8x dataset size 2x dataset size

Designing a Virtual Data
Lake Powered by Apache
Arrow

Arrow-based Distributed Execution
Persistent Columnar Cache (Parquet)
In-Memory Columnar Cache (Arrow)
Pandas
R
BI
Data Sources
(NoSQL, RDBMS, Hadoop, S3)
Arrow-based Execution and Integration

Thank You
• Apache Arrow community
• Strata organizers
• Get involved
• Subscribe to the Arrow ASF lists
• Contribute to the Arrow project
• Want to learn more about Dremio?
• tshiran@dremio.com

Building a Virtual Data Lake with Apache Arrow

More Related Content

What's hot (20)

Viewers also liked (10)

Similar to Building a Virtual Data Lake with Apache Arrow (20)

Recently uploaded (20)

Building a Virtual Data Lake with Apache Arrow

Editor's Notes