SlideShare a Scribd company logo
Tomer Shiran
Co-Founder
@tshiran
Analytics on modern
data is incredibly hard
Unprecedented complexity
The demands for data
are growing rapidly
Increasing demands
Reporting
New products
Forecasting
Threat detection
BI
Machine
Learning
Segmenting
Fraud prevention
Your analysts are hungry for data
SQL
But your data is everywhere
And it’s not in the shape they need
Today you engineer data flows and reshaping
Data Staging
• Custon ETL
• Fragile transforms
• Slow moving
SQL
Today you engineer data flows and reshaping
Data Staging
Data Warehouse
• $$$
• High overhead
• Proprietary lock in
• Custon ETL
• Fragile transforms
• Slow moving
SQL
Today you engineer data flows and reshaping
Data Staging
Data Warehouse
Cubes, BI Extracts &
Aggregation Tables • Data sprawl
• Governance issues
• Slow to update
• $$$
• High overhead
• Proprietary lock in
• Custon ETL
• Fragile transforms
• Slow moving
SQL
+
+
+
+
+
+
+
+
+
Lots of Copies…
How can we Tackle this Age-old
Problem?
Direct access to data In-memory, GPU,
…
Columnar Distributed
Apache Arrow: Process & Move Data
Fast
• Top-level Apache project as of Feb 2016
• Collaboration among many open source projects around shared needs
• Three components:
• Language-independent columnar data structures
• Implementations available for C++, Java, Python
• Metadata for describing schemas/record batches
• Protocol for moving data between between processes without
serialization overhead
High-Performance Data Interchange
Today With Arrow
• Each system has its own internal memory format
• 70-80% CPU wasted on serialization and
deserialization
• Similar functionality implemented in multiple projects
• All systems utilize the same memory format
• No overhead for cross-system communication
• Projects can share functionality (eg, Parquet-
to-Arrow reader)
Data is Organized in Record Batches
Schema
Record Batch
Record Batch
Record Batch
Record Batch
Record Batch
Record Batch
Record Batch
Record Batch
Record Batch
Schema & File
Layout
Streaming Format File Format
Each Record Batch is Columnar
Intel CPU
SELECT * FROM clickstream WHERE
session_id = 1331246351
Traditional
Memory Buffer
Arrow
Memory Buffer
Arrow leverages the data parallelism
(SIMD) in modern Intel CPUs:
Example: Spark to
Pandas via Apache
Arrow
Fast Import of Arrow in Pandas & R
Credit: Wes McKinney, Two Sigma
Fast Export of Arrow in Spark
• Legacy export from Spark to Pandas (toPandas) was extremely
slow
• Row-by-row conversion from Spark driver to Python memory
• SPARK-13534 introduced an Arrow based implementation
• Wes McKinney (Two Sigma), Bryan Cutler (IBM), Li Jin (Two Sigma), and
Yin Xusen (IBM)
• Set spark.sql.execution.arrow.enable = True
Clock Time 12.5s 1.89s (6.6x)
Deserialization 88% of the time 1% of the time
Peak memory usage 8x dataset size 2x dataset size
Designing a Virtual Data
Lake Powered by Apache
Arrow
Arrow-based Distributed Execution
Persistent Columnar Cache (Parquet)
In-Memory Columnar Cache (Arrow)
Pandas
R
BI
Data Sources
(NoSQL, RDBMS, Hadoop, S3)
Arrow-based Execution and Integration
Demo
Thank You
• Apache Arrow community
• Strata organizers
• Get involved
• Subscribe to the Arrow ASF lists
• Contribute to the Arrow project
• Want to learn more about Dremio?
• tshiran@dremio.com

More Related Content

PDF
Presto Summit 2018 - 09 - Netflix Iceberg
PPTX
Introduction to Dremio
PDF
Presto on Apache Spark: A Tale of Two Computation Engines
PDF
Iceberg: A modern table format for big data (Strata NY 2018)
PDF
Introduction SQL Analytics on Lakehouse Architecture
PDF
Incremental View Maintenance with Coral, DBT, and Iceberg
PDF
The delta architecture
PDF
The Parquet Format and Performance Optimization Opportunities
Presto Summit 2018 - 09 - Netflix Iceberg
Introduction to Dremio
Presto on Apache Spark: A Tale of Two Computation Engines
Iceberg: A modern table format for big data (Strata NY 2018)
Introduction SQL Analytics on Lakehouse Architecture
Incremental View Maintenance with Coral, DBT, and Iceberg
The delta architecture
The Parquet Format and Performance Optimization Opportunities

What's hot (20)

PPTX
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
PDF
How We Optimize Spark SQL Jobs With parallel and sync IO
PDF
Apache Iceberg: An Architectural Look Under the Covers
PDF
Parquet performance tuning: the missing guide
PPTX
Simplifying And Accelerating Data Access for Python With Dremio and Apache Arrow
PPTX
Free Training: How to Build a Lakehouse
PDF
Delta: Building Merge on Read
PDF
Building an open data platform with apache iceberg
PDF
Making Data Timelier and More Reliable with Lakehouse Technology
PDF
Hyperspace for Delta Lake
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
PPTX
Apache Arrow: In Theory, In Practice
PPTX
Data Lakehouse, Data Mesh, and Data Fabric (r2)
PDF
Growing the Delta Ecosystem to Rust and Python with Delta-RS
PPTX
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
PDF
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
PPTX
Lakehouse Analytics with Dremio
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Change Data Feed in Delta
PDF
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
How We Optimize Spark SQL Jobs With parallel and sync IO
Apache Iceberg: An Architectural Look Under the Covers
Parquet performance tuning: the missing guide
Simplifying And Accelerating Data Access for Python With Dremio and Apache Arrow
Free Training: How to Build a Lakehouse
Delta: Building Merge on Read
Building an open data platform with apache iceberg
Making Data Timelier and More Reliable with Lakehouse Technology
Hyperspace for Delta Lake
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Apache Arrow: In Theory, In Practice
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Growing the Delta Ecosystem to Rust and Python with Delta-RS
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Lakehouse Analytics with Dremio
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Change Data Feed in Delta
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Ad

Viewers also liked (10)

PDF
Apache Calcite: One planner fits all
PDF
Data Science Languages and Industry Analytics
PPTX
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
PDF
The twins that everyone loved too much
PPTX
Apache Arrow - An Overview
PPTX
Options for Data Prep - A Survey of the Current Market
PDF
Bi on Big Data - Strata 2016 in London
PDF
Don’t optimize my queries, optimize my data!
PDF
SQL on everything, in memory
PPTX
Apache Calcite overview
Apache Calcite: One planner fits all
Data Science Languages and Industry Analytics
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The twins that everyone loved too much
Apache Arrow - An Overview
Options for Data Prep - A Survey of the Current Market
Bi on Big Data - Strata 2016 in London
Don’t optimize my queries, optimize my data!
SQL on everything, in memory
Apache Calcite overview
Ad

Similar to Building a Virtual Data Lake with Apache Arrow (20)

PDF
Streaming Analytics with Spark, Kafka, Cassandra and Akka
PDF
In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015
PPTX
Data modeling trends for analytics
PDF
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
PPTX
Tech Spark Presentation
PPTX
Adding structure to your streaming pipelines: moving from Spark streaming to ...
PDF
Meta scale kognitio hadoop webinar
PDF
Intake at AnacondaCon
PDF
Mastering Query Optimization Techniques for Modern Data Engineers
PDF
Spark_Intro_Syed_Academy
PPTX
Real Time Big Data Processing on AWS
PDF
Nisha talagala keynote_inflow_2016
PPTX
Big Data Introduction - Solix empower
PDF
Big data berlin
PDF
Designing and Building Next Generation Data Pipelines at Scale with Structure...
PDF
20160331 sa introduction to big data pipelining berlin meetup 0.3
PDF
Making Apache Spark Better with Delta Lake
PDF
DoneDeal - AWS Data Analytics Platform
PPTX
From Pipelines to Refineries: scaling big data applications with Tim Hunter
PPSX
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
Streaming Analytics with Spark, Kafka, Cassandra and Akka
In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015
Data modeling trends for analytics
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Tech Spark Presentation
Adding structure to your streaming pipelines: moving from Spark streaming to ...
Meta scale kognitio hadoop webinar
Intake at AnacondaCon
Mastering Query Optimization Techniques for Modern Data Engineers
Spark_Intro_Syed_Academy
Real Time Big Data Processing on AWS
Nisha talagala keynote_inflow_2016
Big Data Introduction - Solix empower
Big data berlin
Designing and Building Next Generation Data Pipelines at Scale with Structure...
20160331 sa introduction to big data pipelining berlin meetup 0.3
Making Apache Spark Better with Delta Lake
DoneDeal - AWS Data Analytics Platform
From Pipelines to Refineries: scaling big data applications with Tim Hunter
How to use Big Data and Data Lake concept in business using Hadoop and Spark...

Recently uploaded (20)

PPTX
CHAPTER 2 - PM Management and IT Context
PPTX
Computer Software and OS of computer science of grade 11.pptx
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PPTX
assetexplorer- product-overview - presentation
PDF
medical staffing services at VALiNTRY
PDF
Nekopoi APK 2025 free lastest update
PDF
top salesforce developer skills in 2025.pdf
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PDF
Digital Strategies for Manufacturing Companies
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PDF
Understanding Forklifts - TECH EHS Solution
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PDF
Softaken Excel to vCard Converter Software.pdf
PDF
Digital Systems & Binary Numbers (comprehensive )
PDF
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
PPTX
Operating system designcfffgfgggggggvggggggggg
CHAPTER 2 - PM Management and IT Context
Computer Software and OS of computer science of grade 11.pptx
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
assetexplorer- product-overview - presentation
medical staffing services at VALiNTRY
Nekopoi APK 2025 free lastest update
top salesforce developer skills in 2025.pdf
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
Digital Strategies for Manufacturing Companies
Which alternative to Crystal Reports is best for small or large businesses.pdf
Understanding Forklifts - TECH EHS Solution
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
Adobe Illustrator 28.6 Crack My Vision of Vector Design
Softaken Excel to vCard Converter Software.pdf
Digital Systems & Binary Numbers (comprehensive )
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
Operating system designcfffgfgggggggvggggggggg

Building a Virtual Data Lake with Apache Arrow

  • 2. Analytics on modern data is incredibly hard Unprecedented complexity
  • 3. The demands for data are growing rapidly Increasing demands Reporting New products Forecasting Threat detection BI Machine Learning Segmenting Fraud prevention
  • 4. Your analysts are hungry for data SQL But your data is everywhere And it’s not in the shape they need
  • 5. Today you engineer data flows and reshaping Data Staging • Custon ETL • Fragile transforms • Slow moving SQL
  • 6. Today you engineer data flows and reshaping Data Staging Data Warehouse • $$$ • High overhead • Proprietary lock in • Custon ETL • Fragile transforms • Slow moving SQL
  • 7. Today you engineer data flows and reshaping Data Staging Data Warehouse Cubes, BI Extracts & Aggregation Tables • Data sprawl • Governance issues • Slow to update • $$$ • High overhead • Proprietary lock in • Custon ETL • Fragile transforms • Slow moving SQL + + + + + + + + +
  • 9. How can we Tackle this Age-old Problem? Direct access to data In-memory, GPU, … Columnar Distributed
  • 10. Apache Arrow: Process & Move Data Fast • Top-level Apache project as of Feb 2016 • Collaboration among many open source projects around shared needs • Three components: • Language-independent columnar data structures • Implementations available for C++, Java, Python • Metadata for describing schemas/record batches • Protocol for moving data between between processes without serialization overhead
  • 11. High-Performance Data Interchange Today With Arrow • Each system has its own internal memory format • 70-80% CPU wasted on serialization and deserialization • Similar functionality implemented in multiple projects • All systems utilize the same memory format • No overhead for cross-system communication • Projects can share functionality (eg, Parquet- to-Arrow reader)
  • 12. Data is Organized in Record Batches Schema Record Batch Record Batch Record Batch Record Batch Record Batch Record Batch Record Batch Record Batch Record Batch Schema & File Layout Streaming Format File Format
  • 13. Each Record Batch is Columnar Intel CPU SELECT * FROM clickstream WHERE session_id = 1331246351 Traditional Memory Buffer Arrow Memory Buffer Arrow leverages the data parallelism (SIMD) in modern Intel CPUs:
  • 14. Example: Spark to Pandas via Apache Arrow
  • 15. Fast Import of Arrow in Pandas & R Credit: Wes McKinney, Two Sigma
  • 16. Fast Export of Arrow in Spark • Legacy export from Spark to Pandas (toPandas) was extremely slow • Row-by-row conversion from Spark driver to Python memory • SPARK-13534 introduced an Arrow based implementation • Wes McKinney (Two Sigma), Bryan Cutler (IBM), Li Jin (Two Sigma), and Yin Xusen (IBM) • Set spark.sql.execution.arrow.enable = True Clock Time 12.5s 1.89s (6.6x) Deserialization 88% of the time 1% of the time Peak memory usage 8x dataset size 2x dataset size
  • 17. Designing a Virtual Data Lake Powered by Apache Arrow
  • 18. Arrow-based Distributed Execution Persistent Columnar Cache (Parquet) In-Memory Columnar Cache (Arrow) Pandas R BI Data Sources (NoSQL, RDBMS, Hadoop, S3) Arrow-based Execution and Integration
  • 19. Demo
  • 20. Thank You • Apache Arrow community • Strata organizers • Get involved • Subscribe to the Arrow ASF lists • Contribute to the Arrow project • Want to learn more about Dremio? • [email protected]

Editor's Notes

  • #3: BI assumes single relational database, but… Data in non-relational technologies Data fragmented across many systems Massive scale and velocity
  • #4: Data is the business, and… Era of impatient smartphone natives Rise of self-service BI Accelerating time to market Because of the complexity of modern data and increasing demands for data, IT gets crushed in the middle: Slow or non-responsive IT “Shadow Analytics” Data governance risk Illusive data engineers Immature software Competing strategic initiatives
  • #5: Here’s the problem everyone is trying to solve today. You have consumers of data with their favorite tools. BI products like Tableau, PowerBI, Qlik, as well as data science tools like Python, R, Spark, and SQL. Then you have all your data, in a mix of relational, NoSQL, Hadoop, and cloud like S3. So how are you going to get the data to the people asking for it?
  • #6: Here’s how everyone tries to solve it: First you move the data out of the operational systems into a staging area, that might be Hadoop, or one of the cloud file systems like S3 or Azure Blob Store. You write a bunch of ETL scripts to move the data. These are expensive to write and maintain, and they’re fragile – when the sources change, the scripts have to change too.
  • #7: Here’s how everyone tries to solve it: First you move the data out of the operational systems into a staging area, that might be Hadoop, or one of the cloud file systems like S3 or Azure Blob Store. You write a bunch of ETL scripts to move the data. These are expensive to write and maintain, and they’re fragile – when the sources change, the scripts have to change too. Then you move the data into a data warehouse. This could be Redshift, Teradata, Vertica, or other products. These are all proprietary, and they take DBA experts to make them work. And to move the data here you write another set of scripts. But what we see with many customers is that the performance here isn’t sufficient for their needs, and so …
  • #8: Here’s how everyone tries to solve it: First you move the data out of the operational systems into a staging area, that might be Hadoop, or one of the cloud file systems like S3 or Azure Blob Store. You write a bunch of ETL scripts to move the data. These are expensive to write and maintain, and they’re fragile – when the sources change, the scripts have to change too. Then you move the data into a data warehouse. This could be Redshift, Teradata, Vertica, or other products. These are all proprietary, and they take DBA experts to make them work. And to move the data here you write another set of scripts. But what we see with many customers is that the performance here isn’t sufficient for their needs, and so … You build cubes and aggregation tables to get the performance your users are asking for. And to do this you build another set of scripts. In the end you’re left with something like this picture. You may have more layers, the technologies may be different, but you’re probably living with something like this. And nobody likes this – it’s expensive, the data movement is slow, it’s hard to change. But worst of all, you’re left with a dynamic where every time a consumer of the data wants a new piece of data: They open a ticket with IT IT begins an engineering project to build another set of pipelines, over several weeks or months