SlideShare a Scribd company logo
25127
The Data Lake Engine
Spark + AI Summit 2020
Data Science Across Data Sources with Apache Arrow
25127
Dremio is the Data Lake Engine CompanyTomer Shiran
Co-Founder & CPO, Dremio
tomer@dremio.com Powering the cloud data lakes of the world’s
leading companies across all industries
Creators of
Over $100M raised
Background
25127
Your Data Lake is Exploding, Yet Your Data Remains Inaccessible
But…
>100% YoY S3
Data Growth1
>50% of Data
Will Live on Cloud Data
Lake Storage by 20252
1) Estimate based on historical growth https://aws.amazon.com/blogs/aws/amazon-s3-growth-for-2011-now-762-billion-objects/
2) Estimate based on trends around cloud migration plus growth in semi-structured and unstructured data
Data Lakes are becoming the
primary place that data lands
Consuming the data is
too slow & too difficult
SQL
Data Consumers
X X X
S3ADLS
S3ADLS
or or
25127
Data Movement is the Typical Workaround for Data Lake Storage
BI Users
SQL
Data Scientists
Data Lake
Storage ADLS S3
25127
Data Movement is the Typical Workaround for Data Lake Storage
BI Users
SQL
Data Scientists
1
Brittle & complex
ETL/ELT
Data Lake
Storage ADLS S3
25127
Data Movement is the Typical Workaround for Data Lake Storage
1
2
Brittle & complex
ETL/ELT
Data Lake
Storage
Proprietary & expensive
DW/Data Marts
BI Users
SQL
Data Scientists
ADLS S3
25127
Data Movement is the Typical Workaround for Data Lake Storage
Proliferating Cubes,
BI Extracts, &
Aggregation Tables
Proprietary & expensive
DW/Data Marts
+
+
+
+
+
+
+
+
+
1
2
3
Brittle & complex
ETL/ELT
D
ecreasingD
ataScope&
Flexibility
Data Lake
Storage
BI Users
SQL
Data Scientists
ADLS S3
25127
Proliferating Cubes,
BI Extracts, &
Aggregation Tables
Proprietary & expensive
DW/Data Marts
+
+
+
+
+
+
+
+
+
1
2
3
Brittle & complex
ETL/ELT
D
ecreasingD
ataScope&
Flexibility
BI Users
SQL
Data Scientists
Data Lake
Storage ADLS S3o
r
o
r
Query data lake storage directly with 4-100X performance
Powered by .
What is Apache Arrow?
Columnar In-
Memory
Representation
Many Language
Bindings
Broad Industry
Adoption
Row-based Column-based
10+ Downloads per Month
25127
Apache Arrow Gandiva Improves CPU Efficiency
✓ A standalone C++ library for efficient
evaluation of arbitrary SQL expressions on
Arrow vectors using runtime code-
generation in LLVM
✓ Expressions are compiled to LLVM bytecode
(IR), optimized & translated to machine code
✓ Gandiva enables vectorized execution with
Intel SIMD instructions
SQL expression
Vectorized
execution
kernel
Input Arrow
buffer
Output Arrow
buffer
Gandiva
compiler
Pre-compiled
functions (.bs)
OptimizeIRBuilder
25127
4.5x-90x Faster than Java-based Code Generation
Test Project time (secs)
with Java JIT
Project time (secs)
with Gandiva LLVM
Improvement
Sum 3.805 0.558 6.8x
Project 5 columns 8.681 1.689 5.13x
Project 10 columns 24.923 3.476 7.74x
CASE-10 4.308 0.925 4.66x
CASE-100 1361 15.187 89.6x
25127
Dremio’s Arrow-based Columnar Cloud Cache (C3) Accelerates I/O
✓ Columnar cloud cache (C3) automatically provides
NVMe-level I/O performance when reading from
S3/ADLS
✓ Arrow persistence enables granular caching as Arrow
buffers in local engine NVMe
✓ Bypass data deserialization and decompression
✓ Enables high-concurrency, low-latency BI workloads
on cloud data lake storage
…
Executor Executor Executor Executor
AWS S3
NVMe NVMeNVMe NVMe
C3 with Apache Arrow persistence
…
Executor Executor Executor
NVMe NVMe NVMe
C3 with Apache Arrow persistence
XL engine
M engine
25127
The Open Data Platform
Storage
Data
Compute
Client
Interactive SQL & BI Data science & batch Occasional SQL
Athena EMR
Batch processing
AWS
S3
ADLS HDFS
File formats: Text | JSON | Parquet | ORC
Table formats: Glue | Hive Metastore | Delta Lake | Iceberg
We Need Fast, Industry-Standard Data Exchange
Storage
Data
Compute
Client
Interactive SQL & BI Data science & batch Occasional SQL
Athena EMR
AWS
S3
ADLS HDFS
File formats: Text | JSON | Parquet | ORC
Table formats: Glue | Hive Metastore | Delta Lake | Iceberg
Batch processing
2
1
3
4
Arrow Flight is an Arrow-based RPC Interface
✓ High-performance wire protocol
✓ Parallel streams of Arrow buffers are transferred
✓ Delivers on the interoperability promise of Apache
Arrow
✓ Client-cluster and cluster-cluster communication
…
Arrow Flight dataframe
Arrow Flight Python Client
import pyarrow.flight as flt
c = flt.FlightClient.connect("localhost", 47470)
fd = flt.FlightDescriptor.for_command(sql)
fi = c.get_flight_info(fd)
ticket = fi.endpoints[0].ticket
df = c.do_get(ticket0).read_all()
Client-Cluster Communication
Cluster-Cluster Communication
Demo
Demo
25127
Q&AThe Data Lake Engine
25127
Dremio is the Data Lake Engine
Data
Lake
Storage
Data
Lake
Engine
BI Users
SQL
Data Scientists
ADLS S3or or
Optional
External
Sources
Data
Users
Accelerate
Business
100X BI query speed
4X Ad-hoc query speed
0 cubes, extracts, or
aggregation tables
Reduce
Cost & Risk&
10x lower AWS EC2 /
Azure VM spend for same
performance
0 lock-in, loss of control,
and duplication of data
Powered by
A Next-Generation Data Lake Query Engine for Live, Interactive Analytics Directly on Data Lake Storage

More Related Content

PDF
Apache Iceberg - A Table Format for Hige Analytic Datasets
PPTX
Apache Arrow Flight Overview
PDF
Apache Arrow Flight: A New Gold Standard for Data Transport
PPTX
Apache Flink and what it is used for
PDF
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
PPTX
Apache Arrow: In Theory, In Practice
PDF
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Arrow Flight Overview
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Flink and what it is used for
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
Apache Arrow: In Theory, In Practice
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...

What's hot (20)

PDF
Realtime Analytics on AWS
PDF
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
PPTX
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
PPTX
Apache airflow
PPTX
File Format Benchmark - Avro, JSON, ORC & Parquet
PDF
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...
PPTX
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
PDF
Intro to Delta Lake
PDF
A Thorough Comparison of Delta Lake, Iceberg and Hudi
PDF
Catalogs - Turning a Set of Parquet Files into a Data Set
PPTX
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
PDF
Hudi architecture, fundamentals and capabilities
PDF
Solving Enterprise Data Challenges with Apache Arrow
PDF
Parquet performance tuning: the missing guide
PPTX
The columnar roadmap: Apache Parquet and Apache Arrow
PDF
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
PDF
Iceberg: A modern table format for big data (Strata NY 2018)
PPTX
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
PDF
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
PPTX
Realtime Analytics on AWS
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
Apache airflow
File Format Benchmark - Avro, JSON, ORC & Parquet
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Intro to Delta Lake
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Catalogs - Turning a Set of Parquet Files into a Data Set
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
Hudi architecture, fundamentals and capabilities
Solving Enterprise Data Challenges with Apache Arrow
Parquet performance tuning: the missing guide
The columnar roadmap: Apache Parquet and Apache Arrow
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Iceberg: A modern table format for big data (Strata NY 2018)
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
Ad

Similar to Data Science Across Data Sources with Apache Arrow (20)

PDF
The Data Lake Engine Data Microservices in Spark using Apache Arrow Flight
PDF
Vue d'ensemble Dremio
PPTX
Apache Arrow - An Overview
PPTX
Simplifying And Accelerating Data Access for Python With Dremio and Apache Arrow
PDF
OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...
PDF
Strata London 2016: The future of column oriented data processing with Arrow ...
PPTX
Building a Virtual Data Lake with Apache Arrow
PDF
Ursa Labs and Apache Arrow in 2019
PDF
Apache Arrow
PDF
Apache Arrow: Present and Future @ ScaledML 2020
PPTX
Efficient Data Formats for Analytics with Parquet and Arrow
PPTX
Data Eng Conf NY Nov 2016 Parquet Arrow
PDF
High-speed Database Throughput Using Apache Arrow Flight SQL
PDF
Apache Arrow at DataEngConf Barcelona 2018
PPTX
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
PPTX
The columnar roadmap: Apache Parquet and Apache Arrow
PDF
Apache Arrow: Leveling Up the Analytics Stack
PPTX
How to boost your datamanagement with Dremio ?
PPTX
Mule soft mar 2017 Parquet Arrow
PPTX
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Data Lake Engine Data Microservices in Spark using Apache Arrow Flight
Vue d'ensemble Dremio
Apache Arrow - An Overview
Simplifying And Accelerating Data Access for Python With Dremio and Apache Arrow
OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...
Strata London 2016: The future of column oriented data processing with Arrow ...
Building a Virtual Data Lake with Apache Arrow
Ursa Labs and Apache Arrow in 2019
Apache Arrow
Apache Arrow: Present and Future @ ScaledML 2020
Efficient Data Formats for Analytics with Parquet and Arrow
Data Eng Conf NY Nov 2016 Parquet Arrow
High-speed Database Throughput Using Apache Arrow Flight SQL
Apache Arrow at DataEngConf Barcelona 2018
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
The columnar roadmap: Apache Parquet and Apache Arrow
Apache Arrow: Leveling Up the Analytics Stack
How to boost your datamanagement with Dremio ?
Mule soft mar 2017 Parquet Arrow
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake

Recently uploaded (20)

PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PDF
Launch Your Data Science Career in Kochi – 2025
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
Global journeys: estimating international migration
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Reliability_Chapter_ presentation 1221.5784
Acceptance and paychological effects of mandatory extra coach I classes.pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
Launch Your Data Science Career in Kochi – 2025
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Introduction-to-Cloud-ComputingFinal.pptx
Business Ppt On Nestle.pptx huunnnhhgfvu
Galatica Smart Energy Infrastructure Startup Pitch Deck
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Major-Components-ofNKJNNKNKNKNKronment.pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Global journeys: estimating international migration
.pdf is not working space design for the following data for the following dat...
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx

Data Science Across Data Sources with Apache Arrow

  • 1. 25127 The Data Lake Engine Spark + AI Summit 2020 Data Science Across Data Sources with Apache Arrow
  • 2. 25127 Dremio is the Data Lake Engine CompanyTomer Shiran Co-Founder & CPO, Dremio [email protected] Powering the cloud data lakes of the world’s leading companies across all industries Creators of Over $100M raised Background
  • 3. 25127 Your Data Lake is Exploding, Yet Your Data Remains Inaccessible But… >100% YoY S3 Data Growth1 >50% of Data Will Live on Cloud Data Lake Storage by 20252 1) Estimate based on historical growth https://aws.amazon.com/blogs/aws/amazon-s3-growth-for-2011-now-762-billion-objects/ 2) Estimate based on trends around cloud migration plus growth in semi-structured and unstructured data Data Lakes are becoming the primary place that data lands Consuming the data is too slow & too difficult SQL Data Consumers X X X S3ADLS S3ADLS or or
  • 4. 25127 Data Movement is the Typical Workaround for Data Lake Storage BI Users SQL Data Scientists Data Lake Storage ADLS S3
  • 5. 25127 Data Movement is the Typical Workaround for Data Lake Storage BI Users SQL Data Scientists 1 Brittle & complex ETL/ELT Data Lake Storage ADLS S3
  • 6. 25127 Data Movement is the Typical Workaround for Data Lake Storage 1 2 Brittle & complex ETL/ELT Data Lake Storage Proprietary & expensive DW/Data Marts BI Users SQL Data Scientists ADLS S3
  • 7. 25127 Data Movement is the Typical Workaround for Data Lake Storage Proliferating Cubes, BI Extracts, & Aggregation Tables Proprietary & expensive DW/Data Marts + + + + + + + + + 1 2 3 Brittle & complex ETL/ELT D ecreasingD ataScope& Flexibility Data Lake Storage BI Users SQL Data Scientists ADLS S3
  • 8. 25127 Proliferating Cubes, BI Extracts, & Aggregation Tables Proprietary & expensive DW/Data Marts + + + + + + + + + 1 2 3 Brittle & complex ETL/ELT D ecreasingD ataScope& Flexibility BI Users SQL Data Scientists Data Lake Storage ADLS S3o r o r Query data lake storage directly with 4-100X performance Powered by .
  • 9. What is Apache Arrow? Columnar In- Memory Representation Many Language Bindings Broad Industry Adoption Row-based Column-based
  • 11. 25127 Apache Arrow Gandiva Improves CPU Efficiency ✓ A standalone C++ library for efficient evaluation of arbitrary SQL expressions on Arrow vectors using runtime code- generation in LLVM ✓ Expressions are compiled to LLVM bytecode (IR), optimized & translated to machine code ✓ Gandiva enables vectorized execution with Intel SIMD instructions SQL expression Vectorized execution kernel Input Arrow buffer Output Arrow buffer Gandiva compiler Pre-compiled functions (.bs) OptimizeIRBuilder
  • 12. 25127 4.5x-90x Faster than Java-based Code Generation Test Project time (secs) with Java JIT Project time (secs) with Gandiva LLVM Improvement Sum 3.805 0.558 6.8x Project 5 columns 8.681 1.689 5.13x Project 10 columns 24.923 3.476 7.74x CASE-10 4.308 0.925 4.66x CASE-100 1361 15.187 89.6x
  • 13. 25127 Dremio’s Arrow-based Columnar Cloud Cache (C3) Accelerates I/O ✓ Columnar cloud cache (C3) automatically provides NVMe-level I/O performance when reading from S3/ADLS ✓ Arrow persistence enables granular caching as Arrow buffers in local engine NVMe ✓ Bypass data deserialization and decompression ✓ Enables high-concurrency, low-latency BI workloads on cloud data lake storage … Executor Executor Executor Executor AWS S3 NVMe NVMeNVMe NVMe C3 with Apache Arrow persistence … Executor Executor Executor NVMe NVMe NVMe C3 with Apache Arrow persistence XL engine M engine
  • 14. 25127 The Open Data Platform Storage Data Compute Client Interactive SQL & BI Data science & batch Occasional SQL Athena EMR Batch processing AWS S3 ADLS HDFS File formats: Text | JSON | Parquet | ORC Table formats: Glue | Hive Metastore | Delta Lake | Iceberg
  • 15. We Need Fast, Industry-Standard Data Exchange Storage Data Compute Client Interactive SQL & BI Data science & batch Occasional SQL Athena EMR AWS S3 ADLS HDFS File formats: Text | JSON | Parquet | ORC Table formats: Glue | Hive Metastore | Delta Lake | Iceberg Batch processing 2 1 3 4
  • 16. Arrow Flight is an Arrow-based RPC Interface ✓ High-performance wire protocol ✓ Parallel streams of Arrow buffers are transferred ✓ Delivers on the interoperability promise of Apache Arrow ✓ Client-cluster and cluster-cluster communication … Arrow Flight dataframe
  • 17. Arrow Flight Python Client import pyarrow.flight as flt c = flt.FlightClient.connect("localhost", 47470) fd = flt.FlightDescriptor.for_command(sql) fi = c.get_flight_info(fd) ticket = fi.endpoints[0].ticket df = c.do_get(ticket0).read_all()
  • 20. Demo
  • 21. Demo
  • 23. 25127 Dremio is the Data Lake Engine Data Lake Storage Data Lake Engine BI Users SQL Data Scientists ADLS S3or or Optional External Sources Data Users Accelerate Business 100X BI query speed 4X Ad-hoc query speed 0 cubes, extracts, or aggregation tables Reduce Cost & Risk& 10x lower AWS EC2 / Azure VM spend for same performance 0 lock-in, loss of control, and duplication of data Powered by A Next-Generation Data Lake Query Engine for Live, Interactive Analytics Directly on Data Lake Storage