SlideShare a Scribd company logo
8
Most read
11
Most read
12
Most read
Brought to you by
High-speed Database
Throughput Using Apache
Arrow Flight SQL
Kyle Porter
Architect at Dremio
James Duong
Architect at Dremio
Introduction to Arrow Flight
Introduction to Apache Arrow
■ A columnar, in-memory data format and supporting libraries
■ Supported in many languages including C++, Java, Python, Go
■ Data is strongly typed. Each row has the same schema.
■ Includes libraries for working with the format:
● Computation engine (Acero) utilizing SIMD operations for vectorized data analysis.
● Interprocess communication.
● Serialization / deserialization from file formats.
■ Fully open source with a permissive license.
Arrow powers dozens of open source
& commercial technologies
10+ programming languages
supported
>70M downloads
per month
Apache Arrow Adoption
Why is Arrow Flight Needed?
■ An open protocol that the community can support.
■ Designed for data in the modern world
● Older protocols are row oriented and geared towards large numbers of columns and low
numbers of rows.
● Arrow’s columnar format is oriented towards high compressibility and large numbers of rows.
■ Supports distributed computing as a client-side concept
● A data request can return multiple endpoints to a client.
● The client can retrieve from each endpoint in parallel.
Arrow Way: Data is sent, transported and
received in the Arrow format
Arrow Flight
■ Protocol for serialization-free transport of Arrow data
● This is particularly efficient if the client application will just work with Arrow data directly.
DATABASE
Column Based
DATABASE
Column Based
Convert
CLIENT
Column Based
Convert
CLIENT
Column Based
JDBC/ODBC Connector
Arrow Flight Connector transporting data in Arrow Format
Status Quo: Serializing/Deserializing
data at each step
Row Based
Column Based
Distributed Computing:
Single Node with Arrow Flight
Coordinator /
Executor
CLIENT
CPU
memory
1 - GetFlightInfo(<query>)
2 - FlightInfo<Schema, Endpoints>
3 - DoGet(<ticket>)
Endpoint = {location, ticket}
CPU
memory
Distributed Computing:
Multiple Nodes with Arrow Flight
CLIENT
Node 2
Node N
Node 1
CPU
memory
CPU
memory
CPU
memory
CPU
memory
DoGet(<ticket>)
DoGet(<ticket>)
DoGet(<ticket>)
Omitting GetFlightInfo call...
Arrow Flight as a Development Framework
■ Includes a fully-built client library
■ Includes a high-performance, scalable server
● Built on top of Google’s gRPC technology and compatible with existing tooling.
● Server implementation details such as thread-pooling, asynchronous IO, request cancellation
are already implemented.
■ Server deployment is a matter of implementing a few RPC request handlers.
Flight SQL Enhancements
for Arrow Flight
Why Extend Arrow Flight?
■ Client sends a byte stream, server sends a result
● The content of the byte stream is opaque in the interface.
● It only has meaning for a particular server.
● Example - Dremio interprets the byte stream to be a UTF-8 encoded SQL query.
■ Catalog information is not part of Arrow Flight’s design
● There is no RPC call to describe how to build the byte stream the client sends.
● Generic tools cannot be built.
■ Arrow Flight is meant to serve any tabular data from any source.
■ ODBC/JDBC standardize query execution and catalog access, but have
drawbacks.
■ Enter Arrow Flight SQL.
What is Arrow Flight SQL?
■ Initiative to allow databases to use Arrow Flight as the transport protocol
● Leverage the performance of Arrow and Flight for database access.
■ Extended set of RPC calls to standardize a SQL interface on Flight
● Query execution
● Prepared statements
● Database catalog metadata
● SQL syntax capabilities
■ Generic client libraries
● A Flight SQL application can be used against any Flight SQL server without code changes.
● ODBC and JDBC clients provided on top.
Common Tool Workflow
SERVER
2 - FlightInfo<Schema, Endpoints>
1 - GetFlightInfo(GetTables)
GetTables
4 - Arrow record batches
3 - DoGet(<ticket>)
DoGet
6 - FlightInfo<Schema, Endpoints>
5 - GetFlightInfo(StatementExecute)
Execute
7 - DoGet(<ticket>)
DoGet
CPU
memory
Listing tables
Retrieving query data
CLIENT
CPU
memory
Flight SQL vs. Legacy
Legacy (ODBC / JDBC)
■ Each database vendor must implement,
maintain, and distribute a driver.
■ Each database vendor must implement their
entire server.
■ Implementation details may be closed source.
■ Protocol is usually proprietary.
Flight SQL
■ Single client that works against any Flight SQL
server.
■ Server implementation is part of Flight. Only
RPC handlers need to be implemented.
■ Flight and Arrow components are open and the
community is actively improving them.
■ Protocol is open and integrates with gRPC and
Arrow tooling.
Flight SQL Status
■ Initial version released with Arrow 7.0.0
● Includes support for C++ and Java clients and servers
■ Enhancements to column and data type metadata have been accepted into
more recent versions of Arrow.
■ Support for transactions and query cancellation have been accepted.
■ Open for contributions
● Support for additional languages (Python, Go, C#, etc.).
● More features such as small result enhancements.
Flight SQL Status
■ JDBC Driver
● Connect legacy JDBC applications to databases with the Flight SQL protocol
with no code changes.
■ Examples: DBeaver, DBVisualizer
● Merged into Apache/master. To be released in Arrow 10.0.0
■ ODBC Driver
● Released by Dremio.
● Connect ODBC applications such as Tableau, pyodbc, PowerBI to Flight
SQL-enabled databases.
Performance
Practical Example: pyodbc vs. PyArrow
● PyArrow is columnar
■ Consume columnar data returned using the Arrow Flight without deserialization costs.
● pyodbc is row-oriented
■ All data values must be converted to scalars to expose to the python application.
■ This process incurs significant deserialization costs.
Practical Example: pyodbc vs. PyArrow
● Comparison: 500,000 rows queried from a remote server. (No parallelism).
■ pyodbc: 8.00s. PyArrow: 0.900s.
Query Execution: pyodbc vs. PyArrow
cursor = connection.cursor()
cursor.execute(sql)
data = cursor.fetchall()
■ ODBC requires all data to be retrieved from a single entry point (the cursor in the above example).
■ Arrow Flight lets the server expose multiple endpoints that host separate partitions of the data. Data
can be retrieved in parallel and even from separate processes or client nodes.
pyodbc (ODBC)
options = flight.FlightCallOptions(headers=headers)
descriptor = flight.FlightDescriptor.for_command(sql)
flight_info = client.get_flight_info(descriptor, options)
reader = client.do_get(flight_info.endpoints[0].ticket, options)
data = reader.read_chunk()
PyArrow (Arrow Flight SQL)
Arrow Client Design Tips
■ Minimize copying of data.
■ Avoid manual calculations on data.
● Prefer library calls using the Compute library to analyze data (for
example, arithmetic or aggregation on Arrow data).
● Arrow libraries use SIMD instructions for high-performance calculations!
■ Arrow provides fast file serialization to JSON, CSV, Parquet, ORC, and
uncompressed Arrow files. Avoid serializing Arrow data by hand.
References
■ Arrow Flight SQL Announcement:
https://arrow.apache.org/blog/2022/02/16/introducing-arrow-flight-sql/
■ Arrow Flight SQL ODBC Driver: https://github.com/dremio/flightsql-odbc and
https://github.com/dremio/warpdrive
■ Arrow Flight SQL JDBC Driver:
https://github.com/apache/arrow/tree/master/java/flight/flight-sql-jdbc-driver
■ Arrow Flight SQL JDBC Driver Improvements:
https://issues.apache.org/jira/browse/ARROW-17729
Brought to you by
Kyle Porter
kporter@dremio.com
James Duong
jduong@dremio.com
Ad

Recommended

Introduction to memcached
Introduction to memcached
Jurriaan Persyn
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
Ryan Blue
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
enissoz
 
How to Design Indexes, Really
How to Design Indexes, Really
Karwin Software Solutions LLC
 
Introduction to Elasticsearch
Introduction to Elasticsearch
Ismaeel Enjreny
 
Apache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In Practice
Dremio Corporation
 
Hello, kafka! (an introduction to apache kafka)
Hello, kafka! (an introduction to apache kafka)
Timothy Spann
 
Apache Arrow Flight Overview
Apache Arrow Flight Overview
Jacques Nadeau
 
Introduction to Kibana
Introduction to Kibana
Vineet .
 
ORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big Data
DataWorks Summit
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
DataWorks Summit
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
Julien Le Dem
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka Streams
Guido Schmutz
 
Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21
Stamatis Zampetakis
 
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
SANG WON PARK
 
Airflow를 이용한 데이터 Workflow 관리
Airflow를 이용한 데이터 Workflow 관리
YoungHeon (Roy) Kim
 
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
GetInData
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Introduction to Apache Kafka
Introduction to Apache Kafka
AIMDek Technologies
 
Hive: Loading Data
Hive: Loading Data
Benjamin Leonhardi
 
Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)
Julian Hyde
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Cloudera, Inc.
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data Transport
Wes McKinney
 
Apache Spark Architecture
Apache Spark Architecture
Alexey Grishchenko
 
Introduction to Apache Calcite
Introduction to Apache Calcite
Jordan Halterman
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
DataWorks Summit
 
Architecting for the Cloud using NetflixOSS - Codemash Workshop
Architecting for the Cloud using NetflixOSS - Codemash Workshop
Sudhir Tonse
 
Data Engineer's Lunch #77: Apache Arrow Flight SQL: A Universal Standard for ...
Data Engineer's Lunch #77: Apache Arrow Flight SQL: A Universal Standard for ...
Anant Corporation
 
The Data Lake Engine Data Microservices in Spark using Apache Arrow Flight
The Data Lake Engine Data Microservices in Spark using Apache Arrow Flight
Databricks
 

More Related Content

What's hot (20)

Introduction to Kibana
Introduction to Kibana
Vineet .
 
ORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big Data
DataWorks Summit
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
DataWorks Summit
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
Julien Le Dem
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka Streams
Guido Schmutz
 
Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21
Stamatis Zampetakis
 
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
SANG WON PARK
 
Airflow를 이용한 데이터 Workflow 관리
Airflow를 이용한 데이터 Workflow 관리
YoungHeon (Roy) Kim
 
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
GetInData
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Introduction to Apache Kafka
Introduction to Apache Kafka
AIMDek Technologies
 
Hive: Loading Data
Hive: Loading Data
Benjamin Leonhardi
 
Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)
Julian Hyde
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Cloudera, Inc.
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data Transport
Wes McKinney
 
Apache Spark Architecture
Apache Spark Architecture
Alexey Grishchenko
 
Introduction to Apache Calcite
Introduction to Apache Calcite
Jordan Halterman
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
DataWorks Summit
 
Architecting for the Cloud using NetflixOSS - Codemash Workshop
Architecting for the Cloud using NetflixOSS - Codemash Workshop
Sudhir Tonse
 
Introduction to Kibana
Introduction to Kibana
Vineet .
 
ORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big Data
DataWorks Summit
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
DataWorks Summit
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
Julien Le Dem
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka Streams
Guido Schmutz
 
Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21
Stamatis Zampetakis
 
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
SANG WON PARK
 
Airflow를 이용한 데이터 Workflow 관리
Airflow를 이용한 데이터 Workflow 관리
YoungHeon (Roy) Kim
 
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
GetInData
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)
Julian Hyde
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Cloudera, Inc.
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data Transport
Wes McKinney
 
Introduction to Apache Calcite
Introduction to Apache Calcite
Jordan Halterman
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
DataWorks Summit
 
Architecting for the Cloud using NetflixOSS - Codemash Workshop
Architecting for the Cloud using NetflixOSS - Codemash Workshop
Sudhir Tonse
 

Similar to High-speed Database Throughput Using Apache Arrow Flight SQL (20)

Data Engineer's Lunch #77: Apache Arrow Flight SQL: A Universal Standard for ...
Data Engineer's Lunch #77: Apache Arrow Flight SQL: A Universal Standard for ...
Anant Corporation
 
The Data Lake Engine Data Microservices in Spark using Apache Arrow Flight
The Data Lake Engine Data Microservices in Spark using Apache Arrow Flight
Databricks
 
OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...
OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...
Altinity Ltd
 
Make your PySpark Data Fly with Arrow!
Make your PySpark Data Fly with Arrow!
Databricks
 
INTERFACE by apidays 2023 -Moving Beyond APIs, Anais Dotis-Georgiou, InfluxData
INTERFACE by apidays 2023 -Moving Beyond APIs, Anais Dotis-Georgiou, InfluxData
apidays
 
PyCon Ireland 2022 - PyArrow full stack.pdf
PyCon Ireland 2022 - PyArrow full stack.pdf
Alessandro Molina
 
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020
Wes McKinney
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Wes McKinney
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018
Wes McKinney
 
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics Stack
Wes McKinney
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
Wes McKinney
 
Data Science Across Data Sources with Apache Arrow
Data Science Across Data Sources with Apache Arrow
Databricks
 
An Introduction to Apache Arrow for Python Programmers.pptx
An Introduction to Apache Arrow for Python Programmers.pptx
ssuser59b75e
 
Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...
Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...
Uwe Korn
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data Frames
Wes McKinney
 
Next-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache Arrow
Wes McKinney
 
Apache Arrow
Apache Arrow
Mike Frampton
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019
Wes McKinney
 
Apache Arrow and Python: The latest
Apache Arrow and Python: The latest
Wes McKinney
 
How Apache Arrow and Parquet boost cross-language interoperability
How Apache Arrow and Parquet boost cross-language interoperability
Uwe Korn
 
Data Engineer's Lunch #77: Apache Arrow Flight SQL: A Universal Standard for ...
Data Engineer's Lunch #77: Apache Arrow Flight SQL: A Universal Standard for ...
Anant Corporation
 
The Data Lake Engine Data Microservices in Spark using Apache Arrow Flight
The Data Lake Engine Data Microservices in Spark using Apache Arrow Flight
Databricks
 
OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...
OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...
Altinity Ltd
 
Make your PySpark Data Fly with Arrow!
Make your PySpark Data Fly with Arrow!
Databricks
 
INTERFACE by apidays 2023 -Moving Beyond APIs, Anais Dotis-Georgiou, InfluxData
INTERFACE by apidays 2023 -Moving Beyond APIs, Anais Dotis-Georgiou, InfluxData
apidays
 
PyCon Ireland 2022 - PyArrow full stack.pdf
PyCon Ireland 2022 - PyArrow full stack.pdf
Alessandro Molina
 
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020
Wes McKinney
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Wes McKinney
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018
Wes McKinney
 
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics Stack
Wes McKinney
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
Wes McKinney
 
Data Science Across Data Sources with Apache Arrow
Data Science Across Data Sources with Apache Arrow
Databricks
 
An Introduction to Apache Arrow for Python Programmers.pptx
An Introduction to Apache Arrow for Python Programmers.pptx
ssuser59b75e
 
Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...
Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...
Uwe Korn
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data Frames
Wes McKinney
 
Next-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache Arrow
Wes McKinney
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019
Wes McKinney
 
Apache Arrow and Python: The latest
Apache Arrow and Python: The latest
Wes McKinney
 
How Apache Arrow and Parquet boost cross-language interoperability
How Apache Arrow and Parquet boost cross-language interoperability
Uwe Korn
 
Ad

More from ScyllaDB (20)

Database Benchmarking for Performance Masterclass: Session 2 - Data Modeling ...
Database Benchmarking for Performance Masterclass: Session 2 - Data Modeling ...
ScyllaDB
 
Database Benchmarking for Performance Masterclass: Session 1 - Benchmarking F...
Database Benchmarking for Performance Masterclass: Session 1 - Benchmarking F...
ScyllaDB
 
New Ways to Reduce Database Costs with ScyllaDB
New Ways to Reduce Database Costs with ScyllaDB
ScyllaDB
 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
Powering a Billion Dreams: Scaling Meesho’s E-commerce Revolution with Scylla...
Powering a Billion Dreams: Scaling Meesho’s E-commerce Revolution with Scylla...
ScyllaDB
 
Leading a High-Stakes Database Migration
Leading a High-Stakes Database Migration
ScyllaDB
 
Achieving Extreme Scale with ScyllaDB: Tips & Tradeoffs
Achieving Extreme Scale with ScyllaDB: Tips & Tradeoffs
ScyllaDB
 
Securely Serving Millions of Boot Artifacts a Day by João Pedro Lima & Matt ...
Securely Serving Millions of Boot Artifacts a Day by João Pedro Lima & Matt ...
ScyllaDB
 
How Agoda Scaled 50x Throughput with ScyllaDB by Worakarn Isaratham
How Agoda Scaled 50x Throughput with ScyllaDB by Worakarn Isaratham
ScyllaDB
 
How Yieldmo Cut Database Costs and Cloud Dependencies Fast by Todd Coleman
How Yieldmo Cut Database Costs and Cloud Dependencies Fast by Todd Coleman
ScyllaDB
 
ScyllaDB: 10 Years and Beyond by Dor Laor
ScyllaDB: 10 Years and Beyond by Dor Laor
ScyllaDB
 
Reduce Your Cloud Spend with ScyllaDB by Tzach Livyatan
Reduce Your Cloud Spend with ScyllaDB by Tzach Livyatan
ScyllaDB
 
Migrating 50TB Data From a Home-Grown Database to ScyllaDB, Fast by Terence Liu
Migrating 50TB Data From a Home-Grown Database to ScyllaDB, Fast by Terence Liu
ScyllaDB
 
Vector Search with ScyllaDB by Szymon Wasik
Vector Search with ScyllaDB by Szymon Wasik
ScyllaDB
 
Workload Prioritization: How to Balance Multiple Workloads in a Cluster by Fe...
Workload Prioritization: How to Balance Multiple Workloads in a Cluster by Fe...
ScyllaDB
 
Two Leading Approaches to Data Virtualization, and Which Scales Better? by Da...
Two Leading Approaches to Data Virtualization, and Which Scales Better? by Da...
ScyllaDB
 
Scaling a Beast: Lessons from 400x Growth in a High-Stakes Financial System b...
Scaling a Beast: Lessons from 400x Growth in a High-Stakes Financial System b...
ScyllaDB
 
Object Storage in ScyllaDB by Ran Regev, ScyllaDB
Object Storage in ScyllaDB by Ran Regev, ScyllaDB
ScyllaDB
 
Lessons Learned from Building a Serverless Notifications System by Srushith R...
Lessons Learned from Building a Serverless Notifications System by Srushith R...
ScyllaDB
 
A Dist Sys Programmer's Journey into AI by Piotr Sarna
A Dist Sys Programmer's Journey into AI by Piotr Sarna
ScyllaDB
 
Database Benchmarking for Performance Masterclass: Session 2 - Data Modeling ...
Database Benchmarking for Performance Masterclass: Session 2 - Data Modeling ...
ScyllaDB
 
Database Benchmarking for Performance Masterclass: Session 1 - Benchmarking F...
Database Benchmarking for Performance Masterclass: Session 1 - Benchmarking F...
ScyllaDB
 
New Ways to Reduce Database Costs with ScyllaDB
New Ways to Reduce Database Costs with ScyllaDB
ScyllaDB
 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
Powering a Billion Dreams: Scaling Meesho’s E-commerce Revolution with Scylla...
Powering a Billion Dreams: Scaling Meesho’s E-commerce Revolution with Scylla...
ScyllaDB
 
Leading a High-Stakes Database Migration
Leading a High-Stakes Database Migration
ScyllaDB
 
Achieving Extreme Scale with ScyllaDB: Tips & Tradeoffs
Achieving Extreme Scale with ScyllaDB: Tips & Tradeoffs
ScyllaDB
 
Securely Serving Millions of Boot Artifacts a Day by João Pedro Lima & Matt ...
Securely Serving Millions of Boot Artifacts a Day by João Pedro Lima & Matt ...
ScyllaDB
 
How Agoda Scaled 50x Throughput with ScyllaDB by Worakarn Isaratham
How Agoda Scaled 50x Throughput with ScyllaDB by Worakarn Isaratham
ScyllaDB
 
How Yieldmo Cut Database Costs and Cloud Dependencies Fast by Todd Coleman
How Yieldmo Cut Database Costs and Cloud Dependencies Fast by Todd Coleman
ScyllaDB
 
ScyllaDB: 10 Years and Beyond by Dor Laor
ScyllaDB: 10 Years and Beyond by Dor Laor
ScyllaDB
 
Reduce Your Cloud Spend with ScyllaDB by Tzach Livyatan
Reduce Your Cloud Spend with ScyllaDB by Tzach Livyatan
ScyllaDB
 
Migrating 50TB Data From a Home-Grown Database to ScyllaDB, Fast by Terence Liu
Migrating 50TB Data From a Home-Grown Database to ScyllaDB, Fast by Terence Liu
ScyllaDB
 
Vector Search with ScyllaDB by Szymon Wasik
Vector Search with ScyllaDB by Szymon Wasik
ScyllaDB
 
Workload Prioritization: How to Balance Multiple Workloads in a Cluster by Fe...
Workload Prioritization: How to Balance Multiple Workloads in a Cluster by Fe...
ScyllaDB
 
Two Leading Approaches to Data Virtualization, and Which Scales Better? by Da...
Two Leading Approaches to Data Virtualization, and Which Scales Better? by Da...
ScyllaDB
 
Scaling a Beast: Lessons from 400x Growth in a High-Stakes Financial System b...
Scaling a Beast: Lessons from 400x Growth in a High-Stakes Financial System b...
ScyllaDB
 
Object Storage in ScyllaDB by Ran Regev, ScyllaDB
Object Storage in ScyllaDB by Ran Regev, ScyllaDB
ScyllaDB
 
Lessons Learned from Building a Serverless Notifications System by Srushith R...
Lessons Learned from Building a Serverless Notifications System by Srushith R...
ScyllaDB
 
A Dist Sys Programmer's Journey into AI by Piotr Sarna
A Dist Sys Programmer's Journey into AI by Piotr Sarna
ScyllaDB
 
Ad

Recently uploaded (7)

Webinar - Unlock the Power of Data in Your Compensation Strategy
Webinar - Unlock the Power of Data in Your Compensation Strategy
PayScale, Inc.
 
The Power of Intangible Rewards in Employee Recognition
The Power of Intangible Rewards in Employee Recognition
Vantage Circle
 
ORGANIZATION Development and human ressoucre development .pdf
ORGANIZATION Development and human ressoucre development .pdf
AyaSenhaji2
 
International Tech Talent in the Netherlands 2025.pdf
International Tech Talent in the Netherlands 2025.pdf
Finders Seekers
 
Resume: McGee Steve Resume a.k.a. FutureMan pdf
Resume: McGee Steve Resume a.k.a. FutureMan pdf
Steven McGee
 
Vietnam-Salary-Guide-2023_compressed.pdf
Vietnam-Salary-Guide-2023_compressed.pdf
angelicanguyen1941
 
Chapter 2 - Principles-of-Office-Management.pptx
Chapter 2 - Principles-of-Office-Management.pptx
NelvinValles3
 
Webinar - Unlock the Power of Data in Your Compensation Strategy
Webinar - Unlock the Power of Data in Your Compensation Strategy
PayScale, Inc.
 
The Power of Intangible Rewards in Employee Recognition
The Power of Intangible Rewards in Employee Recognition
Vantage Circle
 
ORGANIZATION Development and human ressoucre development .pdf
ORGANIZATION Development and human ressoucre development .pdf
AyaSenhaji2
 
International Tech Talent in the Netherlands 2025.pdf
International Tech Talent in the Netherlands 2025.pdf
Finders Seekers
 
Resume: McGee Steve Resume a.k.a. FutureMan pdf
Resume: McGee Steve Resume a.k.a. FutureMan pdf
Steven McGee
 
Vietnam-Salary-Guide-2023_compressed.pdf
Vietnam-Salary-Guide-2023_compressed.pdf
angelicanguyen1941
 
Chapter 2 - Principles-of-Office-Management.pptx
Chapter 2 - Principles-of-Office-Management.pptx
NelvinValles3
 

High-speed Database Throughput Using Apache Arrow Flight SQL

  • 1. Brought to you by High-speed Database Throughput Using Apache Arrow Flight SQL Kyle Porter Architect at Dremio James Duong Architect at Dremio
  • 3. Introduction to Apache Arrow ■ A columnar, in-memory data format and supporting libraries ■ Supported in many languages including C++, Java, Python, Go ■ Data is strongly typed. Each row has the same schema. ■ Includes libraries for working with the format: ● Computation engine (Acero) utilizing SIMD operations for vectorized data analysis. ● Interprocess communication. ● Serialization / deserialization from file formats. ■ Fully open source with a permissive license.
  • 4. Arrow powers dozens of open source & commercial technologies 10+ programming languages supported
  • 6. Why is Arrow Flight Needed? ■ An open protocol that the community can support. ■ Designed for data in the modern world ● Older protocols are row oriented and geared towards large numbers of columns and low numbers of rows. ● Arrow’s columnar format is oriented towards high compressibility and large numbers of rows. ■ Supports distributed computing as a client-side concept ● A data request can return multiple endpoints to a client. ● The client can retrieve from each endpoint in parallel.
  • 7. Arrow Way: Data is sent, transported and received in the Arrow format Arrow Flight ■ Protocol for serialization-free transport of Arrow data ● This is particularly efficient if the client application will just work with Arrow data directly. DATABASE Column Based DATABASE Column Based Convert CLIENT Column Based Convert CLIENT Column Based JDBC/ODBC Connector Arrow Flight Connector transporting data in Arrow Format Status Quo: Serializing/Deserializing data at each step Row Based Column Based
  • 8. Distributed Computing: Single Node with Arrow Flight Coordinator / Executor CLIENT CPU memory 1 - GetFlightInfo(<query>) 2 - FlightInfo<Schema, Endpoints> 3 - DoGet(<ticket>) Endpoint = {location, ticket} CPU memory
  • 9. Distributed Computing: Multiple Nodes with Arrow Flight CLIENT Node 2 Node N Node 1 CPU memory CPU memory CPU memory CPU memory DoGet(<ticket>) DoGet(<ticket>) DoGet(<ticket>) Omitting GetFlightInfo call...
  • 10. Arrow Flight as a Development Framework ■ Includes a fully-built client library ■ Includes a high-performance, scalable server ● Built on top of Google’s gRPC technology and compatible with existing tooling. ● Server implementation details such as thread-pooling, asynchronous IO, request cancellation are already implemented. ■ Server deployment is a matter of implementing a few RPC request handlers.
  • 12. Why Extend Arrow Flight? ■ Client sends a byte stream, server sends a result ● The content of the byte stream is opaque in the interface. ● It only has meaning for a particular server. ● Example - Dremio interprets the byte stream to be a UTF-8 encoded SQL query. ■ Catalog information is not part of Arrow Flight’s design ● There is no RPC call to describe how to build the byte stream the client sends. ● Generic tools cannot be built. ■ Arrow Flight is meant to serve any tabular data from any source. ■ ODBC/JDBC standardize query execution and catalog access, but have drawbacks. ■ Enter Arrow Flight SQL.
  • 13. What is Arrow Flight SQL? ■ Initiative to allow databases to use Arrow Flight as the transport protocol ● Leverage the performance of Arrow and Flight for database access. ■ Extended set of RPC calls to standardize a SQL interface on Flight ● Query execution ● Prepared statements ● Database catalog metadata ● SQL syntax capabilities ■ Generic client libraries ● A Flight SQL application can be used against any Flight SQL server without code changes. ● ODBC and JDBC clients provided on top.
  • 14. Common Tool Workflow SERVER 2 - FlightInfo<Schema, Endpoints> 1 - GetFlightInfo(GetTables) GetTables 4 - Arrow record batches 3 - DoGet(<ticket>) DoGet 6 - FlightInfo<Schema, Endpoints> 5 - GetFlightInfo(StatementExecute) Execute 7 - DoGet(<ticket>) DoGet CPU memory Listing tables Retrieving query data CLIENT CPU memory
  • 15. Flight SQL vs. Legacy Legacy (ODBC / JDBC) ■ Each database vendor must implement, maintain, and distribute a driver. ■ Each database vendor must implement their entire server. ■ Implementation details may be closed source. ■ Protocol is usually proprietary. Flight SQL ■ Single client that works against any Flight SQL server. ■ Server implementation is part of Flight. Only RPC handlers need to be implemented. ■ Flight and Arrow components are open and the community is actively improving them. ■ Protocol is open and integrates with gRPC and Arrow tooling.
  • 16. Flight SQL Status ■ Initial version released with Arrow 7.0.0 ● Includes support for C++ and Java clients and servers ■ Enhancements to column and data type metadata have been accepted into more recent versions of Arrow. ■ Support for transactions and query cancellation have been accepted. ■ Open for contributions ● Support for additional languages (Python, Go, C#, etc.). ● More features such as small result enhancements.
  • 17. Flight SQL Status ■ JDBC Driver ● Connect legacy JDBC applications to databases with the Flight SQL protocol with no code changes. ■ Examples: DBeaver, DBVisualizer ● Merged into Apache/master. To be released in Arrow 10.0.0 ■ ODBC Driver ● Released by Dremio. ● Connect ODBC applications such as Tableau, pyodbc, PowerBI to Flight SQL-enabled databases.
  • 19. Practical Example: pyodbc vs. PyArrow ● PyArrow is columnar ■ Consume columnar data returned using the Arrow Flight without deserialization costs. ● pyodbc is row-oriented ■ All data values must be converted to scalars to expose to the python application. ■ This process incurs significant deserialization costs.
  • 20. Practical Example: pyodbc vs. PyArrow ● Comparison: 500,000 rows queried from a remote server. (No parallelism). ■ pyodbc: 8.00s. PyArrow: 0.900s.
  • 21. Query Execution: pyodbc vs. PyArrow cursor = connection.cursor() cursor.execute(sql) data = cursor.fetchall() ■ ODBC requires all data to be retrieved from a single entry point (the cursor in the above example). ■ Arrow Flight lets the server expose multiple endpoints that host separate partitions of the data. Data can be retrieved in parallel and even from separate processes or client nodes. pyodbc (ODBC) options = flight.FlightCallOptions(headers=headers) descriptor = flight.FlightDescriptor.for_command(sql) flight_info = client.get_flight_info(descriptor, options) reader = client.do_get(flight_info.endpoints[0].ticket, options) data = reader.read_chunk() PyArrow (Arrow Flight SQL)
  • 22. Arrow Client Design Tips ■ Minimize copying of data. ■ Avoid manual calculations on data. ● Prefer library calls using the Compute library to analyze data (for example, arithmetic or aggregation on Arrow data). ● Arrow libraries use SIMD instructions for high-performance calculations! ■ Arrow provides fast file serialization to JSON, CSV, Parquet, ORC, and uncompressed Arrow files. Avoid serializing Arrow data by hand.
  • 23. References ■ Arrow Flight SQL Announcement: https://arrow.apache.org/blog/2022/02/16/introducing-arrow-flight-sql/ ■ Arrow Flight SQL ODBC Driver: https://github.com/dremio/flightsql-odbc and https://github.com/dremio/warpdrive ■ Arrow Flight SQL JDBC Driver: https://github.com/apache/arrow/tree/master/java/flight/flight-sql-jdbc-driver ■ Arrow Flight SQL JDBC Driver Improvements: https://issues.apache.org/jira/browse/ARROW-17729