Using LLVM to accelerate processing of data in Apache Arrow

© 2018 Dremio Corporation @DremioHQ
Using LLVM to accelerate processing of data in
Apache Arrow
DataWorks Summit, San Jose
June 21, 2018
Siddharth Teotia
1

Who?
Siddharth Teotia
@siddcoder
loonytek
Quora
• Software Engineer @ Dremio
• Committer - Apache Arrow
• Formerly at Oracle (Database Engine team)
2

Agenda
• Introduction to Apache Arrow
• Arrow in Practice: Introduction to Dremio
• Why Runtime Code Generation in Databases?
• Commonly used Runtime Code Generation Techniques
• Runtime Code Generation Requirements
• Introduction to LLVM
• LLVM in Dremio
3

Apache Arrow Project
• Top-level Apache Software Foundation project
– Announced Feb 17, 2016
• Focused on Columnar In-Memory Analytics
1. 10-100x speedup on many workloads
2. Designed to work with any programming language
3. Flexible data model that handles both flat and nested types
• Developers from 13+ major open source projects involved.
Calcite
Cassandra
Deeplearning4j
Drill
Hadoop
HBase
Ibis
Impala
Kudu
Pandas
Parquet
Phoenix
Spark
Storm
R
4

Arrow goals
• Columnar in-memory representation optimized for efficient use of processor
cache through data locality.
• Designed to take advantage of modern CPU characteristics by implementing
algorithms that leverage hardware acceleration.
• Interoperability for high speed exchange between data systems.
• Embeddable in execution engines, storage layers, etc.
• Well-documented and cross language compatible.
5

High Performance Interface for Data Exchange
• Each system has its own internal memory
format
• 70-80% CPU wasted on serialization and
deserialization
• Functionality duplication and unnecessary
conversions
• All systems utilize the same memory
format
• No overhead for cross-system
communication
• Projects can share functionality (eg:
Parquet-to-Arrow reader)
6

Focus on CPU Efficiency
Traditional Memory Buffer
( row format)
Arrow Memory Buffer
(columnar format)
• Maximize CPU throughput
– SIMD
– Cache Locality
• Vectorized operations.
• Constant value access
– With minimal structure
overhead
• Use efficient lightweight
compression schemes on a per
column basis.
8

Arrow Data Types
• Scalars
– Boolean
– [u]int[8,16,32,64], Decimal, Float, Double
– Date, Time, Timestamp
– UTF8 String, Binary
• Complex
– Struct, List, Union
9

Columnar Data
10

Real World Arrow: Sabot
• Dremio is an OSS Data-as-a-
Service Platform
• The core engine is “Sabot”
– Built entirely on top of Arrow
libraries, runs in JVM

Why Runtime Code Generation in Databases?
• In general, what would be the most optimal query execution plan?
– Hand-written query plan that does the required processing for exact same
data types and operators as required by the query.
– Such execution plan will only work for a particular query but will be the
fastest way to execute that query.
– We can implement extremely fast dedicated code to process _only_ the
kind of data touched by the query.
• However, query engines need to support broad functionality
– Several different data types, SQL operators etc.
– Interpreter based execution.
– Generic control blocks to understand arbitrary query specific runtime
information (field types etc which are not known during query compilation).
– Dynamic dispatch (aka virtual calls via function pointers in C++).
12

Why Runtime Code Generation in Databases? Cont’d
• Interpreted (non code-generated) execution is not very CPU efficient and hurts
query performance
– Generic code not tailored for specific query has excessive branching
– Cost of branch misprediction: Entire pipeline has to be flushed.
– Not the best way to implement code paths critical for performance on
modern pipelined architectures
• Most databases generate code at runtime (query execution time)
– When query execution is about to begin, we have all the information
available that can be used to generate query specific code.
– The code-generated function(s) are specific to the query since they are
based on information resolved at runtime.
– Optimized custom code for executing a particular query.
13

Commonly Used Runtime Code Generation Techniques
• Generate query specific Java classes at query runtime using predefined templates
– Use Janino to compile runtime generated classes in-memory to bytecode,
load and execute the bytecode in same JVM.
– Dremio uses this mechanism.
• Generate query specific C/C++ code at runtime, execv a compiler and load the
executable.
• Problems with existing code-generation mechanisms:
– Heavy object instantiation and dereferencing in generated Java code.
– Compiling and optimizing C/C++ code is known to be slow.
– Inefficient handling of complex and arbitrary SQL expressions.
– Limited opportunities for leveraging modern hardware capabilities
• SIMD vectorization, use of wider registers for handling decimals etc

Runtime Code Generation Requirements
• Efficient code-generation
– The method to generate query specific code at runtime should itself be very
efficient.
– The method should be able to leverage target hardware capabilities.
• Query specific optimized code
– The method should generate highly optimized code to improve query
execution performance.
• Handle arbitrary complex SQL expressions efficiently
– The method should be able to handle complex SQL expressions efficiently.

Introduction to LLVM
• A library providing compiler related modular tools for implementing JIT
compilation infrastructure.
• LLVM can be used to efficiently generate query specific optimized native
machine code at query runtime for performance critical operations.
• Potential for significant speedup in overall query execution time.
• Two high level steps:
– Generate IR (Intermediary Representation) code
– Compile and optimize IR to machine code targeting specific architecture
• IR is both source (language) and target (architecture) independent low-level
specification
• Custom optimization: separate passes to optimize the generated IR.
– Vectorizing loops, combining instructions etc.
• Full API support for all steps of compilation process

Introduction to LLVM Cont’d
IR (Intermediary Representation) is the core of LLVM for code generation:
• A low-level assembly language like specification used by LLVM for representing
code during compilation.
• Generating IR using IRBuilder
– Part of C++ API provided by LLVM.
– Programmatically assemble IR modules/functions instruction by instruction.
• Generating IR using Cross-compilation
– Clang C++ compiler as a frontend to LLVM.
– Compile C++ functions to corresponding IR code.

LLVM in Dremio
Goal: Use LLVM for efficient execution of SQL expressions in native code.
• Has the potential to significantly improve the performance of our execution
engine.
Welcome to Gandiva !!

Gandiva - Introduction
• A standalone C++ library for efficient evaluation of arbitrary SQL
expressions on Arrow vectors using runtime code-generation in
LLVM.
• Has no runtime or compile time dependencies on Dremio or any
other execution engine.
• Provides Java APIs that use the JNI bridge underneath to talk to
C++ code for code generation and expression evaluation
– Dremio’s execution engine leverages Gandiva Java APIs
• Expression support
– If/Else, CASE, ==, !=, <, >, etc
– Function expressions: +, -, /, *, %
– All fixed width scalar types
– More to come
• Boolean expressions, variable width data, complex
types etc.

Gandiva - Design
IR Generation

Gandiva - Design
Tree Based Expression Builder
• Define the operator, operands, output at each level in the tree

Gandiva - Design
High level usage of
main C++ modules

Gandiva - Design (Sample Usage)
// schema for input fields
auto fielda = field("a", int32()); auto fieldb = field("b", int32()); auto schema = arrow::schema({fielda, fieldb});
// output fields
auto field_result = field("res", int32());
// build expression
auto node_a = TreeExprBuilder::MakeField(fielda); auto node_b = TreeExprBuilder::MakeField(fieldb);
auto condition = TreeExprBuilder::MakeFunction("greater_than", {node_a, node_b}, boolean());
auto sum = TreeExprBuilder::MakeFunction("add", {node_a, node_b}, int32());
auto sub = TreeExprBuilder::MakeFunction("subtract", {node_a, node_b}, int32());
auto if_node = TreeExprBuilder::MakeIf(condition, sum, sub, int32());
auto expr = TreeExprBuilder::MakeExpression(if_node, field_result);
// Build a projector for the expressions
std::shared_ptr<Projector> projector; Status status = Projector::Make(schema, {expr}, pool_, &projector);
// Create an input Arrow record-batch with some sample data
// Evaluate expression on record batch arrow::ArrayVector outputs; status = projector->Evaluate(*in_batch, &outputs);
Expression:
if (a > b)
a + b
else
a - b

Gandiva - Design
• Suitable for expressions of
type
– input is null -> output null
• Evaluate vector’s data
buffer and validity buffer
independently
– Reduced branches.
– Better CPU efficiency
– Amenable to SIMD.
– Junk data is also
evaluated but it
doesn’t affect the
end result
Expression
Decomposition

Gandiva - Design
Expression
Decomposition

Gandiva - Design
Categories of Function Expressions
NULL_IF_NULL NULL_NEVER NULL_INTERNAL
● Always decomposable
● If input null -> output null
● Input validity is pushed to top
of tree to determine validity of
output
● Highly optimized execution
● Eg: +, -, *, / etc
● Majority of functions
● Output is never null
● No need to push
validity for final result
● Eg: isNumeric(expr),
isNull(expr),
isDate(expr)
● Actual evaluation done
using conditions
● Output can be null
● Eg: castStringToInt(x) + y + z
● Evaluate sub-tree and
generate a local bitmap
● Rest of the tree uses local
bitmap to continue with
decomposed evaluation

Gandiva - Design
Handling CASE Expressions
• Interpreting CASE as if-else-if statements loses
optimization opportunities
– Evaluation of same condition across multiple cases
– Evaluation of same validity across multiple cases
• Treat as switch case
• LLVM helps with removing redundant evaluation of validity
and and conditions across multiple cases
• A temporary bitmap is created and shared amongst all
expressions for computing validity of output
– Detect nested if-else and use a single bitmap
– Only the matching “if or else” updates bitmap
case
when cond1 then exp1
..
Else exp

Using Gandiva in Dremio

Performance
Java JIT runtime bytecode generation v/s Gandiva runtime code generation in LLVM
• Compare expression evaluation time of five simple expressions on JSON dataset
of 500 million rows
• Tests were run on Mac machine (2.7GHz quad-core Intel Core i7, 16GB RAM)
Project 5 columns
SELECT
sum(x + N2x + N3x),
sum(x * N2x - N3x),
sum(3 * x + 2 * N2x + N3x),
count(x >= N2x - N3x),
count(x + N2x = N3x)
FROM json.d500
Case - 10
SELECT count
(case
when x < 1000000 then x/1000000 + 0
when x < 2000000 then x/2000000 + 1
when x < 3000000 then x/3000000 + 2
when x < 4000000 then x/4000000 + 3
when x < 5000000 then x/5000000 + 4 ……………...
else 10 end)
FROM json.d500

Performance
Test Project time (secs)
with Java JIT
Project time (secs)
with Gandiva LLVM
Improvement
Sum 3.805 0.558 6.8x
Project 5 columns 8.681 1.689 5.13x
Project 10 columns 24.923 3.476 7.74x
CASE-10 4.308 0.925 4.66x
CASE-100 1361 15.187 89.6x

Get Involved
• Gandiva
– https://github.com/dremio/gandiva
• Arrow
– dev@arrow.apache.org
– http://arrow.apache.org
– Follow @ApacheArrow, @DremioHQ
• Dremio
– https://community.dremio.com/
– https://github.com/dremio/dremio-oss

Using LLVM to accelerate processing of data in Apache Arrow

Recommended

More Related Content

What's hot (20)

Similar to Using LLVM to accelerate processing of data in Apache Arrow (20)

More from DataWorks Summit (20)

Recently uploaded (20)

Using LLVM to accelerate processing of data in Apache Arrow