Ibis: Seamless Transition Between Pandas and Apache Spark

Ibis:
Seamless Transition From
Pandas to Spark
Spark + AI Summit
2020
1

This document is being distributed for informational and educational purposes only and is not an offer to sell or the
solicitation of an offer to buy any securities or other instruments. The information contained herein is not intended to
provide, and should not be relied upon for, investment advice. The views expressed herein are not necessarily the
views of Two Sigma Investments, LP or any of its affiliates (collectively, “Two Sigma”). Such views reflect the
assumptions of the author(s) of the document and are subject to change without notice. The document may employ
data derived from third-party sources. No representation is made by Two Sigma as to the accuracy of such information
and the use of such information in no way implies an endorsement of the source of such information or its validity.
The copyrights and/or trademarks in some of the images, logos or other material used herein may be owned by entities
other than Two Sigma. If so, such copyrights and/or trademarks are most likely owned by the entity that created the
material and are used purely for identification and comment as fair use under international copyright and/or trademark
laws. Use of such image, copyright or trademark does not imply any association with such organization (or
endorsement of such organization) by Two Sigma, nor vice versa.
Legal Disclaimer
2

● If you…
○ like pandas but want to analyze large dataset?
○ are interested in “distributed DataFrames” but don’t know which one to
choose?
○ want to have you analysis code run faster or more scalable without
making code changes?
Target Audience
3

● Modeling Tools @ Two Sigma
● Apache Spark, Pandas, Apache Arrow, Flint, Ibis
About Me: Li Jin
4

A common data science task...
5

df = pd.read_parquet(...)
df['feature'] = (df['v1'] + df['v2']) / 2
df['feature2'] = (
df.groupby('key')['feature']
.rolling(3, min_periods=1).mean()
.sort_index(level=1)
.reset_index(drop=True)
)
df.groupby('key').agg({'feature2': {'max', 'min'}})
6

df['feature2'] = (
)
7

df['feature2'] = (
)
8

df['feature2'] = (
)
9

● The software is not designed for large amounts of data
● Your machine doesn’t have enough RAM to hold all the data
● You are not utilizing all the CPUs
You are happy until...the code runs too slow
10

● Use a bigger machine
● Pros:
○ Low human cost: no code change
● Cons:
○ Same software limits
○ Single threaded
○ Probably not fast enough
Try a few things...
12

● Use a generic way to distribute the code:
○ sparkContext.parallelize(range(2000, 2020)).map(compute_for
_year).collect()
● Pros:
○ Medium human cost: small code change
○ Scalable
● Cons:
○ Works only for embarrassingly parallel problems
○ Boundary handling can be tricky
○ Distributed failures
Try a few things...
13

Try a few things...
● Use a distributed dataframe library
○ Spark
○ Dask
○ Koalas
○ ...
● Pros:
○ Scalable
● Cons:
○ High human cost: learn another API
○ Not obvious which one to use
○ Distributed failures
14

Take a step back...
df['feature2'] = (
)
16

The problem
The problem is not how we express the computation, but
how we execute it.
17

Separation of Concern
From wikipedia:
“In computer science, separation of concerns (SoC) is a design
principle for separating a computer program into distinct sections
such that each section addresses a separate concern.”
18

Separation of expression and execution
Can we separate “how we express the computation”
(expression) and “how we execute it” (execution)?
19

Separation of expression and execution
● SQL is a way to express the computation independent of the
execution.
● Can we have something like SQL, but for Python Data Analysis?
20

Outline
● Ibis: A high level introduction
● Ibis: expression language
● Ibis: backend execution
● PySpark backend for Ibis
● Conclusion
21

Ibis: A high level
introduction
22

Ibis: Python Data Analysis Framework
● Open source
● Started in 2015 by Wes McKinney
● Worked on by top pandas committers:
○ Wes McKinney
○ Phillip Cloud
○ Jeff Reback
23

Ibis components
● ibis language
○ The API that is used to express the computation with ibis expressions
● ibis backends
○ Modules that translate ibis expressions to something that can be
executed by different computation engines
■ ibis.pandas
■ ibis.pyspark
■ Ibis.bigquery
■ ibis.impala
■ Ibis.omniscidb
■ ...
24

Ibis language
● Table API
○ Projection
○ Filtering
○ Join
○ Groupby
○ Sort
○ Window
○ Aggregation
○ …
○ AsofJoin
○ UDFs
● Ibis expressions (intermediate representation)
25

Ibis language
● Table API
○ table = table.mutate(v3=table[‘v1’] + table[‘v2’])
● Ibis expressions
26

Ibis backends
● Ibis expressions -> Backend specific expressions
● table.mutate(v3=table[‘v1’] + table[‘v2’])
○ Pandas: df = df.assign(v3=df[‘v1’] + df[‘v2’])
○ PySpark: df = df.withColumn(‘v3’, df[‘v1’] + df[‘v2’])
27

Recall our earlier example in pandas
df['feature2'] = (
)
29

Basic column selection and arithmetic expression
Pandas code
30

Ibis expression
table = table.mutate(
feature=(table['v1'] + table['v2'])/2
)
Pandas code
31

Group-by and windowed aggregation
Ibis expression
)
Pandas code
df['feature2'] = (
)
32

Ibis expression
)
w = ibis.trailing_window(preceding=2,
group_by='key',
order_by='v1')
feature2=table['feature'].mean().over(w)
)
Pandas code
df['feature2'] = (
)
33

Composite aggregation
Ibis expression
)
group_by='key',
order_by='v1')
)
Pandas code
df['feature2'] = (
)
df.groupby('key').agg(
{'feature2': {'max', 'min'}}
)
34

Ibis expression
)
group_by='key',
order_by='v1')
)
table.group_by('key').aggregate([
table['feature2'].min(),
table['feature2'].max()
])
Pandas code
df['feature2'] = (
)
)
35

Final translation
Ibis expression
)
group_by='key',
order_by='v1')
)
])
Pandas code
df['feature2'] = (
)
)
36

Pandas code
df['feature2'] = (
)
)
Ibis tables are expressions, not dataframes
Ibis expression
)
group_by='key',
order_by='v1')
)
]) 37
So far, table and the result of the
transformation on the left are
expressions, not actual dataframes.

Pandas code
df['feature2'] = (
)
)
Ibis tables are expressions, not dataframes
Ibis expression
)
group_by='key',
order_by='v1')
)
]) 38
We need to execute these expressions
on real data in a backend.

Initialize backend-specific Ibis client
PySpark
pyspark_client = ibis.pyspark.connect(pyspark_session)
Pandas
pandas_client = ibis.pandas.connect()
Impala
impala_client = ibis.impala.connect(host=host, port=port)
...
40

Ibis expression
my_table = client.table('foo')
Access table in Ibis
41

Ibis expression
my_table = pyspark_client.table('foo')
42

Ibis expression
# my_table is an ibis table expression
print(my_table)
PySparkTable[table]
name: foo
schema:
key : string
v1 : int64
v2 : int64
43

Ibis expression
# execute materializes the table expression into a pandas DataFrame
df = my_table.execute()
df
44

def transform(table: TableExpr) -> TableExpr:
)
w = ibis.trailing_window(preceding=2, group_by='key', order_by='v1')
)
])
return table
Recall our table transformation in Ibis
45

def transform(table) -> ibis.expr.types.TableExpr:
)
)
])
return table
result_table = transform(my_table)
Apply transform() on our Ibis table
46

)
)
])
return table
Apply transform() on our Ibis table
47
my_table and result_table
are
ibis table expressions, not dataframes.

)
)
])
return table
result_table.execute()
Execute the result on our backend
48

)
)
])
return table
result_table.execute()
Execute the result on our backend
49
pandas dataframe

Translate Ibis expression into PySpark expressions
51
Ibis expression tree

Ibis expression
table.mutate(
)
52

Ibis expression
table.mutate(
)
53

Ibis expression
table.mutate(
)
54

Ibis expression
table.mutate(
)
Ibis translation code
@compiles(ops.Divide)
def compile_divide(t, expr, scope, **kwargs):
op = expr.op()
left = t.translate(op.left, scope)
right = t.translate(op.right, scope)
return left / right
55

Ibis expression
table.mutate(
)
op = expr.op()
return left / right
t is a PySparkTranslator
56

Ibis expression
table.mutate(
)
op = expr.op()
return left / right
t is a PySparkTranslator
It has a translate() method that
evaluates an Ibis expression into a
PySpark object.
57

Ibis expression
table.mutate(
)
op = expr.op()
return left / right
expr is the ibis expression to
translate
58

Ibis expression
table.mutate(
)
op = expr.op()
return left / right
scope is a dict that caches results
of previously translated Ibis
expressions.
59

Ibis expression
table.mutate(
)
op = expr.op()
return left / right
We use the PySparkTranslator to
evaluate the left and right operands
(which are themselves ibis
expressions) into PySpark columns.
60

Ibis expression
table.mutate(
)
op = expr.op()
return left / right
left and right are
PySpark columns
PySpark column division
61

Ibis expression
table.mutate(
)
op = expr.op()
return left / right
62

Ibis expression
table.mutate(
)
@compiles(ops.Add)
def compile_add(t, expr, scope, **kwargs):
op = expr.op()
return left + right
PySpark column addition
63

Ibis expression
table.mutate(
)
@compiles(ops.Add)
def compile_add(t, expr, scope, **kwargs):
op = expr.op()
return left + right
64

Ibis expression
table.mutate(
)
@compiles(ops.TableColumn)
def compile_column(t, expr, scope, **kwargs):
op = expr.op()
column_name = op.name
pyspark_df = t.translate(op.table, scope)
return pyspark_df[column_name]
65

Ibis expression
table.mutate(
)
op = expr.op()
66

Ibis expression
table.mutate(
)
op = expr.op()
PySpark column selection
67

Ibis expression
group_by='key',
order_by='v1')
)
68

Ibis expression
group_by='key',
order_by='v1')
)
69
PySpark translation

Ibis expression
group_by='key',
order_by='v1')
)
70
PySpark translation
from pyspark.sql.window import Window
w = (
Window.partitionBy('key')
.orderBy('v1')
.rowsBetween(-2, Window.currentRow)
)

Ibis expression
group_by='key',
order_by='v1')
)
71
PySpark translation
import pyspark.sql.functions as F
w = (
.orderBy('v1')
)
F.mean(df['feature']).over(w)

Ibis expression
group_by='key',
order_by='v1')
)
72
PySpark translation
w = (
.orderBy('v1')
)
df = df.withColumn(
'feature2', F.mean(df['feature']).over(w)
)

Ibis expression
group_by='key',
order_by='v1')
)
73
PySpark translation
w = (
.orderBy('v1')
)
df = df.withColumn(
'feature2', F.mean(df['feature']).over(w)
)

Ibis expression
])
74

Ibis expression
])
75
PySpark translation

Ibis expression
])
76
PySpark translation
# PySpark column expressions
F.min(df['feature2'])
F.max(df['feature2'])

Ibis expression
])
77
PySpark translation
F.min(df['feature2']),
F.max(df['feature2'])
)

More interesting examples
Ibis expression
table['v1'].rank().over(window)
78
PySpark translation

Ibis expression
79
PySpark translation
F.rank().over(window).astype('long') - F.lit(1)

Ibis expression
80
PySpark translation
F.rank().over(window).astype('long') - F.lit(1)
Subtle differences like cast to long & off by one

Ibis expression
table['boolean_col'].not_any()
81
PySpark translation

Ibis expression
82
PySpark translation
~F.max(df['boolean_col'])

Ibis expression
83
PySpark translation
~F.max(df['boolean_col'])
No direct translations

Conclusion
● Separate expression and execution
● Don’t limit only to what you can use today, think about what you can
use in the future
○ Arrow Dataset
○ Modin
○ cudf / dask-cudf
○ ...
85

Thanks!
ice.xelloss@gmail.com (@icexelloss)
86

Ibis: Seamless Transition Between Pandas and Apache Spark

More Related Content

What's hot (20)

Similar to Ibis: Seamless Transition Between Pandas and Apache Spark (20)

More from Databricks (20)

Recently uploaded (20)

Ibis: Seamless Transition Between Pandas and Apache Spark