SlideShare a Scribd company logo
Ibis:
Seamless Transition From
Pandas to Spark
Spark + AI Summit
2020
1
This document is being distributed for informational and educational purposes only and is not an offer to sell or the
solicitation of an offer to buy any securities or other instruments. The information contained herein is not intended to
provide, and should not be relied upon for, investment advice. The views expressed herein are not necessarily the
views of Two Sigma Investments, LP or any of its affiliates (collectively, “Two Sigma”). Such views reflect the
assumptions of the author(s) of the document and are subject to change without notice. The document may employ
data derived from third-party sources. No representation is made by Two Sigma as to the accuracy of such information
and the use of such information in no way implies an endorsement of the source of such information or its validity.
The copyrights and/or trademarks in some of the images, logos or other material used herein may be owned by entities
other than Two Sigma. If so, such copyrights and/or trademarks are most likely owned by the entity that created the
material and are used purely for identification and comment as fair use under international copyright and/or trademark
laws. Use of such image, copyright or trademark does not imply any association with such organization (or
endorsement of such organization) by Two Sigma, nor vice versa.
Legal Disclaimer
2
● If you…
○ like pandas but want to analyze large dataset?
○ are interested in “distributed DataFrames” but don’t know which one to
choose?
○ want to have you analysis code run faster or more scalable without
making code changes?
Target Audience
3
● Modeling Tools @ Two Sigma
● Apache Spark, Pandas, Apache Arrow, Flint, Ibis
About Me: Li Jin
4
A common data science task...
5
df = pd.read_parquet(...)
df['feature'] = (df['v1'] + df['v2']) / 2
df['feature2'] = (
df.groupby('key')['feature']
.rolling(3, min_periods=1).mean()
.sort_index(level=1)
.reset_index(drop=True)
)
df.groupby('key').agg({'feature2': {'max', 'min'}})
A common data science task...
6
df = pd.read_parquet(...)
df['feature'] = (df['v1'] + df['v2']) / 2
df['feature2'] = (
df.groupby('key')['feature']
.rolling(3, min_periods=1).mean()
.sort_index(level=1)
.reset_index(drop=True)
)
df.groupby('key').agg({'feature2': {'max', 'min'}})
A common data science task...
7
df = pd.read_parquet(...)
df['feature'] = (df['v1'] + df['v2']) / 2
df['feature2'] = (
df.groupby('key')['feature']
.rolling(3, min_periods=1).mean()
.sort_index(level=1)
.reset_index(drop=True)
)
df.groupby('key').agg({'feature2': {'max', 'min'}})
A common data science task...
8
df = pd.read_parquet(...)
df['feature'] = (df['v1'] + df['v2']) / 2
df['feature2'] = (
df.groupby('key')['feature']
.rolling(3, min_periods=1).mean()
.sort_index(level=1)
.reset_index(drop=True)
)
df.groupby('key').agg({'feature2': {'max', 'min'}})
A common data science task...
9
● The software is not designed for large amounts of data
● Your machine doesn’t have enough RAM to hold all the data
● You are not utilizing all the CPUs
You are happy until...the code runs too slow
10
Try a few things...
11
● Use a bigger machine
● Pros:
○ Low human cost: no code change
● Cons:
○ Same software limits
○ Single threaded
○ Probably not fast enough
Try a few things...
12
● Use a generic way to distribute the code:
○ sparkContext.parallelize(range(2000, 2020)).map(compute_for
_year).collect()
● Pros:
○ Medium human cost: small code change
○ Scalable
● Cons:
○ Works only for embarrassingly parallel problems
○ Boundary handling can be tricky
○ Distributed failures
Try a few things...
13
Try a few things...
● Use a distributed dataframe library
○ Spark
○ Dask
○ Koalas
○ ...
● Pros:
○ Scalable
● Cons:
○ High human cost: learn another API
○ Not obvious which one to use
○ Distributed failures
14
Take a step back...
15
Take a step back...
df['feature'] = (df['v1'] + df['v2']) / 2
df['feature2'] = (
df.groupby('key')['feature']
.rolling(3, min_periods=1).mean()
.sort_index(level=1)
.reset_index(drop=True)
)
df.groupby('key').agg({'feature2': {'max', 'min'}})
16
The problem
The problem is not how we express the computation, but
how we execute it.
17
Separation of Concern
From wikipedia:
“In computer science, separation of concerns (SoC) is a design
principle for separating a computer program into distinct sections
such that each section addresses a separate concern.”
18
Separation of expression and execution
Can we separate “how we express the computation”
(expression) and “how we execute it” (execution)?
19
Separation of expression and execution
● SQL is a way to express the computation independent of the
execution.
● Can we have something like SQL, but for Python Data Analysis?
20
Outline
● Ibis: A high level introduction
● Ibis: expression language
● Ibis: backend execution
● PySpark backend for Ibis
● Conclusion
21
Ibis: A high level
introduction
22
Ibis: Python Data Analysis Framework
● Open source
● Started in 2015 by Wes McKinney
● Worked on by top pandas committers:
○ Wes McKinney
○ Phillip Cloud
○ Jeff Reback
23
Ibis components
● ibis language
○ The API that is used to express the computation with ibis expressions
● ibis backends
○ Modules that translate ibis expressions to something that can be
executed by different computation engines
■ ibis.pandas
■ ibis.pyspark
■ Ibis.bigquery
■ ibis.impala
■ Ibis.omniscidb
■ ...
24
Ibis language
● Table API
○ Projection
○ Filtering
○ Join
○ Groupby
○ Sort
○ Window
○ Aggregation
○ …
○ AsofJoin
○ UDFs
● Ibis expressions (intermediate representation)
25
Ibis language
● Table API
○ table = table.mutate(v3=table[‘v1’] + table[‘v2’])
● Ibis expressions
26
Ibis backends
● Ibis expressions -> Backend specific expressions
● table.mutate(v3=table[‘v1’] + table[‘v2’])
○ Pandas: df = df.assign(v3=df[‘v1’] + df[‘v2’])
○ PySpark: df = df.withColumn(‘v3’, df[‘v1’] + df[‘v2’])
27
Ibis: expression
language
28
Recall our earlier example in pandas
df['feature'] = (df['v1'] + df['v2']) / 2
df['feature2'] = (
df.groupby('key')['feature']
.rolling(3, min_periods=1).mean()
.sort_index(level=1)
.reset_index(drop=True)
)
df.groupby('key').agg({'feature2': {'max', 'min'}})
29
Basic column selection and arithmetic expression
Pandas code
df['feature'] = (df['v1'] + df['v2']) / 2
30
Basic column selection and arithmetic expression
Ibis expression
table = table.mutate(
feature=(table['v1'] + table['v2'])/2
)
Pandas code
df['feature'] = (df['v1'] + df['v2']) / 2
31
Group-by and windowed aggregation
Ibis expression
table = table.mutate(
feature=(table['v1'] + table['v2'])/2
)
Pandas code
df['feature'] = (df['v1'] + df['v2']) / 2
df['feature2'] = (
df.groupby('key')['feature']
.rolling(3, min_periods=1).mean()
.sort_index(level=1)
.reset_index(drop=True)
)
32
Group-by and windowed aggregation
Ibis expression
table = table.mutate(
feature=(table['v1'] + table['v2'])/2
)
w = ibis.trailing_window(preceding=2,
group_by='key',
order_by='v1')
table = table.mutate(
feature2=table['feature'].mean().over(w)
)
Pandas code
df['feature'] = (df['v1'] + df['v2']) / 2
df['feature2'] = (
df.groupby('key')['feature']
.rolling(3, min_periods=1).mean()
.sort_index(level=1)
.reset_index(drop=True)
)
33
Composite aggregation
Ibis expression
table = table.mutate(
feature=(table['v1'] + table['v2'])/2
)
w = ibis.trailing_window(preceding=2,
group_by='key',
order_by='v1')
table = table.mutate(
feature2=table['feature'].mean().over(w)
)
Pandas code
df['feature'] = (df['v1'] + df['v2']) / 2
df['feature2'] = (
df.groupby('key')['feature']
.rolling(3, min_periods=1).mean()
.sort_index(level=1)
.reset_index(drop=True)
)
df.groupby('key').agg(
{'feature2': {'max', 'min'}}
)
34
Composite aggregation
Ibis expression
table = table.mutate(
feature=(table['v1'] + table['v2'])/2
)
w = ibis.trailing_window(preceding=2,
group_by='key',
order_by='v1')
table = table.mutate(
feature2=table['feature'].mean().over(w)
)
table.group_by('key').aggregate([
table['feature2'].min(),
table['feature2'].max()
])
Pandas code
df['feature'] = (df['v1'] + df['v2']) / 2
df['feature2'] = (
df.groupby('key')['feature']
.rolling(3, min_periods=1).mean()
.sort_index(level=1)
.reset_index(drop=True)
)
df.groupby('key').agg(
{'feature2': {'max', 'min'}}
)
35
Final translation
Ibis expression
table = table.mutate(
feature=(table['v1'] + table['v2'])/2
)
w = ibis.trailing_window(preceding=2,
group_by='key',
order_by='v1')
table = table.mutate(
feature2=table['feature'].mean().over(w)
)
table.group_by('key').aggregate([
table['feature2'].min(),
table['feature2'].max()
])
Pandas code
df['feature'] = (df['v1'] + df['v2']) / 2
df['feature2'] = (
df.groupby('key')['feature']
.rolling(3, min_periods=1).mean()
.sort_index(level=1)
.reset_index(drop=True)
)
df.groupby('key').agg(
{'feature2': {'max', 'min'}}
)
36
Pandas code
df['feature'] = (df['v1'] + df['v2']) / 2
df['feature2'] = (
df.groupby('key')['feature']
.rolling(3, min_periods=1).mean()
.sort_index(level=1)
.reset_index(drop=True)
)
df.groupby('key').agg(
{'feature2': {'max', 'min'}}
)
Ibis tables are expressions, not dataframes
Ibis expression
table = table.mutate(
feature=(table['v1'] + table['v2'])/2
)
w = ibis.trailing_window(preceding=2,
group_by='key',
order_by='v1')
table = table.mutate(
feature2=table['feature'].mean().over(w)
)
table.group_by('key').aggregate([
table['feature2'].min(),
table['feature2'].max()
]) 37
So far, table and the result of the
transformation on the left are
expressions, not actual dataframes.
Pandas code
df['feature'] = (df['v1'] + df['v2']) / 2
df['feature2'] = (
df.groupby('key')['feature']
.rolling(3, min_periods=1).mean()
.sort_index(level=1)
.reset_index(drop=True)
)
df.groupby('key').agg(
{'feature2': {'max', 'min'}}
)
Ibis tables are expressions, not dataframes
Ibis expression
table = table.mutate(
feature=(table['v1'] + table['v2'])/2
)
w = ibis.trailing_window(preceding=2,
group_by='key',
order_by='v1')
table = table.mutate(
feature2=table['feature'].mean().over(w)
)
table.group_by('key').aggregate([
table['feature2'].min(),
table['feature2'].max()
]) 38
We need to execute these expressions
on real data in a backend.
Ibis: backend
execution
39
Initialize backend-specific Ibis client
PySpark
pyspark_client = ibis.pyspark.connect(pyspark_session)
Pandas
pandas_client = ibis.pandas.connect()
Impala
impala_client = ibis.impala.connect(host=host, port=port)
...
40
Ibis expression
my_table = client.table('foo')
Access table in Ibis
41
Ibis expression
my_table = pyspark_client.table('foo')
Access table in Ibis
42
Ibis expression
my_table = pyspark_client.table('foo')
# my_table is an ibis table expression
print(my_table)
PySparkTable[table]
name: foo
schema:
key : string
v1 : int64
v2 : int64
Access table in Ibis
43
Ibis expression
my_table = pyspark_client.table('foo')
# execute materializes the table expression into a pandas DataFrame
df = my_table.execute()
df
Access table in Ibis
44
def transform(table: TableExpr) -> TableExpr:
table = table.mutate(
feature=(table['v1'] + table['v2'])/2
)
w = ibis.trailing_window(preceding=2, group_by='key', order_by='v1')
table = table.mutate(
feature2=table['feature'].mean().over(w)
)
table.group_by('key').aggregate([
table['feature2'].min(),
table['feature2'].max()
])
return table
Recall our table transformation in Ibis
45
def transform(table) -> ibis.expr.types.TableExpr:
table = table.mutate(
feature=(table['v1'] + table['v2'])/2
)
w = ibis.trailing_window(preceding=2, group_by='key', order_by='v1')
table = table.mutate(
feature2=table['feature'].mean().over(w)
)
table.group_by('key').aggregate([
table['feature2'].min(),
table['feature2'].max()
])
return table
result_table = transform(my_table)
Apply transform() on our Ibis table
46
def transform(table) -> ibis.expr.types.TableExpr:
table = table.mutate(
feature=(table['v1'] + table['v2'])/2
)
w = ibis.trailing_window(preceding=2, group_by='key', order_by='v1')
table = table.mutate(
feature2=table['feature'].mean().over(w)
)
table.group_by('key').aggregate([
table['feature2'].min(),
table['feature2'].max()
])
return table
result_table = transform(my_table)
Apply transform() on our Ibis table
47
my_table and result_table
are
ibis table expressions, not dataframes.
def transform(table) -> ibis.expr.types.TableExpr:
table = table.mutate(
feature=(table['v1'] + table['v2'])/2
)
w = ibis.trailing_window(preceding=2, group_by='key', order_by='v1')
table = table.mutate(
feature2=table['feature'].mean().over(w)
)
table.group_by('key').aggregate([
table['feature2'].min(),
table['feature2'].max()
])
return table
result_table = transform(my_table)
result_table.execute()
Execute the result on our backend
48
def transform(table) -> ibis.expr.types.TableExpr:
table = table.mutate(
feature=(table['v1'] + table['v2'])/2
)
w = ibis.trailing_window(preceding=2, group_by='key', order_by='v1')
table = table.mutate(
feature2=table['feature'].mean().over(w)
)
table.group_by('key').aggregate([
table['feature2'].min(),
table['feature2'].max()
])
return table
result_table = transform(my_table)
result_table.execute()
Execute the result on our backend
49
pandas dataframe
Ibis: PySpark
backend
50
Translate Ibis expression into PySpark expressions
51
Ibis expression tree
Basic column selection and arithmetic expression
Ibis expression
table.mutate(
feature=(table['v1'] + table['v2'])/2
)
52
Basic column selection and arithmetic expression
Ibis expression
table.mutate(
feature=(table['v1'] + table['v2'])/2
)
53
Ibis expression
table.mutate(
feature=(table['v1'] + table['v2'])/2
)
Basic column selection and arithmetic expression
54
Ibis expression
table.mutate(
feature=(table['v1'] + table['v2'])/2
)
Ibis translation code
@compiles(ops.Divide)
def compile_divide(t, expr, scope, **kwargs):
op = expr.op()
left = t.translate(op.left, scope)
right = t.translate(op.right, scope)
return left / right
Basic column selection and arithmetic expression
55
Ibis expression
table.mutate(
feature=(table['v1'] + table['v2'])/2
)
Ibis translation code
@compiles(ops.Divide)
def compile_divide(t, expr, scope, **kwargs):
op = expr.op()
left = t.translate(op.left, scope)
right = t.translate(op.right, scope)
return left / right
t is a PySparkTranslator
Basic column selection and arithmetic expression
56
Ibis expression
table.mutate(
feature=(table['v1'] + table['v2'])/2
)
Ibis translation code
@compiles(ops.Divide)
def compile_divide(t, expr, scope, **kwargs):
op = expr.op()
left = t.translate(op.left, scope)
right = t.translate(op.right, scope)
return left / right
t is a PySparkTranslator
It has a translate() method that
evaluates an Ibis expression into a
PySpark object.
Basic column selection and arithmetic expression
57
Ibis expression
table.mutate(
feature=(table['v1'] + table['v2'])/2
)
Ibis translation code
@compiles(ops.Divide)
def compile_divide(t, expr, scope, **kwargs):
op = expr.op()
left = t.translate(op.left, scope)
right = t.translate(op.right, scope)
return left / right
expr is the ibis expression to
translate
Basic column selection and arithmetic expression
58
Ibis expression
table.mutate(
feature=(table['v1'] + table['v2'])/2
)
Ibis translation code
@compiles(ops.Divide)
def compile_divide(t, expr, scope, **kwargs):
op = expr.op()
left = t.translate(op.left, scope)
right = t.translate(op.right, scope)
return left / right
scope is a dict that caches results
of previously translated Ibis
expressions.
Basic column selection and arithmetic expression
59
Ibis expression
table.mutate(
feature=(table['v1'] + table['v2'])/2
)
Ibis translation code
@compiles(ops.Divide)
def compile_divide(t, expr, scope, **kwargs):
op = expr.op()
left = t.translate(op.left, scope)
right = t.translate(op.right, scope)
return left / right
We use the PySparkTranslator to
evaluate the left and right operands
(which are themselves ibis
expressions) into PySpark columns.
Basic column selection and arithmetic expression
60
Ibis expression
table.mutate(
feature=(table['v1'] + table['v2'])/2
)
Ibis translation code
@compiles(ops.Divide)
def compile_divide(t, expr, scope, **kwargs):
op = expr.op()
left = t.translate(op.left, scope)
right = t.translate(op.right, scope)
return left / right
left and right are
PySpark columns
PySpark column division
Basic column selection and arithmetic expression
61
Ibis expression
table.mutate(
feature=(table['v1'] + table['v2'])/2
)
Ibis translation code
@compiles(ops.Divide)
def compile_divide(t, expr, scope, **kwargs):
op = expr.op()
left = t.translate(op.left, scope)
right = t.translate(op.right, scope)
return left / right
Basic column selection and arithmetic expression
62
Ibis expression
table.mutate(
feature=(table['v1'] + table['v2'])/2
)
Ibis translation code
@compiles(ops.Add)
def compile_add(t, expr, scope, **kwargs):
op = expr.op()
left = t.translate(op.left, scope)
right = t.translate(op.right, scope)
return left + right
PySpark column addition
Basic column selection and arithmetic expression
63
Ibis expression
table.mutate(
feature=(table['v1'] + table['v2'])/2
)
Ibis translation code
@compiles(ops.Add)
def compile_add(t, expr, scope, **kwargs):
op = expr.op()
left = t.translate(op.left, scope)
right = t.translate(op.right, scope)
return left + right
Basic column selection and arithmetic expression
64
Ibis expression
table.mutate(
feature=(table['v1'] + table['v2'])/2
)
Ibis translation code
@compiles(ops.TableColumn)
def compile_column(t, expr, scope, **kwargs):
op = expr.op()
column_name = op.name
pyspark_df = t.translate(op.table, scope)
return pyspark_df[column_name]
Basic column selection and arithmetic expression
65
Ibis expression
table.mutate(
feature=(table['v1'] + table['v2'])/2
)
Ibis translation code
@compiles(ops.TableColumn)
def compile_column(t, expr, scope, **kwargs):
op = expr.op()
column_name = op.name
pyspark_df = t.translate(op.table, scope)
return pyspark_df[column_name]
Basic column selection and arithmetic expression
66
Ibis expression
table.mutate(
feature=(table['v1'] + table['v2'])/2
)
Ibis translation code
@compiles(ops.TableColumn)
def compile_column(t, expr, scope, **kwargs):
op = expr.op()
column_name = op.name
pyspark_df = t.translate(op.table, scope)
return pyspark_df[column_name]
PySpark column selection
Basic column selection and arithmetic expression
67
Group-by and windowed aggregation
Ibis expression
w = ibis.trailing_window(preceding=2,
group_by='key',
order_by='v1')
table = table.mutate(
feature2=table['feature'].mean().over(w)
)
68
Group-by and windowed aggregation
Ibis expression
w = ibis.trailing_window(preceding=2,
group_by='key',
order_by='v1')
table = table.mutate(
feature2=table['feature'].mean().over(w)
)
69
PySpark translation
Group-by and windowed aggregation
Ibis expression
w = ibis.trailing_window(preceding=2,
group_by='key',
order_by='v1')
table = table.mutate(
feature2=table['feature'].mean().over(w)
)
70
PySpark translation
from pyspark.sql.window import Window
w = (
Window.partitionBy('key')
.orderBy('v1')
.rowsBetween(-2, Window.currentRow)
)
Group-by and windowed aggregation
Ibis expression
w = ibis.trailing_window(preceding=2,
group_by='key',
order_by='v1')
table = table.mutate(
feature2=table['feature'].mean().over(w)
)
71
PySpark translation
from pyspark.sql.window import Window
import pyspark.sql.functions as F
w = (
Window.partitionBy('key')
.orderBy('v1')
.rowsBetween(-2, Window.currentRow)
)
F.mean(df['feature']).over(w)
Group-by and windowed aggregation
Ibis expression
w = ibis.trailing_window(preceding=2,
group_by='key',
order_by='v1')
table = table.mutate(
feature2=table['feature'].mean().over(w)
)
72
PySpark translation
from pyspark.sql.window import Window
import pyspark.sql.functions as F
w = (
Window.partitionBy('key')
.orderBy('v1')
.rowsBetween(-2, Window.currentRow)
)
df = df.withColumn(
'feature2', F.mean(df['feature']).over(w)
)
Group-by and windowed aggregation
Ibis expression
w = ibis.trailing_window(preceding=2,
group_by='key',
order_by='v1')
table = table.mutate(
feature2=table['feature'].mean().over(w)
)
73
PySpark translation
from pyspark.sql.window import Window
import pyspark.sql.functions as F
w = (
Window.partitionBy('key')
.orderBy('v1')
.rowsBetween(-2, Window.currentRow)
)
df = df.withColumn(
'feature2', F.mean(df['feature']).over(w)
)
Composite aggregation
Ibis expression
table.group_by('key').aggregate([
table['feature2'].min(),
table['feature2'].max()
])
74
Composite aggregation
Ibis expression
table.group_by('key').aggregate([
table['feature2'].min(),
table['feature2'].max()
])
75
PySpark translation
Composite aggregation
Ibis expression
table.group_by('key').aggregate([
table['feature2'].min(),
table['feature2'].max()
])
76
PySpark translation
# PySpark column expressions
F.min(df['feature2'])
F.max(df['feature2'])
Composite aggregation
Ibis expression
table.group_by('key').aggregate([
table['feature2'].min(),
table['feature2'].max()
])
77
PySpark translation
df.groupby('key').agg(
F.min(df['feature2']),
F.max(df['feature2'])
)
More interesting examples
Ibis expression
table['v1'].rank().over(window)
78
PySpark translation
More interesting examples
Ibis expression
table['v1'].rank().over(window)
79
PySpark translation
import pyspark.sql.functions as F
F.rank().over(window).astype('long') - F.lit(1)
More interesting examples
Ibis expression
table['v1'].rank().over(window)
80
PySpark translation
import pyspark.sql.functions as F
F.rank().over(window).astype('long') - F.lit(1)
Subtle differences like cast to long & off by one
More interesting examples
Ibis expression
table['boolean_col'].not_any()
81
PySpark translation
More interesting examples
Ibis expression
table['boolean_col'].not_any()
82
PySpark translation
~F.max(df['boolean_col'])
More interesting examples
Ibis expression
table['boolean_col'].not_any()
83
PySpark translation
~F.max(df['boolean_col'])
No direct translations
Conclusion
84
Conclusion
● Separate expression and execution
● Don’t limit only to what you can use today, think about what you can
use in the future
○ Arrow Dataset
○ Modin
○ cudf / dask-cudf
○ ...
85
Thanks!
ice.xelloss@gmail.com (@icexelloss)
86

More Related Content

PPTX
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
PDF
Introducing DataFrames in Spark for Large Scale Data Science
PPTX
Optimizing Apache Spark SQL Joins
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
PDF
Graph based data models
PDF
Apache Iceberg - A Table Format for Hige Analytic Datasets
PDF
Koalas: Pandas on Apache Spark
PDF
Battle of the Stream Processing Titans – Flink versus RisingWave
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Introducing DataFrames in Spark for Large Scale Data Science
Optimizing Apache Spark SQL Joins
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Graph based data models
Apache Iceberg - A Table Format for Hige Analytic Datasets
Koalas: Pandas on Apache Spark
Battle of the Stream Processing Titans – Flink versus RisingWave

What's hot (20)

PPTX
NoSQL databases - An introduction
PDF
Apache Spark in Depth: Core Concepts, Architecture & Internals
PDF
Change Data Feed in Delta
PPTX
Free Training: How to Build a Lakehouse
PDF
Introduction to Spark with Python
PPTX
Azure data platform overview
PDF
Databricks: A Tool That Empowers You To Do More With Data
PDF
The Parquet Format and Performance Optimization Opportunities
PDF
PySpark in practice slides
PDF
Spark SQL
PPTX
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
PPTX
Databricks Platform.pptx
PPTX
Apache Spark overview
PDF
3D: DBT using Databricks and Delta
PPTX
Apache Arrow: In Theory, In Practice
PDF
Optimizing Delta/Parquet Data Lakes for Apache Spark
PPTX
Introduction to Apache Spark
PDF
Cassandra Database
PDF
Accelerating Data Ingestion with Databricks Autoloader
PPTX
Databricks Fundamentals
NoSQL databases - An introduction
Apache Spark in Depth: Core Concepts, Architecture & Internals
Change Data Feed in Delta
Free Training: How to Build a Lakehouse
Introduction to Spark with Python
Azure data platform overview
Databricks: A Tool That Empowers You To Do More With Data
The Parquet Format and Performance Optimization Opportunities
PySpark in practice slides
Spark SQL
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Databricks Platform.pptx
Apache Spark overview
3D: DBT using Databricks and Delta
Apache Arrow: In Theory, In Practice
Optimizing Delta/Parquet Data Lakes for Apache Spark
Introduction to Apache Spark
Cassandra Database
Accelerating Data Ingestion with Databricks Autoloader
Databricks Fundamentals
Ad

Similar to Ibis: Seamless Transition Between Pandas and Apache Spark (20)

PDF
Spark AI 2020
PPTX
PyData NYC 2019
PPTX
PPT on Data Science Using Python
PPTX
More on Pandas.pptx
PDF
Python for Data Analysis Data Wrangling with Pandas NumPy and IPython Wes Mck...
PPTX
interenship.pptx
PDF
Tactical Data Science Tips: Python and Spark Together
PDF
Python-for-Data-Analysis.pdf
PPTX
Python-for-Data-Analysis.pptx
PPTX
Python-for-Data-Analysis.pptx
PDF
Python for Data Analysis.pdf
PDF
Energy analytics with Apache Spark workshop
PDF
pyspark.pdf
PDF
lecture14DATASCIENCE AND MACHINE LER.pdf
PPT
SASasasASSSasSSSSSasasaSASsasASASasasASs
PDF
Python for Data Analysis_ Data Wrangling with Pandas, Numpy, and Ipython ( PD...
PPTX
Python for data analysis
PPTX
Meetup Junio Data Analysis with python 2018
PDF
Sparkly Notebook: Interactive Analysis and Visualization with Spark
PPTX
Python-for-Data-Analysis.pptx
Spark AI 2020
PyData NYC 2019
PPT on Data Science Using Python
More on Pandas.pptx
Python for Data Analysis Data Wrangling with Pandas NumPy and IPython Wes Mck...
interenship.pptx
Tactical Data Science Tips: Python and Spark Together
Python-for-Data-Analysis.pdf
Python-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptx
Python for Data Analysis.pdf
Energy analytics with Apache Spark workshop
pyspark.pdf
lecture14DATASCIENCE AND MACHINE LER.pdf
SASasasASSSasSSSSSasasaSASsasASASasasASs
Python for Data Analysis_ Data Wrangling with Pandas, Numpy, and Ipython ( PD...
Python for data analysis
Meetup Junio Data Analysis with python 2018
Sparkly Notebook: Interactive Analysis and Visualization with Spark
Python-for-Data-Analysis.pptx
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake

Recently uploaded (20)

PPTX
1_Introduction to advance data techniques.pptx
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPT
ISS -ESG Data flows What is ESG and HowHow
PDF
Business Analytics and business intelligence.pdf
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PPTX
Database Infoormation System (DBIS).pptx
PPTX
SAP 2 completion done . PRESENTATION.pptx
PPTX
climate analysis of Dhaka ,Banglades.pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPT
Quality review (1)_presentation of this 21
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
Introduction to machine learning and Linear Models
PPT
Reliability_Chapter_ presentation 1221.5784
PPT
Miokarditis (Inflamasi pada Otot Jantung)
1_Introduction to advance data techniques.pptx
Data_Analytics_and_PowerBI_Presentation.pptx
ISS -ESG Data flows What is ESG and HowHow
Business Analytics and business intelligence.pdf
STERILIZATION AND DISINFECTION-1.ppthhhbx
Database Infoormation System (DBIS).pptx
SAP 2 completion done . PRESENTATION.pptx
climate analysis of Dhaka ,Banglades.pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Quality review (1)_presentation of this 21
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
Clinical guidelines as a resource for EBP(1).pdf
Qualitative Qantitative and Mixed Methods.pptx
Business Ppt On Nestle.pptx huunnnhhgfvu
Fluorescence-microscope_Botany_detailed content
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Introduction to machine learning and Linear Models
Reliability_Chapter_ presentation 1221.5784
Miokarditis (Inflamasi pada Otot Jantung)

Ibis: Seamless Transition Between Pandas and Apache Spark

  • 1. Ibis: Seamless Transition From Pandas to Spark Spark + AI Summit 2020 1
  • 2. This document is being distributed for informational and educational purposes only and is not an offer to sell or the solicitation of an offer to buy any securities or other instruments. The information contained herein is not intended to provide, and should not be relied upon for, investment advice. The views expressed herein are not necessarily the views of Two Sigma Investments, LP or any of its affiliates (collectively, “Two Sigma”). Such views reflect the assumptions of the author(s) of the document and are subject to change without notice. The document may employ data derived from third-party sources. No representation is made by Two Sigma as to the accuracy of such information and the use of such information in no way implies an endorsement of the source of such information or its validity. The copyrights and/or trademarks in some of the images, logos or other material used herein may be owned by entities other than Two Sigma. If so, such copyrights and/or trademarks are most likely owned by the entity that created the material and are used purely for identification and comment as fair use under international copyright and/or trademark laws. Use of such image, copyright or trademark does not imply any association with such organization (or endorsement of such organization) by Two Sigma, nor vice versa. Legal Disclaimer 2
  • 3. ● If you… ○ like pandas but want to analyze large dataset? ○ are interested in “distributed DataFrames” but don’t know which one to choose? ○ want to have you analysis code run faster or more scalable without making code changes? Target Audience 3
  • 4. ● Modeling Tools @ Two Sigma ● Apache Spark, Pandas, Apache Arrow, Flint, Ibis About Me: Li Jin 4
  • 5. A common data science task... 5
  • 6. df = pd.read_parquet(...) df['feature'] = (df['v1'] + df['v2']) / 2 df['feature2'] = ( df.groupby('key')['feature'] .rolling(3, min_periods=1).mean() .sort_index(level=1) .reset_index(drop=True) ) df.groupby('key').agg({'feature2': {'max', 'min'}}) A common data science task... 6
  • 7. df = pd.read_parquet(...) df['feature'] = (df['v1'] + df['v2']) / 2 df['feature2'] = ( df.groupby('key')['feature'] .rolling(3, min_periods=1).mean() .sort_index(level=1) .reset_index(drop=True) ) df.groupby('key').agg({'feature2': {'max', 'min'}}) A common data science task... 7
  • 8. df = pd.read_parquet(...) df['feature'] = (df['v1'] + df['v2']) / 2 df['feature2'] = ( df.groupby('key')['feature'] .rolling(3, min_periods=1).mean() .sort_index(level=1) .reset_index(drop=True) ) df.groupby('key').agg({'feature2': {'max', 'min'}}) A common data science task... 8
  • 9. df = pd.read_parquet(...) df['feature'] = (df['v1'] + df['v2']) / 2 df['feature2'] = ( df.groupby('key')['feature'] .rolling(3, min_periods=1).mean() .sort_index(level=1) .reset_index(drop=True) ) df.groupby('key').agg({'feature2': {'max', 'min'}}) A common data science task... 9
  • 10. ● The software is not designed for large amounts of data ● Your machine doesn’t have enough RAM to hold all the data ● You are not utilizing all the CPUs You are happy until...the code runs too slow 10
  • 11. Try a few things... 11
  • 12. ● Use a bigger machine ● Pros: ○ Low human cost: no code change ● Cons: ○ Same software limits ○ Single threaded ○ Probably not fast enough Try a few things... 12
  • 13. ● Use a generic way to distribute the code: ○ sparkContext.parallelize(range(2000, 2020)).map(compute_for _year).collect() ● Pros: ○ Medium human cost: small code change ○ Scalable ● Cons: ○ Works only for embarrassingly parallel problems ○ Boundary handling can be tricky ○ Distributed failures Try a few things... 13
  • 14. Try a few things... ● Use a distributed dataframe library ○ Spark ○ Dask ○ Koalas ○ ... ● Pros: ○ Scalable ● Cons: ○ High human cost: learn another API ○ Not obvious which one to use ○ Distributed failures 14
  • 15. Take a step back... 15
  • 16. Take a step back... df['feature'] = (df['v1'] + df['v2']) / 2 df['feature2'] = ( df.groupby('key')['feature'] .rolling(3, min_periods=1).mean() .sort_index(level=1) .reset_index(drop=True) ) df.groupby('key').agg({'feature2': {'max', 'min'}}) 16
  • 17. The problem The problem is not how we express the computation, but how we execute it. 17
  • 18. Separation of Concern From wikipedia: “In computer science, separation of concerns (SoC) is a design principle for separating a computer program into distinct sections such that each section addresses a separate concern.” 18
  • 19. Separation of expression and execution Can we separate “how we express the computation” (expression) and “how we execute it” (execution)? 19
  • 20. Separation of expression and execution ● SQL is a way to express the computation independent of the execution. ● Can we have something like SQL, but for Python Data Analysis? 20
  • 21. Outline ● Ibis: A high level introduction ● Ibis: expression language ● Ibis: backend execution ● PySpark backend for Ibis ● Conclusion 21
  • 22. Ibis: A high level introduction 22
  • 23. Ibis: Python Data Analysis Framework ● Open source ● Started in 2015 by Wes McKinney ● Worked on by top pandas committers: ○ Wes McKinney ○ Phillip Cloud ○ Jeff Reback 23
  • 24. Ibis components ● ibis language ○ The API that is used to express the computation with ibis expressions ● ibis backends ○ Modules that translate ibis expressions to something that can be executed by different computation engines ■ ibis.pandas ■ ibis.pyspark ■ Ibis.bigquery ■ ibis.impala ■ Ibis.omniscidb ■ ... 24
  • 25. Ibis language ● Table API ○ Projection ○ Filtering ○ Join ○ Groupby ○ Sort ○ Window ○ Aggregation ○ … ○ AsofJoin ○ UDFs ● Ibis expressions (intermediate representation) 25
  • 26. Ibis language ● Table API ○ table = table.mutate(v3=table[‘v1’] + table[‘v2’]) ● Ibis expressions 26
  • 27. Ibis backends ● Ibis expressions -> Backend specific expressions ● table.mutate(v3=table[‘v1’] + table[‘v2’]) ○ Pandas: df = df.assign(v3=df[‘v1’] + df[‘v2’]) ○ PySpark: df = df.withColumn(‘v3’, df[‘v1’] + df[‘v2’]) 27
  • 29. Recall our earlier example in pandas df['feature'] = (df['v1'] + df['v2']) / 2 df['feature2'] = ( df.groupby('key')['feature'] .rolling(3, min_periods=1).mean() .sort_index(level=1) .reset_index(drop=True) ) df.groupby('key').agg({'feature2': {'max', 'min'}}) 29
  • 30. Basic column selection and arithmetic expression Pandas code df['feature'] = (df['v1'] + df['v2']) / 2 30
  • 31. Basic column selection and arithmetic expression Ibis expression table = table.mutate( feature=(table['v1'] + table['v2'])/2 ) Pandas code df['feature'] = (df['v1'] + df['v2']) / 2 31
  • 32. Group-by and windowed aggregation Ibis expression table = table.mutate( feature=(table['v1'] + table['v2'])/2 ) Pandas code df['feature'] = (df['v1'] + df['v2']) / 2 df['feature2'] = ( df.groupby('key')['feature'] .rolling(3, min_periods=1).mean() .sort_index(level=1) .reset_index(drop=True) ) 32
  • 33. Group-by and windowed aggregation Ibis expression table = table.mutate( feature=(table['v1'] + table['v2'])/2 ) w = ibis.trailing_window(preceding=2, group_by='key', order_by='v1') table = table.mutate( feature2=table['feature'].mean().over(w) ) Pandas code df['feature'] = (df['v1'] + df['v2']) / 2 df['feature2'] = ( df.groupby('key')['feature'] .rolling(3, min_periods=1).mean() .sort_index(level=1) .reset_index(drop=True) ) 33
  • 34. Composite aggregation Ibis expression table = table.mutate( feature=(table['v1'] + table['v2'])/2 ) w = ibis.trailing_window(preceding=2, group_by='key', order_by='v1') table = table.mutate( feature2=table['feature'].mean().over(w) ) Pandas code df['feature'] = (df['v1'] + df['v2']) / 2 df['feature2'] = ( df.groupby('key')['feature'] .rolling(3, min_periods=1).mean() .sort_index(level=1) .reset_index(drop=True) ) df.groupby('key').agg( {'feature2': {'max', 'min'}} ) 34
  • 35. Composite aggregation Ibis expression table = table.mutate( feature=(table['v1'] + table['v2'])/2 ) w = ibis.trailing_window(preceding=2, group_by='key', order_by='v1') table = table.mutate( feature2=table['feature'].mean().over(w) ) table.group_by('key').aggregate([ table['feature2'].min(), table['feature2'].max() ]) Pandas code df['feature'] = (df['v1'] + df['v2']) / 2 df['feature2'] = ( df.groupby('key')['feature'] .rolling(3, min_periods=1).mean() .sort_index(level=1) .reset_index(drop=True) ) df.groupby('key').agg( {'feature2': {'max', 'min'}} ) 35
  • 36. Final translation Ibis expression table = table.mutate( feature=(table['v1'] + table['v2'])/2 ) w = ibis.trailing_window(preceding=2, group_by='key', order_by='v1') table = table.mutate( feature2=table['feature'].mean().over(w) ) table.group_by('key').aggregate([ table['feature2'].min(), table['feature2'].max() ]) Pandas code df['feature'] = (df['v1'] + df['v2']) / 2 df['feature2'] = ( df.groupby('key')['feature'] .rolling(3, min_periods=1).mean() .sort_index(level=1) .reset_index(drop=True) ) df.groupby('key').agg( {'feature2': {'max', 'min'}} ) 36
  • 37. Pandas code df['feature'] = (df['v1'] + df['v2']) / 2 df['feature2'] = ( df.groupby('key')['feature'] .rolling(3, min_periods=1).mean() .sort_index(level=1) .reset_index(drop=True) ) df.groupby('key').agg( {'feature2': {'max', 'min'}} ) Ibis tables are expressions, not dataframes Ibis expression table = table.mutate( feature=(table['v1'] + table['v2'])/2 ) w = ibis.trailing_window(preceding=2, group_by='key', order_by='v1') table = table.mutate( feature2=table['feature'].mean().over(w) ) table.group_by('key').aggregate([ table['feature2'].min(), table['feature2'].max() ]) 37 So far, table and the result of the transformation on the left are expressions, not actual dataframes.
  • 38. Pandas code df['feature'] = (df['v1'] + df['v2']) / 2 df['feature2'] = ( df.groupby('key')['feature'] .rolling(3, min_periods=1).mean() .sort_index(level=1) .reset_index(drop=True) ) df.groupby('key').agg( {'feature2': {'max', 'min'}} ) Ibis tables are expressions, not dataframes Ibis expression table = table.mutate( feature=(table['v1'] + table['v2'])/2 ) w = ibis.trailing_window(preceding=2, group_by='key', order_by='v1') table = table.mutate( feature2=table['feature'].mean().over(w) ) table.group_by('key').aggregate([ table['feature2'].min(), table['feature2'].max() ]) 38 We need to execute these expressions on real data in a backend.
  • 40. Initialize backend-specific Ibis client PySpark pyspark_client = ibis.pyspark.connect(pyspark_session) Pandas pandas_client = ibis.pandas.connect() Impala impala_client = ibis.impala.connect(host=host, port=port) ... 40
  • 41. Ibis expression my_table = client.table('foo') Access table in Ibis 41
  • 42. Ibis expression my_table = pyspark_client.table('foo') Access table in Ibis 42
  • 43. Ibis expression my_table = pyspark_client.table('foo') # my_table is an ibis table expression print(my_table) PySparkTable[table] name: foo schema: key : string v1 : int64 v2 : int64 Access table in Ibis 43
  • 44. Ibis expression my_table = pyspark_client.table('foo') # execute materializes the table expression into a pandas DataFrame df = my_table.execute() df Access table in Ibis 44
  • 45. def transform(table: TableExpr) -> TableExpr: table = table.mutate( feature=(table['v1'] + table['v2'])/2 ) w = ibis.trailing_window(preceding=2, group_by='key', order_by='v1') table = table.mutate( feature2=table['feature'].mean().over(w) ) table.group_by('key').aggregate([ table['feature2'].min(), table['feature2'].max() ]) return table Recall our table transformation in Ibis 45
  • 46. def transform(table) -> ibis.expr.types.TableExpr: table = table.mutate( feature=(table['v1'] + table['v2'])/2 ) w = ibis.trailing_window(preceding=2, group_by='key', order_by='v1') table = table.mutate( feature2=table['feature'].mean().over(w) ) table.group_by('key').aggregate([ table['feature2'].min(), table['feature2'].max() ]) return table result_table = transform(my_table) Apply transform() on our Ibis table 46
  • 47. def transform(table) -> ibis.expr.types.TableExpr: table = table.mutate( feature=(table['v1'] + table['v2'])/2 ) w = ibis.trailing_window(preceding=2, group_by='key', order_by='v1') table = table.mutate( feature2=table['feature'].mean().over(w) ) table.group_by('key').aggregate([ table['feature2'].min(), table['feature2'].max() ]) return table result_table = transform(my_table) Apply transform() on our Ibis table 47 my_table and result_table are ibis table expressions, not dataframes.
  • 48. def transform(table) -> ibis.expr.types.TableExpr: table = table.mutate( feature=(table['v1'] + table['v2'])/2 ) w = ibis.trailing_window(preceding=2, group_by='key', order_by='v1') table = table.mutate( feature2=table['feature'].mean().over(w) ) table.group_by('key').aggregate([ table['feature2'].min(), table['feature2'].max() ]) return table result_table = transform(my_table) result_table.execute() Execute the result on our backend 48
  • 49. def transform(table) -> ibis.expr.types.TableExpr: table = table.mutate( feature=(table['v1'] + table['v2'])/2 ) w = ibis.trailing_window(preceding=2, group_by='key', order_by='v1') table = table.mutate( feature2=table['feature'].mean().over(w) ) table.group_by('key').aggregate([ table['feature2'].min(), table['feature2'].max() ]) return table result_table = transform(my_table) result_table.execute() Execute the result on our backend 49 pandas dataframe
  • 51. Translate Ibis expression into PySpark expressions 51 Ibis expression tree
  • 52. Basic column selection and arithmetic expression Ibis expression table.mutate( feature=(table['v1'] + table['v2'])/2 ) 52
  • 53. Basic column selection and arithmetic expression Ibis expression table.mutate( feature=(table['v1'] + table['v2'])/2 ) 53
  • 54. Ibis expression table.mutate( feature=(table['v1'] + table['v2'])/2 ) Basic column selection and arithmetic expression 54
  • 55. Ibis expression table.mutate( feature=(table['v1'] + table['v2'])/2 ) Ibis translation code @compiles(ops.Divide) def compile_divide(t, expr, scope, **kwargs): op = expr.op() left = t.translate(op.left, scope) right = t.translate(op.right, scope) return left / right Basic column selection and arithmetic expression 55
  • 56. Ibis expression table.mutate( feature=(table['v1'] + table['v2'])/2 ) Ibis translation code @compiles(ops.Divide) def compile_divide(t, expr, scope, **kwargs): op = expr.op() left = t.translate(op.left, scope) right = t.translate(op.right, scope) return left / right t is a PySparkTranslator Basic column selection and arithmetic expression 56
  • 57. Ibis expression table.mutate( feature=(table['v1'] + table['v2'])/2 ) Ibis translation code @compiles(ops.Divide) def compile_divide(t, expr, scope, **kwargs): op = expr.op() left = t.translate(op.left, scope) right = t.translate(op.right, scope) return left / right t is a PySparkTranslator It has a translate() method that evaluates an Ibis expression into a PySpark object. Basic column selection and arithmetic expression 57
  • 58. Ibis expression table.mutate( feature=(table['v1'] + table['v2'])/2 ) Ibis translation code @compiles(ops.Divide) def compile_divide(t, expr, scope, **kwargs): op = expr.op() left = t.translate(op.left, scope) right = t.translate(op.right, scope) return left / right expr is the ibis expression to translate Basic column selection and arithmetic expression 58
  • 59. Ibis expression table.mutate( feature=(table['v1'] + table['v2'])/2 ) Ibis translation code @compiles(ops.Divide) def compile_divide(t, expr, scope, **kwargs): op = expr.op() left = t.translate(op.left, scope) right = t.translate(op.right, scope) return left / right scope is a dict that caches results of previously translated Ibis expressions. Basic column selection and arithmetic expression 59
  • 60. Ibis expression table.mutate( feature=(table['v1'] + table['v2'])/2 ) Ibis translation code @compiles(ops.Divide) def compile_divide(t, expr, scope, **kwargs): op = expr.op() left = t.translate(op.left, scope) right = t.translate(op.right, scope) return left / right We use the PySparkTranslator to evaluate the left and right operands (which are themselves ibis expressions) into PySpark columns. Basic column selection and arithmetic expression 60
  • 61. Ibis expression table.mutate( feature=(table['v1'] + table['v2'])/2 ) Ibis translation code @compiles(ops.Divide) def compile_divide(t, expr, scope, **kwargs): op = expr.op() left = t.translate(op.left, scope) right = t.translate(op.right, scope) return left / right left and right are PySpark columns PySpark column division Basic column selection and arithmetic expression 61
  • 62. Ibis expression table.mutate( feature=(table['v1'] + table['v2'])/2 ) Ibis translation code @compiles(ops.Divide) def compile_divide(t, expr, scope, **kwargs): op = expr.op() left = t.translate(op.left, scope) right = t.translate(op.right, scope) return left / right Basic column selection and arithmetic expression 62
  • 63. Ibis expression table.mutate( feature=(table['v1'] + table['v2'])/2 ) Ibis translation code @compiles(ops.Add) def compile_add(t, expr, scope, **kwargs): op = expr.op() left = t.translate(op.left, scope) right = t.translate(op.right, scope) return left + right PySpark column addition Basic column selection and arithmetic expression 63
  • 64. Ibis expression table.mutate( feature=(table['v1'] + table['v2'])/2 ) Ibis translation code @compiles(ops.Add) def compile_add(t, expr, scope, **kwargs): op = expr.op() left = t.translate(op.left, scope) right = t.translate(op.right, scope) return left + right Basic column selection and arithmetic expression 64
  • 65. Ibis expression table.mutate( feature=(table['v1'] + table['v2'])/2 ) Ibis translation code @compiles(ops.TableColumn) def compile_column(t, expr, scope, **kwargs): op = expr.op() column_name = op.name pyspark_df = t.translate(op.table, scope) return pyspark_df[column_name] Basic column selection and arithmetic expression 65
  • 66. Ibis expression table.mutate( feature=(table['v1'] + table['v2'])/2 ) Ibis translation code @compiles(ops.TableColumn) def compile_column(t, expr, scope, **kwargs): op = expr.op() column_name = op.name pyspark_df = t.translate(op.table, scope) return pyspark_df[column_name] Basic column selection and arithmetic expression 66
  • 67. Ibis expression table.mutate( feature=(table['v1'] + table['v2'])/2 ) Ibis translation code @compiles(ops.TableColumn) def compile_column(t, expr, scope, **kwargs): op = expr.op() column_name = op.name pyspark_df = t.translate(op.table, scope) return pyspark_df[column_name] PySpark column selection Basic column selection and arithmetic expression 67
  • 68. Group-by and windowed aggregation Ibis expression w = ibis.trailing_window(preceding=2, group_by='key', order_by='v1') table = table.mutate( feature2=table['feature'].mean().over(w) ) 68
  • 69. Group-by and windowed aggregation Ibis expression w = ibis.trailing_window(preceding=2, group_by='key', order_by='v1') table = table.mutate( feature2=table['feature'].mean().over(w) ) 69 PySpark translation
  • 70. Group-by and windowed aggregation Ibis expression w = ibis.trailing_window(preceding=2, group_by='key', order_by='v1') table = table.mutate( feature2=table['feature'].mean().over(w) ) 70 PySpark translation from pyspark.sql.window import Window w = ( Window.partitionBy('key') .orderBy('v1') .rowsBetween(-2, Window.currentRow) )
  • 71. Group-by and windowed aggregation Ibis expression w = ibis.trailing_window(preceding=2, group_by='key', order_by='v1') table = table.mutate( feature2=table['feature'].mean().over(w) ) 71 PySpark translation from pyspark.sql.window import Window import pyspark.sql.functions as F w = ( Window.partitionBy('key') .orderBy('v1') .rowsBetween(-2, Window.currentRow) ) F.mean(df['feature']).over(w)
  • 72. Group-by and windowed aggregation Ibis expression w = ibis.trailing_window(preceding=2, group_by='key', order_by='v1') table = table.mutate( feature2=table['feature'].mean().over(w) ) 72 PySpark translation from pyspark.sql.window import Window import pyspark.sql.functions as F w = ( Window.partitionBy('key') .orderBy('v1') .rowsBetween(-2, Window.currentRow) ) df = df.withColumn( 'feature2', F.mean(df['feature']).over(w) )
  • 73. Group-by and windowed aggregation Ibis expression w = ibis.trailing_window(preceding=2, group_by='key', order_by='v1') table = table.mutate( feature2=table['feature'].mean().over(w) ) 73 PySpark translation from pyspark.sql.window import Window import pyspark.sql.functions as F w = ( Window.partitionBy('key') .orderBy('v1') .rowsBetween(-2, Window.currentRow) ) df = df.withColumn( 'feature2', F.mean(df['feature']).over(w) )
  • 78. More interesting examples Ibis expression table['v1'].rank().over(window) 78 PySpark translation
  • 79. More interesting examples Ibis expression table['v1'].rank().over(window) 79 PySpark translation import pyspark.sql.functions as F F.rank().over(window).astype('long') - F.lit(1)
  • 80. More interesting examples Ibis expression table['v1'].rank().over(window) 80 PySpark translation import pyspark.sql.functions as F F.rank().over(window).astype('long') - F.lit(1) Subtle differences like cast to long & off by one
  • 81. More interesting examples Ibis expression table['boolean_col'].not_any() 81 PySpark translation
  • 82. More interesting examples Ibis expression table['boolean_col'].not_any() 82 PySpark translation ~F.max(df['boolean_col'])
  • 83. More interesting examples Ibis expression table['boolean_col'].not_any() 83 PySpark translation ~F.max(df['boolean_col']) No direct translations
  • 85. Conclusion ● Separate expression and execution ● Don’t limit only to what you can use today, think about what you can use in the future ○ Arrow Dataset ○ Modin ○ cudf / dask-cudf ○ ... 85