SlideShare a Scribd company logo
Pandas UDF and Python Type Hint in Apache Spark 3.0
Pandas UDF and Python Type
Hint in Apache Spark 3.0
Hyukjin Kwon
Databricks Software Engineer
Hyukjin Kwon
▪ Apache Spark PMC / Committer
▪ Major Koalas contributor
▪ Databricks Software Engineer
▪ @HyukjinKwon in Github
Agenda
▪ Pandas UDFs
▪ Python Type Hints
▪ Proliferation of Pandas UDF Types
▪ New Pandas APIs with Python Type Hints
▪ Pandas UDFs
▪ Pandas Function APIs
Pandas UDFs
Pandas UDFs
▪ Apache Arrow, to exchange data between JVM and Python driver/
executors with near-zero (de)serialization cost
▪ Vectorization
▪ Rich APIs in Pandas and NumPy
Pandas UDFs
from pyspark.sql.functions import pandas_udf, PandasUDFType
@pandas_udf('double', PandasUDFType.SCALAR)
def pandas_plus_one(v):
# `v` is a pandas Series
return v.add(1) # outputs a pandas Series
spark.range(10).select(pandas_plus_one("id")).show()
Scalar Pandas UDF example that adds one
Spark DataFrame
Spark Columns
Value
Pandas Series
in Pandas UDF
Pandas UDFs
Spark Executor Spark Executor Spark Executor
Python Worker Python Worker Python Worker
Pandas UDFs
Partition
Partition
Partition
Partition
Partition
Partition
Spark
DataFrame
Pandas UDFs
Partition
Partition
Partition
Partition
Partition
Partition
Arrow Batch Arrow Batch Arrow Batch
Near-zero
(de)serialization
Pandas UDFs
Partition
Partition
Partition
Partition
Partition
Partition
Arrow Batch
Pandas Series
Arrow Batch Arrow Batch
Pandas Series Pandas Series
def pandas_plus_one(v):
# `v` is a pandas Series
return v.add(1) # outputs a pandas Series
Vectorized
execution
Performance
Performance: Pandas UDF vs regular UDF
Python Type Hints
Python Type Hints
def greeting(name):
return 'Hello ' + name
Typical Python codes
def greeting(name: str) -> str:
return 'Hello ' + name
Python codes with type hints
Python Type Hints
▪ PEP 484
▪ Standard syntax for type annotations in Python 3
▪ Optional
▪ Static analysis
▪ IDE can automatically detects and reports the type mismatch
▪ Static analysis such as mypy
▪ Easier to refactor codes
▪ Runtime type checking and code generation
▪ Infer the type of codes to run
▪ Runtime type checking
IDE Support
def merge(
self,
right: "DataFrame",
how: str = "inner",
...
Python type hint support in IDE
Static Analysis and Documentation
databricks/koalas/frame.py: note: In member "join" of class
"DataFrame":
databricks/koalas/frame.py:7546: error: Argument "how" to
"merge" of "DataFrame" has incompatible type "int"; expected
"str"
Found 1 error in 1 file (checked 65 source files)
mypy static analysis
Auto-documentation
Python Type Hints
▪ Early but still growing
▪ Arguably still premature
▪ Type hinting APIs are still being changed and under development.
▪ Started being used in production
▪ Type hinting is being encouraged, and being used in production
▪ PySpark type hints support, pyspark-stubs
▪ Third-party, optional PySpark type hinting support.
Proliferation of Pandas UDF Types
Pandas UDFs in Apache Spark 2.4
▪ Scalar Pandas UDF
▪ Transforms Pandas Series to Pandas Series and returns a Spark Column
▪ The same length of the input and output
▪ Grouped Map Pandas UDF
▪ Splits each group as a Pandas DataFrame, applies a function on each, and combines as a Spark DataFrame
▪ The function takes a Pandas DataFrame and returns a Pandas DataFrame
▪ Grouped Aggregate Pandas UDF
▪ Splits each group as a Pandas Series, applies a function on each, and combines as a Spark Column
▪ The function takes a Pandas Series and returns single aggregated scalar value
Pandas UDFs proposed in Apache Spark 3.0
▪ Scalar Iterator Pandas UDF
▪ Transforms an iterator of Pandas Series to an iterator Pandas Series and returns a Spark Column
▪ Map Pandas UDF
▪ Transforms an iterator of Pandas DataFrame to an iterator of Pandas DataFrame in a Spark DataFrame
▪ Cogrouped Map Pandas UDF
▪ Splits each cogroup as a Pandas DataFrame, applies a function on each, and combines as a Spark DataFrame
▪ The function takes and returns a Pandas DataFrame
Complexity and Confusion
@pandas_udf("long", PandasUDFType.SCALAR)
def pandas_plus_one(v):
return v + 1
spark.range(3).select(pandas_plus_one("id").alias("id")).show()
@pandas_udf("long", PandasUDFType.SCALAR_ITER)
def pandas_plus_one(vv):
return map(lambda v: v + 1, vv)
spark.range(3).select(pandas_plus_one("id").alias("id")).show()
@pandas_udf("id long", PandasUDFType.GROUPED_MAP)
def pandas_plus_one(v):
return v + 1
spark.range(3).groupby("id").apply(pandas_plus_one).show()
+---+
| id|
+---+
| 1|
| 2|
| 3|
+---+
Same output
Adds one
Complexity and Confusion
@pandas_udf("long", PandasUDFType.SCALAR)
def pandas_plus_one(v):
# `v` is a pandas Series
return v + 1 # outputs a pandas Series
spark.range(3).select(pandas_plus_one("id").alias("id")).show()
@pandas_udf("long", PandasUDFType.SCALAR_ITER)
def pandas_plus_one(vv):
# `vv` is an iterator of pandas Series.
# outputs an iterator of pandas Series.
return map(lambda v: v + 1, vv)
spark.range(3).select(pandas_plus_one("id").alias("id")).show()
@pandas_udf("id long", PandasUDFType.GROUPED_MAP)
def pandas_plus_one(v):
# `v` is a pandas DataFrame
return v + 1 # outputs a pandas DataFrame
spark.range(3).groupby("id").apply(pandas_plus_one).show()
▪ What types are expected in the
function?
▪ How does each UDF work?
▪ Why should I specify the UDF
type?
Adds one
Complexity and Confusion
@pandas_udf("long", PandasUDFType.SCALAR)
def pandas_plus_one(v):
return v + 1
df = spark.range(3)
df.select(pandas_plus_one("id") + cos("id")).show()
@pandas_udf("id long", PandasUDFType.GROUPED_MAP)
def pandas_plus_one(v):
return v + 1
df = spark.range(3)
df.groupby("id").apply(pandas_plus_one("id") + col(“id")).show()
Adds one and cosine
Adds one and cosine(?)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "...", line 70, in apply
...
ValueError: Invalid udf: the udf argument must be a pandas_udf of
type GROUPED_MAP.
+-------------------------------+
|(pandas_plus_one(id) + COS(id))|
+-------------------------------+
| 2.0|
| 2.5403023058681398|
| 2.5838531634528574|
+-------------------------------+
Complexity and Confusion
@pandas_udf("long", PandasUDFType.SCALAR)
def pandas_plus_one(v):
return v + 1
df = spark.range(3)
df.select(pandas_plus_one("id") + cos("id")).show()
@pandas_udf("id long", PandasUDFType.GROUPED_MAP)
def pandas_plus_one(v):
return v + 1
df = spark.range(3)
# `pandas_plus_one` can _only_ be used with `groupby(...).apply(...)`
df.groupby("id").apply(pandas_plus_one("id") + col("id")).show()
Adds one and cosine
Adds one and cosine(?)
▪ Expression
▪ Query execution plan
New Pandas APIs with Python Type Hints
Python Type Hints
@pandas_udf("long")
def pandas_plus_one(v: pd.Series) -> pd.Series:
return v + 1
spark.range(3).select(pandas_plus_one("id").alias("id")).show()
@pandas_udf("long")
def pandas_plus_one(vv: Iterator[pd.Series]) -> Iterator[pd.Series]:
return map(lambda v: v + 1, vv)
spark.range(3).select(pandas_plus_one("id").alias("id")).show()
@pandas_udf("id long")
def pandas_plus_one(v: pd.DataFrame) -> pd.DataFrame:
return v + 1
spark.range(3).groupby("id").apply(pandas_plus_one).show()
▪ Self-descriptive
▪ Describe what the pandas UDF is supposed to take and
return.
▪ Shows the relationship between input and output.
▪ Static analysis
▪ IDE detects if non-pandas instances are used mistakenly.
▪ Other tools such as mypy can be integrated for a better
code quality in the pandas UDFs.
▪ Auto-documentation
▪ Type hints in the pandas UDF automatically documents the
input and output.
▪ Pandas UDFs
▪ Works as a function, internally an expression
▪ Consistent with Scala UDFs and regular Python UDFs
▪ Returns a regular PySpark column
▪ Pandas Function APIs
▪ Works as an API in DataFrame, query plan internally
▪ Consistent with APIs such as map, mapGroups, etc.
API Separation
@pandas_udf("long")
def pandas_plus_one(v: pd.Series) -> pd.Series:
return v + 1
df = spark.range(3)
df.select(pandas_plus_one("id") + cos("id")).show()
def pandas_plus_one(v: pd.DataFrame) -> pd.DataFrame:
return v + 1
df = spark.range(3)
df.groupby("id").applyInPandas(pandas_plus_one).show()
▪ Series to Series
▪ A Pandas UDF
▪ pandas.Series, ... -> pandas.Series
▪ Length of each input series and output series should be the same
▪ StructType in input and output is represented via pandas.DataFrame
New Pandas UDFs
import pandas as pd
from pyspark.sql.functions import pandas_udf
@pandas_udf('long')
def pandas_plus_one(s: pd.Series) -> pd.Series:
return s + 1
spark.range(10).select(pandas_plus_one("id")).show()
New Style
from pyspark.sql.functions import pandas_udf, PandasUDFType
@pandas_udf('long', PandasUDFType.SCALAR)
def pandas_plus_one(v):
return v + 1
spark.range(10).select(pandas_plus_one("id")).show()
Old Style (Scalar Pandas UDF)
New Pandas UDFs
▪ Iterator of Series to Iterator of Series
▪ A Pandas UDF
▪ Iterator[pd.Series] -> Iterator[pd.Series]
▪ Length of the whole input iterator and output iterator should be the same
▪ StructType in input and output is represented via pandas.DataFrame
from typing import Iterator
import pandas as pd
from pyspark.sql.functions import pandas_udf
@pandas_udf('long')
def pandas_plus_one(iterator: Iterator[pd.Series]) -> Iterator[pd.Series]:
return map(lambda s: s + 1, iterator)
spark.range(10).select(pandas_plus_one("id")).show()
New Style
from pyspark.sql.functions import pandas_udf, PandasUDFType
@pandas_udf('long', PandasUDFType.SCALAR_ITER)
def pandas_plus_one(iterator):
return map(lambda s: s + 1, iterator)
spark.range(10).select(pandas_plus_one("id")).show()
Old Style (Scalar Iterator Pandas UDF)
New Pandas UDFs
▪ Iterator of Multiple Series to Iterator of Series
▪ A Pandas UDF
▪ Iterator[Tuple[pandas.Series, ...]] -> Iterator[pandas.Series]
▪ Length of the whole input iterator and output iterator should be the same
▪ StructType in input and output is represented via pandas.DataFrame
from typing import Iterator, Tuple
import pandas as pd
from pyspark.sql.functions import pandas_udf
@pandas_udf("long")
def multiply_two(
iterator: Iterator[Tuple[pd.Series, pd.Series]]) -> Iterator[pd.Series]:
return (a * b for a, b in iterator)
spark.range(10).select(multiply_two("id", "id")).show()
New Style
from pyspark.sql.functions import pandas_udf, PandasUDFType
@pandas_udf('long', PandasUDFType.SCALAR_ITER)
def multiply_two(iterator):
return (a * b for a, b in iterator)
spark.range(10).select(multiply_two("id", "id")).show()
Old Style (Scalar Iterator Pandas UDF)
New Pandas UDFs
▪ Iterator of Series to Iterator of Series
▪ Iterator of Multiple Series to Iterator of Series
▪ Useful when it requires to execute to calculate one expensive state to share
▪ Prefetch the data within the iterator
@pandas_udf("long")
def calculate(iterator: Iterator[pd.Series]) -> Iterator[pd.Series]:
# Do some expensive initialization with a state
state = very_expensive_initialization()
for x in iterator:
# Use that state for the whole iterator.
yield calculate_with_state(x, state)
df.select(calculate("value")).show()
Initializing a expensive state
@pandas_udf("long")
def calculate(iterator: Iterator[pd.Series]) -> Iterator[pd.Series]:
# Pre-fetch the iterator
threading.Thread(consume, args=(iterator, queue))
for s in queue:
yield func(s)
df.select(calculate("value")).show()
Pre-fetching input iterator
New Pandas UDFs
▪ Series to Scalar
▪ A Pandas UDF
▪ pandas.Series, ... -> Any (any scalar value)
▪ Should output a scalar value a Python primitive type such as int, or NumPy data type such as numpy.int64.

Any should ideally be a specific scalar type accordingly
▪ StructType in input is represented via pandas.DataFrame
▪ Typically assumes an aggregation
import pandas as pd
from pyspark.sql.functions import pandas_udf
df = spark.createDataFrame(
[(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], ("id", "v"))
@pandas_udf("double")
def pandas_mean(v: pd.Series) -> float:
return v.sum()
df.select(pandas_mean(df['v'])).show()
New Style
import pandas as pd
from pyspark.sql.functions import pandas_udf, PandasUDFType
df = spark.createDataFrame(
[(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], ("id", "v"))
@pandas_udf("double", PandasUDFType.GROUPED_AGG)
def pandas_mean(v):
return v.sum()
df.select(pandas_mean(df['v'])).show()
Old Style (Grouped Aggregate Pandas UDF)
Pandas Function APIs: Grouped Map
▪ Grouped Map
▪ A Pandas Function API that applies a function on each group
▪ Optional Python type hints currently in Spark 3.0
▪ Length of output can be arbitrary
▪ StructType is unsupported
import pandas as pd
df = spark.createDataFrame(
[(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], ("id", "v"))
def subtract_mean(pdf: pd.DataFrame) -> pd.DataFrame:
v = pdf.v
return pdf.assign(v=v - v.mean())
df.groupby(“id").applyInPandas(subtract_mean, df.schema).show()
New Style
import pandas as pd
from pyspark.sql.functions import pandas_udf, PandasUDFType
df = spark.createDataFrame(
[(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], ("id", "v"))
@pandas_udf(df.schema, PandasUDFType.GROUPED_MAP)
def subtract_mean(pdf):
v = pdf.v
return pdf.assign(v=v - v.mean())
df.groupby("id").apply(subtract_mean).show()
Old Style (Grouped Map Pandas UDF)
Pandas Function APIs: Grouped Map
▪ Map
▪ A Pandas Function API that applies a function on the Spark DataFrame
▪ Similar characteristics with the iterator support of Python UDF
▪ Optional Python type hints currently in Spark 3.0
▪ Length of output can be arbitrary
▪ StructType is unsupported
from typing import Iterator
import pandas as pd
df = spark.createDataFrame([(1, 21), (2, 30)], ("id", "age"))
def pandas_filter(iterator: Iterator[pd.DataFrame]) -> Iterator[pd.DataFrame]:
for pdf in iterator:
yield pdf[pdf.id == 1]
df.mapInPandas(pandas_filter, df.schema).show()
Pandas Function APIs: Grouped Map
▪ Co-grouped Map
▪ A Pandas Function API that applies a function on each co-group
▪ Requires two grouped Spark DataFrames
▪ Optional Python type hints currently in Spark 3.0
▪ Length of output can be arbitrary
▪ StructType is unsupported
import pandas as pd
df1 = spark.createDataFrame(
[(1201, 1, 1.0), (1201, 2, 2.0), (1202, 1, 3.0), (1202, 2, 4.0)],
("time", "id", "v1"))
df2 = spark.createDataFrame(
[(1201, 1, "x"), (1201, 2, "y")], ("time", "id", "v2"))
def asof_join(left: pd.DataFrame, right: pd.DataFrame) -> pd.DataFrame:
return pd.merge_asof(left, right, on="time", by="id")
df1.groupby("id").cogroup(
df2.groupby("id")
).applyInPandas(asof_join, "time int, id int, v1 double, v2 string").show()
Re-cap
▪ Pandas APIs leverage Python type hints for static analysis, auto-
documentation and self-descriptive UDF
▪ Old Pandas UDFs separation to Pandas UDF and Pandas Function API
▪ New APIs
▪ Iterator support in Pandas UDF
▪ Cogrouped-map and map Pandas Function APIs
Questions?

More Related Content

What's hot (20)

PDF
CS8461 Operating System Lab Manual S.Selvi
SELVI SIVAPERUMAL
 
PPTX
Spring integration
Dominik Strzyżewski
 
PDF
Pivoting Data with SparkSQL by Andrew Ray
Spark Summit
 
PDF
Spark shuffle introduction
colorant
 
ODP
Web service Introduction
Madhukar Kumar
 
PPTX
Virtual Private Network
HASHIR RAZA
 
PDF
Productizing Structured Streaming Jobs
Databricks
 
PDF
Dive into PySpark
Mateusz Buśkiewicz
 
PDF
Spark SQL Deep Dive @ Melbourne Spark Meetup
Databricks
 
PDF
Oracle Management Cloud
Fabio Batista
 
PDF
Search engine and web crawler
vinay arora
 
PDF
Performance tuning and optimization (ppt)
Harish Chand
 
PPTX
Oracle Office Hours - Exposing REST services with APEX and ORDS
Doug Gault
 
PDF
Hadoop Security
Timothy Spann
 
PPTX
Laravel Tutorial PPT
Piyush Aggarwal
 
PPTX
PySpark dataframe
Jaemun Jung
 
PDF
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
PPSX
What I learnt: Elastic search & Kibana : introduction, installtion & configur...
Rahul K Chauhan
 
PPTX
Restful web services ppt
OECLIB Odisha Electronics Control Library
 
PDF
High-speed Database Throughput Using Apache Arrow Flight SQL
ScyllaDB
 
CS8461 Operating System Lab Manual S.Selvi
SELVI SIVAPERUMAL
 
Spring integration
Dominik Strzyżewski
 
Pivoting Data with SparkSQL by Andrew Ray
Spark Summit
 
Spark shuffle introduction
colorant
 
Web service Introduction
Madhukar Kumar
 
Virtual Private Network
HASHIR RAZA
 
Productizing Structured Streaming Jobs
Databricks
 
Dive into PySpark
Mateusz Buśkiewicz
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Databricks
 
Oracle Management Cloud
Fabio Batista
 
Search engine and web crawler
vinay arora
 
Performance tuning and optimization (ppt)
Harish Chand
 
Oracle Office Hours - Exposing REST services with APEX and ORDS
Doug Gault
 
Hadoop Security
Timothy Spann
 
Laravel Tutorial PPT
Piyush Aggarwal
 
PySpark dataframe
Jaemun Jung
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
What I learnt: Elastic search & Kibana : introduction, installtion & configur...
Rahul K Chauhan
 
High-speed Database Throughput Using Apache Arrow Flight SQL
ScyllaDB
 

Similar to Pandas UDF and Python Type Hint in Apache Spark 3.0 (20)

PDF
Vectorized UDF: Scalable Analysis with Python and PySpark with Li Jin
Databricks
 
PDF
Pandas UDF: Scalable Analysis with Python and PySpark
Li Jin
 
PDF
PandasUDFs: One Weird Trick to Scaled Ensembles
Databricks
 
PDF
Apache Arrow and Pandas UDF on Apache Spark
Takuya UESHIN
 
PDF
Introduction to Spark Datasets - Functional and relational together at last
Holden Karau
 
PDF
Improving Pandas and PySpark interoperability with Apache Arrow
Li Jin
 
PDF
Improving Pandas and PySpark performance and interoperability with Apache Arrow
PyData
 
PDF
Bay Area Apache Spark ™ Meetup: Upcoming Apache Spark 4.0.0 Release
carlyakerly1
 
PDF
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
Miklos Christine
 
PPTX
Improving Python and Spark Performance and Interoperability with Apache Arrow
Julien Le Dem
 
PDF
Improving Python and Spark Performance and Interoperability with Apache Arrow...
Databricks
 
PDF
Accelerating Data Processing in Spark SQL with Pandas UDFs
Databricks
 
PPTX
Koalas: Unifying Spark and pandas APIs
Takuya UESHIN
 
PDF
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
Databricks
 
PDF
Koalas: Making an Easy Transition from Pandas to Apache Spark
Databricks
 
PDF
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Spark Summit
 
PDF
Data Summer Conf 2018, “Hands-on with Apache Spark for Beginners (ENG)” — Akm...
Provectus
 
PDF
Getting The Best Performance With PySpark
Spark Summit
 
PDF
Improving Python and Spark Performance and Interoperability with Apache Arrow
Li Jin
 
PDF
Improving Python and Spark Performance and Interoperability with Apache Arrow
Two Sigma
 
Vectorized UDF: Scalable Analysis with Python and PySpark with Li Jin
Databricks
 
Pandas UDF: Scalable Analysis with Python and PySpark
Li Jin
 
PandasUDFs: One Weird Trick to Scaled Ensembles
Databricks
 
Apache Arrow and Pandas UDF on Apache Spark
Takuya UESHIN
 
Introduction to Spark Datasets - Functional and relational together at last
Holden Karau
 
Improving Pandas and PySpark interoperability with Apache Arrow
Li Jin
 
Improving Pandas and PySpark performance and interoperability with Apache Arrow
PyData
 
Bay Area Apache Spark ™ Meetup: Upcoming Apache Spark 4.0.0 Release
carlyakerly1
 
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
Miklos Christine
 
Improving Python and Spark Performance and Interoperability with Apache Arrow
Julien Le Dem
 
Improving Python and Spark Performance and Interoperability with Apache Arrow...
Databricks
 
Accelerating Data Processing in Spark SQL with Pandas UDFs
Databricks
 
Koalas: Unifying Spark and pandas APIs
Takuya UESHIN
 
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
Databricks
 
Koalas: Making an Easy Transition from Pandas to Apache Spark
Databricks
 
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Spark Summit
 
Data Summer Conf 2018, “Hands-on with Apache Spark for Beginners (ENG)” — Akm...
Provectus
 
Getting The Best Performance With PySpark
Spark Summit
 
Improving Python and Spark Performance and Interoperability with Apache Arrow
Li Jin
 
Improving Python and Spark Performance and Interoperability with Apache Arrow
Two Sigma
 
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Ad

Recently uploaded (20)

PPTX
RESEARCH-FINAL-GROUP-3, about the final .pptx
gwapokoha1
 
PPTX
Project_Update_Summary.for the use from PM
Odysseas Lekatsas
 
PDF
5991-5857_Agilent_MS_Theory_EN (1).pdf. pdf
NohaSalah45
 
PDF
SaleServicereport and SaleServicereport
2251330007
 
PDF
Business Automation Solution with Excel 1.1.pdf
Vivek Kedia
 
PDF
Microsoft Power BI - Advanced Certificate for Business Intelligence using Pow...
Prasenjit Debnath
 
PPT
intro to AI dfg fgh gggdrhre ghtwhg ewge
traineramrsiam
 
PDF
TESDA License NC II PC Operations TESDA, Office Productivity
MELJUN CORTES
 
PPTX
PPT2 W1L2.pptx.........................................
palicteronalyn26
 
DOCX
brigada_PROGRAM_25.docx the boys white house
RonelNebrao
 
PDF
Datàaaaaaaaaaengineeeeeeeeeeeeeeeeeeeeeee
juadsr96
 
PPTX
Data anlytics Hospitals Research India.pptx
SayantanChakravorty2
 
PPTX
microservices-with-container-apps-dapr.pptx
vjay22
 
PPTX
Model Evaluation & Visualisation part of a series of intro modules for data ...
brandonlee626749
 
PDF
A Web Repository System for Data Mining in Drug Discovery
IJDKP
 
PPTX
english9quizw1-240228142338-e9bcf6fd.pptx
rossanthonytan130
 
PDF
Orchestrating Data Workloads With Airflow.pdf
ssuserae5511
 
PPTX
Daily, Weekly, Monthly Report MTC March 2025.pptx
PanjiDewaPamungkas1
 
PPTX
727325165-Unit-1-Data-Analytics-PPT-1.pptx
revathi148366
 
PDF
Informatics Market Insights AI Workforce.pdf
karizaroxx
 
RESEARCH-FINAL-GROUP-3, about the final .pptx
gwapokoha1
 
Project_Update_Summary.for the use from PM
Odysseas Lekatsas
 
5991-5857_Agilent_MS_Theory_EN (1).pdf. pdf
NohaSalah45
 
SaleServicereport and SaleServicereport
2251330007
 
Business Automation Solution with Excel 1.1.pdf
Vivek Kedia
 
Microsoft Power BI - Advanced Certificate for Business Intelligence using Pow...
Prasenjit Debnath
 
intro to AI dfg fgh gggdrhre ghtwhg ewge
traineramrsiam
 
TESDA License NC II PC Operations TESDA, Office Productivity
MELJUN CORTES
 
PPT2 W1L2.pptx.........................................
palicteronalyn26
 
brigada_PROGRAM_25.docx the boys white house
RonelNebrao
 
Datàaaaaaaaaaengineeeeeeeeeeeeeeeeeeeeeee
juadsr96
 
Data anlytics Hospitals Research India.pptx
SayantanChakravorty2
 
microservices-with-container-apps-dapr.pptx
vjay22
 
Model Evaluation & Visualisation part of a series of intro modules for data ...
brandonlee626749
 
A Web Repository System for Data Mining in Drug Discovery
IJDKP
 
english9quizw1-240228142338-e9bcf6fd.pptx
rossanthonytan130
 
Orchestrating Data Workloads With Airflow.pdf
ssuserae5511
 
Daily, Weekly, Monthly Report MTC March 2025.pptx
PanjiDewaPamungkas1
 
727325165-Unit-1-Data-Analytics-PPT-1.pptx
revathi148366
 
Informatics Market Insights AI Workforce.pdf
karizaroxx
 

Pandas UDF and Python Type Hint in Apache Spark 3.0

  • 2. Pandas UDF and Python Type Hint in Apache Spark 3.0 Hyukjin Kwon Databricks Software Engineer
  • 3. Hyukjin Kwon ▪ Apache Spark PMC / Committer ▪ Major Koalas contributor ▪ Databricks Software Engineer ▪ @HyukjinKwon in Github
  • 4. Agenda ▪ Pandas UDFs ▪ Python Type Hints ▪ Proliferation of Pandas UDF Types ▪ New Pandas APIs with Python Type Hints ▪ Pandas UDFs ▪ Pandas Function APIs
  • 6. Pandas UDFs ▪ Apache Arrow, to exchange data between JVM and Python driver/ executors with near-zero (de)serialization cost ▪ Vectorization ▪ Rich APIs in Pandas and NumPy
  • 7. Pandas UDFs from pyspark.sql.functions import pandas_udf, PandasUDFType @pandas_udf('double', PandasUDFType.SCALAR) def pandas_plus_one(v): # `v` is a pandas Series return v.add(1) # outputs a pandas Series spark.range(10).select(pandas_plus_one("id")).show() Scalar Pandas UDF example that adds one Spark DataFrame Spark Columns Value Pandas Series in Pandas UDF
  • 8. Pandas UDFs Spark Executor Spark Executor Spark Executor Python Worker Python Worker Python Worker
  • 10. Pandas UDFs Partition Partition Partition Partition Partition Partition Arrow Batch Arrow Batch Arrow Batch Near-zero (de)serialization
  • 11. Pandas UDFs Partition Partition Partition Partition Partition Partition Arrow Batch Pandas Series Arrow Batch Arrow Batch Pandas Series Pandas Series def pandas_plus_one(v): # `v` is a pandas Series return v.add(1) # outputs a pandas Series Vectorized execution
  • 14. Python Type Hints def greeting(name): return 'Hello ' + name Typical Python codes def greeting(name: str) -> str: return 'Hello ' + name Python codes with type hints
  • 15. Python Type Hints ▪ PEP 484 ▪ Standard syntax for type annotations in Python 3 ▪ Optional ▪ Static analysis ▪ IDE can automatically detects and reports the type mismatch ▪ Static analysis such as mypy ▪ Easier to refactor codes ▪ Runtime type checking and code generation ▪ Infer the type of codes to run ▪ Runtime type checking
  • 16. IDE Support def merge( self, right: "DataFrame", how: str = "inner", ... Python type hint support in IDE
  • 17. Static Analysis and Documentation databricks/koalas/frame.py: note: In member "join" of class "DataFrame": databricks/koalas/frame.py:7546: error: Argument "how" to "merge" of "DataFrame" has incompatible type "int"; expected "str" Found 1 error in 1 file (checked 65 source files) mypy static analysis Auto-documentation
  • 18. Python Type Hints ▪ Early but still growing ▪ Arguably still premature ▪ Type hinting APIs are still being changed and under development. ▪ Started being used in production ▪ Type hinting is being encouraged, and being used in production ▪ PySpark type hints support, pyspark-stubs ▪ Third-party, optional PySpark type hinting support.
  • 20. Pandas UDFs in Apache Spark 2.4 ▪ Scalar Pandas UDF ▪ Transforms Pandas Series to Pandas Series and returns a Spark Column ▪ The same length of the input and output ▪ Grouped Map Pandas UDF ▪ Splits each group as a Pandas DataFrame, applies a function on each, and combines as a Spark DataFrame ▪ The function takes a Pandas DataFrame and returns a Pandas DataFrame ▪ Grouped Aggregate Pandas UDF ▪ Splits each group as a Pandas Series, applies a function on each, and combines as a Spark Column ▪ The function takes a Pandas Series and returns single aggregated scalar value
  • 21. Pandas UDFs proposed in Apache Spark 3.0 ▪ Scalar Iterator Pandas UDF ▪ Transforms an iterator of Pandas Series to an iterator Pandas Series and returns a Spark Column ▪ Map Pandas UDF ▪ Transforms an iterator of Pandas DataFrame to an iterator of Pandas DataFrame in a Spark DataFrame ▪ Cogrouped Map Pandas UDF ▪ Splits each cogroup as a Pandas DataFrame, applies a function on each, and combines as a Spark DataFrame ▪ The function takes and returns a Pandas DataFrame
  • 22. Complexity and Confusion @pandas_udf("long", PandasUDFType.SCALAR) def pandas_plus_one(v): return v + 1 spark.range(3).select(pandas_plus_one("id").alias("id")).show() @pandas_udf("long", PandasUDFType.SCALAR_ITER) def pandas_plus_one(vv): return map(lambda v: v + 1, vv) spark.range(3).select(pandas_plus_one("id").alias("id")).show() @pandas_udf("id long", PandasUDFType.GROUPED_MAP) def pandas_plus_one(v): return v + 1 spark.range(3).groupby("id").apply(pandas_plus_one).show() +---+ | id| +---+ | 1| | 2| | 3| +---+ Same output Adds one
  • 23. Complexity and Confusion @pandas_udf("long", PandasUDFType.SCALAR) def pandas_plus_one(v): # `v` is a pandas Series return v + 1 # outputs a pandas Series spark.range(3).select(pandas_plus_one("id").alias("id")).show() @pandas_udf("long", PandasUDFType.SCALAR_ITER) def pandas_plus_one(vv): # `vv` is an iterator of pandas Series. # outputs an iterator of pandas Series. return map(lambda v: v + 1, vv) spark.range(3).select(pandas_plus_one("id").alias("id")).show() @pandas_udf("id long", PandasUDFType.GROUPED_MAP) def pandas_plus_one(v): # `v` is a pandas DataFrame return v + 1 # outputs a pandas DataFrame spark.range(3).groupby("id").apply(pandas_plus_one).show() ▪ What types are expected in the function? ▪ How does each UDF work? ▪ Why should I specify the UDF type? Adds one
  • 24. Complexity and Confusion @pandas_udf("long", PandasUDFType.SCALAR) def pandas_plus_one(v): return v + 1 df = spark.range(3) df.select(pandas_plus_one("id") + cos("id")).show() @pandas_udf("id long", PandasUDFType.GROUPED_MAP) def pandas_plus_one(v): return v + 1 df = spark.range(3) df.groupby("id").apply(pandas_plus_one("id") + col(“id")).show() Adds one and cosine Adds one and cosine(?) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "...", line 70, in apply ... ValueError: Invalid udf: the udf argument must be a pandas_udf of type GROUPED_MAP. +-------------------------------+ |(pandas_plus_one(id) + COS(id))| +-------------------------------+ | 2.0| | 2.5403023058681398| | 2.5838531634528574| +-------------------------------+
  • 25. Complexity and Confusion @pandas_udf("long", PandasUDFType.SCALAR) def pandas_plus_one(v): return v + 1 df = spark.range(3) df.select(pandas_plus_one("id") + cos("id")).show() @pandas_udf("id long", PandasUDFType.GROUPED_MAP) def pandas_plus_one(v): return v + 1 df = spark.range(3) # `pandas_plus_one` can _only_ be used with `groupby(...).apply(...)` df.groupby("id").apply(pandas_plus_one("id") + col("id")).show() Adds one and cosine Adds one and cosine(?) ▪ Expression ▪ Query execution plan
  • 26. New Pandas APIs with Python Type Hints
  • 27. Python Type Hints @pandas_udf("long") def pandas_plus_one(v: pd.Series) -> pd.Series: return v + 1 spark.range(3).select(pandas_plus_one("id").alias("id")).show() @pandas_udf("long") def pandas_plus_one(vv: Iterator[pd.Series]) -> Iterator[pd.Series]: return map(lambda v: v + 1, vv) spark.range(3).select(pandas_plus_one("id").alias("id")).show() @pandas_udf("id long") def pandas_plus_one(v: pd.DataFrame) -> pd.DataFrame: return v + 1 spark.range(3).groupby("id").apply(pandas_plus_one).show() ▪ Self-descriptive ▪ Describe what the pandas UDF is supposed to take and return. ▪ Shows the relationship between input and output. ▪ Static analysis ▪ IDE detects if non-pandas instances are used mistakenly. ▪ Other tools such as mypy can be integrated for a better code quality in the pandas UDFs. ▪ Auto-documentation ▪ Type hints in the pandas UDF automatically documents the input and output.
  • 28. ▪ Pandas UDFs ▪ Works as a function, internally an expression ▪ Consistent with Scala UDFs and regular Python UDFs ▪ Returns a regular PySpark column ▪ Pandas Function APIs ▪ Works as an API in DataFrame, query plan internally ▪ Consistent with APIs such as map, mapGroups, etc. API Separation @pandas_udf("long") def pandas_plus_one(v: pd.Series) -> pd.Series: return v + 1 df = spark.range(3) df.select(pandas_plus_one("id") + cos("id")).show() def pandas_plus_one(v: pd.DataFrame) -> pd.DataFrame: return v + 1 df = spark.range(3) df.groupby("id").applyInPandas(pandas_plus_one).show()
  • 29. ▪ Series to Series ▪ A Pandas UDF ▪ pandas.Series, ... -> pandas.Series ▪ Length of each input series and output series should be the same ▪ StructType in input and output is represented via pandas.DataFrame New Pandas UDFs import pandas as pd from pyspark.sql.functions import pandas_udf @pandas_udf('long') def pandas_plus_one(s: pd.Series) -> pd.Series: return s + 1 spark.range(10).select(pandas_plus_one("id")).show() New Style from pyspark.sql.functions import pandas_udf, PandasUDFType @pandas_udf('long', PandasUDFType.SCALAR) def pandas_plus_one(v): return v + 1 spark.range(10).select(pandas_plus_one("id")).show() Old Style (Scalar Pandas UDF)
  • 30. New Pandas UDFs ▪ Iterator of Series to Iterator of Series ▪ A Pandas UDF ▪ Iterator[pd.Series] -> Iterator[pd.Series] ▪ Length of the whole input iterator and output iterator should be the same ▪ StructType in input and output is represented via pandas.DataFrame from typing import Iterator import pandas as pd from pyspark.sql.functions import pandas_udf @pandas_udf('long') def pandas_plus_one(iterator: Iterator[pd.Series]) -> Iterator[pd.Series]: return map(lambda s: s + 1, iterator) spark.range(10).select(pandas_plus_one("id")).show() New Style from pyspark.sql.functions import pandas_udf, PandasUDFType @pandas_udf('long', PandasUDFType.SCALAR_ITER) def pandas_plus_one(iterator): return map(lambda s: s + 1, iterator) spark.range(10).select(pandas_plus_one("id")).show() Old Style (Scalar Iterator Pandas UDF)
  • 31. New Pandas UDFs ▪ Iterator of Multiple Series to Iterator of Series ▪ A Pandas UDF ▪ Iterator[Tuple[pandas.Series, ...]] -> Iterator[pandas.Series] ▪ Length of the whole input iterator and output iterator should be the same ▪ StructType in input and output is represented via pandas.DataFrame from typing import Iterator, Tuple import pandas as pd from pyspark.sql.functions import pandas_udf @pandas_udf("long") def multiply_two( iterator: Iterator[Tuple[pd.Series, pd.Series]]) -> Iterator[pd.Series]: return (a * b for a, b in iterator) spark.range(10).select(multiply_two("id", "id")).show() New Style from pyspark.sql.functions import pandas_udf, PandasUDFType @pandas_udf('long', PandasUDFType.SCALAR_ITER) def multiply_two(iterator): return (a * b for a, b in iterator) spark.range(10).select(multiply_two("id", "id")).show() Old Style (Scalar Iterator Pandas UDF)
  • 32. New Pandas UDFs ▪ Iterator of Series to Iterator of Series ▪ Iterator of Multiple Series to Iterator of Series ▪ Useful when it requires to execute to calculate one expensive state to share ▪ Prefetch the data within the iterator @pandas_udf("long") def calculate(iterator: Iterator[pd.Series]) -> Iterator[pd.Series]: # Do some expensive initialization with a state state = very_expensive_initialization() for x in iterator: # Use that state for the whole iterator. yield calculate_with_state(x, state) df.select(calculate("value")).show() Initializing a expensive state @pandas_udf("long") def calculate(iterator: Iterator[pd.Series]) -> Iterator[pd.Series]: # Pre-fetch the iterator threading.Thread(consume, args=(iterator, queue)) for s in queue: yield func(s) df.select(calculate("value")).show() Pre-fetching input iterator
  • 33. New Pandas UDFs ▪ Series to Scalar ▪ A Pandas UDF ▪ pandas.Series, ... -> Any (any scalar value) ▪ Should output a scalar value a Python primitive type such as int, or NumPy data type such as numpy.int64.
 Any should ideally be a specific scalar type accordingly ▪ StructType in input is represented via pandas.DataFrame ▪ Typically assumes an aggregation import pandas as pd from pyspark.sql.functions import pandas_udf df = spark.createDataFrame( [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], ("id", "v")) @pandas_udf("double") def pandas_mean(v: pd.Series) -> float: return v.sum() df.select(pandas_mean(df['v'])).show() New Style import pandas as pd from pyspark.sql.functions import pandas_udf, PandasUDFType df = spark.createDataFrame( [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], ("id", "v")) @pandas_udf("double", PandasUDFType.GROUPED_AGG) def pandas_mean(v): return v.sum() df.select(pandas_mean(df['v'])).show() Old Style (Grouped Aggregate Pandas UDF)
  • 34. Pandas Function APIs: Grouped Map ▪ Grouped Map ▪ A Pandas Function API that applies a function on each group ▪ Optional Python type hints currently in Spark 3.0 ▪ Length of output can be arbitrary ▪ StructType is unsupported import pandas as pd df = spark.createDataFrame( [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], ("id", "v")) def subtract_mean(pdf: pd.DataFrame) -> pd.DataFrame: v = pdf.v return pdf.assign(v=v - v.mean()) df.groupby(“id").applyInPandas(subtract_mean, df.schema).show() New Style import pandas as pd from pyspark.sql.functions import pandas_udf, PandasUDFType df = spark.createDataFrame( [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], ("id", "v")) @pandas_udf(df.schema, PandasUDFType.GROUPED_MAP) def subtract_mean(pdf): v = pdf.v return pdf.assign(v=v - v.mean()) df.groupby("id").apply(subtract_mean).show() Old Style (Grouped Map Pandas UDF)
  • 35. Pandas Function APIs: Grouped Map ▪ Map ▪ A Pandas Function API that applies a function on the Spark DataFrame ▪ Similar characteristics with the iterator support of Python UDF ▪ Optional Python type hints currently in Spark 3.0 ▪ Length of output can be arbitrary ▪ StructType is unsupported from typing import Iterator import pandas as pd df = spark.createDataFrame([(1, 21), (2, 30)], ("id", "age")) def pandas_filter(iterator: Iterator[pd.DataFrame]) -> Iterator[pd.DataFrame]: for pdf in iterator: yield pdf[pdf.id == 1] df.mapInPandas(pandas_filter, df.schema).show()
  • 36. Pandas Function APIs: Grouped Map ▪ Co-grouped Map ▪ A Pandas Function API that applies a function on each co-group ▪ Requires two grouped Spark DataFrames ▪ Optional Python type hints currently in Spark 3.0 ▪ Length of output can be arbitrary ▪ StructType is unsupported import pandas as pd df1 = spark.createDataFrame( [(1201, 1, 1.0), (1201, 2, 2.0), (1202, 1, 3.0), (1202, 2, 4.0)], ("time", "id", "v1")) df2 = spark.createDataFrame( [(1201, 1, "x"), (1201, 2, "y")], ("time", "id", "v2")) def asof_join(left: pd.DataFrame, right: pd.DataFrame) -> pd.DataFrame: return pd.merge_asof(left, right, on="time", by="id") df1.groupby("id").cogroup( df2.groupby("id") ).applyInPandas(asof_join, "time int, id int, v1 double, v2 string").show()
  • 37. Re-cap ▪ Pandas APIs leverage Python type hints for static analysis, auto- documentation and self-descriptive UDF ▪ Old Pandas UDFs separation to Pandas UDF and Pandas Function API ▪ New APIs ▪ Iterator support in Pandas UDF ▪ Cogrouped-map and map Pandas Function APIs