Performant data processing with PySpark, SparkR and DataFrame API

Performant data processing
with PySpark, SparkR and
DataFrame API
Ryuji Tamagawa from Osaka
Many Thanks to Holden Karau,
for the discussion we had about this talk.

Agenda
Who am I ?
Spark
Spark and non-JVM languages
DataFrame APIs come to rescue
Examples

Who am I ?
Software engineer working for
Sky, from architecture design to
troubleshooting in the field
Translator working with O’Reilly
Japan
‘Learning Spark’ is the 27th book
Prized Rakuten tech award
Silver 2010 for translating
‘Hadoop the definitive guide’
A bed for 6 cats

Works of 2015
Available
Jan, 2016 ?

Motivation for
today’s talk
I want to deal with my ‘Big’ data,  
WITH PYTHON !!

Apache Spark
You may already
have heard a lot
Fast, distributed
data processing
framework with
high-level APIs
Written in Scala,
run in JVM
OS
HDFS
Hive e.t.c.
HBaseMapReduce
YARN
Impala
e.t.c（in-
memory SQL
engine）
Spark
（Spark Streaming, MLlib,
GraphX, Spark SQL)

Why it’s fast
Do not need to write temporary data to storage every time
Do not need to invoke JVM process every time
map
JVM Invocation
I/0
HDFS
reduce
JVM Invocation
I/0
map
JVM Invocation
I/0
reduce
JVM Invocation
I/0
f1（read data to RDD）
Executor（JVM）Invocation
HDFS
I/O
f2
f3
f4（persist to storage）
f5（does shufﬂe） I/O
f6
f7
Memory(RDDs)
access
access
access
access I/O
access
access
MapReduce Spark

Apache Spark
and
non-JVM languages

Spark supports
non-JVM languages
Shells
PySpark,  
for Python users
SparkR,  
for R users
GUI Environment :  
Jupiter, RStudio
You can write application code in
these languages

The Web UI tells us a lot
http://<address>:4040

Performance problems
with those languages
Data processing
performance with
those languages
may be several
times slower than
JVM languages
The reason lies in
the architecture https://cwiki.apache.org/confluence/
display/SPARK/PySpark+Internals

The choices you
have had
Learn Scala
Write (more lines of) code in Java
Use non-JVM languages with more
CPU cores to make up the
performance gap

DataFrame APIs
come to the rescue !

DataFrame
Tabular data with schema based on RDD
Successor of Schema RDD (Since 1.4)
Has rich set of APIs for data operation
Or, you can simply use SQL!

Do it within JVM
When you call
DataFrame APIs from
non-JVM Languages,
data will not be
transferred between JVM
and the language
runtime
Obviously, the
performance is almost
same compared to JVM
languages
Only code goes
through

Executor
DataFrame APIs compared to
RDD APIs by Examples
JVM
DataFrame,
Cached
Python
lambda items:
items[0] == ‘abc’
transfer
DataFrame,
result
transfer
Driver

Executor
DataFrame APIs compared to
RDD APIs by Examples
JVM
DataFrame,
Cached
filter(df[“_1”]
== “abc”)
transfer
DataFrame,
result
Driver

Watch out for UDFs
You can write UDFs
in Python
You can use
lambdas in Python,
too
Once you use them,
data flows between
the two worlds
slen = udf(
lambda s: len(s),
IntegerType())
df.select(
slen(df.name))
.collect()

Make it small first,
then use UDFs
Filter or sample your
‘big’ data with
DataFrame APIs
Then use UDFs
SQL optimizer does
not take it into
account when making
plans (so far)
‘BIG’ data
in DataFrame
filtering with
‘native APIs’
‘Small’ data in DataFrame
whatever
operation with
UDFs

Make it small first,
then use UDFs
Filter or sample your
‘big’ data with
DataFrame APIs
Then use UDFs
SQL optimizer does
not take it into
account when making
plans (so far)
slen = udf(
lambda s: len(s),
IntegerType())
sqc.SQL(
‘select…
from df
where fname like “tama%”
and slen(name)’
).collect()
processed first !

Ingesting Data
It’s slow to Deal with files like CSVs by non-JVM driver
Anyway, convert raw data to ‘Dataframe-native’ formats like Parquet at first
You can process Such files directly from JVM processes (executors) even when
using non-JVM languages
Executor
JVM
DataFrameDriver
Local Data
Py4J
Driver Machine
HDFS (Parquet)

Driver Machine
Ingesting Data
Executor
JVM
DataFrameDriver Py4Jcode only
HDFS (Parquet)
code only
It’s slow to Deal with files like CSVs by non-JVM driver
Anyway, convert raw data to ‘Dataframe-native’ formats like Parquet at first
You can process Such files directly from JVM processes (executors) even when
using non-JVM languages

Parquet: general purpose file
format for analytic workload
Columnar storage : reduces I/O
significantly
High compression rate
projection pushdown
Today, workloads become CPU-
intensive : very fast read, CPU-internal-
aware

Performant data processing with PySpark, SparkR and DataFrame API

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Performant data processing with PySpark, SparkR and DataFrame API (20)

More from Ryuji Tamagawa (20)

Recently uploaded (20)

Performant data processing with PySpark, SparkR and DataFrame API