Performant data processing
with PySpark, SparkR and
DataFrame API
Ryuji Tamagawa from Osaka
Many Thanks to Holden Karau,
for the discussion we had about this talk.
Agenda
Who am I ?
Spark
Spark and non-JVM languages
DataFrame APIs come to rescue
Examples
Who am I ?
Software engineer working for
Sky, from architecture design to
troubleshooting in the field
Translator working with O’Reilly
Japan
‘Learning Spark’ is the 27th book
Prized Rakuten tech award
Silver 2010 for translating
‘Hadoop the definitive guide’
A bed for 6 cats
Works of 2015
Available
Jan, 2016 ?
Works of past
Motivation for
today’s talk
I want to deal with my ‘Big’ data, 

WITH PYTHON !!
Apache Spark
Apache Spark
You may already
have heard a lot
Fast, distributed
data processing
framework with
high-level APIs
Written in Scala,
run in JVM
OS
HDFS
Hive e.t.c.
HBaseMapReduce
YARN
Impala
e.t.c(in-
memory SQL
engine)
Spark
(Spark Streaming, MLlib,
GraphX, Spark SQL)
Why it’s fast
Do not need to write temporary data to storage every time
Do not need to invoke JVM process every time
map
JVM Invocation
I/0
HDFS
reduce
JVM Invocation
I/0
map
JVM Invocation
I/0
reduce
JVM Invocation
I/0
f1(read data to RDD)
Executor(JVM)Invocation
HDFS
I/O
f2
f3
f4(persist to storage)
f5(does shuffle) I/O
f6
f7
Memory(RDDs)
access
access
access
access I/O
access
access
MapReduce Spark
Apache Spark
and
non-JVM languages
Spark supports
non-JVM languages
Shells
PySpark, 

for Python users
SparkR, 

for R users
GUI Environment : 

Jupiter, RStudio
You can write application code in
these languages
The Web UI tells us a lot
http://<address>:4040
Performance problems
with those languages
Data processing
performance with
those languages
may be several
times slower than
JVM languages
The reason lies in
the architecture https://cwiki.apache.org/confluence/
display/SPARK/PySpark+Internals
The choices you
have had
Learn Scala
Write (more lines of) code in Java
Use non-JVM languages with more
CPU cores to make up the
performance gap
DataFrame APIs
come to the rescue !
DataFrame
Tabular data with schema based on RDD
Successor of Schema RDD (Since 1.4)
Has rich set of APIs for data operation
Or, you can simply use SQL!
Do it within JVM
When you call
DataFrame APIs from
non-JVM Languages,
data will not be
transferred between JVM
and the language
runtime
Obviously, the
performance is almost
same compared to JVM
languages
Only code goes
through
Executor
DataFrame APIs compared to
RDD APIs by Examples
JVM
DataFrame,
Cached
Python
lambda items:
items[0] == ‘abc’
transfer
DataFrame,
result
transfer
Driver
Executor
DataFrame APIs compared to
RDD APIs by Examples
JVM
DataFrame,
Cached
filter(df[“_1”]
== “abc”)
transfer
DataFrame,
result
Driver
Watch out for UDFs
You can write UDFs
in Python
You can use
lambdas in Python,
too
Once you use them,
data flows between
the two worlds
slen = udf(
lambda s: len(s),
IntegerType())
df.select(
slen(df.name))
.collect()
Make it small first,
then use UDFs
Filter or sample your
‘big’ data with
DataFrame APIs
Then use UDFs
SQL optimizer does
not take it into
account when making
plans (so far)
‘BIG’ data
in DataFrame
filtering with
‘native APIs’
‘Small’ data in DataFrame
whatever
operation with
UDFs
Make it small first,
then use UDFs
Filter or sample your
‘big’ data with
DataFrame APIs
Then use UDFs
SQL optimizer does
not take it into
account when making
plans (so far)
slen = udf(
lambda s: len(s),
IntegerType())
sqc.SQL(
‘select…
from df
where fname like “tama%”
and slen(name)’
).collect()
processed first !
Ingesting Data
It’s slow to Deal with files like CSVs by non-JVM driver
Anyway, convert raw data to ‘Dataframe-native’ formats like Parquet at first
You can process Such files directly from JVM processes (executors) even when
using non-JVM languages
Executor
JVM
DataFrameDriver
Local Data
Py4J
Driver Machine
HDFS (Parquet)
Driver Machine
Ingesting Data
Executor
JVM
DataFrameDriver Py4Jcode only
HDFS (Parquet)
code only
It’s slow to Deal with files like CSVs by non-JVM driver
Anyway, convert raw data to ‘Dataframe-native’ formats like Parquet at first
You can process Such files directly from JVM processes (executors) even when
using non-JVM languages
Appendix : Parquet
Parquet: general purpose file
format for analytic workload
Columnar storage : reduces I/O
significantly
High compression rate
projection pushdown
Today, workloads become CPU-
intensive : very fast read, CPU-internal-
aware

More Related Content

PDF
Python and Bigdata - An Introduction to Spark (PySpark)
PDF
PySaprk
PPTX
Programming in Spark using PySpark
PDF
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
PDF
Spark Under the Hood - Meetup @ Data Science London
PPTX
Up and running with pyspark
PDF
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
PPTX
Spark tutorial
Python and Bigdata - An Introduction to Spark (PySpark)
PySaprk
Programming in Spark using PySpark
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
Spark Under the Hood - Meetup @ Data Science London
Up and running with pyspark
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Spark tutorial

What's hot (20)

PDF
PySpark Best Practices
PPTX
Frustration-Reduced PySpark: Data engineering with DataFrames
PDF
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
PDF
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
PPTX
Large-Scale Data Science in Apache Spark 2.0
PPTX
Spark r under the hood with Hossein Falaki
PDF
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
PPTX
Introduction to Apache Spark Developer Training
PDF
New Developments in Spark
PDF
Introduction to Apache Spark
PPTX
Parallelizing Existing R Packages with SparkR
PPTX
Lightening Fast Big Data Analytics using Apache Spark
PDF
A really really fast introduction to PySpark - lightning fast cluster computi...
PDF
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
PPTX
Introduction to Apache Spark
PDF
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
PPTX
Building a modern Application with DataFrames
PDF
Apache Arrow and Pandas UDF on Apache Spark
PDF
Spark Meetup at Uber
PDF
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
PySpark Best Practices
Frustration-Reduced PySpark: Data engineering with DataFrames
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
Large-Scale Data Science in Apache Spark 2.0
Spark r under the hood with Hossein Falaki
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Introduction to Apache Spark Developer Training
New Developments in Spark
Introduction to Apache Spark
Parallelizing Existing R Packages with SparkR
Lightening Fast Big Data Analytics using Apache Spark
A really really fast introduction to PySpark - lightning fast cluster computi...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Introduction to Apache Spark
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
Building a modern Application with DataFrames
Apache Arrow and Pandas UDF on Apache Spark
Spark Meetup at Uber
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
Ad

Viewers also liked (20)

PDF
Fast Data Analytics with Spark and Python
PDF
Getting The Best Performance With PySpark
PDF
High Performance Python on Apache Spark
PDF
PySpark in practice slides
PDF
Improving Python and Spark (PySpark) Performance and Interoperability
PPTX
SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th
PDF
Apache Spark Introduction - CloudxLab
PDF
丸の内MongoDB勉強会#20LT 2.8のストレージエンジン動かしてみました
PDF
lessons learned from talking at rakuten technology conference
PDF
ヘルシープログラマ・翻訳と実践
PDF
20161215 python pandas-spark四方山話
PDF
Google Big Query
PDF
You might be paying too much for BigQuery
PDF
Mongo dbを知ろう devlove関西
PDF
Spark workshop
PDF
Apache Spark 101
PDF
Google BigQueryについて 紹介と推測
PDF
Tachyon Presentation at AMPCamp 6 (November, 2015)
PPTX
Master Data Mastery – Strategies to improve procurement performance
PDF
An excursion into Text Analytics with Apache Spark
Fast Data Analytics with Spark and Python
Getting The Best Performance With PySpark
High Performance Python on Apache Spark
PySpark in practice slides
Improving Python and Spark (PySpark) Performance and Interoperability
SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th
Apache Spark Introduction - CloudxLab
丸の内MongoDB勉強会#20LT 2.8のストレージエンジン動かしてみました
lessons learned from talking at rakuten technology conference
ヘルシープログラマ・翻訳と実践
20161215 python pandas-spark四方山話
Google Big Query
You might be paying too much for BigQuery
Mongo dbを知ろう devlove関西
Spark workshop
Apache Spark 101
Google BigQueryについて 紹介と推測
Tachyon Presentation at AMPCamp 6 (November, 2015)
Master Data Mastery – Strategies to improve procurement performance
An excursion into Text Analytics with Apache Spark
Ad

Similar to Performant data processing with PySpark, SparkR and DataFrame API (20)

PPTX
Learn about SPARK tool and it's componemts
PPTX
Apache Spark on HDinsight Training
PDF
Spark Programming Basic Training Handout
PPT
Apache spark-melbourne-april-2015-meetup
PPTX
Spark from the Surface
PPTX
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
PPTX
Introduction to Spark - DataFactZ
PPT
An Introduction to Apache spark with scala
PDF
Big Data Beyond the JVM - Strata San Jose 2018
PDF
Apache spark - Architecture , Overview & libraries
PPTX
Apache Spark: Lightning Fast Cluster Computing
PDF
spark interview questions & answers acadgild blogs
PPTX
Getting Started with Apache Spark (Scala)
PPTX
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
PDF
Big data beyond the JVM - DDTX 2018
PPTX
Azure Databricks is Easier Than You Think
PDF
Composable Parallel Processing in Apache Spark and Weld
PDF
20170126 big data processing
PPTX
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
PPTX
Cleveland Hadoop Users Group - Spark
Learn about SPARK tool and it's componemts
Apache Spark on HDinsight Training
Spark Programming Basic Training Handout
Apache spark-melbourne-april-2015-meetup
Spark from the Surface
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Introduction to Spark - DataFactZ
An Introduction to Apache spark with scala
Big Data Beyond the JVM - Strata San Jose 2018
Apache spark - Architecture , Overview & libraries
Apache Spark: Lightning Fast Cluster Computing
spark interview questions & answers acadgild blogs
Getting Started with Apache Spark (Scala)
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Big data beyond the JVM - DDTX 2018
Azure Databricks is Easier Than You Think
Composable Parallel Processing in Apache Spark and Weld
20170126 big data processing
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
Cleveland Hadoop Users Group - Spark

More from Ryuji Tamagawa (20)

PDF
20171012 found IT #9 PySparkの勘所
PDF
20170927 pydata tokyo データサイエンスな皆様に送る分散処理の基礎の基礎、そしてPySparkの勘所
PPTX
hbstudy 74 Site Reliability Engineering
PDF
PySparkの勘所(20170630 sapporo db analytics showcase)
PDF
20170210 sapporotechbar7
PDF
20161004 データ処理のプラットフォームとしてのpythonとpandas 東京
PDF
20160708 データ処理のプラットフォームとしてのpython 札幌
PDF
20160127三木会 RDB経験者のためのspark
PDF
20151205 Japan.R SparkRとParquet
PDF
Apache Sparkの紹介
PDF
足を地に着け落ち着いて考える
PDF
BigQueryの課金、節約しませんか
PDF
Seleniumをもっと知るための本の話
PDF
データベース勉強会 In 広島 mongodb
PDF
Invitation to mongo db @ Rakuten TechTalk
PDF
MongoDB tuning on AWS
PDF
初めてのMongo db
PDF
RDB経験者に送るMongoDBの勘所(db tech showcase tokyo 2013)
PDF
初めてのAws elastic map reduce
PDF
初めてのAws rds for sql server
20171012 found IT #9 PySparkの勘所
20170927 pydata tokyo データサイエンスな皆様に送る分散処理の基礎の基礎、そしてPySparkの勘所
hbstudy 74 Site Reliability Engineering
PySparkの勘所(20170630 sapporo db analytics showcase)
20170210 sapporotechbar7
20161004 データ処理のプラットフォームとしてのpythonとpandas 東京
20160708 データ処理のプラットフォームとしてのpython 札幌
20160127三木会 RDB経験者のためのspark
20151205 Japan.R SparkRとParquet
Apache Sparkの紹介
足を地に着け落ち着いて考える
BigQueryの課金、節約しませんか
Seleniumをもっと知るための本の話
データベース勉強会 In 広島 mongodb
Invitation to mongo db @ Rakuten TechTalk
MongoDB tuning on AWS
初めてのMongo db
RDB経験者に送るMongoDBの勘所(db tech showcase tokyo 2013)
初めてのAws elastic map reduce
初めてのAws rds for sql server

Recently uploaded (20)

PDF
CCleaner 6.39.11548 Crack 2025 License Key
PDF
Multiverse AI Review 2025: Access All TOP AI Model-Versions!
PDF
How AI/LLM recommend to you ? GDG meetup 16 Aug by Fariman Guliev
PPTX
Cybersecurity: Protecting the Digital World
PPTX
CNN LeNet5 Architecture: Neural Networks
PDF
EaseUS PDF Editor Pro 6.2.0.2 Crack with License Key 2025
DOCX
How to Use SharePoint as an ISO-Compliant Document Management System
PDF
How Tridens DevSecOps Ensures Compliance, Security, and Agility
PPTX
Patient Appointment Booking in Odoo with online payment
PDF
The Dynamic Duo Transforming Financial Accounting Systems Through Modern Expe...
PDF
Top 10 Software Development Trends to Watch in 2025 🚀.pdf
PDF
Product Update: Alluxio AI 3.7 Now with Sub-Millisecond Latency
PDF
Visual explanation of Dijkstra's Algorithm using Python
PDF
Cost to Outsource Software Development in 2025
PPTX
WiFi Honeypot Detecscfddssdffsedfseztor.pptx
PPTX
Computer Software - Technology and Livelihood Education
PDF
Autodesk AutoCAD Crack Free Download 2025
PPTX
AMADEUS TRAVEL AGENT SOFTWARE | AMADEUS TICKETING SYSTEM
PDF
DNT Brochure 2025 – ISV Solutions @ D365
PDF
Website Design Services for Small Businesses.pdf
CCleaner 6.39.11548 Crack 2025 License Key
Multiverse AI Review 2025: Access All TOP AI Model-Versions!
How AI/LLM recommend to you ? GDG meetup 16 Aug by Fariman Guliev
Cybersecurity: Protecting the Digital World
CNN LeNet5 Architecture: Neural Networks
EaseUS PDF Editor Pro 6.2.0.2 Crack with License Key 2025
How to Use SharePoint as an ISO-Compliant Document Management System
How Tridens DevSecOps Ensures Compliance, Security, and Agility
Patient Appointment Booking in Odoo with online payment
The Dynamic Duo Transforming Financial Accounting Systems Through Modern Expe...
Top 10 Software Development Trends to Watch in 2025 🚀.pdf
Product Update: Alluxio AI 3.7 Now with Sub-Millisecond Latency
Visual explanation of Dijkstra's Algorithm using Python
Cost to Outsource Software Development in 2025
WiFi Honeypot Detecscfddssdffsedfseztor.pptx
Computer Software - Technology and Livelihood Education
Autodesk AutoCAD Crack Free Download 2025
AMADEUS TRAVEL AGENT SOFTWARE | AMADEUS TICKETING SYSTEM
DNT Brochure 2025 – ISV Solutions @ D365
Website Design Services for Small Businesses.pdf

Performant data processing with PySpark, SparkR and DataFrame API

  • 1. Performant data processing with PySpark, SparkR and DataFrame API Ryuji Tamagawa from Osaka Many Thanks to Holden Karau, for the discussion we had about this talk.
  • 2. Agenda Who am I ? Spark Spark and non-JVM languages DataFrame APIs come to rescue Examples
  • 3. Who am I ? Software engineer working for Sky, from architecture design to troubleshooting in the field Translator working with O’Reilly Japan ‘Learning Spark’ is the 27th book Prized Rakuten tech award Silver 2010 for translating ‘Hadoop the definitive guide’ A bed for 6 cats
  • 6. Motivation for today’s talk I want to deal with my ‘Big’ data, 
 WITH PYTHON !!
  • 8. Apache Spark You may already have heard a lot Fast, distributed data processing framework with high-level APIs Written in Scala, run in JVM OS HDFS Hive e.t.c. HBaseMapReduce YARN Impala e.t.c(in- memory SQL engine) Spark (Spark Streaming, MLlib, GraphX, Spark SQL)
  • 9. Why it’s fast Do not need to write temporary data to storage every time Do not need to invoke JVM process every time map JVM Invocation I/0 HDFS reduce JVM Invocation I/0 map JVM Invocation I/0 reduce JVM Invocation I/0 f1(read data to RDD) Executor(JVM)Invocation HDFS I/O f2 f3 f4(persist to storage) f5(does shuffle) I/O f6 f7 Memory(RDDs) access access access access I/O access access MapReduce Spark
  • 11. Spark supports non-JVM languages Shells PySpark, 
 for Python users SparkR, 
 for R users GUI Environment : 
 Jupiter, RStudio You can write application code in these languages
  • 12. The Web UI tells us a lot http://<address>:4040
  • 13. Performance problems with those languages Data processing performance with those languages may be several times slower than JVM languages The reason lies in the architecture https://cwiki.apache.org/confluence/ display/SPARK/PySpark+Internals
  • 14. The choices you have had Learn Scala Write (more lines of) code in Java Use non-JVM languages with more CPU cores to make up the performance gap
  • 15. DataFrame APIs come to the rescue !
  • 16. DataFrame Tabular data with schema based on RDD Successor of Schema RDD (Since 1.4) Has rich set of APIs for data operation Or, you can simply use SQL!
  • 17. Do it within JVM When you call DataFrame APIs from non-JVM Languages, data will not be transferred between JVM and the language runtime Obviously, the performance is almost same compared to JVM languages Only code goes through
  • 18. Executor DataFrame APIs compared to RDD APIs by Examples JVM DataFrame, Cached Python lambda items: items[0] == ‘abc’ transfer DataFrame, result transfer Driver
  • 19. Executor DataFrame APIs compared to RDD APIs by Examples JVM DataFrame, Cached filter(df[“_1”] == “abc”) transfer DataFrame, result Driver
  • 20. Watch out for UDFs You can write UDFs in Python You can use lambdas in Python, too Once you use them, data flows between the two worlds slen = udf( lambda s: len(s), IntegerType()) df.select( slen(df.name)) .collect()
  • 21. Make it small first, then use UDFs Filter or sample your ‘big’ data with DataFrame APIs Then use UDFs SQL optimizer does not take it into account when making plans (so far) ‘BIG’ data in DataFrame filtering with ‘native APIs’ ‘Small’ data in DataFrame whatever operation with UDFs
  • 22. Make it small first, then use UDFs Filter or sample your ‘big’ data with DataFrame APIs Then use UDFs SQL optimizer does not take it into account when making plans (so far) slen = udf( lambda s: len(s), IntegerType()) sqc.SQL( ‘select… from df where fname like “tama%” and slen(name)’ ).collect() processed first !
  • 23. Ingesting Data It’s slow to Deal with files like CSVs by non-JVM driver Anyway, convert raw data to ‘Dataframe-native’ formats like Parquet at first You can process Such files directly from JVM processes (executors) even when using non-JVM languages Executor JVM DataFrameDriver Local Data Py4J Driver Machine HDFS (Parquet)
  • 24. Driver Machine Ingesting Data Executor JVM DataFrameDriver Py4Jcode only HDFS (Parquet) code only It’s slow to Deal with files like CSVs by non-JVM driver Anyway, convert raw data to ‘Dataframe-native’ formats like Parquet at first You can process Such files directly from JVM processes (executors) even when using non-JVM languages
  • 26. Parquet: general purpose file format for analytic workload Columnar storage : reduces I/O significantly High compression rate projection pushdown Today, workloads become CPU- intensive : very fast read, CPU-internal- aware