SlideShare a Scribd company logo
Performant data processing
with PySpark, SparkR and
DataFrame API
Ryuji Tamagawa from Osaka
Many Thanks to Holden Karau,
for the discussion we had about this talk.
Agenda
Who am I ?
Spark
Spark and non-JVM languages
DataFrame APIs come to rescue
Examples
Who am I ?
Software engineer working for
Sky, from architecture design to
troubleshooting in the field
Translator working with O’Reilly
Japan
‘Learning Spark’ is the 27th book
Prized Rakuten tech award
Silver 2010 for translating
‘Hadoop the definitive guide’
A bed for 6 cats
Works of 2015
Available
Jan, 2016 ?
Works of past
Motivation for
today’s talk
I want to deal with my ‘Big’ data, 

WITH PYTHON !!
Apache Spark
Apache Spark
You may already
have heard a lot
Fast, distributed
data processing
framework with
high-level APIs
Written in Scala,
run in JVM
OS
HDFS
Hive e.t.c.
HBaseMapReduce
YARN
Impala
e.t.c(in-
memory SQL
engine)
Spark
(Spark Streaming, MLlib,
GraphX, Spark SQL)
Why it’s fast
Do not need to write temporary data to storage every time
Do not need to invoke JVM process every time
map
JVM Invocation
I/0
HDFS
reduce
JVM Invocation
I/0
map
JVM Invocation
I/0
reduce
JVM Invocation
I/0
f1(read data to RDD)
Executor(JVM)Invocation
HDFS
I/O
f2
f3
f4(persist to storage)
f5(does shuffle) I/O
f6
f7
Memory(RDDs)
access
access
access
access I/O
access
access
MapReduce Spark
Apache Spark
and
non-JVM languages
Spark supports
non-JVM languages
Shells
PySpark, 

for Python users
SparkR, 

for R users
GUI Environment : 

Jupiter, RStudio
You can write application code in
these languages
The Web UI tells us a lot
http://<address>:4040
Performance problems
with those languages
Data processing
performance with
those languages
may be several
times slower than
JVM languages
The reason lies in
the architecture https://cwiki.apache.org/confluence/
display/SPARK/PySpark+Internals
The choices you
have had
Learn Scala
Write (more lines of) code in Java
Use non-JVM languages with more
CPU cores to make up the
performance gap
DataFrame APIs
come to the rescue !
DataFrame
Tabular data with schema based on RDD
Successor of Schema RDD (Since 1.4)
Has rich set of APIs for data operation
Or, you can simply use SQL!
Do it within JVM
When you call
DataFrame APIs from
non-JVM Languages,
data will not be
transferred between JVM
and the language
runtime
Obviously, the
performance is almost
same compared to JVM
languages
Only code goes
through
Executor
DataFrame APIs compared to
RDD APIs by Examples
JVM
DataFrame,
Cached
Python
lambda items:
items[0] == ‘abc’
transfer
DataFrame,
result
transfer
Driver
Executor
DataFrame APIs compared to
RDD APIs by Examples
JVM
DataFrame,
Cached
filter(df[“_1”]
== “abc”)
transfer
DataFrame,
result
Driver
Watch out for UDFs
You can write UDFs
in Python
You can use
lambdas in Python,
too
Once you use them,
data flows between
the two worlds
slen = udf(
lambda s: len(s),
IntegerType())
df.select(
slen(df.name))
.collect()
Make it small first,
then use UDFs
Filter or sample your
‘big’ data with
DataFrame APIs
Then use UDFs
SQL optimizer does
not take it into
account when making
plans (so far)
‘BIG’ data
in DataFrame
filtering with
‘native APIs’
‘Small’ data in DataFrame
whatever
operation with
UDFs
Make it small first,
then use UDFs
Filter or sample your
‘big’ data with
DataFrame APIs
Then use UDFs
SQL optimizer does
not take it into
account when making
plans (so far)
slen = udf(
lambda s: len(s),
IntegerType())
sqc.SQL(
‘select…
from df
where fname like “tama%”
and slen(name)’
).collect()
processed first !
Ingesting Data
It’s slow to Deal with files like CSVs by non-JVM driver
Anyway, convert raw data to ‘Dataframe-native’ formats like Parquet at first
You can process Such files directly from JVM processes (executors) even when
using non-JVM languages
Executor
JVM
DataFrameDriver
Local Data
Py4J
Driver Machine
HDFS (Parquet)
Driver Machine
Ingesting Data
Executor
JVM
DataFrameDriver Py4Jcode only
HDFS (Parquet)
code only
It’s slow to Deal with files like CSVs by non-JVM driver
Anyway, convert raw data to ‘Dataframe-native’ formats like Parquet at first
You can process Such files directly from JVM processes (executors) even when
using non-JVM languages
Appendix : Parquet
Parquet: general purpose file
format for analytic workload
Columnar storage : reduces I/O
significantly
High compression rate
projection pushdown
Today, workloads become CPU-
intensive : very fast read, CPU-internal-
aware

More Related Content

What's hot (20)

PDF
PySpark Best Practices
Cloudera, Inc.
 
PPTX
Frustration-Reduced PySpark: Data engineering with DataFrames
Ilya Ganelin
 
PDF
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Spark Summit
 
PDF
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
Spark Summit
 
PPTX
Large-Scale Data Science in Apache Spark 2.0
Databricks
 
PPTX
Spark r under the hood with Hossein Falaki
Databricks
 
PDF
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Databricks
 
PPTX
Introduction to Apache Spark Developer Training
Cloudera, Inc.
 
PDF
New Developments in Spark
Databricks
 
PDF
Introduction to Apache Spark
Samy Dindane
 
PPTX
Parallelizing Existing R Packages with SparkR
Databricks
 
PPTX
Lightening Fast Big Data Analytics using Apache Spark
Manish Gupta
 
PDF
A really really fast introduction to PySpark - lightning fast cluster computi...
Holden Karau
 
PDF
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Spark Summit
 
PPTX
Introduction to Apache Spark
Rahul Jain
 
PDF
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
Spark Summit
 
PPTX
Building a modern Application with DataFrames
Spark Summit
 
PDF
Apache Arrow and Pandas UDF on Apache Spark
Takuya UESHIN
 
PDF
Spark Meetup at Uber
Databricks
 
PDF
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
Spark Summit
 
PySpark Best Practices
Cloudera, Inc.
 
Frustration-Reduced PySpark: Data engineering with DataFrames
Ilya Ganelin
 
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Spark Summit
 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
Spark Summit
 
Large-Scale Data Science in Apache Spark 2.0
Databricks
 
Spark r under the hood with Hossein Falaki
Databricks
 
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Databricks
 
Introduction to Apache Spark Developer Training
Cloudera, Inc.
 
New Developments in Spark
Databricks
 
Introduction to Apache Spark
Samy Dindane
 
Parallelizing Existing R Packages with SparkR
Databricks
 
Lightening Fast Big Data Analytics using Apache Spark
Manish Gupta
 
A really really fast introduction to PySpark - lightning fast cluster computi...
Holden Karau
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Spark Summit
 
Introduction to Apache Spark
Rahul Jain
 
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
Spark Summit
 
Building a modern Application with DataFrames
Spark Summit
 
Apache Arrow and Pandas UDF on Apache Spark
Takuya UESHIN
 
Spark Meetup at Uber
Databricks
 
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
Spark Summit
 

Viewers also liked (20)

PDF
Fast Data Analytics with Spark and Python
Benjamin Bengfort
 
PDF
Getting The Best Performance With PySpark
Spark Summit
 
PDF
High Performance Python on Apache Spark
Wes McKinney
 
PDF
PySpark in practice slides
Dat Tran
 
PDF
Improving Python and Spark (PySpark) Performance and Interoperability
Wes McKinney
 
PPTX
SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th
Alton Alexander
 
PDF
Apache Spark Introduction - CloudxLab
Abhinav Singh
 
PDF
丸の内MongoDB勉強会#20LT 2.8のストレージエンジン動かしてみました
Ryuji Tamagawa
 
PDF
lessons learned from talking at rakuten technology conference
Ryuji Tamagawa
 
PDF
ヘルシープログラマ・翻訳と実践
Ryuji Tamagawa
 
PDF
20161215 python pandas-spark四方山話
Ryuji Tamagawa
 
PDF
Google Big Query
Ryuji Tamagawa
 
PDF
You might be paying too much for BigQuery
Ryuji Tamagawa
 
PDF
Mongo dbを知ろう devlove関西
Ryuji Tamagawa
 
PDF
Spark workshop
Wojciech Pituła
 
PDF
Apache Spark 101
Abdullah Çetin ÇAVDAR
 
PDF
Google BigQueryについて 紹介と推測
Ryuji Tamagawa
 
PDF
Tachyon Presentation at AMPCamp 6 (November, 2015)
Tachyon Nexus, Inc.
 
PPTX
Master Data Mastery – Strategies to improve procurement performance
Verdantis Inc.
 
PDF
An excursion into Text Analytics with Apache Spark
Krishna Sankar
 
Fast Data Analytics with Spark and Python
Benjamin Bengfort
 
Getting The Best Performance With PySpark
Spark Summit
 
High Performance Python on Apache Spark
Wes McKinney
 
PySpark in practice slides
Dat Tran
 
Improving Python and Spark (PySpark) Performance and Interoperability
Wes McKinney
 
SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th
Alton Alexander
 
Apache Spark Introduction - CloudxLab
Abhinav Singh
 
丸の内MongoDB勉強会#20LT 2.8のストレージエンジン動かしてみました
Ryuji Tamagawa
 
lessons learned from talking at rakuten technology conference
Ryuji Tamagawa
 
ヘルシープログラマ・翻訳と実践
Ryuji Tamagawa
 
20161215 python pandas-spark四方山話
Ryuji Tamagawa
 
Google Big Query
Ryuji Tamagawa
 
You might be paying too much for BigQuery
Ryuji Tamagawa
 
Mongo dbを知ろう devlove関西
Ryuji Tamagawa
 
Spark workshop
Wojciech Pituła
 
Apache Spark 101
Abdullah Çetin ÇAVDAR
 
Google BigQueryについて 紹介と推測
Ryuji Tamagawa
 
Tachyon Presentation at AMPCamp 6 (November, 2015)
Tachyon Nexus, Inc.
 
Master Data Mastery – Strategies to improve procurement performance
Verdantis Inc.
 
An excursion into Text Analytics with Apache Spark
Krishna Sankar
 
Ad

Similar to Performant data processing with PySpark, SparkR and DataFrame API (20)

PPTX
Learn about SPARK tool and it's componemts
siddharth30121
 
PPTX
Apache Spark on HDinsight Training
Synergetics Learning and Cloud Consulting
 
PPT
Apache spark-melbourne-april-2015-meetup
Ned Shawa
 
PPTX
Spark from the Surface
Josi Aranda
 
PPTX
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
PPTX
Introduction to Spark - DataFactZ
DataFactZ
 
PPT
An Introduction to Apache spark with scala
johnn210
 
PDF
Big Data Beyond the JVM - Strata San Jose 2018
Holden Karau
 
PDF
Apache spark - Architecture , Overview & libraries
Walaa Hamdy Assy
 
PPTX
Apache Spark: Lightning Fast Cluster Computing
All Things Open
 
PDF
spark interview questions & answers acadgild blogs
prateek kumar
 
PPTX
Getting Started with Apache Spark (Scala)
Knoldus Inc.
 
PPTX
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Michael Rys
 
PDF
Big data beyond the JVM - DDTX 2018
Holden Karau
 
PPTX
Azure Databricks is Easier Than You Think
Ike Ellis
 
PDF
Composable Parallel Processing in Apache Spark and Weld
Databricks
 
PDF
20170126 big data processing
Vienna Data Science Group
 
PPTX
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
bhuvankumar3877
 
PPTX
Cleveland Hadoop Users Group - Spark
Vince Gonzalez
 
PDF
Apache Spark PDF
Naresh Rupareliya
 
Learn about SPARK tool and it's componemts
siddharth30121
 
Apache Spark on HDinsight Training
Synergetics Learning and Cloud Consulting
 
Apache spark-melbourne-april-2015-meetup
Ned Shawa
 
Spark from the Surface
Josi Aranda
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
Introduction to Spark - DataFactZ
DataFactZ
 
An Introduction to Apache spark with scala
johnn210
 
Big Data Beyond the JVM - Strata San Jose 2018
Holden Karau
 
Apache spark - Architecture , Overview & libraries
Walaa Hamdy Assy
 
Apache Spark: Lightning Fast Cluster Computing
All Things Open
 
spark interview questions & answers acadgild blogs
prateek kumar
 
Getting Started with Apache Spark (Scala)
Knoldus Inc.
 
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Michael Rys
 
Big data beyond the JVM - DDTX 2018
Holden Karau
 
Azure Databricks is Easier Than You Think
Ike Ellis
 
Composable Parallel Processing in Apache Spark and Weld
Databricks
 
20170126 big data processing
Vienna Data Science Group
 
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
bhuvankumar3877
 
Cleveland Hadoop Users Group - Spark
Vince Gonzalez
 
Apache Spark PDF
Naresh Rupareliya
 
Ad

More from Ryuji Tamagawa (20)

PDF
20171012 found IT #9 PySparkの勘所
Ryuji Tamagawa
 
PDF
20170927 pydata tokyo データサイエンスな皆様に送る分散処理の基礎の基礎、そしてPySparkの勘所
Ryuji Tamagawa
 
PPTX
hbstudy 74 Site Reliability Engineering
Ryuji Tamagawa
 
PDF
PySparkの勘所(20170630 sapporo db analytics showcase)
Ryuji Tamagawa
 
PDF
20170210 sapporotechbar7
Ryuji Tamagawa
 
PDF
20161004 データ処理のプラットフォームとしてのpythonとpandas 東京
Ryuji Tamagawa
 
PDF
20160708 データ処理のプラットフォームとしてのpython 札幌
Ryuji Tamagawa
 
PDF
20160127三木会 RDB経験者のためのspark
Ryuji Tamagawa
 
PDF
20151205 Japan.R SparkRとParquet
Ryuji Tamagawa
 
PDF
Apache Sparkの紹介
Ryuji Tamagawa
 
PDF
足を地に着け落ち着いて考える
Ryuji Tamagawa
 
PDF
BigQueryの課金、節約しませんか
Ryuji Tamagawa
 
PDF
Seleniumをもっと知るための本の話
Ryuji Tamagawa
 
PDF
データベース勉強会 In 広島 mongodb
Ryuji Tamagawa
 
PDF
Invitation to mongo db @ Rakuten TechTalk
Ryuji Tamagawa
 
PDF
MongoDB tuning on AWS
Ryuji Tamagawa
 
PDF
初めてのMongo db
Ryuji Tamagawa
 
PDF
RDB経験者に送るMongoDBの勘所(db tech showcase tokyo 2013)
Ryuji Tamagawa
 
PDF
初めてのAws elastic map reduce
Ryuji Tamagawa
 
PDF
初めてのAws rds for sql server
Ryuji Tamagawa
 
20171012 found IT #9 PySparkの勘所
Ryuji Tamagawa
 
20170927 pydata tokyo データサイエンスな皆様に送る分散処理の基礎の基礎、そしてPySparkの勘所
Ryuji Tamagawa
 
hbstudy 74 Site Reliability Engineering
Ryuji Tamagawa
 
PySparkの勘所(20170630 sapporo db analytics showcase)
Ryuji Tamagawa
 
20170210 sapporotechbar7
Ryuji Tamagawa
 
20161004 データ処理のプラットフォームとしてのpythonとpandas 東京
Ryuji Tamagawa
 
20160708 データ処理のプラットフォームとしてのpython 札幌
Ryuji Tamagawa
 
20160127三木会 RDB経験者のためのspark
Ryuji Tamagawa
 
20151205 Japan.R SparkRとParquet
Ryuji Tamagawa
 
Apache Sparkの紹介
Ryuji Tamagawa
 
足を地に着け落ち着いて考える
Ryuji Tamagawa
 
BigQueryの課金、節約しませんか
Ryuji Tamagawa
 
Seleniumをもっと知るための本の話
Ryuji Tamagawa
 
データベース勉強会 In 広島 mongodb
Ryuji Tamagawa
 
Invitation to mongo db @ Rakuten TechTalk
Ryuji Tamagawa
 
MongoDB tuning on AWS
Ryuji Tamagawa
 
初めてのMongo db
Ryuji Tamagawa
 
RDB経験者に送るMongoDBの勘所(db tech showcase tokyo 2013)
Ryuji Tamagawa
 
初めてのAws elastic map reduce
Ryuji Tamagawa
 
初めてのAws rds for sql server
Ryuji Tamagawa
 

Recently uploaded (20)

PDF
Rewards and Recognition (2).pdf
ethan Talor
 
PDF
Telemedicine App Development_ Key Factors to Consider for Your Healthcare Ven...
Mobilityinfotech
 
DOCX
Zoho Creator Solution for EI by Elsner Technologies.docx
Elsner Technologies Pvt. Ltd.
 
PPTX
IObit Driver Booster Pro 12.4-12.5 license keys 2025-2026
chaudhryakashoo065
 
PDF
From Data Preparation to Inference: How Alluxio Speeds Up AI
Alluxio, Inc.
 
PDF
TEASMA: A Practical Methodology for Test Adequacy Assessment of Deep Neural N...
Lionel Briand
 
PDF
CodeCleaner: Mitigating Data Contamination for LLM Benchmarking
arabelatso
 
PPTX
Wondershare Filmora Crack 14.5.18 + Key Full Download [Latest 2025]
HyperPc soft
 
PDF
IObit Uninstaller Pro 14.3.1.8 Crack for Windows Latest
utfefguu
 
PDF
The Rise of Sustainable Mobile App Solutions by New York Development Firms
ostechnologies16
 
DOCX
Best AI-Powered Wearable Tech for Remote Health Monitoring in 2025
SEOLIFT - SEO Company London
 
PDF
capitulando la keynote de GrafanaCON 2025 - Madrid
Imma Valls Bernaus
 
PPTX
Foundations of Marketo Engage - Programs, Campaigns & Beyond - June 2025
BradBedford3
 
PDF
CodeCleaner: Mitigating Data Contamination for LLM Benchmarking
arabelatso
 
PDF
Which Hiring Management Tools Offer the Best ROI?
HireME
 
PDF
OpenChain Webinar - AboutCode - Practical Compliance in One Stack – Licensing...
Shane Coughlan
 
PPTX
IObit Driver Booster Pro Crack Download Latest Version
chaudhryakashoo065
 
PDF
The Next-Gen HMIS Software AI, Blockchain & Cloud for Housing.pdf
Prudence B2B
 
PPTX
Avast Premium Security crack 25.5.6162 + License Key 2025
HyperPc soft
 
PDF
Humans vs AI Call Agents - Qcall.ai's Special Report
Udit Goenka
 
Rewards and Recognition (2).pdf
ethan Talor
 
Telemedicine App Development_ Key Factors to Consider for Your Healthcare Ven...
Mobilityinfotech
 
Zoho Creator Solution for EI by Elsner Technologies.docx
Elsner Technologies Pvt. Ltd.
 
IObit Driver Booster Pro 12.4-12.5 license keys 2025-2026
chaudhryakashoo065
 
From Data Preparation to Inference: How Alluxio Speeds Up AI
Alluxio, Inc.
 
TEASMA: A Practical Methodology for Test Adequacy Assessment of Deep Neural N...
Lionel Briand
 
CodeCleaner: Mitigating Data Contamination for LLM Benchmarking
arabelatso
 
Wondershare Filmora Crack 14.5.18 + Key Full Download [Latest 2025]
HyperPc soft
 
IObit Uninstaller Pro 14.3.1.8 Crack for Windows Latest
utfefguu
 
The Rise of Sustainable Mobile App Solutions by New York Development Firms
ostechnologies16
 
Best AI-Powered Wearable Tech for Remote Health Monitoring in 2025
SEOLIFT - SEO Company London
 
capitulando la keynote de GrafanaCON 2025 - Madrid
Imma Valls Bernaus
 
Foundations of Marketo Engage - Programs, Campaigns & Beyond - June 2025
BradBedford3
 
CodeCleaner: Mitigating Data Contamination for LLM Benchmarking
arabelatso
 
Which Hiring Management Tools Offer the Best ROI?
HireME
 
OpenChain Webinar - AboutCode - Practical Compliance in One Stack – Licensing...
Shane Coughlan
 
IObit Driver Booster Pro Crack Download Latest Version
chaudhryakashoo065
 
The Next-Gen HMIS Software AI, Blockchain & Cloud for Housing.pdf
Prudence B2B
 
Avast Premium Security crack 25.5.6162 + License Key 2025
HyperPc soft
 
Humans vs AI Call Agents - Qcall.ai's Special Report
Udit Goenka
 

Performant data processing with PySpark, SparkR and DataFrame API

  • 1. Performant data processing with PySpark, SparkR and DataFrame API Ryuji Tamagawa from Osaka Many Thanks to Holden Karau, for the discussion we had about this talk.
  • 2. Agenda Who am I ? Spark Spark and non-JVM languages DataFrame APIs come to rescue Examples
  • 3. Who am I ? Software engineer working for Sky, from architecture design to troubleshooting in the field Translator working with O’Reilly Japan ‘Learning Spark’ is the 27th book Prized Rakuten tech award Silver 2010 for translating ‘Hadoop the definitive guide’ A bed for 6 cats
  • 6. Motivation for today’s talk I want to deal with my ‘Big’ data, 
 WITH PYTHON !!
  • 8. Apache Spark You may already have heard a lot Fast, distributed data processing framework with high-level APIs Written in Scala, run in JVM OS HDFS Hive e.t.c. HBaseMapReduce YARN Impala e.t.c(in- memory SQL engine) Spark (Spark Streaming, MLlib, GraphX, Spark SQL)
  • 9. Why it’s fast Do not need to write temporary data to storage every time Do not need to invoke JVM process every time map JVM Invocation I/0 HDFS reduce JVM Invocation I/0 map JVM Invocation I/0 reduce JVM Invocation I/0 f1(read data to RDD) Executor(JVM)Invocation HDFS I/O f2 f3 f4(persist to storage) f5(does shuffle) I/O f6 f7 Memory(RDDs) access access access access I/O access access MapReduce Spark
  • 11. Spark supports non-JVM languages Shells PySpark, 
 for Python users SparkR, 
 for R users GUI Environment : 
 Jupiter, RStudio You can write application code in these languages
  • 12. The Web UI tells us a lot http://<address>:4040
  • 13. Performance problems with those languages Data processing performance with those languages may be several times slower than JVM languages The reason lies in the architecture https://cwiki.apache.org/confluence/ display/SPARK/PySpark+Internals
  • 14. The choices you have had Learn Scala Write (more lines of) code in Java Use non-JVM languages with more CPU cores to make up the performance gap
  • 15. DataFrame APIs come to the rescue !
  • 16. DataFrame Tabular data with schema based on RDD Successor of Schema RDD (Since 1.4) Has rich set of APIs for data operation Or, you can simply use SQL!
  • 17. Do it within JVM When you call DataFrame APIs from non-JVM Languages, data will not be transferred between JVM and the language runtime Obviously, the performance is almost same compared to JVM languages Only code goes through
  • 18. Executor DataFrame APIs compared to RDD APIs by Examples JVM DataFrame, Cached Python lambda items: items[0] == ‘abc’ transfer DataFrame, result transfer Driver
  • 19. Executor DataFrame APIs compared to RDD APIs by Examples JVM DataFrame, Cached filter(df[“_1”] == “abc”) transfer DataFrame, result Driver
  • 20. Watch out for UDFs You can write UDFs in Python You can use lambdas in Python, too Once you use them, data flows between the two worlds slen = udf( lambda s: len(s), IntegerType()) df.select( slen(df.name)) .collect()
  • 21. Make it small first, then use UDFs Filter or sample your ‘big’ data with DataFrame APIs Then use UDFs SQL optimizer does not take it into account when making plans (so far) ‘BIG’ data in DataFrame filtering with ‘native APIs’ ‘Small’ data in DataFrame whatever operation with UDFs
  • 22. Make it small first, then use UDFs Filter or sample your ‘big’ data with DataFrame APIs Then use UDFs SQL optimizer does not take it into account when making plans (so far) slen = udf( lambda s: len(s), IntegerType()) sqc.SQL( ‘select… from df where fname like “tama%” and slen(name)’ ).collect() processed first !
  • 23. Ingesting Data It’s slow to Deal with files like CSVs by non-JVM driver Anyway, convert raw data to ‘Dataframe-native’ formats like Parquet at first You can process Such files directly from JVM processes (executors) even when using non-JVM languages Executor JVM DataFrameDriver Local Data Py4J Driver Machine HDFS (Parquet)
  • 24. Driver Machine Ingesting Data Executor JVM DataFrameDriver Py4Jcode only HDFS (Parquet) code only It’s slow to Deal with files like CSVs by non-JVM driver Anyway, convert raw data to ‘Dataframe-native’ formats like Parquet at first You can process Such files directly from JVM processes (executors) even when using non-JVM languages
  • 26. Parquet: general purpose file format for analytic workload Columnar storage : reduces I/O significantly High compression rate projection pushdown Today, workloads become CPU- intensive : very fast read, CPU-internal- aware