SlideShare a Scribd company logo
PySpark
@
▸ facebook : Ryuji Tamagawa
▸ Twitter : tamagawa_ryuji
▸ FB
pydata.tokyo
▸ Twitter
20170927 pydata tokyo データサイエンスな皆様に送る分散処理の基礎の基礎、そしてPySparkの勘所
8 11
Wes Mckinney blog
▸ http://qiita.com/tamagawa-ryuji
20170927 pydata tokyo データサイエンスな皆様に送る分散処理の基礎の基礎、そしてPySparkの勘所
▸
▸ CPU
▸ PyData.Tokyo
▸
PySpark
20170927 pydata tokyo データサイエンスな皆様に送る分散処理の基礎の基礎、そしてPySparkの勘所
▸
▸
▸ Spark Hadoop
▸ PySpark
▸ Spark/Hadoop PyData
20170927 pydata tokyo データサイエンスな皆様に送る分散処理の基礎の基礎、そしてPySparkの勘所
▸
▸
▸
PySpark
▸
▸ SSD
▸ CPU
▸
Parquet
S3
CPU
20170927 pydata tokyo データサイエンスな皆様に送る分散処理の基礎の基礎、そしてPySparkの勘所
https://www.slideshare.net/kumagi/ss-78765920/4
▸
▸
▸ groupby
▸
▸
▸
N
▸ N
N
▸ …
…
▸
▸
▸
▸ CPU/
▸ CPU/
▸ 1
Hadoop Spark
▸
▸
▸ n /n
▸
▸
▸ Amazon EMR
▸ Microsoft Azure HDInsight
▸ Cloudera Altus
▸ Databricks Community Edition Spark
▸ PyData + Jupyter PySpark
Spark Hadoop
Spark Hadoop
Hadoop0.x Spark
OS
HDFS
MapReduce
OS
HDFS
Hive e.t.c.
HBase
MapReduce
OS
HDFS
Hive e.t.c.
HBaseMapReduce
YARN
Spark
Spark Streaming, MLlib,
GraphX, Spark SQL)
Impala
SQL
YARN
Spark
Spark Streaming, MLlib, GraphX,
Spark SQL)
Mesos
Spark
Spark Streaming, MLlib, GraphX,
Spark SQL) Spark
Spark Streaming, MLlib, GraphX,
Spark SQL)
Windows
Hadoop 0.x Hadoop 1.x Hadoop 2.x + Spark
Spark Hadoop
Hadoop Spark
map
JVM
HDFS
reduce
JVM
map
JVM
reduce
JVM
f1
RDD
Executor JVM
HDFS
f2
f3
f4
f5
f6
f7
MapReduce Spark
RDD
Spark Hadoop
Spark
▸ Hadoop MapReduce
▸ Spark API MapReduce API
▸ Hadoop
PySpark
(Py)Spark
▸ / Spark
▸ PyData
▸ Spark
▸ Spark Hadoop
PyData
PySpark
Spark 1.2
PySpark …
(Py)Spark
PySpark
PySpark
RDD API DataFrame API
▸ RDD Resilient Distributed Dataset =
Spark Java
▸ DataFrame RDD
/ R data.frame
▸ Python RDD API DataFrame API Scala
/ Java
PySpark
DataFrame API
RDD
DataFrame /
Dataset
MLlib ML
GraphX GraphFrame
Spark
Streaming
Structured
Streaming
Worker node
PySpark
Executer
JVM
Driver
JVM
Executer
JVM
Executer
JVM
Storage
Python
VM
Worker node Worker node
Python
VM
Python
VM
RDD API PySpark
Worker node
Executer
JVM
Driver
JVM
Executer
JVM
Executer
JVM
Storage
Python
VM
Worker node Worker node
Python
VM
Python
VM
DataFrame API PySpark
PySpark
▸ RDD API Executer JVM Python VM
▸ DataFrame API JVM
▸ UDF Python VM
▸ UDF Scala Java
▸ Spark 2.x DataFrame 

Spark PyData
Spark PyData
Spark PyData
▸ Spark
▸ Python PyData
▸
▸ Parquet
▸ Apache Arrow
Spark PyData
▸ CSV JSON
▸Parquet Spark DataFrame API
Python
fastparquet pyarrow
▸ Performance comparison of different file formats and storage engines
in the Hadoop ecosystem
▸
=
Spark PyData
Parquet


https://parquet.apache.org/documentation/latest/


zip CSV
I/O
ROW BLOCK
COLUMN #0 ROW #0
COLUMN #0 ROW #1
COLUMN #0 ROW #N
COLUMN #1 ROW #0
COLUMN #1 ROW #1
…
…
COLUMN #1 ROW #N
COLUMN #2 ROW #0
COLUMN #2 ROW #1
…
COLUMN #M ROW #N
ROW BLOCK
COLUMN #0 ROW #0
COLUMN #0 ROW #1
COLUMN #0 ROW #N
COLUMN #1 ROW #0
COLUMN #1 ROW #1
…
…
COLUMN #1 ROW #N
COLUMN #2 ROW #0
COLUMN #2 ROW #1
…
COLUMN #M ROW #N
...
Spark PyData
Spark
df = spark.read.csv(csvFilename, header=True, schema = theSchema).coalesce(20)
df.write.save(filename, compression = 'snappy')
from fastparquet import write
pdf = pd.read_csv(csvFilename)
write(filename, pdf, compression='UNCOMPRESSED')
fastparquet
import pyarrow as pa
import pyarrow.parquet as pq
arrow_table = pa.Table.from_pandas(pdf)
pq.write_table(arrow_table, filename, compression = 'GZIP')
pyarrow
Spark PyData
▸ pandas CSV Spark
Spark pandas
…
▸ Spark - pandas
▸ pandas → Spark …
▸ Apache Arrow
Spark PyData
Apache Arrow
▸ Apache Arrow
▸ PyData / OSS
▸ /
https://arrow.apache.org
Spark PyData
Wes blog
▸ pandas Apache Arrow
▸ Blog
▸ PyData Blog


Wes OK
▸ Apache Arrow pandas 10 

https://qiita.com/tamagawa-ryuji/items/3d8fc52406706ae0c144
PySpark Python Spark
20170927 pydata tokyo データサイエンスな皆様に送る分散処理の基礎の基礎、そしてPySparkの勘所

More Related Content

PDF
PySparkの勘所(20170630 sapporo db analytics showcase)
PDF
20171012 found IT #9 PySparkの勘所
PDF
20170210 sapporotechbar7
PDF
Introduction to Apache Hivemall v0.5.2 and v0.6
PDF
20161215 python pandas-spark四方山話
PDF
Apache spark session
PDF
Beginner Apache Spark Presentation
PPTX
A complete hadoop stack
PySparkの勘所(20170630 sapporo db analytics showcase)
20171012 found IT #9 PySparkの勘所
20170210 sapporotechbar7
Introduction to Apache Hivemall v0.5.2 and v0.6
20161215 python pandas-spark四方山話
Apache spark session
Beginner Apache Spark Presentation
A complete hadoop stack

What's hot (19)

PPTX
Cassandra + Hadoop @ApacheCon
PDF
Introduing spark
PDF
How to measure your dataflow using fio, pktgen and bandwidthTest
PDF
hbaseconasia2019 Spatio temporal Data Management based on Ali-HBase Ganos and...
PDF
An introduction to Big-Data processing applying hadoop
PDF
PPTX
Константин Макарычев (Sofware Engineer): ИСПОЛЬЗОВАНИЕ SPARK ДЛЯ МАШИННОГО ОБ...
PPTX
PDF
Avoiding Performance Potholes: Scaling Python for Data Science Using Apache ...
PDF
Big data ecosystem
PDF
Big Data Programming Using Hadoop Workshop
PDF
Big Data Ecosystem after Spark
PDF
Hadoop - Simple. Scalable.
PDF
Introduction to Apache Tajo: Future of Data Warehouse
PDF
Hadoop 101 - Big Data Technology
PDF
Blaze the-evolution-of-numpy
PPTX
Nov HUG 2009: Hadoop Record Reader In Python
PDF
Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.
PDF
Big Data - Fast Machine Learning at Scale + Couchbase
Cassandra + Hadoop @ApacheCon
Introduing spark
How to measure your dataflow using fio, pktgen and bandwidthTest
hbaseconasia2019 Spatio temporal Data Management based on Ali-HBase Ganos and...
An introduction to Big-Data processing applying hadoop
Константин Макарычев (Sofware Engineer): ИСПОЛЬЗОВАНИЕ SPARK ДЛЯ МАШИННОГО ОБ...
Avoiding Performance Potholes: Scaling Python for Data Science Using Apache ...
Big data ecosystem
Big Data Programming Using Hadoop Workshop
Big Data Ecosystem after Spark
Hadoop - Simple. Scalable.
Introduction to Apache Tajo: Future of Data Warehouse
Hadoop 101 - Big Data Technology
Blaze the-evolution-of-numpy
Nov HUG 2009: Hadoop Record Reader In Python
Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.
Big Data - Fast Machine Learning at Scale + Couchbase
Ad

Viewers also liked (12)

PPTX
Apache sparkとapache cassandraで行うテキスト解析
PDF
Pynqでカメラ画像をリアルタイムfastx コーナー検出
PPTX
PYNQ 祭り: Pmod のプログラミング
PDF
APACHE TOREE: A JUPYTER KERNEL FOR SPARK by Marius van Niekerk
PDF
PYNQ祭り
PDF
Presto in my_use_case
PPTX
PYNQで○○してみた!
PDF
PYNQ祭りLT todotani
PPTX
PYNQ単体でUIを表示してみる(PYNQまつり)
PDF
[db analytics showcase Sapporo 2017] A15: Pythonでの分散処理再入門 by 株式会社HPCソリューションズ ...
PDF
Pynq祭り資料
PDF
コンピュータエンジニアへのFPGAのすすめ
Apache sparkとapache cassandraで行うテキスト解析
Pynqでカメラ画像をリアルタイムfastx コーナー検出
PYNQ 祭り: Pmod のプログラミング
APACHE TOREE: A JUPYTER KERNEL FOR SPARK by Marius van Niekerk
PYNQ祭り
Presto in my_use_case
PYNQで○○してみた!
PYNQ祭りLT todotani
PYNQ単体でUIを表示してみる(PYNQまつり)
[db analytics showcase Sapporo 2017] A15: Pythonでの分散処理再入門 by 株式会社HPCソリューションズ ...
Pynq祭り資料
コンピュータエンジニアへのFPGAのすすめ
Ad

Similar to 20170927 pydata tokyo データサイエンスな皆様に送る分散処理の基礎の基礎、そしてPySparkの勘所 (20)

PPTX
Intro to Apache Spark
PDF
PYSPARK PROGRAMMING.pdf
PPTX
5 reasons why spark is in demand!
PPTX
4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...
PDF
5 Reasons why Spark is in demand!
PDF
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...
PDF
5 things one must know about spark!
PDF
NYC_2016_slides
PPTX
Intro to Apache Spark by CTO of Twingo
PPTX
5 things one must know about spark!
PDF
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
PDF
Introduction To Spark - Durham LUG 20150916
PDF
Introduction to Spark with Python
PDF
2014 sept 26_thug_lambda_part1
PDF
H2O PySparkling Water
PPTX
Apache spark installation [autosaved]
PDF
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
PPTX
Scalable Machine Learning with PySpark
PPTX
Big Data Processing with .NET and Spark (SQLBits 2020)
PDF
Adios hadoop, Hola Spark! T3chfest 2015
Intro to Apache Spark
PYSPARK PROGRAMMING.pdf
5 reasons why spark is in demand!
4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...
5 Reasons why Spark is in demand!
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...
5 things one must know about spark!
NYC_2016_slides
Intro to Apache Spark by CTO of Twingo
5 things one must know about spark!
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
Introduction To Spark - Durham LUG 20150916
Introduction to Spark with Python
2014 sept 26_thug_lambda_part1
H2O PySparkling Water
Apache spark installation [autosaved]
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
Scalable Machine Learning with PySpark
Big Data Processing with .NET and Spark (SQLBits 2020)
Adios hadoop, Hola Spark! T3chfest 2015

More from Ryuji Tamagawa (20)

PPTX
hbstudy 74 Site Reliability Engineering
PDF
20161004 データ処理のプラットフォームとしてのpythonとpandas 東京
PDF
20160708 データ処理のプラットフォームとしてのpython 札幌
PDF
20160127三木会 RDB経験者のためのspark
PDF
20151205 Japan.R SparkRとParquet
PDF
Performant data processing with PySpark, SparkR and DataFrame API
PDF
Apache Sparkの紹介
PDF
足を地に着け落ち着いて考える
PDF
ヘルシープログラマ・翻訳と実践
PDF
Google Big Query
PDF
BigQueryの課金、節約しませんか
PDF
You might be paying too much for BigQuery
PDF
Google BigQueryについて 紹介と推測
PDF
lessons learned from talking at rakuten technology conference
PDF
丸の内MongoDB勉強会#20LT 2.8のストレージエンジン動かしてみました
PDF
Mongo dbを知ろう devlove関西
PDF
Seleniumをもっと知るための本の話
PDF
データベース勉強会 In 広島 mongodb
PDF
Invitation to mongo db @ Rakuten TechTalk
PDF
MongoDB tuning on AWS
hbstudy 74 Site Reliability Engineering
20161004 データ処理のプラットフォームとしてのpythonとpandas 東京
20160708 データ処理のプラットフォームとしてのpython 札幌
20160127三木会 RDB経験者のためのspark
20151205 Japan.R SparkRとParquet
Performant data processing with PySpark, SparkR and DataFrame API
Apache Sparkの紹介
足を地に着け落ち着いて考える
ヘルシープログラマ・翻訳と実践
Google Big Query
BigQueryの課金、節約しませんか
You might be paying too much for BigQuery
Google BigQueryについて 紹介と推測
lessons learned from talking at rakuten technology conference
丸の内MongoDB勉強会#20LT 2.8のストレージエンジン動かしてみました
Mongo dbを知ろう devlove関西
Seleniumをもっと知るための本の話
データベース勉強会 In 広島 mongodb
Invitation to mongo db @ Rakuten TechTalk
MongoDB tuning on AWS

Recently uploaded (20)

PDF
Approach and Philosophy of On baking technology
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
Empathic Computing: Creating Shared Understanding
PDF
Machine learning based COVID-19 study performance prediction
PPTX
Tartificialntelligence_presentation.pptx
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PPTX
Machine Learning_overview_presentation.pptx
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPTX
TLE Review Electricity (Electricity).pptx
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PPTX
A Presentation on Artificial Intelligence
Approach and Philosophy of On baking technology
Unlocking AI with Model Context Protocol (MCP)
A comparative analysis of optical character recognition models for extracting...
Empathic Computing: Creating Shared Understanding
Machine learning based COVID-19 study performance prediction
Tartificialntelligence_presentation.pptx
Assigned Numbers - 2025 - Bluetooth® Document
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
Univ-Connecticut-ChatGPT-Presentaion.pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
NewMind AI Weekly Chronicles - August'25-Week II
Machine Learning_overview_presentation.pptx
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Network Security Unit 5.pdf for BCA BBA.
MIND Revenue Release Quarter 2 2025 Press Release
TLE Review Electricity (Electricity).pptx
gpt5_lecture_notes_comprehensive_20250812015547.pdf
A Presentation on Artificial Intelligence

20170927 pydata tokyo データサイエンスな皆様に送る分散処理の基礎の基礎、そしてPySparkの勘所