SlideShare a Scribd company logo
Python / pandas
Sky
20161004 データ処理のプラットフォームとしてのpythonとpandas 東京
•
• Python 2000
(**)
• db tech showcase MongoDB
•
• FB: Ryuji Tamagawa
• Twitter : tamagawa_ryuji
20161004 データ処理のプラットフォームとしてのpythonとpandas 東京
2015-2016
• Python / pandas
• Python / pandas
• Python
•
• Python
• NumPy, SciPy, matplotlib, pandas
• Python
• Python IPython, Jupyter notebook, Spyder, VisualStudio
• Python / pandas
• Python
• pandas
• Spark - PySpark DataFrame API
• matplotlib
Part 1 : Python
Python
•
• Google Guido
Google Google
1
•
NumPy, SciPy, matplotlib →
pandas
•
•
-2000
Linux
-2010 Web Trac
Google
Python
•
•
•
•
•
• ’Batteries included’
Python
• 2.x 3.x 32bit 64bit
64bit
• 2.x
• 3.x
3
• 2.x
3.x
• Ruby?
• R?
• Java?
• Scala?
Python
• Python ’CPython’ JIT
PyPy JVM Jython .Net IronPython
• CPython
• CPython 2
• C
• processing
PySpark
Python
• Python Linux Mac OS
Python Python
Mac
• Python pip 3.x 2.7.9 2.x
Python pip Linux Python
pip yum apt
• Python Anaconda Python
conda
• python 2016

http://qiita.com/y__sama/items/5b62d31cb7e6ed50f02c
NumPy, SciPy, matplotlib, pandas
•
• NumPy SciPy
• pandas
pandas pandas NumPy
• Anaconda Python
Python
•
scikit-learn http://
scikit-learn.org/stable/
Python
• TensorFlow 

Python
Python


IPython
Jupyter, …
IDE
Spyder, Rodeo
Visual Studio, PyCharm,
PyDev
• IPython
•
•
• Anaconda


• Jupyter Notebook
• Python
• IPython Notebook
Python
• Apache Zeppelin http://
zeppelin.apache.org
IDE
• R RStudio
• IDE
•
• 2 Spyder Rodeo
•
Spyder
•
• Visual Studio
• Eclipse PyDev
• PyCharm
•
Part 2 :
Python / pandas
Python / pandas
• pandas
• /
etc…
•
Spark
• pandas
processing
•
• 64bit Python +
GB
• Python 1
1 CPU
GIL
• processing
Jenkins
CPU/
Jenkins
1 1.2 1000000
‘abc’ ’ ’
[1, 2, 3, ‘foo’, ‘bar’, ‘foo’]
(1, 2, 3, ‘foo’, ‘bar’, ‘foo’)
{‘k1’: ‘value1’, ‘k2’: ‘value2’}
set(1, 2, 3, ‘foo’, ‘bar’)
•
•
• split
s = ‘foo, bar, baz’
items = s.split(‘,’)
print items[0]
print items[-1]
print items[0][-2:]
• ,
• lambda map, reduce, filter
sList = [‘foo’, ‘bar’, ‘baz’]
lList = [len(s) for s in sList]
lList = map(lambda s:len(s), sList)
lDict = {s:len(s) for s in sList}
lList = []
for s in sList:
lList.append(len(s))
lDict = {}
for s in sList:
lDict[s] = len(s)
pandas
• pandas
•
matplotlib / seaborn
• NumPy
SciPy
Python
• pandas + matplotlib
OK pandas NumPy
NumPy /
SciPy
https://openbook4.me/projects/183
20161004 データ処理のプラットフォームとしてのpythonとpandas 東京
pandas
• pandas
DataFrame
• R
• RDB
2
• index Series Columns
Columns
Series Series SeriesIndex
IDE /
• IDE
• jupyter notebook
• http://sinhrks.hatenablog.com/entry/2015/01/28/073327
0 1
import pandas as pd
df[‘nValue’] = df[‘value’] / sum(df[‘value’])
id value color
sapporo 43 red
osaka 42 pink
matsumoto 40 green
id value color nValue
sapporo 43 red 0.344
osaka 42 pink 0.336
matsumoto 40 green 0.32
Python
pandas I/O
• CSV JSON RDB Excel
• column
• RDB
•
import pandas as pd
pd.read_csv(<filename>)
pd.read_json(<filename>)
pd.to_csv(<filename>)
pd.to_excel(<filename>)
#
pd.to_clipboard()
pandas.read_csv
• pandas CSV
•
•
• usecols :
• nrows :
• na_values : na
• parse_dates infer_datetime_format:
• chunksize :
• compression : zip CSV
pandas.read_csv(filepath_or_buffer, sep=', ',
delimiter=None, header='infer', names=None,
index_col=None, usecols=None, squeeze=False,
prefix=None, mangle_dupe_cols=True, dtype=None,
engine=None, converters=None, true_values=None,
false_values=None, skipinitialspace=False,
skiprows=None, skipfooter=None, nrows=None,
na_values=None, keep_default_na=True,
na_filter=True, verbose=False,
skip_blank_lines=True, parse_dates=False,
infer_datetime_format=False,
keep_date_col=False, date_parser=None,
dayfirst=False, iterator=False, chunksize=None,
compression='infer', thousands=None, decimal='.',
lineterminator=None, quotechar='"', quoting=0,
escapechar=None, comment=None,
encoding=None, dialect=None, tupleize_cols=False,
error_bad_lines=True, warn_bad_lines=True,
skip_footer=0, doublequote=True,
delim_whitespace=False, as_recarray=False,
compact_ints=False, use_unsigned=False,
low_memory=True, buffer_lines=None,
memory_map=False, float_precision=None)
Spark - PySpark DataFrame API
•
Python
• Spark PySpark
findSpark
Spark
• Python Spark API
DataFrame API
• Spark pandas
Spark
PySpark
Spark

node
Spark

node
Spark

node
Spark

node
driver
•
•
Apache Arrow
• Python / R
:
feather
• pandas 2.0, parquet for Python
Python / pandas
Questions ?

More Related Content

PDF
20161215 python pandas-spark四方山話
PDF
Hive at Last.fm
KEY
anohana
PDF
日本全国ぶらりPerl旅
PDF
ニュースパスのクローラーアーキテクチャとマイクロサービス
PDF
DRUG - RDSTK Talk
PDF
Go, memcached, microservices
KEY
第 10 回 Webteko
20161215 python pandas-spark四方山話
Hive at Last.fm
anohana
日本全国ぶらりPerl旅
ニュースパスのクローラーアーキテクチャとマイクロサービス
DRUG - RDSTK Talk
Go, memcached, microservices
第 10 回 Webteko

What's hot (6)

PDF
Infrastructure coders logstash
KEY
State of Python (2010)
PPTX
C# - Raise the bar with functional & immutable constructs (Dutch)
PPTX
Linux commands
PDF
穏やかにファイルを削除する続き
PDF
Amazon AI のスゴいデモ(仮) - Serverless Meetup Osaka
Infrastructure coders logstash
State of Python (2010)
C# - Raise the bar with functional & immutable constructs (Dutch)
Linux commands
穏やかにファイルを削除する続き
Amazon AI のスゴいデモ(仮) - Serverless Meetup Osaka
Ad

Similar to 20161004 データ処理のプラットフォームとしてのpythonとpandas 東京 (20)

PDF
20160708 データ処理のプラットフォームとしてのpython 札幌
PDF
LTから入門するPython開発環境 #PyLadiesTokyo
KEY
Kiosk / PHP
PPTX
Open Source Monitoring Tools
PDF
PyDriller: Python Framework for Mining Software Repositories
PDF
Lessons learned while building Omroep.nl
PDF
Contributing to pandas (Korean)
PDF
Railsチュートリアルの歩き方 (第4版)
PDF
Lessons learned while building Omroep.nl
PDF
Py "Baseball" Data入門〜サービス(と野球)を支えるデータ分析基盤 #monotarotech
PDF
スマートフォン勉強会@関東 #11 どう考えてもdisconなものをiPhoneに移植してみた
PDF
Apex on Local - Better Alternative to Salesforce DX
PPTX
Data Science at Scale: Using Apache Spark for Data Science at Bitly
PDF
Railsチュートリアルの歩き方 (第3版)
PDF
tumblr用クライアントアプリケーションの開発 @ KLabインターン成果発表
PPTX
Analysing GitHub commits with R
KEY
関西アンカンファレンス Python の Paver について
PDF
IPFS introduction
KEY
PDF
Migrating from matlab to python
20160708 データ処理のプラットフォームとしてのpython 札幌
LTから入門するPython開発環境 #PyLadiesTokyo
Kiosk / PHP
Open Source Monitoring Tools
PyDriller: Python Framework for Mining Software Repositories
Lessons learned while building Omroep.nl
Contributing to pandas (Korean)
Railsチュートリアルの歩き方 (第4版)
Lessons learned while building Omroep.nl
Py "Baseball" Data入門〜サービス(と野球)を支えるデータ分析基盤 #monotarotech
スマートフォン勉強会@関東 #11 どう考えてもdisconなものをiPhoneに移植してみた
Apex on Local - Better Alternative to Salesforce DX
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Railsチュートリアルの歩き方 (第3版)
tumblr用クライアントアプリケーションの開発 @ KLabインターン成果発表
Analysing GitHub commits with R
関西アンカンファレンス Python の Paver について
IPFS introduction
Migrating from matlab to python
Ad

More from Ryuji Tamagawa (20)

PDF
20171012 found IT #9 PySparkの勘所
PDF
20170927 pydata tokyo データサイエンスな皆様に送る分散処理の基礎の基礎、そしてPySparkの勘所
PPTX
hbstudy 74 Site Reliability Engineering
PDF
PySparkの勘所(20170630 sapporo db analytics showcase)
PDF
20170210 sapporotechbar7
PDF
20160127三木会 RDB経験者のためのspark
PDF
20151205 Japan.R SparkRとParquet
PDF
Performant data processing with PySpark, SparkR and DataFrame API
PDF
Apache Sparkの紹介
PDF
足を地に着け落ち着いて考える
PDF
ヘルシープログラマ・翻訳と実践
PDF
Google Big Query
PDF
BigQueryの課金、節約しませんか
PDF
You might be paying too much for BigQuery
PDF
Google BigQueryについて 紹介と推測
PDF
lessons learned from talking at rakuten technology conference
PDF
丸の内MongoDB勉強会#20LT 2.8のストレージエンジン動かしてみました
PDF
Mongo dbを知ろう devlove関西
PDF
Seleniumをもっと知るための本の話
PDF
データベース勉強会 In 広島 mongodb
20171012 found IT #9 PySparkの勘所
20170927 pydata tokyo データサイエンスな皆様に送る分散処理の基礎の基礎、そしてPySparkの勘所
hbstudy 74 Site Reliability Engineering
PySparkの勘所(20170630 sapporo db analytics showcase)
20170210 sapporotechbar7
20160127三木会 RDB経験者のためのspark
20151205 Japan.R SparkRとParquet
Performant data processing with PySpark, SparkR and DataFrame API
Apache Sparkの紹介
足を地に着け落ち着いて考える
ヘルシープログラマ・翻訳と実践
Google Big Query
BigQueryの課金、節約しませんか
You might be paying too much for BigQuery
Google BigQueryについて 紹介と推測
lessons learned from talking at rakuten technology conference
丸の内MongoDB勉強会#20LT 2.8のストレージエンジン動かしてみました
Mongo dbを知ろう devlove関西
Seleniumをもっと知るための本の話
データベース勉強会 In 広島 mongodb

Recently uploaded (20)

PDF
Sensors and Actuators in IoT Systems using pdf
PPTX
Comunidade Salesforce São Paulo - Desmistificando o Omnistudio (Vlocity)
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Empathic Computing: Creating Shared Understanding
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Electronic commerce courselecture one. Pdf
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPTX
Telecom Fraud Prevention Guide | Hyperlink InfoSystem
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPT
Teaching material agriculture food technology
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Advanced IT Governance
PDF
Review of recent advances in non-invasive hemoglobin estimation
Sensors and Actuators in IoT Systems using pdf
Comunidade Salesforce São Paulo - Desmistificando o Omnistudio (Vlocity)
NewMind AI Weekly Chronicles - August'25 Week I
Empathic Computing: Creating Shared Understanding
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
NewMind AI Monthly Chronicles - July 2025
Electronic commerce courselecture one. Pdf
Understanding_Digital_Forensics_Presentation.pptx
Telecom Fraud Prevention Guide | Hyperlink InfoSystem
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Per capita expenditure prediction using model stacking based on satellite ima...
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Teaching material agriculture food technology
Dropbox Q2 2025 Financial Results & Investor Presentation
MYSQL Presentation for SQL database connectivity
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Advanced IT Governance
Review of recent advances in non-invasive hemoglobin estimation

20161004 データ処理のプラットフォームとしてのpythonとpandas 東京