SlideShare a Scribd company logo
WHY PYTHON IS BETTER
FOR DATA SCIENCE
ÍCARO MEDEIROS
São Paulo Big Data Meetup

São Paulo - SP, 25/11/2015
DATA SCIENTISTS SHOULD DO…
http://berkeleysciencereview.com/article/first-rule-data-science/
WHY PYTHON?
▸ General purpose

▸ Smooth learning curve

▸ REPL (IPython!)

▸ Programmer productivity

▸ Popular and mature

▸ Glue language (high level API, low level C/Fortran bindings)

▸ Science ecosystem (growing!)
PYTHON IS POPULAR: IT MEANS WIDESPREAD KNOWLEDGE AND MANY TOOLS
http://githut.info/
PYTHON IS POPULAR: IT MEANS WIDESPREAD KNOWLEDGE AND MANY TOOLS
pypl.github.io/PYPL.html
AVOID THE TWO LANGUAGE PROBLEM
PYTHON CAN BE USED IN WHOLE DATA SCIENCE WORKFLOW
https://speakerdeck.com/chdoig/the-state-of-python-for-data-science-pyss-2015?slide=22
AUTHOR A MULTISTAGE PROCESSING PIPELINE IN
PYTHON, DESIGN A HYPOTHESIS TEST, PERFORM A
REGRESSION ANALYSIS OVER DATA SAMPLES WITH R,
DESIGN AND IMPLEMENT AN ALGORITHM FOR SOME
DATA-INTENSIVE PRODUCT OR SERVICE IN HADOOP,
OR COMMUNICATE THE RESULTS OF OUR ANALYSES
Jeff Hammerbacher
ONE DAY AT FACEBOOK’S DATA SCIENCE TEAM, A MEMBER COULD…
http://berkeleysciencereview.com/scientific-collaborations-uc-berkeley-data-driven-cover/
OPTIONS FOR PROCESSING PIPELINE
Airflow
https://github.com/airbnb/airflow
https://github.com/spotify/luigi
AIRFLOW EXAMPLE
https://github.com/airbnb/airflow
REGRESSION ANALYSIS IN PYTHON: EASY
http://statsmodels.sourceforge.net/devel/examples/notebooks/generated/ols.html
Why Python is better for Data Science
PYTHON <3 BIG DATA
map reduce in python
pure python HDFS client
fast and general engine for large-scale
data processing
mrjob
http://spark.apache.org
https://github.com/spotify/snakebite
https://pythonhosted.org/mrjob
…
OH, BUT SCALA/JAVA IS FASTER. PYTHON IS 2 *FASTER: [WRITING, RUNNING]
DataFrame operations are optimized and compiled into JVM bytecode
https://databricks.com/blog/2015/04/24/recent-performance-improvements-in-apache-spark-sql-python-
dataframes-and-more.html
RDD AVERAGE: EXAMPLE FROM ‘LEARNING SPARK'
RDD AVERAGE: EXAMPLE FROM ‘LEARNING SPARK'
SO CONCISE
COMMUNICATE RESULTS WITH IPYTHON / JUPYTER
Language agnostic :)
COMMUNICATE RESULTS WITH IPYTHON / JUPYTER
DEMO
TIME
MATPLOTLIB / SEABORN / PLOT.LY / BOKEH: SUCH VISUALIZATION!!
PYTHON FITS ALL!
PYTHON FITS ALL!
PYTHON FOR
SCIENCE IS
GROWING
SCIENCE IS GETTING MORE AND MORE IMPORTANT FOR PYTHON COMMUNITY
# module imports imports/numpy
1 sys 2437939 5.85
2 os 2009086 4.82
3 re 1303009 3.12
4 numpy 416981 1.00
5 warnings 371345 0.89
6 subprocess 344934 0.83
7 django 282097 0.68
8 math 281987 0.68
11 matplotlib 146913 0.35
13 pylab 77817 0.19
14 scipy 69092 0.17
22 pandas 18928 0.05
24 theano 5482 0.051
6/25 MOST POPULAR LIBRARIES ARE FOR DATA SCIENCE
https://www.python.org/dev/peps/pep-0465/#but-isn-t-matrix-multiplication-a-pretty-niche-requirement
SCIENCE IS IMPORTANT FOR PYTHON: MATRIX MULTIPLICATION
https://www.python.org/dev/peps/pep-0465/#but-isn-t-matrix-multiplication-a-pretty-niche-requirement
import numpy as np
from numpy.linalg import inv, solve
# Using dot function:
S = np.dot((np.dot(H, beta) - r).T,
np.dot(inv(np.dot(np.dot(H, V), H.T)),
np.dot(H, beta) - r))
# With the @ operator
S = (H @ beta - r).T @ inv(H @ V @ H.T) @ (H @ beta - r)
S = ( H β − r ) T ( H V H T ) − 1 ( H β − r )
PEP 0465: PROPOSED FEB/14. SINCE PY 3.5 (SEP/15)
2013: 7 INTERNATIONAL CONFERENCES ON NUMERICAL PYTHON
AT PYCON 2014, ~20% OF THE TUTORIALS INVOLVED THE USE OF MATRICES
SCIENCE STACK IS GETTING BETTER EACH DAY
https://speakerdeck.com/jakevdp/the-state-of-the-stack-scipy-2015-keynote?slide=8
SCIENCE STACK IS ALWAYS EVOLVING…
https://speakerdeck.com/jakevdp/the-state-of-the-stack-scipy-2015-keynote?slide=29
CONDA: AUTOMATING ENVIRONMENTS
https://speakerdeck.com/chdoig/the-state-of-python-for-data-science-pyss-2015?slide=60
THE STACK IS STILL GETTING NEW MEMBERS…
http://www.tensorflow.org/
TAKEAWAY MESSAGE
TRY PYTHON. IT WILL BE
A ONE WAY TRIP!
slides
icaromedeiros.com.br
slideshare.net/icaromedeiros
@icaromedeiros
Ad

Recommended

401 - Lecture No. 03 - Meat Production Systems.pptx
401 - Lecture No. 03 - Meat Production Systems.pptx
IzzatAftab
 
Netflix - Success Case study v1.5
Netflix - Success Case study v1.5
Raghu Vamsy Sirasala
 
Role of Zinc in animal nutrition
Role of Zinc in animal nutrition
Michal Slota
 
An Introduction to Python Programming
An Introduction to Python Programming
Morteza Zakeri
 
Introduction to Python for Data Science
Introduction to Python for Data Science
Arc & Codementor
 
Python for Data Science
Python for Data Science
Harri Hämäläinen
 
Programming for data science in python
Programming for data science in python
UmmeSalmaM1
 
Python webinar 4th june
Python webinar 4th june
Edureka!
 
Python for Big Data Analytics
Python for Big Data Analytics
Edureka!
 
Python PPT
Python PPT
Edureka!
 
Samsung SDS OpeniT - The possibility of Python
Samsung SDS OpeniT - The possibility of Python
Insuk (Chris) Cho
 
what is python ?
what is python ?
NetmaxTechnologies1
 
Python on Science ? Yes, We can.
Python on Science ? Yes, We can.
Marcel Caraciolo
 
The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016
Dataiku
 
Data Analysis Python For Environmental Science Hayden Van Der Post
Data Analysis Python For Environmental Science Hayden Van Der Post
unidosmungwe
 
Exploring and Using the Python Ecosystem
Exploring and Using the Python Ecosystem
Adam Cook
 
Time travel: Let’s learn from the history of Python packaging!
Time travel: Let’s learn from the history of Python packaging!
Kir Chou
 
Micropython for the iot
Micropython for the iot
Jacques Supcik
 
Pi, Python, and Paintball??? Innovating with Affordable Tech!
Pi, Python, and Paintball??? Innovating with Affordable Tech!
Barry Tarlton
 
Why Python Should Be Your First Programming Language
Why Python Should Be Your First Programming Language
Edureka!
 
Webinar: Mastering Python - An Excellent tool for Web Scraping and Data Anal...
Webinar: Mastering Python - An Excellent tool for Web Scraping and Data Anal...
Edureka!
 
Old Dogs and New Tricks
Old Dogs and New Tricks
Elizabeth Leddy
 
Mag pi18 Citation "PhotoReportage"
Mag pi18 Citation "PhotoReportage"
Arnaud VELTEN (BUSINESS COMMANDO)
 
Python 101 For The Net Developer
Python 101 For The Net Developer
Sarah Dutkiewicz
 
How You Can Use Open Source Materials to Learn Python & Data Science - EuroPy...
How You Can Use Open Source Materials to Learn Python & Data Science - EuroPy...
Kamila Stępniowska
 
DATA SCIENCE PPT.pptx
DATA SCIENCE PPT.pptx
vikashyadav23235277
 
Python and BIG Data analytics | Python Fundamentals | Python Architecture
Python and BIG Data analytics | Python Fundamentals | Python Architecture
Skillspeed
 
Python – The Fastest Growing Programming Language
Python – The Fastest Growing Programming Language
IRJET Journal
 
Data Science and Culture
Data Science and Culture
Ícaro Medeiros
 
Statistics: the grammar of Data Science
Statistics: the grammar of Data Science
Ícaro Medeiros
 

More Related Content

Similar to Why Python is better for Data Science (20)

Python for Big Data Analytics
Python for Big Data Analytics
Edureka!
 
Python PPT
Python PPT
Edureka!
 
Samsung SDS OpeniT - The possibility of Python
Samsung SDS OpeniT - The possibility of Python
Insuk (Chris) Cho
 
what is python ?
what is python ?
NetmaxTechnologies1
 
Python on Science ? Yes, We can.
Python on Science ? Yes, We can.
Marcel Caraciolo
 
The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016
Dataiku
 
Data Analysis Python For Environmental Science Hayden Van Der Post
Data Analysis Python For Environmental Science Hayden Van Der Post
unidosmungwe
 
Exploring and Using the Python Ecosystem
Exploring and Using the Python Ecosystem
Adam Cook
 
Time travel: Let’s learn from the history of Python packaging!
Time travel: Let’s learn from the history of Python packaging!
Kir Chou
 
Micropython for the iot
Micropython for the iot
Jacques Supcik
 
Pi, Python, and Paintball??? Innovating with Affordable Tech!
Pi, Python, and Paintball??? Innovating with Affordable Tech!
Barry Tarlton
 
Why Python Should Be Your First Programming Language
Why Python Should Be Your First Programming Language
Edureka!
 
Webinar: Mastering Python - An Excellent tool for Web Scraping and Data Anal...
Webinar: Mastering Python - An Excellent tool for Web Scraping and Data Anal...
Edureka!
 
Old Dogs and New Tricks
Old Dogs and New Tricks
Elizabeth Leddy
 
Mag pi18 Citation "PhotoReportage"
Mag pi18 Citation "PhotoReportage"
Arnaud VELTEN (BUSINESS COMMANDO)
 
Python 101 For The Net Developer
Python 101 For The Net Developer
Sarah Dutkiewicz
 
How You Can Use Open Source Materials to Learn Python & Data Science - EuroPy...
How You Can Use Open Source Materials to Learn Python & Data Science - EuroPy...
Kamila Stępniowska
 
DATA SCIENCE PPT.pptx
DATA SCIENCE PPT.pptx
vikashyadav23235277
 
Python and BIG Data analytics | Python Fundamentals | Python Architecture
Python and BIG Data analytics | Python Fundamentals | Python Architecture
Skillspeed
 
Python – The Fastest Growing Programming Language
Python – The Fastest Growing Programming Language
IRJET Journal
 
Python for Big Data Analytics
Python for Big Data Analytics
Edureka!
 
Python PPT
Python PPT
Edureka!
 
Samsung SDS OpeniT - The possibility of Python
Samsung SDS OpeniT - The possibility of Python
Insuk (Chris) Cho
 
Python on Science ? Yes, We can.
Python on Science ? Yes, We can.
Marcel Caraciolo
 
The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016
Dataiku
 
Data Analysis Python For Environmental Science Hayden Van Der Post
Data Analysis Python For Environmental Science Hayden Van Der Post
unidosmungwe
 
Exploring and Using the Python Ecosystem
Exploring and Using the Python Ecosystem
Adam Cook
 
Time travel: Let’s learn from the history of Python packaging!
Time travel: Let’s learn from the history of Python packaging!
Kir Chou
 
Micropython for the iot
Micropython for the iot
Jacques Supcik
 
Pi, Python, and Paintball??? Innovating with Affordable Tech!
Pi, Python, and Paintball??? Innovating with Affordable Tech!
Barry Tarlton
 
Why Python Should Be Your First Programming Language
Why Python Should Be Your First Programming Language
Edureka!
 
Webinar: Mastering Python - An Excellent tool for Web Scraping and Data Anal...
Webinar: Mastering Python - An Excellent tool for Web Scraping and Data Anal...
Edureka!
 
Python 101 For The Net Developer
Python 101 For The Net Developer
Sarah Dutkiewicz
 
How You Can Use Open Source Materials to Learn Python & Data Science - EuroPy...
How You Can Use Open Source Materials to Learn Python & Data Science - EuroPy...
Kamila Stępniowska
 
Python and BIG Data analytics | Python Fundamentals | Python Architecture
Python and BIG Data analytics | Python Fundamentals | Python Architecture
Skillspeed
 
Python – The Fastest Growing Programming Language
Python – The Fastest Growing Programming Language
IRJET Journal
 

More from Ícaro Medeiros (15)

Data Science and Culture
Data Science and Culture
Ícaro Medeiros
 
Statistics: the grammar of Data Science
Statistics: the grammar of Data Science
Ícaro Medeiros
 
Linked Data, Big Data, and User Science at Globo.com
Linked Data, Big Data, and User Science at Globo.com
Ícaro Medeiros
 
Linked Data in Use: Schema.org, JSON-LD and hypermedia APIs - Front in Bahia...
Linked Data in Use: Schema.org, JSON-LD and hypermedia APIs - Front in Bahia...
Ícaro Medeiros
 
Web Semântica na Globo.com (Novas Mídias UFRJ)
Web Semântica na Globo.com (Novas Mídias UFRJ)
Ícaro Medeiros
 
Linked data at globo.com - Web of Linked Entities (WoLE 2013) - WWW 2013
Linked data at globo.com - Web of Linked Entities (WoLE 2013) - WWW 2013
Ícaro Medeiros
 
Engenharia de ontologias
Engenharia de ontologias
Ícaro Medeiros
 
Schema.org - HTML semântico - Front in Maceio 2012
Schema.org - HTML semântico - Front in Maceio 2012
Ícaro Medeiros
 
Ontology matching
Ontology matching
Ícaro Medeiros
 
R2R Framework: Ontology Mapping
R2R Framework: Ontology Mapping
Ícaro Medeiros
 
SameAs Networks and Beyond: Analyzing Deployment Status and Implications of o...
SameAs Networks and Beyond: Analyzing Deployment Status and Implications of o...
Ícaro Medeiros
 
Tag Suggestion using Multiple Sources of Knowledge
Tag Suggestion using Multiple Sources of Knowledge
Ícaro Medeiros
 
Expressões regulares no Linux
Expressões regulares no Linux
Ícaro Medeiros
 
Ontology Learning
Ontology Learning
Ícaro Medeiros
 
Tag Suggestion
Tag Suggestion
Ícaro Medeiros
 
Data Science and Culture
Data Science and Culture
Ícaro Medeiros
 
Statistics: the grammar of Data Science
Statistics: the grammar of Data Science
Ícaro Medeiros
 
Linked Data, Big Data, and User Science at Globo.com
Linked Data, Big Data, and User Science at Globo.com
Ícaro Medeiros
 
Linked Data in Use: Schema.org, JSON-LD and hypermedia APIs - Front in Bahia...
Linked Data in Use: Schema.org, JSON-LD and hypermedia APIs - Front in Bahia...
Ícaro Medeiros
 
Web Semântica na Globo.com (Novas Mídias UFRJ)
Web Semântica na Globo.com (Novas Mídias UFRJ)
Ícaro Medeiros
 
Linked data at globo.com - Web of Linked Entities (WoLE 2013) - WWW 2013
Linked data at globo.com - Web of Linked Entities (WoLE 2013) - WWW 2013
Ícaro Medeiros
 
Engenharia de ontologias
Engenharia de ontologias
Ícaro Medeiros
 
Schema.org - HTML semântico - Front in Maceio 2012
Schema.org - HTML semântico - Front in Maceio 2012
Ícaro Medeiros
 
R2R Framework: Ontology Mapping
R2R Framework: Ontology Mapping
Ícaro Medeiros
 
SameAs Networks and Beyond: Analyzing Deployment Status and Implications of o...
SameAs Networks and Beyond: Analyzing Deployment Status and Implications of o...
Ícaro Medeiros
 
Tag Suggestion using Multiple Sources of Knowledge
Tag Suggestion using Multiple Sources of Knowledge
Ícaro Medeiros
 
Expressões regulares no Linux
Expressões regulares no Linux
Ícaro Medeiros
 
Ad

Recently uploaded (20)

Best MLM Compensation Plans for Network Marketing Success in 2025
Best MLM Compensation Plans for Network Marketing Success in 2025
LETSCMS Pvt. Ltd.
 
Enable Your Cloud Journey With Microsoft Trusted Partner | IFI Tech
Enable Your Cloud Journey With Microsoft Trusted Partner | IFI Tech
IFI Techsolutions
 
CodeCleaner: Mitigating Data Contamination for LLM Benchmarking
CodeCleaner: Mitigating Data Contamination for LLM Benchmarking
arabelatso
 
How Automation in Claims Handling Streamlined Operations
How Automation in Claims Handling Streamlined Operations
Insurance Tech Services
 
OpenChain Webinar - AboutCode - Practical Compliance in One Stack – Licensing...
OpenChain Webinar - AboutCode - Practical Compliance in One Stack – Licensing...
Shane Coughlan
 
Download Adobe Illustrator Crack free for Windows 2025?
Download Adobe Illustrator Crack free for Windows 2025?
grete1122g
 
Top Time Tracking Solutions for Accountants
Top Time Tracking Solutions for Accountants
oliviareed320
 
Canva Pro Crack Free Download 2025-FREE LATEST
Canva Pro Crack Free Download 2025-FREE LATEST
grete1122g
 
Best Practice for LLM Serving in the Cloud
Best Practice for LLM Serving in the Cloud
Alluxio, Inc.
 
Why Edge Computing Matters in Mobile Application Tech.pdf
Why Edge Computing Matters in Mobile Application Tech.pdf
IMG Global Infotech
 
Folding Cheat Sheet # 9 - List Unfolding 𝑢𝑛𝑓𝑜𝑙𝑑 as the Computational Dual of ...
Folding Cheat Sheet # 9 - List Unfolding 𝑢𝑛𝑓𝑜𝑙𝑑 as the Computational Dual of ...
Philip Schwarz
 
University Campus Navigation for All - Peak of Data & AI
University Campus Navigation for All - Peak of Data & AI
Safe Software
 
AI for PV: Development and Governance for a Regulated Industry
AI for PV: Development and Governance for a Regulated Industry
Biologit
 
Y - Recursion The Hard Way GopherCon EU 2025
Y - Recursion The Hard Way GopherCon EU 2025
Eleanor McHugh
 
Humans vs AI Call Agents - Qcall.ai's Special Report
Humans vs AI Call Agents - Qcall.ai's Special Report
Udit Goenka
 
Introduction to Agile Frameworks for Product Managers.pdf
Introduction to Agile Frameworks for Product Managers.pdf
Ali Vahed
 
declaration of Variables and constants.pptx
declaration of Variables and constants.pptx
meemee7378
 
Decipher SEO Solutions for your startup needs.
Decipher SEO Solutions for your startup needs.
mathai2
 
Sysinfo OST to PST Converter Infographic
Sysinfo OST to PST Converter Infographic
SysInfo Tools
 
From Code to Commerce, a Backend Java Developer's Galactic Journey into Ecomm...
From Code to Commerce, a Backend Java Developer's Galactic Journey into Ecomm...
Jamie Coleman
 
Best MLM Compensation Plans for Network Marketing Success in 2025
Best MLM Compensation Plans for Network Marketing Success in 2025
LETSCMS Pvt. Ltd.
 
Enable Your Cloud Journey With Microsoft Trusted Partner | IFI Tech
Enable Your Cloud Journey With Microsoft Trusted Partner | IFI Tech
IFI Techsolutions
 
CodeCleaner: Mitigating Data Contamination for LLM Benchmarking
CodeCleaner: Mitigating Data Contamination for LLM Benchmarking
arabelatso
 
How Automation in Claims Handling Streamlined Operations
How Automation in Claims Handling Streamlined Operations
Insurance Tech Services
 
OpenChain Webinar - AboutCode - Practical Compliance in One Stack – Licensing...
OpenChain Webinar - AboutCode - Practical Compliance in One Stack – Licensing...
Shane Coughlan
 
Download Adobe Illustrator Crack free for Windows 2025?
Download Adobe Illustrator Crack free for Windows 2025?
grete1122g
 
Top Time Tracking Solutions for Accountants
Top Time Tracking Solutions for Accountants
oliviareed320
 
Canva Pro Crack Free Download 2025-FREE LATEST
Canva Pro Crack Free Download 2025-FREE LATEST
grete1122g
 
Best Practice for LLM Serving in the Cloud
Best Practice for LLM Serving in the Cloud
Alluxio, Inc.
 
Why Edge Computing Matters in Mobile Application Tech.pdf
Why Edge Computing Matters in Mobile Application Tech.pdf
IMG Global Infotech
 
Folding Cheat Sheet # 9 - List Unfolding 𝑢𝑛𝑓𝑜𝑙𝑑 as the Computational Dual of ...
Folding Cheat Sheet # 9 - List Unfolding 𝑢𝑛𝑓𝑜𝑙𝑑 as the Computational Dual of ...
Philip Schwarz
 
University Campus Navigation for All - Peak of Data & AI
University Campus Navigation for All - Peak of Data & AI
Safe Software
 
AI for PV: Development and Governance for a Regulated Industry
AI for PV: Development and Governance for a Regulated Industry
Biologit
 
Y - Recursion The Hard Way GopherCon EU 2025
Y - Recursion The Hard Way GopherCon EU 2025
Eleanor McHugh
 
Humans vs AI Call Agents - Qcall.ai's Special Report
Humans vs AI Call Agents - Qcall.ai's Special Report
Udit Goenka
 
Introduction to Agile Frameworks for Product Managers.pdf
Introduction to Agile Frameworks for Product Managers.pdf
Ali Vahed
 
declaration of Variables and constants.pptx
declaration of Variables and constants.pptx
meemee7378
 
Decipher SEO Solutions for your startup needs.
Decipher SEO Solutions for your startup needs.
mathai2
 
Sysinfo OST to PST Converter Infographic
Sysinfo OST to PST Converter Infographic
SysInfo Tools
 
From Code to Commerce, a Backend Java Developer's Galactic Journey into Ecomm...
From Code to Commerce, a Backend Java Developer's Galactic Journey into Ecomm...
Jamie Coleman
 
Ad

Why Python is better for Data Science

  • 1. WHY PYTHON IS BETTER FOR DATA SCIENCE ÍCARO MEDEIROS São Paulo Big Data Meetup São Paulo - SP, 25/11/2015
  • 2. DATA SCIENTISTS SHOULD DO… http://berkeleysciencereview.com/article/first-rule-data-science/
  • 3. WHY PYTHON? ▸ General purpose ▸ Smooth learning curve ▸ REPL (IPython!) ▸ Programmer productivity ▸ Popular and mature ▸ Glue language (high level API, low level C/Fortran bindings) ▸ Science ecosystem (growing!)
  • 4. PYTHON IS POPULAR: IT MEANS WIDESPREAD KNOWLEDGE AND MANY TOOLS http://githut.info/
  • 5. PYTHON IS POPULAR: IT MEANS WIDESPREAD KNOWLEDGE AND MANY TOOLS pypl.github.io/PYPL.html
  • 6. AVOID THE TWO LANGUAGE PROBLEM
  • 7. PYTHON CAN BE USED IN WHOLE DATA SCIENCE WORKFLOW https://speakerdeck.com/chdoig/the-state-of-python-for-data-science-pyss-2015?slide=22
  • 8. AUTHOR A MULTISTAGE PROCESSING PIPELINE IN PYTHON, DESIGN A HYPOTHESIS TEST, PERFORM A REGRESSION ANALYSIS OVER DATA SAMPLES WITH R, DESIGN AND IMPLEMENT AN ALGORITHM FOR SOME DATA-INTENSIVE PRODUCT OR SERVICE IN HADOOP, OR COMMUNICATE THE RESULTS OF OUR ANALYSES Jeff Hammerbacher ONE DAY AT FACEBOOK’S DATA SCIENCE TEAM, A MEMBER COULD… http://berkeleysciencereview.com/scientific-collaborations-uc-berkeley-data-driven-cover/
  • 9. OPTIONS FOR PROCESSING PIPELINE Airflow https://github.com/airbnb/airflow https://github.com/spotify/luigi
  • 11. REGRESSION ANALYSIS IN PYTHON: EASY http://statsmodels.sourceforge.net/devel/examples/notebooks/generated/ols.html
  • 13. PYTHON <3 BIG DATA map reduce in python pure python HDFS client fast and general engine for large-scale data processing mrjob http://spark.apache.org https://github.com/spotify/snakebite https://pythonhosted.org/mrjob …
  • 14. OH, BUT SCALA/JAVA IS FASTER. PYTHON IS 2 *FASTER: [WRITING, RUNNING] DataFrame operations are optimized and compiled into JVM bytecode https://databricks.com/blog/2015/04/24/recent-performance-improvements-in-apache-spark-sql-python- dataframes-and-more.html
  • 15. RDD AVERAGE: EXAMPLE FROM ‘LEARNING SPARK'
  • 16. RDD AVERAGE: EXAMPLE FROM ‘LEARNING SPARK' SO CONCISE
  • 17. COMMUNICATE RESULTS WITH IPYTHON / JUPYTER Language agnostic :)
  • 18. COMMUNICATE RESULTS WITH IPYTHON / JUPYTER DEMO TIME
  • 19. MATPLOTLIB / SEABORN / PLOT.LY / BOKEH: SUCH VISUALIZATION!!
  • 23. SCIENCE IS GETTING MORE AND MORE IMPORTANT FOR PYTHON COMMUNITY # module imports imports/numpy 1 sys 2437939 5.85 2 os 2009086 4.82 3 re 1303009 3.12 4 numpy 416981 1.00 5 warnings 371345 0.89 6 subprocess 344934 0.83 7 django 282097 0.68 8 math 281987 0.68 11 matplotlib 146913 0.35 13 pylab 77817 0.19 14 scipy 69092 0.17 22 pandas 18928 0.05 24 theano 5482 0.051 6/25 MOST POPULAR LIBRARIES ARE FOR DATA SCIENCE https://www.python.org/dev/peps/pep-0465/#but-isn-t-matrix-multiplication-a-pretty-niche-requirement
  • 24. SCIENCE IS IMPORTANT FOR PYTHON: MATRIX MULTIPLICATION https://www.python.org/dev/peps/pep-0465/#but-isn-t-matrix-multiplication-a-pretty-niche-requirement import numpy as np from numpy.linalg import inv, solve # Using dot function: S = np.dot((np.dot(H, beta) - r).T, np.dot(inv(np.dot(np.dot(H, V), H.T)), np.dot(H, beta) - r)) # With the @ operator S = (H @ beta - r).T @ inv(H @ V @ H.T) @ (H @ beta - r) S = ( H β − r ) T ( H V H T ) − 1 ( H β − r ) PEP 0465: PROPOSED FEB/14. SINCE PY 3.5 (SEP/15) 2013: 7 INTERNATIONAL CONFERENCES ON NUMERICAL PYTHON AT PYCON 2014, ~20% OF THE TUTORIALS INVOLVED THE USE OF MATRICES
  • 25. SCIENCE STACK IS GETTING BETTER EACH DAY https://speakerdeck.com/jakevdp/the-state-of-the-stack-scipy-2015-keynote?slide=8
  • 26. SCIENCE STACK IS ALWAYS EVOLVING… https://speakerdeck.com/jakevdp/the-state-of-the-stack-scipy-2015-keynote?slide=29
  • 28. THE STACK IS STILL GETTING NEW MEMBERS… http://www.tensorflow.org/
  • 29. TAKEAWAY MESSAGE TRY PYTHON. IT WILL BE A ONE WAY TRIP!