SlideShare a Scribd company logo
Princeton Research
Data Management
Workshop 2020
Co-sponsored by the Center for Digital Humanities, the Center for Statistics and Machine Learning, the Office of
the Dean for Research, and Data-Driven Social Science Initiative
Organized by Princeton University Library’s Princeton Research
Data Service, Princeton Institute for Computational Science and
Engineering, and OIT Research Computing
Day Two:
Break-out Session:
Python, Numpy, Pandas
Python, Numpy, and Pandas
Henry Schreiner, PICSiE/PHY
henryfs@princeton.edu
2020 Research Data Management Workshop
Python for data science
● Second most popular
language on GitHub
● General purpose
● Only Data Science
language in top 10
● Over 200K PyPI
packages, 1.6 billion
releases
Python for data science
● Another metric (PYPL, Google-based) has it #1
● Data Science languages shown below
● Python fastest growing
● R peaked around 2017
● Others also in decline
● Note the log scale!
Timeline
● 1994: Python 1.0 released
● 1995: First array package: Numeric
● 2003: Matplotlib
● 2005: Numeric and numarray merged into Numpy
● 2008: Pandas introduced
● 2012: The Anaconda python distribution
Timeline
● 2012: Numba JIT compiler
● 2014: IPython becomes Jupyter project & notebook
● 2016: LIGO's discovery: Jupyter Notebook + Python
● 2017: Google releases TensorFlow (Python)
● Now: All Machine Learning libraries are primarily or
exclusively used via Python
Why Python?
What makes Python
special?
● Great interactivity
● General purpose
● Weaknesses filled
by libraries and
services
Python: the language
● Simple
● Easy to
learn
● Flexible and
powerful
● Object
Oriented
def square(x):
return x**2
print(square(4))
# Prints 4
IPython
● Adds interactive features to
Python
○ Timing chunks of code
○ Shell-like features
○ Fancy display system
%cd my_dir
%%timeit
run_long()
! ./program
Jupyter Notebooks
● Cell-based HTML
document
● Supports many
kernels (IPython was
first and is the most
popular)
● Interleave
documentation, code,
and output
Jupyter Lab
● Holds multiple
views of
○ Notebooks
○ Output
○ Editors
○ Terminals
Jupyter Hub
● Multiuser notebook or lab instances
● Available at mybinder.org or through Princeton Research
Computing
Example: Runge-Kutta static notebook, runnable mybinder
Libraries
PyPI
● The core service for
Python libraries
● Uses pip to install
● Environment
management separate
Anaconda
● Can package Python
and complex libraries
● Uses conda to install
● Environment manager
too (reproducible)
● conda-forge is
community effort
Numpy
● Adds an array type
● Fast computations
array-at-a-time
● Python and Numpy now
define a standard protocol
for arrays
● A library that replaces
langagues like ADL
import numpy as np
v = np.array([1,2,3])
print(v**2)
# Prints 1, 4, 9
Pandas
● Tabular data
○ A library that replaces languages like R and Excel
○ Designed with interactivity in mind
● Other libraries mimic Pandas’ API
Numba
● Adds full JIT (just in time) compiler to Python
● Compiles normal python functions into LLVM
● Growing subset of Python and Numpy
● Can be as fast as any compiled language
● Supports parallel computation, GPUs, and more
Other libraries of note
● CuPY: CUDA with a numpy interface
● TensorFlow/PyTorch: Machine learning libraries
● Matplotlib: The plotting library for Python
● PyQt/PySide: Bindings to Qt Graphical User Interface
● PyBind11: Easy C++ bindings
Summary
● Python is wildly popular, simple to learn, and well
supported
● Python has an impressive collection of tools
○ Interactivity: IPython, Jupyter
○ Package delivery: PyPI (pip), Conda
○ Libraries: Numpy, Pandas, and many more
Demo
● The second half is devoted to a Pandas demo session

More Related Content

What's hot (20)

PPTX
Python
Aashish Jain
 
PPTX
Python ppt
Rachit Bhargava
 
PDF
Introduction to Python Pandas for Data Analytics
Phoenix
 
PPTX
Basics of python
SurjeetSinghSurjeetS
 
PPTX
Introduction to python for Beginners
Sujith Kumar
 
PDF
Pandas
maikroeder
 
PDF
Data visualization in Python
Marc Garcia
 
PPT
Introduction to Python
Nowell Strite
 
PDF
Python
대갑 김
 
PDF
Data Visualization in Python
Jagriti Goswami
 
PPTX
Presentation on data preparation with pandas
AkshitaKanther
 
PPTX
NumPy.pptx
EN1036VivekSingh
 
PPTX
Programming
monishagoyal4
 
PPTX
Python pandas Library
Md. Sohag Miah
 
PDF
Installing Anaconda Distribution of Python
Jatin Miglani
 
PDF
Introduction To Python | Edureka
Edureka!
 
PPTX
Introduction to the basics of Python programming (part 1)
Pedro Rodrigues
 
PPTX
NumPy
AbhijeetAnand88
 
PDF
Python Basics | Python Tutorial | Edureka
Edureka!
 
Python
Aashish Jain
 
Python ppt
Rachit Bhargava
 
Introduction to Python Pandas for Data Analytics
Phoenix
 
Basics of python
SurjeetSinghSurjeetS
 
Introduction to python for Beginners
Sujith Kumar
 
Pandas
maikroeder
 
Data visualization in Python
Marc Garcia
 
Introduction to Python
Nowell Strite
 
Python
대갑 김
 
Data Visualization in Python
Jagriti Goswami
 
Presentation on data preparation with pandas
AkshitaKanther
 
NumPy.pptx
EN1036VivekSingh
 
Programming
monishagoyal4
 
Python pandas Library
Md. Sohag Miah
 
Installing Anaconda Distribution of Python
Jatin Miglani
 
Introduction To Python | Edureka
Edureka!
 
Introduction to the basics of Python programming (part 1)
Pedro Rodrigues
 
Python Basics | Python Tutorial | Edureka
Edureka!
 

Similar to RDM 2020: Python, Numpy, and Pandas (20)

PDF
Python workshop
Marie Behzadi
 
PDF
Python workshop
Shiraz LUG
 
PPTX
Python Introduction its a oop language and easy to use
SrajanCollege1
 
PDF
python training in chandigarh
priyansuthakur59093
 
PDF
London level39
Travis Oliphant
 
PDF
Open Chemistry, JupyterLab and data: Reproducible quantum chemistry
Marcus Hanwell
 
PPTX
Presentation.pptx
AyushmanTiwari11
 
PPTX
Presentation.pptx
AyushmanTiwari11
 
PDF
what is python and why is important with
LetsUpdateSkills
 
PDF
Why learn python in 2017?
Karolis Ramanauskas
 
PDF
Data analysis with Pandas and Spark
Felix Crisan
 
PDF
Python, the Language of Science and Engineering for Engineers
Boey Pak Cheong
 
PDF
A Comprehensive Guide of Python Final Year Projects with Source Code.pdf
jagan477830
 
PPTX
Introduction to Jupyter notebook and MS Azure Machine Learning Studio
Muralidharan Deenathayalan
 
PPTX
Introduction to Jupyter notebook and MS Azure Machine Learning Studio
Muralidharan Deenathayalan
 
PDF
Programming for data science in python
UmmeSalmaM1
 
PPTX
Python in geospatial analysis
Sakthivel R
 
PDF
Python in Industry
Dharmit Shah
 
PDF
Anaconda vs Python: Understanding the differences
Julie Bowie
 
PDF
An overview of data and web-application development with Python
Sivaranjan Goswami
 
Python workshop
Marie Behzadi
 
Python workshop
Shiraz LUG
 
Python Introduction its a oop language and easy to use
SrajanCollege1
 
python training in chandigarh
priyansuthakur59093
 
London level39
Travis Oliphant
 
Open Chemistry, JupyterLab and data: Reproducible quantum chemistry
Marcus Hanwell
 
Presentation.pptx
AyushmanTiwari11
 
Presentation.pptx
AyushmanTiwari11
 
what is python and why is important with
LetsUpdateSkills
 
Why learn python in 2017?
Karolis Ramanauskas
 
Data analysis with Pandas and Spark
Felix Crisan
 
Python, the Language of Science and Engineering for Engineers
Boey Pak Cheong
 
A Comprehensive Guide of Python Final Year Projects with Source Code.pdf
jagan477830
 
Introduction to Jupyter notebook and MS Azure Machine Learning Studio
Muralidharan Deenathayalan
 
Introduction to Jupyter notebook and MS Azure Machine Learning Studio
Muralidharan Deenathayalan
 
Programming for data science in python
UmmeSalmaM1
 
Python in geospatial analysis
Sakthivel R
 
Python in Industry
Dharmit Shah
 
Anaconda vs Python: Understanding the differences
Julie Bowie
 
An overview of data and web-application development with Python
Sivaranjan Goswami
 
Ad

More from Henry Schreiner (20)

PDF
Tools That Help You Write Better Code - 2025 Princeton Software Engineering S...
Henry Schreiner
 
PDF
Princeton RSE: Building Python Packages (+binary)
Henry Schreiner
 
PDF
Tools to help you write better code - Princeton Wintersession
Henry Schreiner
 
PDF
Learning Rust with Advent of Code 2023 - Princeton
Henry Schreiner
 
PDF
The two flavors of Python 3.13 - PyHEP 2024
Henry Schreiner
 
PDF
Modern binary build systems - PyCon 2024
Henry Schreiner
 
PDF
Software Quality Assurance Tooling - Wintersession 2024
Henry Schreiner
 
PDF
Princeton RSE Peer network first meeting
Henry Schreiner
 
PDF
Software Quality Assurance Tooling 2023
Henry Schreiner
 
PDF
Princeton Wintersession: Software Quality Assurance Tooling
Henry Schreiner
 
PDF
What's new in Python 3.11
Henry Schreiner
 
PDF
Everything you didn't know you needed
Henry Schreiner
 
PDF
SciPy22 - Building binary extensions with pybind11, scikit build, and cibuild...
Henry Schreiner
 
PDF
SciPy 2022 Scikit-HEP
Henry Schreiner
 
PDF
PyCon 2022 -Scikit-HEP Developer Pages: Guidelines for modern packaging
Henry Schreiner
 
PDF
PyCon2022 - Building Python Extensions
Henry Schreiner
 
PDF
boost-histogram / Hist: PyHEP Topical meeting
Henry Schreiner
 
PDF
Digital RSE: automated code quality checks - RSE group meeting
Henry Schreiner
 
PDF
CMake best practices
Henry Schreiner
 
PDF
Pybind11 - SciPy 2021
Henry Schreiner
 
Tools That Help You Write Better Code - 2025 Princeton Software Engineering S...
Henry Schreiner
 
Princeton RSE: Building Python Packages (+binary)
Henry Schreiner
 
Tools to help you write better code - Princeton Wintersession
Henry Schreiner
 
Learning Rust with Advent of Code 2023 - Princeton
Henry Schreiner
 
The two flavors of Python 3.13 - PyHEP 2024
Henry Schreiner
 
Modern binary build systems - PyCon 2024
Henry Schreiner
 
Software Quality Assurance Tooling - Wintersession 2024
Henry Schreiner
 
Princeton RSE Peer network first meeting
Henry Schreiner
 
Software Quality Assurance Tooling 2023
Henry Schreiner
 
Princeton Wintersession: Software Quality Assurance Tooling
Henry Schreiner
 
What's new in Python 3.11
Henry Schreiner
 
Everything you didn't know you needed
Henry Schreiner
 
SciPy22 - Building binary extensions with pybind11, scikit build, and cibuild...
Henry Schreiner
 
SciPy 2022 Scikit-HEP
Henry Schreiner
 
PyCon 2022 -Scikit-HEP Developer Pages: Guidelines for modern packaging
Henry Schreiner
 
PyCon2022 - Building Python Extensions
Henry Schreiner
 
boost-histogram / Hist: PyHEP Topical meeting
Henry Schreiner
 
Digital RSE: automated code quality checks - RSE group meeting
Henry Schreiner
 
CMake best practices
Henry Schreiner
 
Pybind11 - SciPy 2021
Henry Schreiner
 
Ad

Recently uploaded (20)

PDF
Proactive Server and System Monitoring with FME: Using HTTP and System Caller...
Safe Software
 
PDF
How to Comply With Saudi Arabia’s National Cybersecurity Regulations.pdf
Bluechip Advanced Technologies
 
PDF
FME as an Orchestration Tool with Principles From Data Gravity
Safe Software
 
PDF
My Journey from CAD to BIM: A True Underdog Story
Safe Software
 
PDF
Simplify Your FME Flow Setup: Fault-Tolerant Deployment Made Easy with Packer...
Safe Software
 
PDF
Understanding The True Cost of DynamoDB Webinar
ScyllaDB
 
PPTX
Smarter Governance with AI: What Every Board Needs to Know
OnBoard
 
PDF
Enhancing Environmental Monitoring with Real-Time Data Integration: Leveragin...
Safe Software
 
PDF
Next level data operations using Power Automate magic
Andries den Haan
 
PDF
Plugging AI into everything: Model Context Protocol Simplified.pdf
Abati Adewale
 
PDF
Bridging CAD, IBM TRIRIGA & GIS with FME: The Portland Public Schools Case
Safe Software
 
PPTX
Practical Applications of AI in Local Government
OnBoard
 
PPTX
The birth and death of Stars - earth and life science
rizellemarieastrolo
 
PPTX
Mastering Authorization: Integrating Authentication and Authorization Data in...
Hitachi, Ltd. OSS Solution Center.
 
PPTX
2025 HackRedCon Cyber Career Paths.pptx Scott Stanton
Scott Stanton
 
PDF
Optimizing the trajectory of a wheel loader working in short loading cycles
Reno Filla
 
PDF
Hello I'm "AI" Your New _________________
Dr. Tathagat Varma
 
PPSX
Usergroup - OutSystems Architecture.ppsx
Kurt Vandevelde
 
PDF
“Scaling i.MX Applications Processors’ Native Edge AI with Discrete AI Accele...
Edge AI and Vision Alliance
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
Proactive Server and System Monitoring with FME: Using HTTP and System Caller...
Safe Software
 
How to Comply With Saudi Arabia’s National Cybersecurity Regulations.pdf
Bluechip Advanced Technologies
 
FME as an Orchestration Tool with Principles From Data Gravity
Safe Software
 
My Journey from CAD to BIM: A True Underdog Story
Safe Software
 
Simplify Your FME Flow Setup: Fault-Tolerant Deployment Made Easy with Packer...
Safe Software
 
Understanding The True Cost of DynamoDB Webinar
ScyllaDB
 
Smarter Governance with AI: What Every Board Needs to Know
OnBoard
 
Enhancing Environmental Monitoring with Real-Time Data Integration: Leveragin...
Safe Software
 
Next level data operations using Power Automate magic
Andries den Haan
 
Plugging AI into everything: Model Context Protocol Simplified.pdf
Abati Adewale
 
Bridging CAD, IBM TRIRIGA & GIS with FME: The Portland Public Schools Case
Safe Software
 
Practical Applications of AI in Local Government
OnBoard
 
The birth and death of Stars - earth and life science
rizellemarieastrolo
 
Mastering Authorization: Integrating Authentication and Authorization Data in...
Hitachi, Ltd. OSS Solution Center.
 
2025 HackRedCon Cyber Career Paths.pptx Scott Stanton
Scott Stanton
 
Optimizing the trajectory of a wheel loader working in short loading cycles
Reno Filla
 
Hello I'm "AI" Your New _________________
Dr. Tathagat Varma
 
Usergroup - OutSystems Architecture.ppsx
Kurt Vandevelde
 
“Scaling i.MX Applications Processors’ Native Edge AI with Discrete AI Accele...
Edge AI and Vision Alliance
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 

RDM 2020: Python, Numpy, and Pandas

  • 1. Princeton Research Data Management Workshop 2020 Co-sponsored by the Center for Digital Humanities, the Center for Statistics and Machine Learning, the Office of the Dean for Research, and Data-Driven Social Science Initiative Organized by Princeton University Library’s Princeton Research Data Service, Princeton Institute for Computational Science and Engineering, and OIT Research Computing Day Two: Break-out Session: Python, Numpy, Pandas
  • 2. Python, Numpy, and Pandas Henry Schreiner, PICSiE/PHY [email protected] 2020 Research Data Management Workshop
  • 3. Python for data science ● Second most popular language on GitHub ● General purpose ● Only Data Science language in top 10 ● Over 200K PyPI packages, 1.6 billion releases
  • 4. Python for data science ● Another metric (PYPL, Google-based) has it #1 ● Data Science languages shown below ● Python fastest growing ● R peaked around 2017 ● Others also in decline ● Note the log scale!
  • 5. Timeline ● 1994: Python 1.0 released ● 1995: First array package: Numeric ● 2003: Matplotlib ● 2005: Numeric and numarray merged into Numpy ● 2008: Pandas introduced ● 2012: The Anaconda python distribution
  • 6. Timeline ● 2012: Numba JIT compiler ● 2014: IPython becomes Jupyter project & notebook ● 2016: LIGO's discovery: Jupyter Notebook + Python ● 2017: Google releases TensorFlow (Python) ● Now: All Machine Learning libraries are primarily or exclusively used via Python
  • 7. Why Python? What makes Python special? ● Great interactivity ● General purpose ● Weaknesses filled by libraries and services
  • 8. Python: the language ● Simple ● Easy to learn ● Flexible and powerful ● Object Oriented def square(x): return x**2 print(square(4)) # Prints 4
  • 9. IPython ● Adds interactive features to Python ○ Timing chunks of code ○ Shell-like features ○ Fancy display system %cd my_dir %%timeit run_long() ! ./program
  • 10. Jupyter Notebooks ● Cell-based HTML document ● Supports many kernels (IPython was first and is the most popular) ● Interleave documentation, code, and output
  • 11. Jupyter Lab ● Holds multiple views of ○ Notebooks ○ Output ○ Editors ○ Terminals
  • 12. Jupyter Hub ● Multiuser notebook or lab instances ● Available at mybinder.org or through Princeton Research Computing Example: Runge-Kutta static notebook, runnable mybinder
  • 13. Libraries PyPI ● The core service for Python libraries ● Uses pip to install ● Environment management separate Anaconda ● Can package Python and complex libraries ● Uses conda to install ● Environment manager too (reproducible) ● conda-forge is community effort
  • 14. Numpy ● Adds an array type ● Fast computations array-at-a-time ● Python and Numpy now define a standard protocol for arrays ● A library that replaces langagues like ADL import numpy as np v = np.array([1,2,3]) print(v**2) # Prints 1, 4, 9
  • 15. Pandas ● Tabular data ○ A library that replaces languages like R and Excel ○ Designed with interactivity in mind ● Other libraries mimic Pandas’ API
  • 16. Numba ● Adds full JIT (just in time) compiler to Python ● Compiles normal python functions into LLVM ● Growing subset of Python and Numpy ● Can be as fast as any compiled language ● Supports parallel computation, GPUs, and more
  • 17. Other libraries of note ● CuPY: CUDA with a numpy interface ● TensorFlow/PyTorch: Machine learning libraries ● Matplotlib: The plotting library for Python ● PyQt/PySide: Bindings to Qt Graphical User Interface ● PyBind11: Easy C++ bindings
  • 18. Summary ● Python is wildly popular, simple to learn, and well supported ● Python has an impressive collection of tools ○ Interactivity: IPython, Jupyter ○ Package delivery: PyPI (pip), Conda ○ Libraries: Numpy, Pandas, and many more
  • 19. Demo ● The second half is devoted to a Pandas demo session