SlideShare a Scribd company logo
8
Most read
13
Most read
22
Most read
Financial data analysis in Python with pandas

                  Wes McKinney
                   @wesmckinn


                    10/17/2011




@wesmckinn ()     Data analysis with pandas   10/17/2011   1 / 22
My background




   3 years as a quant hacker at AQR, now consultant / entrepreneur
   Math and statistics background with the zest of computer science
   Active in scientific Python community
   My blog: http://blog.wesmckinney.com
   Twitter: @wesmckinn




    @wesmckinn ()          Data analysis with pandas      10/17/2011   2 / 22
Bare essentials for financial research



    Fast time series functionality
         Easy data alignment
         Date/time handling
         Moving window statistics
         Resamping / frequency conversion
    Fast data access (SQL databases, flat files, etc.)
    Data visualization (plotting)
    Statistical models
         Linear regression
         Time series models: ARMA, VAR, ...




     @wesmckinn ()            Data analysis with pandas   10/17/2011   3 / 22
Would be nice to have




   Portfolio and risk analytics, backtesting
        Easy enough to write yourself, though most people do a bad job of it
   Portfolio optimization
        Most financial firms use a 3rd party library anyway
   Derivative pricing
        Can use QuantLib in most languages




    @wesmckinn ()            Data analysis with pandas          10/17/2011   4 / 22
What are financial firms using?



   HFT: a C++ and hardware arms race, a different topic
   Research
        Mainstream: R, MATLAB, Python, ...
        Econometrics: Stata, eViews, RATS, etc.
        Non-programmatic environments: ClariFI, Palantir, ...
   Production
        Popular: Java, C#, C++
        Less popular, but growing: Python
        Fringe: Functional languages (Ocaml, Haskell, F#)




    @wesmckinn ()            Data analysis with pandas          10/17/2011   5 / 22
What are financial firms using?



   Many hybrid languages environments (e.g. Java/R, C++/R,
   C++/MATLAB, Python/C++)
        Which is the main implementation language?
        If main language is Java/C++, result is lower productivity and higher
        cost to prototyping new functionality
   Trends
        Banks and hedge funds are realizing that Java-based production
        systems can be replaced with 20% as much Python code (or less)
        MATLAB is being increasingly ditched in favor of Python. R and
        Python use for research generally growing




    @wesmckinn ()            Data analysis with pandas          10/17/2011   6 / 22
Python language



   Simple, expressive syntax
   Designed for readability, like “runnable pseudocode”
   Easy-to-use, powerful built-in types and data structures:
        Lists and tuples (fixed-size, immutable lists)
        Dicts (hash maps / associative arrays) and sets
   Everything’s an object, including functions
   “There should be one, and preferably only one way to do it”
   “Batteries included”: great general purpose standard library




    @wesmckinn ()            Data analysis with pandas         10/17/2011   7 / 22
A simple example: quicksort


Pseudocode from Wikipedia:

function qsort(array)
    if length(array) < 2
        return array
    var list less, greater
    select and remove a pivot value pivot from array
    for each x in array
        if x < pivot then append x to less
        else append x to greater
    return concat(qsort(less), pivot, qsort(greater))




     @wesmckinn ()           Data analysis with pandas   10/17/2011   8 / 22
A simple example: quicksort

First try Python implementation:
def qsort ( array ):
    if len ( array ) < 2:
        return array

    less , greater = [] , []

    pivot , rest = array [0] , array [1:]

    for x in rest :
        if x < pivot :
             less . append ( x )
        else :
             greater . append ( x )

    return qsort ( less ) + [ pivot ] + qsort ( greater )



      @wesmckinn ()         Data analysis with pandas       10/17/2011   9 / 22
A simple example: quicksort



Use list comprehensions:
def qsort ( array ):
    if len ( array ) < 2:
        return array

     pivot , rest = array [0] , array [1:]
     less = [ x for x in rest if x < pivot ]
     greater = [ x for x in rest if x >= pivot ]

     return qsort ( less ) + [ pivot ] + qsort ( greater )




      @wesmckinn ()         Data analysis with pandas        10/17/2011   10 / 22
A simple example: quicksort




Heck, fit it onto one line!
qs = lambda r : ( r if len ( r ) < 2
                  else ( qs ([ x for x in r [1:] if x < r [0]])
                         + [ r [0]]
                         + qs ([ x for x in r [1:] if x >= r [0]])))

Though that’s starting to look like Lisp code...




      @wesmckinn ()           Data analysis with pandas   10/17/2011   11 / 22
A simple example: quicksort


A quicksort using NumPy arrays
def qsort ( array ):
    if len ( array ) < 2:
        return array
    pivot , rest = array [0] , array [1:]
    less = rest [ rest < pivot ]
    greater = rest [ rest >= pivot ]
    return np . r_ [ qsort ( less ) , [ pivot ] , qsort ( greater )]

Of course no need for this when you can just do:

sorted_array = np.sort(array)




      @wesmckinn ()           Data analysis with pandas       10/17/2011   12 / 22
Python: drunk with power
This comic has way too much airtime but:




     @wesmckinn ()          Data analysis with pandas   10/17/2011   13 / 22
Staples of Python for science: MINS




   (M) matplotlib: plotting and data visualization
   (I) IPython: rich interactive computing and development environment
   (N) NumPy: multi-dimensional arrays, linear algebra, FFTs, random
   number generation, etc.
   (S) SciPy: optimization, probability distributions, signal processing,
   ODEs, sparse matrices, ...




     @wesmckinn ()          Data analysis with pandas        10/17/2011   14 / 22
Why did Python become popular in science?



   NumPy traces its roots to 1995
   Extremely easy to integrate C/C++/Fortran code
   Access fast low level algorithms in a high level, interpreted language
   The language itself
        “It fits in your head”
        “It [Python] doesn’t get in my way” - Robert Kern
   Python is good at all the things other scientific programming
   languages are not good at (e.g. networking, string processing, OOP)
   Liberal BSD license: can use Python for commercial applications




    @wesmckinn ()            Data analysis with pandas       10/17/2011   15 / 22
Some exciting stuff in the last few years


    Cython
         “Augmented” Python language with type declarations, for generating
         compiled extensions
         C-like speedups with Python-like development time
    IPython: enhanced interactive Python interpreter
         The best research and software development env for Python
         An integrated parallel / distributed computing backend
         GUI console with inline plotting and a rich HTML notebook (more on
         this later)
    PyCUDA / PyOpenCL: GPU computing in Python
         Transformed Python overnight into one of the best languages for doing
         GPU computing




     @wesmckinn ()            Data analysis with pandas        10/17/2011   16 / 22
Where has Python historically been weak?



   Rich data structures for data analysis and statistics
         NumPy arrays, while powerful, feel distinctly “lower level” if you’re
         used to R’s data.frame
         pandas has filled this gap over the last 2 years
   Statistics libraries
         Nowhere near the depth of R’s CRAN repository
         statsmodels provides tested implementations a lot of standard
         regression and time series models
         Turns out that most financial data analysis requires only fairly
         elementary statistical models




     @wesmckinn ()             Data analysis with pandas           10/17/2011    17 / 22
pandas library



    Began building at AQR in 2008, open-sourced late 2009
    Why
         R / MATLAB, while good for research / data analysis, are not suitable
         implementation languages for large-scale production systems
                (I personally don’t care for them for data analysis)
         Existing data structures for time series in R / MATLAB were too
         limited / not flexible enough my needs
    Core idea: indexed data structures capable of storing heterogeneous
    data
    Etymology: panel data structures




     @wesmckinn ()                Data analysis with pandas            10/17/2011   18 / 22
pandas in a nutshell



    A clean axis indexing design to support fast data alignment, lookups,
    hierarchical indexing, and more
    High-performance data structures
         Series/TimeSeries: 1D labeled vector
         DataFrame: 2D spreadsheet-like structure
         Panel: 3D labeled array, collection of DataFrames
    SQL-like functionality: GroupBy, joining/merging, etc.
    Missing data handling
    Time series functionality




     @wesmckinn ()              Data analysis with pandas    10/17/2011   19 / 22
pandas design philosophy



   “Think outside the matrix”: stop thinking about shape and start
   thinking about indexes
   Indexing and data alignment are essential
   Fault-tolerance: save you from common blunders caused by coding
   errors (specifically misaligned data)
   Lift the best features of other data analysis environments (R,
   MATLAB, Stata, etc.) and make them better, faster
   Performance and usability equally important




     @wesmckinn ()          Data analysis with pandas       10/17/2011   20 / 22
The pandas killer feature: indexing



    Each axis has an index
    Automatic alignment between differently-indexed objects: makes it
    nearly impossible to accidentally combine misaligned data
    Hierarchical indexing provides an intuitive way of structuring and
    working with higher-dimensional data
    Natural way of expressing “group by” and join-type operations
    Better integrated and more flexible indexing than anything available
    in R or MATLAB




     @wesmckinn ()           Data analysis with pandas       10/17/2011   21 / 22
Tutorial time




                     To the IPython console!




     @wesmckinn ()          Data analysis with pandas   10/17/2011   22 / 22
Ad

Recommended

Knowledge Representation & Reasoning AI UNIT 3
Knowledge Representation & Reasoning AI UNIT 3
Dr. SURBHI SAROHA
 
Artificial Intelligence Notes Unit 3
Artificial Intelligence Notes Unit 3
DigiGurukul
 
07 approximate inference in bn
07 approximate inference in bn
IEEE International Conference on Intelligent Information Hiding and Multimedia Signal Processing
 
AI Propositional logic
AI Propositional logic
Dr. SURBHI SAROHA
 
Turing Machine
Turing Machine
Rajendran
 
Sequence to sequence (encoder-decoder) learning
Sequence to sequence (encoder-decoder) learning
Roberto Pereira Silveira
 
First order logic
First order logic
Megha Sharma
 
Knowledge based agents
Knowledge based agents
Dr. C.V. Suresh Babu
 
Time space trade off
Time space trade off
anisha talwar
 
Predicate logic_2(Artificial Intelligence)
Predicate logic_2(Artificial Intelligence)
SHUBHAM KUMAR GUPTA
 
Measuring algorithm performance
Measuring algorithm performance
HabitamuAsimare
 
Data Structures and Algorithm Analysis
Data Structures and Algorithm Analysis
Mary Margarat
 
Complexity analysis in Algorithms
Complexity analysis in Algorithms
Daffodil International University
 
Virtual memory ppt
Virtual memory ppt
Punjab College Of Technical Education
 
Instruction pipeline: Computer Architecture
Instruction pipeline: Computer Architecture
InteX Research Lab
 
Finite Automata: Deterministic And Non-deterministic Finite Automaton (DFA)
Finite Automata: Deterministic And Non-deterministic Finite Automaton (DFA)
Mohammad Ilyas Malik
 
Prefix, Infix and Post-fix Notations
Prefix, Infix and Post-fix Notations
Afaq Mansoor Khan
 
Asymptotic notation
Asymptotic notation
Dr Shashikant Athawale
 
Syntax analysis
Syntax analysis
Akshaya Arunan
 
Asymptotic notations
Asymptotic notations
Ehtisham Ali
 
Recurrences
Recurrences
Dr Sandeep Kumar Poonia
 
Design and analysis of algorithms
Design and analysis of algorithms
Dr Geetha Mohan
 
Closure properties
Closure properties
Dr. ABHISHEK K PANDEY
 
Asymptotic notations
Asymptotic notations
Nikhil Sharma
 
Predicate Logic
Predicate Logic
giki67
 
UNIT III NON LINEAR DATA STRUCTURES – TREES
UNIT III NON LINEAR DATA STRUCTURES – TREES
Kathirvel Ayyaswamy
 
Introduction to Artificial Intelligence and History of AI
Introduction to Artificial Intelligence and History of AI
Sheetal Jain
 
Sorting
Sorting
Ashim Lamichhane
 
pandas: a Foundational Python Library for Data Analysis and Statistics
pandas: a Foundational Python Library for Data Analysis and Statistics
Wes McKinney
 
Data Wrangling and Visualization Using Python
Data Wrangling and Visualization Using Python
MOHITKUMAR1379
 

More Related Content

What's hot (20)

Time space trade off
Time space trade off
anisha talwar
 
Predicate logic_2(Artificial Intelligence)
Predicate logic_2(Artificial Intelligence)
SHUBHAM KUMAR GUPTA
 
Measuring algorithm performance
Measuring algorithm performance
HabitamuAsimare
 
Data Structures and Algorithm Analysis
Data Structures and Algorithm Analysis
Mary Margarat
 
Complexity analysis in Algorithms
Complexity analysis in Algorithms
Daffodil International University
 
Virtual memory ppt
Virtual memory ppt
Punjab College Of Technical Education
 
Instruction pipeline: Computer Architecture
Instruction pipeline: Computer Architecture
InteX Research Lab
 
Finite Automata: Deterministic And Non-deterministic Finite Automaton (DFA)
Finite Automata: Deterministic And Non-deterministic Finite Automaton (DFA)
Mohammad Ilyas Malik
 
Prefix, Infix and Post-fix Notations
Prefix, Infix and Post-fix Notations
Afaq Mansoor Khan
 
Asymptotic notation
Asymptotic notation
Dr Shashikant Athawale
 
Syntax analysis
Syntax analysis
Akshaya Arunan
 
Asymptotic notations
Asymptotic notations
Ehtisham Ali
 
Recurrences
Recurrences
Dr Sandeep Kumar Poonia
 
Design and analysis of algorithms
Design and analysis of algorithms
Dr Geetha Mohan
 
Closure properties
Closure properties
Dr. ABHISHEK K PANDEY
 
Asymptotic notations
Asymptotic notations
Nikhil Sharma
 
Predicate Logic
Predicate Logic
giki67
 
UNIT III NON LINEAR DATA STRUCTURES – TREES
UNIT III NON LINEAR DATA STRUCTURES – TREES
Kathirvel Ayyaswamy
 
Introduction to Artificial Intelligence and History of AI
Introduction to Artificial Intelligence and History of AI
Sheetal Jain
 
Sorting
Sorting
Ashim Lamichhane
 
Time space trade off
Time space trade off
anisha talwar
 
Predicate logic_2(Artificial Intelligence)
Predicate logic_2(Artificial Intelligence)
SHUBHAM KUMAR GUPTA
 
Measuring algorithm performance
Measuring algorithm performance
HabitamuAsimare
 
Data Structures and Algorithm Analysis
Data Structures and Algorithm Analysis
Mary Margarat
 
Instruction pipeline: Computer Architecture
Instruction pipeline: Computer Architecture
InteX Research Lab
 
Finite Automata: Deterministic And Non-deterministic Finite Automaton (DFA)
Finite Automata: Deterministic And Non-deterministic Finite Automaton (DFA)
Mohammad Ilyas Malik
 
Prefix, Infix and Post-fix Notations
Prefix, Infix and Post-fix Notations
Afaq Mansoor Khan
 
Asymptotic notations
Asymptotic notations
Ehtisham Ali
 
Design and analysis of algorithms
Design and analysis of algorithms
Dr Geetha Mohan
 
Asymptotic notations
Asymptotic notations
Nikhil Sharma
 
Predicate Logic
Predicate Logic
giki67
 
UNIT III NON LINEAR DATA STRUCTURES – TREES
UNIT III NON LINEAR DATA STRUCTURES – TREES
Kathirvel Ayyaswamy
 
Introduction to Artificial Intelligence and History of AI
Introduction to Artificial Intelligence and History of AI
Sheetal Jain
 

Similar to Python for Financial Data Analysis with pandas (20)

pandas: a Foundational Python Library for Data Analysis and Statistics
pandas: a Foundational Python Library for Data Analysis and Statistics
Wes McKinney
 
Data Wrangling and Visualization Using Python
Data Wrangling and Visualization Using Python
MOHITKUMAR1379
 
ytuiuyuyuyryuuytryuryruyrjgjhgfnyfpug.pdf
ytuiuyuyuyryuuytryuryruyrjgjhgfnyfpug.pdf
chandruyck42
 
A look inside pandas design and development
A look inside pandas design and development
Wes McKinney
 
What's new in pandas and the SciPy stack for financial users
What's new in pandas and the SciPy stack for financial users
Wes McKinney
 
Meetup Junio Data Analysis with python 2018
Meetup Junio Data Analysis with python 2018
DataLab Community
 
pandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for Python
Wes McKinney
 
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
Wes McKinney
 
Migrating from matlab to python
Migrating from matlab to python
ActiveState
 
DATA ANALYSIS AND VISUALISATION using python
DATA ANALYSIS AND VISUALISATION using python
ChiragNahata2
 
Q-Step_WS_06112019_Data_Analysis_and_visualisation_with_Python.pptx
Q-Step_WS_06112019_Data_Analysis_and_visualisation_with_Python.pptx
Ogunsina1
 
Q-Step_WS_06112019_Data_Analysis_and_visualisation_with_Python (3).pptx
Q-Step_WS_06112019_Data_Analysis_and_visualisation_with_Python (3).pptx
smartashammari
 
Dc python meetup
Dc python meetup
Jeffrey Clark
 
To understand the importance of Python libraries in data analysis.
To understand the importance of Python libraries in data analysis.
GurpinderSingh98
 
Python for Data Analysis_ Data Wrangling with Pandas, Numpy, and Ipython ( PD...
Python for Data Analysis_ Data Wrangling with Pandas, Numpy, and Ipython ( PD...
R.K.College of engg & Tech
 
Wes McKinney - Python for Data Analysis-O'Reilly Media (2012).pdf
Wes McKinney - Python for Data Analysis-O'Reilly Media (2012).pdf
Blue Sea
 
python-pandas-For-Data-Analysis-Manipulate.pptx
python-pandas-For-Data-Analysis-Manipulate.pptx
PLOKESH8
 
Unit 3_Numpy_VP.pptx
Unit 3_Numpy_VP.pptx
vishnupriyapm4
 
Adarsh_Masekar(2GP19CS003).pptx
Adarsh_Masekar(2GP19CS003).pptx
hkabir55
 
Comparing EDA with classical and Bayesian analysis.pptx
Comparing EDA with classical and Bayesian analysis.pptx
PremaGanesh1
 
pandas: a Foundational Python Library for Data Analysis and Statistics
pandas: a Foundational Python Library for Data Analysis and Statistics
Wes McKinney
 
Data Wrangling and Visualization Using Python
Data Wrangling and Visualization Using Python
MOHITKUMAR1379
 
ytuiuyuyuyryuuytryuryruyrjgjhgfnyfpug.pdf
ytuiuyuyuyryuuytryuryruyrjgjhgfnyfpug.pdf
chandruyck42
 
A look inside pandas design and development
A look inside pandas design and development
Wes McKinney
 
What's new in pandas and the SciPy stack for financial users
What's new in pandas and the SciPy stack for financial users
Wes McKinney
 
Meetup Junio Data Analysis with python 2018
Meetup Junio Data Analysis with python 2018
DataLab Community
 
pandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for Python
Wes McKinney
 
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
Wes McKinney
 
Migrating from matlab to python
Migrating from matlab to python
ActiveState
 
DATA ANALYSIS AND VISUALISATION using python
DATA ANALYSIS AND VISUALISATION using python
ChiragNahata2
 
Q-Step_WS_06112019_Data_Analysis_and_visualisation_with_Python.pptx
Q-Step_WS_06112019_Data_Analysis_and_visualisation_with_Python.pptx
Ogunsina1
 
Q-Step_WS_06112019_Data_Analysis_and_visualisation_with_Python (3).pptx
Q-Step_WS_06112019_Data_Analysis_and_visualisation_with_Python (3).pptx
smartashammari
 
To understand the importance of Python libraries in data analysis.
To understand the importance of Python libraries in data analysis.
GurpinderSingh98
 
Python for Data Analysis_ Data Wrangling with Pandas, Numpy, and Ipython ( PD...
Python for Data Analysis_ Data Wrangling with Pandas, Numpy, and Ipython ( PD...
R.K.College of engg & Tech
 
Wes McKinney - Python for Data Analysis-O'Reilly Media (2012).pdf
Wes McKinney - Python for Data Analysis-O'Reilly Media (2012).pdf
Blue Sea
 
python-pandas-For-Data-Analysis-Manipulate.pptx
python-pandas-For-Data-Analysis-Manipulate.pptx
PLOKESH8
 
Adarsh_Masekar(2GP19CS003).pptx
Adarsh_Masekar(2GP19CS003).pptx
hkabir55
 
Comparing EDA with classical and Bayesian analysis.pptx
Comparing EDA with classical and Bayesian analysis.pptx
PremaGanesh1
 
Ad

More from Wes McKinney (20)

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
Wes McKinney
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
Wes McKinney
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Wes McKinney
 
Apache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data Framework
Wes McKinney
 
New Directions for Apache Arrow
New Directions for Apache Arrow
Wes McKinney
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data Transport
Wes McKinney
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data Frames
Wes McKinney
 
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020
Wes McKinney
 
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics Stack
Wes McKinney
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Wes McKinney
 
Apache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science Stack
Wes McKinney
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019
Wes McKinney
 
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
Wes McKinney
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018
Wes McKinney
 
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory Data
Wes McKinney
 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory data
Wes McKinney
 
Shared Infrastructure for Data Science
Shared Infrastructure for Data Science
Wes McKinney
 
Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)
Wes McKinney
 
Memory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine Learning
Wes McKinney
 
Raising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data Science
Wes McKinney
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
Wes McKinney
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
Wes McKinney
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Wes McKinney
 
Apache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data Framework
Wes McKinney
 
New Directions for Apache Arrow
New Directions for Apache Arrow
Wes McKinney
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data Transport
Wes McKinney
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data Frames
Wes McKinney
 
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020
Wes McKinney
 
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics Stack
Wes McKinney
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Wes McKinney
 
Apache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science Stack
Wes McKinney
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019
Wes McKinney
 
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
Wes McKinney
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018
Wes McKinney
 
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory Data
Wes McKinney
 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory data
Wes McKinney
 
Shared Infrastructure for Data Science
Shared Infrastructure for Data Science
Wes McKinney
 
Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)
Wes McKinney
 
Memory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine Learning
Wes McKinney
 
Raising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data Science
Wes McKinney
 
Ad

Recently uploaded (20)

ReSTIR [DI]: Spatiotemporal reservoir resampling for real-time ray tracing ...
ReSTIR [DI]: Spatiotemporal reservoir resampling for real-time ray tracing ...
revolcs10
 
MuleSoft for AgentForce : Topic Center and API Catalog
MuleSoft for AgentForce : Topic Center and API Catalog
shyamraj55
 
A Constitutional Quagmire - Ethical Minefields of AI, Cyber, and Privacy.pdf
A Constitutional Quagmire - Ethical Minefields of AI, Cyber, and Privacy.pdf
Priyanka Aash
 
Securing Account Lifecycles in the Age of Deepfakes.pptx
Securing Account Lifecycles in the Age of Deepfakes.pptx
FIDO Alliance
 
Python Conference Singapore - 19 Jun 2025
Python Conference Singapore - 19 Jun 2025
ninefyi
 
Mastering AI Workflows with FME by Mark Döring
Mastering AI Workflows with FME by Mark Döring
Safe Software
 
"Scaling in space and time with Temporal", Andriy Lupa.pdf
"Scaling in space and time with Temporal", Andriy Lupa.pdf
Fwdays
 
UserCon Belgium: Honey, VMware increased my bill
UserCon Belgium: Honey, VMware increased my bill
stijn40
 
PyCon SG 25 - Firecracker Made Easy with Python.pdf
PyCon SG 25 - Firecracker Made Easy with Python.pdf
Muhammad Yuga Nugraha
 
EIS-Webinar-Engineering-Retail-Infrastructure-06-16-2025.pdf
EIS-Webinar-Engineering-Retail-Infrastructure-06-16-2025.pdf
Earley Information Science
 
Curietech AI in action - Accelerate MuleSoft development
Curietech AI in action - Accelerate MuleSoft development
shyamraj55
 
Raman Bhaumik - Passionate Tech Enthusiast
Raman Bhaumik - Passionate Tech Enthusiast
Raman Bhaumik
 
Coordinated Disclosure for ML - What's Different and What's the Same.pdf
Coordinated Disclosure for ML - What's Different and What's the Same.pdf
Priyanka Aash
 
"Database isolation: how we deal with hundreds of direct connections to the d...
"Database isolation: how we deal with hundreds of direct connections to the d...
Fwdays
 
Cracking the Code - Unveiling Synergies Between Open Source Security and AI.pdf
Cracking the Code - Unveiling Synergies Between Open Source Security and AI.pdf
Priyanka Aash
 
From Manual to Auto Searching- FME in the Driver's Seat
From Manual to Auto Searching- FME in the Driver's Seat
Safe Software
 
Enhance GitHub Copilot using MCP - Enterprise version.pdf
Enhance GitHub Copilot using MCP - Enterprise version.pdf
Nilesh Gule
 
Connecting Data and Intelligence: The Role of FME in Machine Learning
Connecting Data and Intelligence: The Role of FME in Machine Learning
Safe Software
 
GenAI Opportunities and Challenges - Where 370 Enterprises Are Focusing Now.pdf
GenAI Opportunities and Challenges - Where 370 Enterprises Are Focusing Now.pdf
Priyanka Aash
 
Techniques for Automatic Device Identification and Network Assignment.pdf
Techniques for Automatic Device Identification and Network Assignment.pdf
Priyanka Aash
 
ReSTIR [DI]: Spatiotemporal reservoir resampling for real-time ray tracing ...
ReSTIR [DI]: Spatiotemporal reservoir resampling for real-time ray tracing ...
revolcs10
 
MuleSoft for AgentForce : Topic Center and API Catalog
MuleSoft for AgentForce : Topic Center and API Catalog
shyamraj55
 
A Constitutional Quagmire - Ethical Minefields of AI, Cyber, and Privacy.pdf
A Constitutional Quagmire - Ethical Minefields of AI, Cyber, and Privacy.pdf
Priyanka Aash
 
Securing Account Lifecycles in the Age of Deepfakes.pptx
Securing Account Lifecycles in the Age of Deepfakes.pptx
FIDO Alliance
 
Python Conference Singapore - 19 Jun 2025
Python Conference Singapore - 19 Jun 2025
ninefyi
 
Mastering AI Workflows with FME by Mark Döring
Mastering AI Workflows with FME by Mark Döring
Safe Software
 
"Scaling in space and time with Temporal", Andriy Lupa.pdf
"Scaling in space and time with Temporal", Andriy Lupa.pdf
Fwdays
 
UserCon Belgium: Honey, VMware increased my bill
UserCon Belgium: Honey, VMware increased my bill
stijn40
 
PyCon SG 25 - Firecracker Made Easy with Python.pdf
PyCon SG 25 - Firecracker Made Easy with Python.pdf
Muhammad Yuga Nugraha
 
EIS-Webinar-Engineering-Retail-Infrastructure-06-16-2025.pdf
EIS-Webinar-Engineering-Retail-Infrastructure-06-16-2025.pdf
Earley Information Science
 
Curietech AI in action - Accelerate MuleSoft development
Curietech AI in action - Accelerate MuleSoft development
shyamraj55
 
Raman Bhaumik - Passionate Tech Enthusiast
Raman Bhaumik - Passionate Tech Enthusiast
Raman Bhaumik
 
Coordinated Disclosure for ML - What's Different and What's the Same.pdf
Coordinated Disclosure for ML - What's Different and What's the Same.pdf
Priyanka Aash
 
"Database isolation: how we deal with hundreds of direct connections to the d...
"Database isolation: how we deal with hundreds of direct connections to the d...
Fwdays
 
Cracking the Code - Unveiling Synergies Between Open Source Security and AI.pdf
Cracking the Code - Unveiling Synergies Between Open Source Security and AI.pdf
Priyanka Aash
 
From Manual to Auto Searching- FME in the Driver's Seat
From Manual to Auto Searching- FME in the Driver's Seat
Safe Software
 
Enhance GitHub Copilot using MCP - Enterprise version.pdf
Enhance GitHub Copilot using MCP - Enterprise version.pdf
Nilesh Gule
 
Connecting Data and Intelligence: The Role of FME in Machine Learning
Connecting Data and Intelligence: The Role of FME in Machine Learning
Safe Software
 
GenAI Opportunities and Challenges - Where 370 Enterprises Are Focusing Now.pdf
GenAI Opportunities and Challenges - Where 370 Enterprises Are Focusing Now.pdf
Priyanka Aash
 
Techniques for Automatic Device Identification and Network Assignment.pdf
Techniques for Automatic Device Identification and Network Assignment.pdf
Priyanka Aash
 

Python for Financial Data Analysis with pandas

  • 1. Financial data analysis in Python with pandas Wes McKinney @wesmckinn 10/17/2011 @wesmckinn () Data analysis with pandas 10/17/2011 1 / 22
  • 2. My background 3 years as a quant hacker at AQR, now consultant / entrepreneur Math and statistics background with the zest of computer science Active in scientific Python community My blog: http://blog.wesmckinney.com Twitter: @wesmckinn @wesmckinn () Data analysis with pandas 10/17/2011 2 / 22
  • 3. Bare essentials for financial research Fast time series functionality Easy data alignment Date/time handling Moving window statistics Resamping / frequency conversion Fast data access (SQL databases, flat files, etc.) Data visualization (plotting) Statistical models Linear regression Time series models: ARMA, VAR, ... @wesmckinn () Data analysis with pandas 10/17/2011 3 / 22
  • 4. Would be nice to have Portfolio and risk analytics, backtesting Easy enough to write yourself, though most people do a bad job of it Portfolio optimization Most financial firms use a 3rd party library anyway Derivative pricing Can use QuantLib in most languages @wesmckinn () Data analysis with pandas 10/17/2011 4 / 22
  • 5. What are financial firms using? HFT: a C++ and hardware arms race, a different topic Research Mainstream: R, MATLAB, Python, ... Econometrics: Stata, eViews, RATS, etc. Non-programmatic environments: ClariFI, Palantir, ... Production Popular: Java, C#, C++ Less popular, but growing: Python Fringe: Functional languages (Ocaml, Haskell, F#) @wesmckinn () Data analysis with pandas 10/17/2011 5 / 22
  • 6. What are financial firms using? Many hybrid languages environments (e.g. Java/R, C++/R, C++/MATLAB, Python/C++) Which is the main implementation language? If main language is Java/C++, result is lower productivity and higher cost to prototyping new functionality Trends Banks and hedge funds are realizing that Java-based production systems can be replaced with 20% as much Python code (or less) MATLAB is being increasingly ditched in favor of Python. R and Python use for research generally growing @wesmckinn () Data analysis with pandas 10/17/2011 6 / 22
  • 7. Python language Simple, expressive syntax Designed for readability, like “runnable pseudocode” Easy-to-use, powerful built-in types and data structures: Lists and tuples (fixed-size, immutable lists) Dicts (hash maps / associative arrays) and sets Everything’s an object, including functions “There should be one, and preferably only one way to do it” “Batteries included”: great general purpose standard library @wesmckinn () Data analysis with pandas 10/17/2011 7 / 22
  • 8. A simple example: quicksort Pseudocode from Wikipedia: function qsort(array) if length(array) < 2 return array var list less, greater select and remove a pivot value pivot from array for each x in array if x < pivot then append x to less else append x to greater return concat(qsort(less), pivot, qsort(greater)) @wesmckinn () Data analysis with pandas 10/17/2011 8 / 22
  • 9. A simple example: quicksort First try Python implementation: def qsort ( array ): if len ( array ) < 2: return array less , greater = [] , [] pivot , rest = array [0] , array [1:] for x in rest : if x < pivot : less . append ( x ) else : greater . append ( x ) return qsort ( less ) + [ pivot ] + qsort ( greater ) @wesmckinn () Data analysis with pandas 10/17/2011 9 / 22
  • 10. A simple example: quicksort Use list comprehensions: def qsort ( array ): if len ( array ) < 2: return array pivot , rest = array [0] , array [1:] less = [ x for x in rest if x < pivot ] greater = [ x for x in rest if x >= pivot ] return qsort ( less ) + [ pivot ] + qsort ( greater ) @wesmckinn () Data analysis with pandas 10/17/2011 10 / 22
  • 11. A simple example: quicksort Heck, fit it onto one line! qs = lambda r : ( r if len ( r ) < 2 else ( qs ([ x for x in r [1:] if x < r [0]]) + [ r [0]] + qs ([ x for x in r [1:] if x >= r [0]]))) Though that’s starting to look like Lisp code... @wesmckinn () Data analysis with pandas 10/17/2011 11 / 22
  • 12. A simple example: quicksort A quicksort using NumPy arrays def qsort ( array ): if len ( array ) < 2: return array pivot , rest = array [0] , array [1:] less = rest [ rest < pivot ] greater = rest [ rest >= pivot ] return np . r_ [ qsort ( less ) , [ pivot ] , qsort ( greater )] Of course no need for this when you can just do: sorted_array = np.sort(array) @wesmckinn () Data analysis with pandas 10/17/2011 12 / 22
  • 13. Python: drunk with power This comic has way too much airtime but: @wesmckinn () Data analysis with pandas 10/17/2011 13 / 22
  • 14. Staples of Python for science: MINS (M) matplotlib: plotting and data visualization (I) IPython: rich interactive computing and development environment (N) NumPy: multi-dimensional arrays, linear algebra, FFTs, random number generation, etc. (S) SciPy: optimization, probability distributions, signal processing, ODEs, sparse matrices, ... @wesmckinn () Data analysis with pandas 10/17/2011 14 / 22
  • 15. Why did Python become popular in science? NumPy traces its roots to 1995 Extremely easy to integrate C/C++/Fortran code Access fast low level algorithms in a high level, interpreted language The language itself “It fits in your head” “It [Python] doesn’t get in my way” - Robert Kern Python is good at all the things other scientific programming languages are not good at (e.g. networking, string processing, OOP) Liberal BSD license: can use Python for commercial applications @wesmckinn () Data analysis with pandas 10/17/2011 15 / 22
  • 16. Some exciting stuff in the last few years Cython “Augmented” Python language with type declarations, for generating compiled extensions C-like speedups with Python-like development time IPython: enhanced interactive Python interpreter The best research and software development env for Python An integrated parallel / distributed computing backend GUI console with inline plotting and a rich HTML notebook (more on this later) PyCUDA / PyOpenCL: GPU computing in Python Transformed Python overnight into one of the best languages for doing GPU computing @wesmckinn () Data analysis with pandas 10/17/2011 16 / 22
  • 17. Where has Python historically been weak? Rich data structures for data analysis and statistics NumPy arrays, while powerful, feel distinctly “lower level” if you’re used to R’s data.frame pandas has filled this gap over the last 2 years Statistics libraries Nowhere near the depth of R’s CRAN repository statsmodels provides tested implementations a lot of standard regression and time series models Turns out that most financial data analysis requires only fairly elementary statistical models @wesmckinn () Data analysis with pandas 10/17/2011 17 / 22
  • 18. pandas library Began building at AQR in 2008, open-sourced late 2009 Why R / MATLAB, while good for research / data analysis, are not suitable implementation languages for large-scale production systems (I personally don’t care for them for data analysis) Existing data structures for time series in R / MATLAB were too limited / not flexible enough my needs Core idea: indexed data structures capable of storing heterogeneous data Etymology: panel data structures @wesmckinn () Data analysis with pandas 10/17/2011 18 / 22
  • 19. pandas in a nutshell A clean axis indexing design to support fast data alignment, lookups, hierarchical indexing, and more High-performance data structures Series/TimeSeries: 1D labeled vector DataFrame: 2D spreadsheet-like structure Panel: 3D labeled array, collection of DataFrames SQL-like functionality: GroupBy, joining/merging, etc. Missing data handling Time series functionality @wesmckinn () Data analysis with pandas 10/17/2011 19 / 22
  • 20. pandas design philosophy “Think outside the matrix”: stop thinking about shape and start thinking about indexes Indexing and data alignment are essential Fault-tolerance: save you from common blunders caused by coding errors (specifically misaligned data) Lift the best features of other data analysis environments (R, MATLAB, Stata, etc.) and make them better, faster Performance and usability equally important @wesmckinn () Data analysis with pandas 10/17/2011 20 / 22
  • 21. The pandas killer feature: indexing Each axis has an index Automatic alignment between differently-indexed objects: makes it nearly impossible to accidentally combine misaligned data Hierarchical indexing provides an intuitive way of structuring and working with higher-dimensional data Natural way of expressing “group by” and join-type operations Better integrated and more flexible indexing than anything available in R or MATLAB @wesmckinn () Data analysis with pandas 10/17/2011 21 / 22
  • 22. Tutorial time To the IPython console! @wesmckinn () Data analysis with pandas 10/17/2011 22 / 22