Skip to content

Grouped timeseries operations fail on multi-indexed DataFrames #10063

Closed
@neiseman

Description

@neiseman

This might be a potential bug: doing grouped timeseries operations fails silently on a multi-indexed DataFrame.

    import pandas as pd
    import pandas.io.data as web

    # Get some market data
    df = web.DataReader(['AAPL', 'GOOG'], 'yahoo', pd.Timestamp('2013'), pd.Timestamp('2014')).to_frame()
    df.index.names = ('dt', 'symbol')

    In [21]: df.head()
    Out[21]: 
                            Open       High        Low      Close     Volume  \
    dt         symbol                                                          
    2013-01-02 AAPL    553.82001  555.00000  541.62994  549.03003  140129500   
    2013-01-03 AAPL    547.88000  549.67004  541.00000  542.10004   88241300   
    2013-01-04 AAPL    536.96997  538.63000  525.82996  527.00000  148583400   
    2013-01-07 AAPL    522.00000  529.30005  515.20001  523.90002  121039100   
    2013-01-08 AAPL    529.21002  531.89001  521.25000  525.31000  114676800   

                       Adj Close  
    dt         symbol             
    2013-01-02 AAPL     74.63931  
    2013-01-03 AAPL     73.69719  
    2013-01-04 AAPL     71.64438  
    2013-01-07 AAPL     71.22294  
    2013-01-08 AAPL     71.41463 

Let's say we want to resample this to monthly data. This fails and returns an empty DataFrame:

    df_M = df.groupby(level='symbol').resample('M', how='mean')
    In [23]: df_M
    Out[23]: 
    Empty DataFrame
    Columns: []
    Index: []

This, however, works, but requires a seemingly-unnecessary re-indexing:

    df_M = df.reset_index().set_index('dt').groupby('symbol').resample('M', how='mean')
    In [26]: df_M.head()
    Out[26]: 
                       Adj Close       Close        High         Low        Open  \
    symbol dt                                                                      
    AAPL   2013-01-31  67.677750  497.822382  504.407623  492.969997  500.083329   
           2013-02-28  62.388477  456.808942  463.231056  452.106325  458.503692   
           2013-03-31  60.417287  441.841000  446.803495  437.337996  442.011512   
           2013-04-30  57.398619  419.765001  425.553183  414.722271  419.766820   
           2013-05-31  61.340151  446.452734  451.658190  441.495455  446.400919   

                             Volume  
    symbol dt                        
    AAPL   2013-01-31  1.562312e+08  
           2013-02-28  1.229478e+08  
           2013-03-31  1.147110e+08  
           2013-04-30  1.245851e+08  
           2013-05-31  1.073583e+08  

The fact that you need to do the reset_index().set_index('dt') and then groupby('symbol') instead of groupby(level='symbol') seems to defeat the purpose of a multi-index! What gives?

I also realize that data like this is perhaps more well-suited to a Panel than a DataFrame, but when dealing with very large amounts of (often sparse) data, the 3D Panel structure presents performance and memory issues compared to the flat DataFrame.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions