Closed
Description
This might be a potential bug: doing grouped timeseries operations fails silently on a multi-indexed DataFrame.
import pandas as pd
import pandas.io.data as web
# Get some market data
df = web.DataReader(['AAPL', 'GOOG'], 'yahoo', pd.Timestamp('2013'), pd.Timestamp('2014')).to_frame()
df.index.names = ('dt', 'symbol')
In [21]: df.head()
Out[21]:
Open High Low Close Volume \
dt symbol
2013-01-02 AAPL 553.82001 555.00000 541.62994 549.03003 140129500
2013-01-03 AAPL 547.88000 549.67004 541.00000 542.10004 88241300
2013-01-04 AAPL 536.96997 538.63000 525.82996 527.00000 148583400
2013-01-07 AAPL 522.00000 529.30005 515.20001 523.90002 121039100
2013-01-08 AAPL 529.21002 531.89001 521.25000 525.31000 114676800
Adj Close
dt symbol
2013-01-02 AAPL 74.63931
2013-01-03 AAPL 73.69719
2013-01-04 AAPL 71.64438
2013-01-07 AAPL 71.22294
2013-01-08 AAPL 71.41463
Let's say we want to resample this to monthly data. This fails and returns an empty DataFrame:
df_M = df.groupby(level='symbol').resample('M', how='mean')
In [23]: df_M
Out[23]:
Empty DataFrame
Columns: []
Index: []
This, however, works, but requires a seemingly-unnecessary re-indexing:
df_M = df.reset_index().set_index('dt').groupby('symbol').resample('M', how='mean')
In [26]: df_M.head()
Out[26]:
Adj Close Close High Low Open \
symbol dt
AAPL 2013-01-31 67.677750 497.822382 504.407623 492.969997 500.083329
2013-02-28 62.388477 456.808942 463.231056 452.106325 458.503692
2013-03-31 60.417287 441.841000 446.803495 437.337996 442.011512
2013-04-30 57.398619 419.765001 425.553183 414.722271 419.766820
2013-05-31 61.340151 446.452734 451.658190 441.495455 446.400919
Volume
symbol dt
AAPL 2013-01-31 1.562312e+08
2013-02-28 1.229478e+08
2013-03-31 1.147110e+08
2013-04-30 1.245851e+08
2013-05-31 1.073583e+08
The fact that you need to do the reset_index().set_index('dt')
and then groupby('symbol')
instead of groupby(level='symbol')
seems to defeat the purpose of a multi-index! What gives?
I also realize that data like this is perhaps more well-suited to a Panel than a DataFrame, but when dealing with very large amounts of (often sparse) data, the 3D Panel structure presents performance and memory issues compared to the flat DataFrame.