Closed
Description
related #739
Have a look at this example:
import pandas as pd
import numpy as np
from StringIO import StringIO
print "Pandas version %s\n\n" % pd.__version__
data1 = """idx,metric
0,2.1
1,2.5
2,3"""
data2 = """idx,metric
0,2.7
1,2.2
2,2.8"""
df1 = pd.read_csv(StringIO(data1))
df2 = pd.read_csv(StringIO(data2))
concatenated = pd.concat([df1, df2], ignore_index=True)
merged = concatenated.groupby("idx").agg([np.mean, np.std])
print merged
print merged.sort('metric')
and its output:
$ python test.py
Pandas version 0.11.0
metric
mean std
idx
0 2.40 0.424264
1 2.35 0.212132
2 2.90 0.141421
Traceback (most recent call last):
File "test.py", line 22, in <module>
print merged.sort('metric')
File "/***/Python-2.7.3/lib/python2.7/site-packages/pandas/core/frame.py", line 3098, in sort
inplace=inplace)
File "/***/Python-2.7.3/lib/python2.7/site-packages/pandas/core/frame.py", line 3153, in sort_index
% str(by))
ValueError: Cannot sort by duplicate column metric
The problem here is not that there is a duplicate column metric
as stated by the error message. The problem is that there are still two sub-levels. The solution in this case is to use
merged.sort([('metric', 'mean')])
for sorting by the mean of the metric. It took myself quite a while to figure this out. First of all, the error message should be more clear in this case. Then, maybe I was too stupid, but I could not find the solution in the docs, but within a thread on StackOverflow. Looks like the error message above is the result of an over-generalized condition around https://github.com/pydata/pandas/blob/v0.12.0rc1/pandas/core/frame.py#L3269