Description
The Issue
When using DataFrame.sort_by(kind='mergesort')
, sorting is supposed to be stable. Unfortunately, that is not the case when ascending=False
.
# Create a DataFrame where the first column is already in descending order.
In [96]: df = pd.DataFrame([[2, 'first'], [2, 'second'], [1, 'a'], [1, 'b']], columns=['sort_col', 'order'])
In [97]: df
Out[97]:
sort_col order
0 2 first
1 2 second
2 1 a
3 1 b
[4 rows x 2 columns]
# Look at the 'order' column. Clearly not stable.
In [98]: df.sort_index(by='sort_col', kind='mergesort', ascending=False)
Out[98]:
sort_col order
1 2 second
0 2 first
3 1 b
2 1 a
[4 rows x 2 columns]
More Info
Inside the sort_by()
source code, argsort()
is called on the sorted column. Then, ascending
is checked and if it is False, the indexes are simply reversed. Here is the relevant code snippet in pandas/core/frame.py
:
k = self[by].values
...
indexer = k.argsort(kind=kind)
...
if not ascending:
indexer = indexer[::-1]
Since numpy always sorts in ascending order, this actually guarantees sorting is always unstable! Check this out:
In [110]: df['sort_col'].values.argsort(kind='mergesort')[::-1]
Out[110]: array([1, 0, 3, 2])
Clearly, simply reversing the indices doesn't work. We need to sort a reversed k
, reverse the indices, and then subtract the indices from the highest index so they correspond to the original k
:
In [112]: 3 - df['sort_col'].values[::-1].argsort(kind='mergesort')[::-1]
Out[112]: array([0, 1, 2, 3])
The workaround in my code is to stable sort descending is reverse the DataFrame, sort ascending, and reverse again.
What is the best way to fix this? This is probably an easy fix, but I've never contributed to pandas, so I need to set up my fork and make sure I can run tests before working on a pull request.
My Versions
Here are my versions:
In [75]: show_versions()
INSTALLED VERSIONS
------------------
Python: 3.3.3.final.0
OS: Linux
Release: 3.12.9-2-ARCH
Processor:
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
pandas: 0.13.0
Cython: 0.20
Numpy: 1.8.0
Scipy: 0.13.0
statsmodels: Not installed
patsy: Not installed
scikits.timeseries: Not installed
dateutil: 2.2
pytz: 2013.9
bottleneck: Not installed
PyTables: Not Installed
numexpr: Not Installed
matplotlib: 1.3.1
openpyxl: 1.8.2
xlrd: 0.9.2
xlwt: Not installed
xlsxwriter: Not installed
sqlalchemy: Not installed
lxml: Not installed
bs4: Not installed
html5lib: Not installed
bigquery: Not installed
apiclient: Not installed