Skip to content

BUG: Mergesort is unstable when ascending=False #6399

Closed
@jdreaver

Description

@jdreaver

The Issue

When using DataFrame.sort_by(kind='mergesort'), sorting is supposed to be stable. Unfortunately, that is not the case when ascending=False.

# Create a DataFrame where the first column is already in descending order.
In [96]: df = pd.DataFrame([[2, 'first'], [2, 'second'], [1, 'a'], [1, 'b']], columns=['sort_col', 'order'])

In [97]: df
Out[97]: 
   sort_col   order
0         2   first
1         2  second
2         1       a
3         1       b

[4 rows x 2 columns]

# Look at the 'order' column. Clearly not stable.
In [98]: df.sort_index(by='sort_col', kind='mergesort', ascending=False)
Out[98]: 
   sort_col   order
1         2  second
0         2   first
3         1       b
2         1       a

[4 rows x 2 columns]

More Info

Inside the sort_by() source code, argsort() is called on the sorted column. Then, ascending is checked and if it is False, the indexes are simply reversed. Here is the relevant code snippet in pandas/core/frame.py:

k = self[by].values
...
indexer = k.argsort(kind=kind)
...
if not ascending:
    indexer = indexer[::-1]

Since numpy always sorts in ascending order, this actually guarantees sorting is always unstable! Check this out:

In [110]: df['sort_col'].values.argsort(kind='mergesort')[::-1]
Out[110]: array([1, 0, 3, 2])

Clearly, simply reversing the indices doesn't work. We need to sort a reversed k, reverse the indices, and then subtract the indices from the highest index so they correspond to the original k:

In [112]: 3 - df['sort_col'].values[::-1].argsort(kind='mergesort')[::-1]
Out[112]: array([0, 1, 2, 3])

The workaround in my code is to stable sort descending is reverse the DataFrame, sort ascending, and reverse again.

What is the best way to fix this? This is probably an easy fix, but I've never contributed to pandas, so I need to set up my fork and make sure I can run tests before working on a pull request.

My Versions

Here are my versions:

In [75]: show_versions()

INSTALLED VERSIONS
------------------
Python: 3.3.3.final.0
OS: Linux
Release: 3.12.9-2-ARCH
Processor: 
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.13.0
Cython: 0.20
Numpy: 1.8.0
Scipy: 0.13.0
statsmodels: Not installed
    patsy: Not installed
scikits.timeseries: Not installed
dateutil: 2.2
pytz: 2013.9
bottleneck: Not installed
PyTables: Not Installed
    numexpr: Not Installed
matplotlib: 1.3.1
openpyxl: 1.8.2
xlrd: 0.9.2
xlwt: Not installed
xlsxwriter: Not installed
sqlalchemy: Not installed
lxml: Not installed
bs4: Not installed
html5lib: Not installed
bigquery: Not installed
apiclient: Not installed

Metadata

Metadata

Assignees

No one assigned

    Labels

    AlgosNon-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diffBugNumeric OperationsArithmetic, Comparison, and Logical operations

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions