Skip to content

DataFrame .duplicated() / .drop_duplicates() flagging unique rows as duplicated in 0.17.0 #11668

Closed
@bijanhoule

Description

@bijanhoule

Dataframe.duplicated() and .drop_duplicates() are flagging rows as duplicates when they are in fact distinct.

This was the smallest dataset I could make to recreate the issue, but I've seen this issue on DataFrames of any size:

>>> import pandas as pd

>>> json_data = '{"Col1":{"0":"S2#OaGwWII","1":")A9$rw3W_I","2":"2Ra+*_RWII","3":"2RA`4kRWII","4":"2R=K*_RWII"},' \
            '"Col2":{"0":141105144406,"1":141107294517,"2":141106133624,"3":141108219194,"4":141106133614}}'
>>> df = pd.read_json(json_data)
>>> print(df)

         Col1          Col2
0  S2#OaGwWII  141105144406
1  )A9$rw3W_I  141107294517
2  2Ra+*_RWII  141106133624
3  2RA`4kRWII  141108219194
4  2R=K*_RWII  141106133614

>>> df.duplicated(keep=False)

0    False
1    False
2     True
3    False
4     True
dtype: bool

It also seems to depend on row order / column order; this behavior can be changed by shuffling / sampling rows or columns, e.g.:

>>> df[['Col2', 'Col1']].duplicated(keep=False)
0    False
1    False
2    False
3    False
4    False
dtype: bool

I only see this behavior on 0.17.0, while 0.16.2 is fine. More details about each environment are below:

pandas 0.17.0 / python 3.4.3 (failing)

>>> pd.util.print_versions.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.4.3.final.0
python-bits: 64
OS: Linux
OS-release: 2.6.18-400.1.1.el5
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US

pandas: 0.17.0
nose: 1.3.4
pip: 7.1.2
setuptools: 18.4
Cython: 0.23.4
numpy: 1.10.1
scipy: 0.16.0
statsmodels: 0.6.1
IPython: 3.2.1
sphinx: 1.3.1
patsy: 0.4.0
dateutil: 2.4.2
pytz: 2015.7
blosc: None
bottleneck: None
tables: 3.2.2
numexpr: 2.4.4
matplotlib: 1.4.3
openpyxl: 1.8.5
xlrd: 0.9.4
xlwt: None
xlsxwriter: 0.7.3
lxml: 3.4.4
bs4: 4.4.0
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.9
pymysql: 0.6.6.None
psycopg2: 2.6 (dt dec pq3 ext)

pandas 0.17.0 / python 3.5 (failing)

>>> pd.util.print_versions.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.5.0.final.0
python-bits: 64
OS: Linux
OS-release: 2.6.18-348.18.1.el5
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: C

pandas: 0.17.0
nose: 1.3.7
pip: 7.1.2
setuptools: 18.4
Cython: 0.23.4
numpy: 1.10.1
scipy: 0.16.0
statsmodels: 0.6.1
IPython: 4.0.0
sphinx: 1.3.1
patsy: 0.4.0
dateutil: 2.4.2
pytz: 2015.7
blosc: None
bottleneck: None
tables: 3.2.2
numexpr: 2.4.4
matplotlib: 1.5.0
openpyxl: 1.8.5
xlrd: 0.9.4
xlwt: None
xlsxwriter: 0.7.7
lxml: 3.4.4
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.9
pymysql: 0.6.7.None
psycopg2: None

pandas 0.16.2 python 3.5 (passing)

>>> pd.util.print_versions.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.5.0.final.0
python-bits: 64
OS: Linux
OS-release: 2.6.18-348.18.1.el5
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: C

pandas: 0.16.2
nose: None
Cython: None
numpy: 1.10.1
scipy: None
statsmodels: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.4.2
pytz: 2015.7
bottleneck: None
tables: None
numexpr: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions