Skip to content

Concatenating two series of categoricals results in data corruption without warning #19096

Closed
@ediphy-azorab

Description

@ediphy-azorab

Code Sample, a copy-pastable example if possible

I'm sadly unable to share the underlying data, and have not yet been able to product a minimised reproduction.

In [202]: s1 = df1.symbol

In [203]: s2 = df2.symbol

In [204]: s1.dtype
Out[204]: CategoricalDtype(categories=['RE00012ME6MA', 'RE00002YE6MA', 'RE00018ME6MA', 'RE00012YE6MA', 'RE00013YE6MA', 'RE00010YE6MA', 'RE00014YE6MA', 'RE00015YE6MA', 'RE00016YE6MA', 'RE00017YE6MA', 'RE00018YE6MA'
, 'RE00019YE6MA', 'RE00020YE6MA', 'RE00025YE6MA', 'RE00011YE6MA', 'RE00003YE6MA', 'RE00005YE6MA', 'RE00009YE6MA', 'RE00004YE6MA', 'RE00008YE6MA', 'RE00006YE6MA', 'RE00007YE6MA', 'RE00030YE6MA'], ordered=False)

In [205]: s1.shape
Out[205]: (2084,)

In [206]: s2.dtype
Out[206]: CategoricalDtype(categories=['RE00030YE6MA', 'RE00008YE6MA', 'RE00016YE6MA', 'RE00015YE6MA', 'RE00018YE6MA', 'RE00017YE6MA', 'RE00020YE6MA', 'RE00006YE6MA', 'RE00005YE6MA', 'RE00004YE6MA', 'RE00014YE6MA'
, 'RE00025YE6MA', 'RE00003YE6MA', 'RE00013YE6MA', 'RE00002YE6MA', 'RE00009YE6MA', 'RE00018ME6MA', 'RE00011YE6MA', 'RE00019YE6MA', 'RE00010YE6MA', 'RE00007YE6MA', 'RE00012YE6MA', 'RE00012ME6MA'], ordered=False)

In [207]: s2.shape
Out[207]: (1030,)

In [208]: pd.concat([s1, s2]).astype('object') == pd.concat([s1.astype('object'), s2.astype('object')])
Out[208]:
0        True
1        True
2        True
3        True
4        True
        ...
1025    False
1026    False
1027    False
1028    False
1029    False
Name: symbol, Length: 3114, dtype: bool

In [209]: pd.concat([s1, s2], ignore_index=True).astype('object') == pd.concat([s1.astype('object'), s2.astype('object')], ignore_index=True)
Out[209]:
0        True
1        True
2        True
3        True
4        True
        ...
3109    False
3110    False
3111    False
3112    False
3113    False
Name: symbol, Length: 3114, dtype: bool

In [210]: pd.concat([s1.astype('object'), s2.astype('object')], ignore_index=True).iloc[-5:]
Out[210]:
3109    RE00012ME6MA
3110    RE00012ME6MA
3111    RE00005YE6MA
3112    RE00015YE6MA
3113    RE00015YE6MA
Name: symbol, dtype: object

In [211]: pd.concat([s1, s2], ignore_index=True).astype('object').iloc[-5:]
Out[211]:
3109    RE00030YE6MA
3110    RE00030YE6MA
3111    RE00016YE6MA
3112    RE00012YE6MA
3113    RE00012YE6MA
Name: symbol, dtype: object

Problem description

The row values have changed without warning. This seems to be extremely suprising behaviour!

Expected Output

Concatenating two series with categories of the same values in different orders should not result in the row values changing

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.3.final.0
python-bits: 64
OS: Linux
OS-release: 4.9.0-4-amd64
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: C.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.22.0
pytest: None
pip: 10.0.0.subpip_fix
setuptools: 36.5.0
Cython: None
numpy: 1.13.3
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.1.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: 4.1.0
bs4: 4.6.0
html5lib: 1.0b10
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: 0.5.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugCategoricalCategorical Data TypeReshapingConcat, Merge/Join, Stack/Unstack, Explode

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions