Skip to content

groupby with categorical type returns all combinations  #17594

Closed
@bear24rw

Description

@bear24rw

Code Sample, a copy-pastable example if possible

import pandas as pd                                                                                                                                                                                                                    
df = pd.DataFrame({'a': ['x','x','y'], 'b': [0,1,0], 'c': [7,8,9]})                                                                                                                                                                    
print(df.groupby(['a','b']).mean().reset_index())                                                                                                                                                                                      
df['a'] = df['a'].astype('category')                                                                                                                                                                                                   
print(df.groupby(['a','b']).mean().reset_index())

Returns two different results:

   a  b  c
0  x  0  7
1  x  1  8
2  y  0  9

   a  b    c
0  x  0  7.0
1  x  1  8.0
2  y  0  9.0
3  y  1  NaN

Problem description

Performing a groupby with a categorical type returns all combination of the groupby columns. This is a problem in my actual application as it results in a massive dataframe that is mostly filled with nans. I would also prefer not to move off of category dtype since it provides necessary memory savings.

Expected Output

   a  b  c
0  x  0  7
1  x  1  8
2  y  0  9

   a  b  c
0  x  0  7
1  x  1  8
2  y  0  9

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.1.final.0
python-bits: 64
OS: Linux
OS-release: 4.10.0-26-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.20.3
pytest: None
pip: 9.0.1
setuptools: 33.1.1
Cython: None
numpy: 1.13.1
scipy: 0.19.0
xarray: None
IPython: 6.1.0
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: None
tables: 3.4.2
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: None
xlrd: 1.0.0
xlwt: None
xlsxwriter: 0.9.6
lxml: None
bs4: 4.5.3
html5lib: 0.999999999
sqlalchemy: None
pymysql: None
psycopg2: 2.7.1 (dt dec pq3 ext lo64)
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions