Skip to content

BUG: groupby with categorical drops empty groups when aggregating over a series #8870

Closed
@aimboden

Description

@aimboden
  • Series groupby excluding NaN groups with Categorical (DataFrame DOES include)
  • sorting via a returned Interval-like-Index (string based)

Hello,

When grouping a DataFrame over more than one column including a categorical, the empty groups are kept in the aggregation result. A test for this behaviour was introduced in #8138.

However, when performing aggregation on only one column of the DataFrame, the empty groups are dropped. This seems inconsistent to me and I guess that it's an edge case that wasn't thought of at the time.

d = {'foo': [10, 8, 4, 1], 'bar': [10, 20, 30, 40],
     'baz': ['d', 'c', 'd', 'c']}
df = pd.DataFrame(d)
cat = pd.cut(df['foo'], np.linspace(0, 20, 5))
df['range'] = cat
groups = df.groupby(['range', 'baz'], as_index=True, sort=True)

# Expected result, fixed as part of #8138
fixed = groups.agg('mean')

# Inconsistent behaviour with series
inconsistent = groups['foo'].agg('mean')

# Expected result
expected = fixed['foo']
fixed
bar foo
range baz
(0, 5] c 1 40
d 4 30
(10, 15] c NaN NaN
d NaN NaN
(15, 20] c NaN NaN
d NaN NaN
(5, 10] c 8 20
d 10 10
inconsistent
range baz
(0, 5] c 1
d 4
(5, 10] c 8
d 10
expected
range baz
(0, 5] c 1
d 4
(10, 15] c NaN
d NaN
(15, 20] c NaN
d NaN
(5, 10] c 8
d 10

Note the strange ordering of the categorical index. I would expect sorted = True to sort by categorical level and not by lexical order?

Also note that using as_index=False fails due to #8869

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions