Closed
Description
- Series groupby excluding NaN groups with Categorical (DataFrame DOES include)
- sorting via a returned Interval-like-Index (string based)
Hello,
When grouping a DataFrame over more than one column including a categorical, the empty groups are kept in the aggregation result. A test for this behaviour was introduced in #8138.
However, when performing aggregation on only one column of the DataFrame, the empty groups are dropped. This seems inconsistent to me and I guess that it's an edge case that wasn't thought of at the time.
d = {'foo': [10, 8, 4, 1], 'bar': [10, 20, 30, 40],
'baz': ['d', 'c', 'd', 'c']}
df = pd.DataFrame(d)
cat = pd.cut(df['foo'], np.linspace(0, 20, 5))
df['range'] = cat
groups = df.groupby(['range', 'baz'], as_index=True, sort=True)
# Expected result, fixed as part of #8138
fixed = groups.agg('mean')
# Inconsistent behaviour with series
inconsistent = groups['foo'].agg('mean')
# Expected result
expected = fixed['foo']
fixed
bar | foo | ||
---|---|---|---|
range | baz | ||
(0, 5] | c | 1 | 40 |
d | 4 | 30 | |
(10, 15] | c | NaN | NaN |
d | NaN | NaN | |
(15, 20] | c | NaN | NaN |
d | NaN | NaN | |
(5, 10] | c | 8 | 20 |
d | 10 | 10 |
inconsistent
range | baz | |
---|---|---|
(0, 5] | c | 1 |
d | 4 | |
(5, 10] | c | 8 |
d | 10 |
expected
range | baz | |
(0, 5] | c | 1 |
d | 4 | |
(10, 15] | c | NaN |
d | NaN | |
(15, 20] | c | NaN |
d | NaN | |
(5, 10] | c | 8 |
d | 10 |
Note the strange ordering of the categorical index. I would expect sorted = True
to sort by categorical level and not by lexical order?
Also note that using as_index=False
fails due to #8869