Skip to content

BUG: filter (with dropna=False) when there are no groups fulfilling the condition #12768

Closed
@sebov

Description

@sebov

For a DataFrame I want to preserve rows that belong to groups that fulfil specific condition and replace other rows with NaN. I have used a combination of 'groupby' and 'filter' (with dropna=False). In a special case when there are no groups fulfilling the condition an exception occured.

AttributeError                            Traceback (most recent call last)
<ipython-input-11-ffb9adbc134a> in <module>()
----> 1 pd.DataFrame({'a': [1,1,2], 'b':[1,2,0]}).groupby('a').filter(lambda x: x['b'].sum() > 5, dropna=False)

....../local/lib/python2.7/site-packages/pandas/core/groupby.py in filter(self, func, dropna, *args, **kwargs)
   3570                                 type(res).__name__)
   3571 
-> 3572         return self._apply_filter(indices, dropna)
   3573 
   3574 

....../local/lib/python2.7/site-packages/pandas/core/groupby.py in _apply_filter(self, indices, dropna)
    831             mask = np.empty(len(self._selected_obj.index), dtype=bool)
    832             mask.fill(False)
--> 833             mask[indices.astype(int)] = True
    834             # mask fails to broadcast when passed to where; broadcast manually.
    835             mask = np.tile(mask, list(self._selected_obj.shape[1:]) + [1]).T

AttributeError: 'list' object has no attribute 'astype'

The problem I have identified is in the _apply_filter method of _GroupBy class (core/groupby.py) -- line with "mask[indices.astype(int)] = True" throws because in my case indices is equal to []; shouldn't it be "indices = np.array([])" instead of "indices = []" in the case when len(indices) == 0

    def _apply_filter(self, indices, dropna):
        if len(indices) == 0:
            indices = []
        else:
            indices = np.sort(np.concatenate(indices))
        if dropna:
            filtered = self._selected_obj.take(indices, axis=self.axis)
        else:
            mask = np.empty(len(self._selected_obj.index), dtype=bool)
            mask.fill(False)
            mask[indices.astype(int)] = True
            # mask fails to broadcast when passed to where; broadcast manually.
            mask = np.tile(mask, list(self._selected_obj.shape[1:]) + [1]).T
            filtered = self._selected_obj.where(mask)  # Fill with NaNs.
        return filtered

Code Sample, a copy-pastable example if possible

>>> import pandas as pd
>>> pd.DataFrame({'a': [1,1,2], 'b': [1,2,0]}).groupby('a').filter(lambda x: x['b'].sum() > 5, dropna=False)

Expected Output

    a   b
0 NaN NaN
1 NaN NaN
2 NaN NaN

output of pd.show_versions()

commit: None
python: 2.7.9.final.0
python-bits: 64
OS: Linux
OS-release: 3.19.0-56-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.18.0
nose: 1.3.7
pip: 1.5.6
setuptools: 12.2
Cython: 0.23.4
numpy: 1.11.0
scipy: 0.16.1
statsmodels: None
xarray: None
IPython: 4.0.3
sphinx: None
patsy: 0.4.0
dateutil: 2.5.2
pytz: 2016.3
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.5.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: 0.7.6
lxml: None
bs4: 4.3.2
html5lib: 0.999
httplib2: 0.9
apiclient: None
sqlalchemy: None
pymysql: 0.6.6.None
psycopg2: None
jinja2: 2.8
boto: None

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions