Closed
Description
For a DataFrame I want to preserve rows that belong to groups that fulfil specific condition and replace other rows with NaN. I have used a combination of 'groupby' and 'filter' (with dropna=False). In a special case when there are no groups fulfilling the condition an exception occured.
AttributeError Traceback (most recent call last)
<ipython-input-11-ffb9adbc134a> in <module>()
----> 1 pd.DataFrame({'a': [1,1,2], 'b':[1,2,0]}).groupby('a').filter(lambda x: x['b'].sum() > 5, dropna=False)
....../local/lib/python2.7/site-packages/pandas/core/groupby.py in filter(self, func, dropna, *args, **kwargs)
3570 type(res).__name__)
3571
-> 3572 return self._apply_filter(indices, dropna)
3573
3574
....../local/lib/python2.7/site-packages/pandas/core/groupby.py in _apply_filter(self, indices, dropna)
831 mask = np.empty(len(self._selected_obj.index), dtype=bool)
832 mask.fill(False)
--> 833 mask[indices.astype(int)] = True
834 # mask fails to broadcast when passed to where; broadcast manually.
835 mask = np.tile(mask, list(self._selected_obj.shape[1:]) + [1]).T
AttributeError: 'list' object has no attribute 'astype'
The problem I have identified is in the _apply_filter method of _GroupBy class (core/groupby.py) -- line with "mask[indices.astype(int)] = True" throws because in my case indices is equal to []; shouldn't it be "indices = np.array([])" instead of "indices = []" in the case when len(indices) == 0
def _apply_filter(self, indices, dropna):
if len(indices) == 0:
indices = []
else:
indices = np.sort(np.concatenate(indices))
if dropna:
filtered = self._selected_obj.take(indices, axis=self.axis)
else:
mask = np.empty(len(self._selected_obj.index), dtype=bool)
mask.fill(False)
mask[indices.astype(int)] = True
# mask fails to broadcast when passed to where; broadcast manually.
mask = np.tile(mask, list(self._selected_obj.shape[1:]) + [1]).T
filtered = self._selected_obj.where(mask) # Fill with NaNs.
return filtered
Code Sample, a copy-pastable example if possible
>>> import pandas as pd
>>> pd.DataFrame({'a': [1,1,2], 'b': [1,2,0]}).groupby('a').filter(lambda x: x['b'].sum() > 5, dropna=False)
Expected Output
a b
0 NaN NaN
1 NaN NaN
2 NaN NaN
output of pd.show_versions()
commit: None
python: 2.7.9.final.0
python-bits: 64
OS: Linux
OS-release: 3.19.0-56-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
pandas: 0.18.0
nose: 1.3.7
pip: 1.5.6
setuptools: 12.2
Cython: 0.23.4
numpy: 1.11.0
scipy: 0.16.1
statsmodels: None
xarray: None
IPython: 4.0.3
sphinx: None
patsy: 0.4.0
dateutil: 2.5.2
pytz: 2016.3
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.5.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: 0.7.6
lxml: None
bs4: 4.3.2
html5lib: 0.999
httplib2: 0.9
apiclient: None
sqlalchemy: None
pymysql: 0.6.6.None
psycopg2: None
jinja2: 2.8
boto: None