Description
Hi, I am the maintainer of tsfresh, we calculate features from time series and rely on pandas internally.
Since we open sourced tsfresh, we had numerous reports of tsfresh crashing on big datasets but were never able to pin it down. The errors seem to occur for big datasets (100 GB+). I also tried to reproduce it but do not have access to a machine that has enough memory at the moment.
Recently, we found the place, it seem to crash at
[x + (y,) for x, y in df.groupby([column_id, column_kind])[column_value]]
df is a dataframe that looks like
==== ====== =========
id kind val
==== ====== =========
1 a -0.21761
1 a -0.613667
1 a -2.07339
2 b -0.576254
2 b -1.21924
==== ====== =========
and it should get converted into
[(1, 'a', pd.Series([-0.217610, -0.613667, -2.073386]),
(2, 'b', pd.Series([-0.576254, -1.219238])]
An exemplary stack trace where this crashes is
data_in_chunks = [x + (y,) for x, y in df.groupby([column_id, column_kind])[column_value]]
File "/home/yuval/pai/projects/pai-tsfresh/pai_tsfresh/feature_extraction/extraction.py", line 217, in
data_in_chunks = [x + (y,) for x, y in df.groupby([column_id, column_kind])[column_value]]
File "/usr/local/lib/python3.6/dist-packages/pandas/core/groupby.py", line 1922, in get_iterator
splitter = self._get_splitter(data, axis=axis)
File "/usr/local/lib/python3.6/dist-packages/pandas/core/groupby.py", line 1928, in _get_splitter
comp_ids, _, ngroups = self.group_info
File "pandas/_libs/properties.pyx", line 38, in pandas._libs.properties.cache_readonly.get
File "/usr/local/lib/python3.6/dist-packages/pandas/core/groupby.py", line 2040, in group_info
comp_ids, obs_group_ids = self._get_compressed_labels()
File "/usr/local/lib/python3.6/dist-packages/pandas/core/groupby.py", line 2056, in _get_compressed_labels
all_labels = [ping.labels for ping in self.groupings]
File "/usr/local/lib/python3.6/dist-packages/pandas/core/groupby.py", line 2056, in
all_labels = [ping.labels for ping in self.groupings]
File "/usr/local/lib/python3.6/dist-packages/pandas/core/groupby.py", line 2750, in labels
self._make_labels()
File "/usr/local/lib/python3.6/dist-packages/pandas/core/groupby.py", line 2767, in _make_labels
self.grouper, sort=self.sort)
File "/usr/local/lib/python3.6/dist-packages/pandas/core/algorithms.py", line 468, in factorize
table = hash_klass(size_hint or len(values))
File "pandas/_libs/hashtable_class_helper.pxi", line 1005, in pandas._libs.hashtable.StringHashTable.init
OverflowError: value too large to convert to int
See also the discussion in blue-yonder/tsfresh#418
So, I assume that we hit some kind of threshold. Any idea how to get the groupby more robust for bigger datasets?
Output of pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Linux
OS-release: 4.15.0-1018-gcp
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.23.4
pytest: None
pip: 9.0.1
setuptools: 20.7.0
Cython: None
numpy: 1.15.0
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: None
patsy: 0.5.0
dateutil: 2.6.1
pytz: 2018.3
blosc: None
bottleneck: None
tables: None
numexpr: 2.6.4
feather: None
matplotlib: 2.1.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.9999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None