Groupby Get Groups fails with "OverflowError: value too large to convert to int"

Hi, I am the maintainer of tsfresh, we calculate features from time series and rely on pandas internally. 

Since we open sourced tsfresh, we had numerous reports of tsfresh crashing on big datasets but were never able to pin it down. The errors seem to occur for big datasets (100 GB+). I also tried to reproduce it but do not have access to a machine that has enough memory at the moment.

Recently, we found the place, it seem to crash at 

```python
[x + (y,) for x, y in df.groupby([column_id, column_kind])[column_value]]
```
df is a dataframe that looks like
```
        ====  ======  =========
          id  kind          val
        ====  ======  =========
           1  a       -0.21761
           1  a       -0.613667
           1  a       -2.07339
           2  b       -0.576254
           2  b       -1.21924
        ====  ======  =========
```
and it should get converted into
```
        [(1, 'a', pd.Series([-0.217610, -0.613667, -2.073386]),
         (2, 'b', pd.Series([-0.576254, -1.219238])]
```

An exemplary stack trace where this crashes is

``` python
data_in_chunks = [x + (y,) for x, y in df.groupby([column_id, column_kind])[column_value]]
File "/home/yuval/pai/projects/pai-tsfresh/pai_tsfresh/feature_extraction/extraction.py", line 217, in 
data_in_chunks = [x + (y,) for x, y in df.groupby([column_id, column_kind])[column_value]]
File "/usr/local/lib/python3.6/dist-packages/pandas/core/groupby.py", line 1922, in get_iterator
splitter = self._get_splitter(data, axis=axis)
File "/usr/local/lib/python3.6/dist-packages/pandas/core/groupby.py", line 1928, in _get_splitter
comp_ids, _, ngroups = self.group_info
File "pandas/_libs/properties.pyx", line 38, in pandas._libs.properties.cache_readonly.get
File "/usr/local/lib/python3.6/dist-packages/pandas/core/groupby.py", line 2040, in group_info
comp_ids, obs_group_ids = self._get_compressed_labels()
File "/usr/local/lib/python3.6/dist-packages/pandas/core/groupby.py", line 2056, in _get_compressed_labels
all_labels = [ping.labels for ping in self.groupings]
File "/usr/local/lib/python3.6/dist-packages/pandas/core/groupby.py", line 2056, in 
all_labels = [ping.labels for ping in self.groupings]
File "/usr/local/lib/python3.6/dist-packages/pandas/core/groupby.py", line 2750, in labels
self._make_labels()
File "/usr/local/lib/python3.6/dist-packages/pandas/core/groupby.py", line 2767, in _make_labels
self.grouper, sort=self.sort)
File "/usr/local/lib/python3.6/dist-packages/pandas/core/algorithms.py", line 468, in factorize
table = hash_klass(size_hint or len(values))
File "pandas/_libs/hashtable_class_helper.pxi", line 1005, in pandas._libs.hashtable.StringHashTable.init
OverflowError: value too large to convert to int
```

See also the discussion in https://github.com/blue-yonder/tsfresh/issues/418

So, I assume that we hit some kind of threshold. Any idea how to get the groupby more robust for bigger datasets?

#### Output of ``pd.show_versions()``

<details>

INSTALLED VERSIONS
------------------
commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Linux
OS-release: 4.15.0-1018-gcp
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.4
pytest: None
pip: 9.0.1
setuptools: 20.7.0
Cython: None
numpy: 1.15.0
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: None
patsy: 0.5.0
dateutil: 2.6.1
pytz: 2018.3
blosc: None
bottleneck: None
tables: None
numexpr: 2.6.4
feather: None
matplotlib: 2.1.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.9999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

</details>


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Groupby Get Groups fails with "OverflowError: value too large to convert to int" #22729

Output of `pd.show_versions()`

INSTALLED VERSIONS

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Groupby Get Groups fails with "OverflowError: value too large to convert to int" #22729

Description

Output of pd.show_versions()

INSTALLED VERSIONS

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Output of `pd.show_versions()`