Skip to content

BUG: Infinite recursion on IntervalIndex constructor due to integer overflow #25485

Closed
@kingsykes

Description

@kingsykes

Code Sample:

import numpy as np
import pandas as pd

# construct IntervalIndex from timestamps
# --> Length of index is > 100, the default leaf_size
# --> index is heavily skewed towards Timstamp.max
idx = pd.IntervalIndex.from_arrays(pd.date_range('2018-01-01', '2018-12-31', freq='D'), [pd.Timestamp.max]*365)

# type conversion from IntervalIndex._engine
left = idx._maybe_convert_i8(idx.left)
right = idx._maybe_convert_i8(idx.right)

# pivot point calculation from IntervalNode constructor, when n_elements > leaf_size
pivot = np.median(left + right) / 2

print(pd.Timestamp(pivot))
# >>> 1848-02-11 00:06:21.572612096
# This is below the minimum left-side value, which should not be

print(pd.Timestamp(np.median(idx.mid.asi8)))
# >>> 2140-05-21 23:53:38.427387904
# This is the correct midpoint
# NB: with my own data, accessing `mid` throws OverflowError: Overflow in int64 addition

# proposed alternate logic for midpoint
alternate = np.median(left/2 + right/2)

print(pd.Timestamp(alternate)
# >>> 2140-05-21 23:53:38.427387904
# Matches the correct midpoint, and should never over/under flow

Description

Current behavior causes a hard crash in the python process under certain conditions:

  • Length of the index is > 100, the default leaf_size of the IntervalTree
  • The intervals are heavily skewed towards the upper bound of the type (for signed types I believe this can also be the lower bound, but I haven't verified it)

Having an index length greater than the leaf_size parameter, causes the the IndexNode constructor to create child nodes. Currently the default leaf_size is 100 and is not exposed to the user API. Instantiating an IntervalTree directly and setting the leaf_size > index length does not lead to a crash.

By emulating the IntervalNode.classify_intervals method in the REPL, I found that the entire index was assigned to the right-hand child node, because the pivot was calculated on an array of overflowed integers, and that the constructor would never stop recursing.

As an additional note, I also encountered an OverflowError when accessing the mid attribute of the IntervalIndex with my own data, but this does not occur in the code sample above. I traced it back through the length attribute, where left bounds are subtracted from right bounds.

idx.right - idx.left
# >>> Traceback (most recent call last):
# >>>   File "<input>", line 1, in <module>
# >>>   File "~\pandas\core\indexes\datetimelike.py", line 501, in __sub__
# >>>     result = self._data.__sub__(maybe_unwrap_index(other))
# >>>   File "~\pandas\core\arrays\datetimelike.py", line 1275, in __sub__
# >>>     result = self._sub_datetime_arraylike(other)
# >>>   File "~\pandas\core\arrays\datetimes.py", line 724, in _sub_datetime_arraylike
# >>>     arr_mask=self._isnan)
# >>>   File "~\pandas\core\algorithms.py", line 938, in checked_add_with_arr
# >>>     raise OverflowError("Overflow in int64 addition")
# >>> OverflowError: Overflow in int64 addition

Expected Output

Changing the pivot calculation from
self.pivot = np.median(left + right) / 2
to
self.pivot = np.median(left/2 + right/2)
should protect against over and underflows.

Output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.5.6.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 63 Stepping 2, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None
------------------
pandas: 0.24.1
pytest: None
pip: 10.0.1
setuptools: 40.2.0
Cython: None
numpy: 1.16.1
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: 1.1.13
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

Metadata

Metadata

Assignees

No one assigned

    Labels

    IntervalInterval data typeNumeric OperationsArithmetic, Comparison, and Logical operations

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions