Description
Code Sample:
import numpy as np
import pandas as pd
# construct IntervalIndex from timestamps
# --> Length of index is > 100, the default leaf_size
# --> index is heavily skewed towards Timstamp.max
idx = pd.IntervalIndex.from_arrays(pd.date_range('2018-01-01', '2018-12-31', freq='D'), [pd.Timestamp.max]*365)
# type conversion from IntervalIndex._engine
left = idx._maybe_convert_i8(idx.left)
right = idx._maybe_convert_i8(idx.right)
# pivot point calculation from IntervalNode constructor, when n_elements > leaf_size
pivot = np.median(left + right) / 2
print(pd.Timestamp(pivot))
# >>> 1848-02-11 00:06:21.572612096
# This is below the minimum left-side value, which should not be
print(pd.Timestamp(np.median(idx.mid.asi8)))
# >>> 2140-05-21 23:53:38.427387904
# This is the correct midpoint
# NB: with my own data, accessing `mid` throws OverflowError: Overflow in int64 addition
# proposed alternate logic for midpoint
alternate = np.median(left/2 + right/2)
print(pd.Timestamp(alternate)
# >>> 2140-05-21 23:53:38.427387904
# Matches the correct midpoint, and should never over/under flow
Description
Current behavior causes a hard crash in the python process under certain conditions:
- Length of the index is > 100, the default leaf_size of the IntervalTree
- The intervals are heavily skewed towards the upper bound of the type (for signed types I believe this can also be the lower bound, but I haven't verified it)
Having an index length greater than the leaf_size parameter, causes the the IndexNode constructor to create child nodes. Currently the default leaf_size is 100 and is not exposed to the user API. Instantiating an IntervalTree directly and setting the leaf_size > index length does not lead to a crash.
By emulating the IntervalNode.classify_intervals
method in the REPL, I found that the entire index was assigned to the right-hand child node, because the pivot was calculated on an array of overflowed integers, and that the constructor would never stop recursing.
As an additional note, I also encountered an OverflowError
when accessing the mid
attribute of the IntervalIndex
with my own data, but this does not occur in the code sample above. I traced it back through the length
attribute, where left bounds are subtracted from right bounds.
idx.right - idx.left
# >>> Traceback (most recent call last):
# >>> File "<input>", line 1, in <module>
# >>> File "~\pandas\core\indexes\datetimelike.py", line 501, in __sub__
# >>> result = self._data.__sub__(maybe_unwrap_index(other))
# >>> File "~\pandas\core\arrays\datetimelike.py", line 1275, in __sub__
# >>> result = self._sub_datetime_arraylike(other)
# >>> File "~\pandas\core\arrays\datetimes.py", line 724, in _sub_datetime_arraylike
# >>> arr_mask=self._isnan)
# >>> File "~\pandas\core\algorithms.py", line 938, in checked_add_with_arr
# >>> raise OverflowError("Overflow in int64 addition")
# >>> OverflowError: Overflow in int64 addition
Expected Output
Changing the pivot calculation from
self.pivot = np.median(left + right) / 2
to
self.pivot = np.median(left/2 + right/2)
should protect against over and underflows.
Output of pd.show_versions()
INSTALLED VERSIONS ------------------ commit: None python: 3.5.6.final.0 python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 63 Stepping 2, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None ------------------ pandas: 0.24.1 pytest: None pip: 10.0.1 setuptools: 40.2.0 Cython: None numpy: 1.16.1 scipy: None pyarrow: None xarray: None IPython: None sphinx: None patsy: None dateutil: 2.7.3 pytz: 2018.5 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: None openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml.etree: None bs4: None html5lib: None sqlalchemy: 1.1.13 pymysql: None psycopg2: None jinja2: None s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None gcsfs: None