Description
I've recently started using pandas (impressed so far!) and found that plotting large data (from around 100k) samples is quite slow. I traced the bottleneck to the _dt_to_float_ordinal helper function called by DatetimeConverter.(https://github.com/pydata/pandas/blob/master/pandas/tseries/converter.py#L144).
More specifically, this function uses matplotlib's date2num, which converts arrays and iterables using a slow list comprehension. Since pandas seem to natively store datetimes as epoch+nanoseconds in an int64 array, it would be much faster to use matplotlib's vectorized epoch2num instead. In a testcase with 1 million points, using epoch2num is about 100 times faster than date2num:
from pandas import date_range, DataFrame
from numpy import int64, arange
from matplotlib import pyplot, dates
import time
n = 1e6
df = DataFrame(arange(n), index = date_range('20130101', periods=n, freq='S'))
start = time.time()
pyplot.plot(df.index, df)
print('date2num took {0:g}s'.format(time.time() - start))
pyplot.show()
# monkey patch
import pandas.tseries.converter
def _my_dt_to_float_ordinal(dt):
try:
base = dates.epoch2num(dt.astype(int64) / 1.0E9)
except AttributeError:
base = dates.date2num(dt)
return base
pandas.tseries.converter._dt_to_float_ordinal = _my_dt_to_float_ordinal
start = time.time()
pyplot.plot(df.index, df)
print('epoch2num took {0:g}s'.format(time.time() - start))
pyplot.show()
Unfortunately, I am not familiar enough with pandas to know whether date2num is used intentionally or to implement a proper patch myself that works in all cases.