Skip to content

Speed up DatetimeConverter for plotting #6636

Closed
@agijsberts

Description

@agijsberts

I've recently started using pandas (impressed so far!) and found that plotting large data (from around 100k) samples is quite slow. I traced the bottleneck to the _dt_to_float_ordinal helper function called by DatetimeConverter.(https://github.com/pydata/pandas/blob/master/pandas/tseries/converter.py#L144).

More specifically, this function uses matplotlib's date2num, which converts arrays and iterables using a slow list comprehension. Since pandas seem to natively store datetimes as epoch+nanoseconds in an int64 array, it would be much faster to use matplotlib's vectorized epoch2num instead. In a testcase with 1 million points, using epoch2num is about 100 times faster than date2num:

from pandas import date_range, DataFrame
from numpy import int64, arange
from matplotlib import pyplot, dates
import time

n = 1e6

df = DataFrame(arange(n), index = date_range('20130101', periods=n, freq='S'))

start = time.time()
pyplot.plot(df.index, df)
print('date2num took {0:g}s'.format(time.time() - start))
pyplot.show()

# monkey patch
import pandas.tseries.converter
def _my_dt_to_float_ordinal(dt):
    try:
        base = dates.epoch2num(dt.astype(int64) / 1.0E9)
    except AttributeError:
        base = dates.date2num(dt)
    return base
pandas.tseries.converter._dt_to_float_ordinal = _my_dt_to_float_ordinal

start = time.time()
pyplot.plot(df.index, df)
print('epoch2num took {0:g}s'.format(time.time() - start))
pyplot.show()

Unfortunately, I am not familiar enough with pandas to know whether date2num is used intentionally or to implement a proper patch myself that works in all cases.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions