Skip to content

PERF: speed up pd.to_datetime and co. by extracting dt format from data and using strptime to parse #5490

Closed
@lexual

Description

@lexual

I had a series containing strings like these:

"November 1, 2013"

Series length was about 500,000

A)
running pd.to_datetime(s) takes just over a minute.
B)
running pd.to_datetime(s, format="%B %d, %Y") takes about 7 seconds!

My suggestion is a way to make case A (where user doesn't specify the format type) take about as long as case B (user does specify).

Basically it looks like the code is always using date_util parser for case A.

My suggestion is based upon the idea that it's highly likely that the date strings are all in a consistent format (it's highly unlikely in this case that they would be in 500K separate formats!).

In a nutshell:

  • figure out the date format of the first entry.
  • try to use that against the entire series, using the speedy code in tslib.array_strptime
  • if that works, we've saved heaps of time, if not fall back to the current slower behaviour of using dateutil parse each time.

Here's some pseudo-code::

datestr1 = s[0]
# I'm assuming dateutil has something like this, that can tell you what the format is for a given date string.
date_format = figure_out_datetime_format(datestr1)

try:
    # use the super speed code that pandas uses when you tell it what the format is.
    dt_series = tslib.array_strptime(s, format=datestr1, *, ...)
except:
    # date strings aren't consistent after all. Let's do it the old slow way.
    dt_series = tslib.array_to_datetime(s, format=None)

return dt_series

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions