Skip to content

PERF: allow even more flexible ISO 8601 datetime parsing #11899

Closed
@femtotrader

Description

@femtotrader

Hello,

I noticed that there is a huge code speed difference with to_datetime execution when format is not given and when it's given.

I wonder if there is not some room for improvements here!

In [1]: %time df=pd.read_csv("AUDUSD-2014-01.csv", names=['Symbol', 'Date', 'Bid', 'Ask'])
CPU times: user 3.31 s, sys: 481 ms, total: 3.79 s
Wall time: 4.13 s

In [2]: df
Out[274]:
          Symbol                   Date      Bid      Ask
0        AUD/USD  20140101 21:55:34.404  0.88796  0.88922
1        AUD/USD  20140101 21:55:34.444  0.88805  0.88914
2        AUD/USD  20140101 21:55:34.475  0.88809  0.88910
3        AUD/USD  20140101 21:55:48.962  0.88811  0.88908
4        AUD/USD  20140101 21:56:38.293  0.88808  0.88887
...          ...                    ...      ...      ...
1947101  AUD/USD  20140131 21:59:48.048  0.87525  0.87589
1947102  AUD/USD  20140131 21:59:54.599  0.87527  0.87589
1947103  AUD/USD  20140131 21:59:56.927  0.87531  0.87588
1947104  AUD/USD  20140131 21:59:59.365  0.87531  0.87574
1947105  AUD/USD  20140131 22:00:00.038  0.87531  0.87574

[1947106 rows x 4 columns]

In [3]: %time pd.to_datetime(df['Date'])
CPU times: user 11min 44s, sys: 19.4 s, total: 12min 4s
Wall time: 12min 46s
Out[3]:
0         2014-01-01 21:55:34.404
1         2014-01-01 21:55:34.444
2         2014-01-01 21:55:34.475
3         2014-01-01 21:55:48.962
4         2014-01-01 21:56:38.293
                    ...
1947101   2014-01-31 21:59:48.048
1947102   2014-01-31 21:59:54.599
1947103   2014-01-31 21:59:56.927
1947104   2014-01-31 21:59:59.365
1947105   2014-01-31 22:00:00.038
Name: Date, dtype: datetime64[ns]

In [4]: fmt='%Y%m%d %H:%M:%S.%f'

In [5]: %time pd.to_datetime(df['Date'], format=fmt)
CPU times: user 37.3 s, sys: 1.31 s, total: 38.6 s
Wall time: 40 s

In [6]: timedelta(minutes=12, seconds=46) / timedelta(seconds=40)
Out[6]: 19.15

There is x19.15 factor!!!

Sample data can be found here
https://drive.google.com/file/d/0B8iUtWjZOTqla3ZZTC1FS0pkZXc/view?usp=sharing

See also pydata/pandas-datareader#153

Metadata

Metadata

Assignees

No one assigned

    Labels

    DatetimeDatetime data dtypePerformanceMemory or execution speed performance

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions