Closed
Description
Hello,
I noticed that there is a huge code speed difference with to_datetime
execution when format is not given and when it's given.
I wonder if there is not some room for improvements here!
In [1]: %time df=pd.read_csv("AUDUSD-2014-01.csv", names=['Symbol', 'Date', 'Bid', 'Ask'])
CPU times: user 3.31 s, sys: 481 ms, total: 3.79 s
Wall time: 4.13 s
In [2]: df
Out[274]:
Symbol Date Bid Ask
0 AUD/USD 20140101 21:55:34.404 0.88796 0.88922
1 AUD/USD 20140101 21:55:34.444 0.88805 0.88914
2 AUD/USD 20140101 21:55:34.475 0.88809 0.88910
3 AUD/USD 20140101 21:55:48.962 0.88811 0.88908
4 AUD/USD 20140101 21:56:38.293 0.88808 0.88887
... ... ... ... ...
1947101 AUD/USD 20140131 21:59:48.048 0.87525 0.87589
1947102 AUD/USD 20140131 21:59:54.599 0.87527 0.87589
1947103 AUD/USD 20140131 21:59:56.927 0.87531 0.87588
1947104 AUD/USD 20140131 21:59:59.365 0.87531 0.87574
1947105 AUD/USD 20140131 22:00:00.038 0.87531 0.87574
[1947106 rows x 4 columns]
In [3]: %time pd.to_datetime(df['Date'])
CPU times: user 11min 44s, sys: 19.4 s, total: 12min 4s
Wall time: 12min 46s
Out[3]:
0 2014-01-01 21:55:34.404
1 2014-01-01 21:55:34.444
2 2014-01-01 21:55:34.475
3 2014-01-01 21:55:48.962
4 2014-01-01 21:56:38.293
...
1947101 2014-01-31 21:59:48.048
1947102 2014-01-31 21:59:54.599
1947103 2014-01-31 21:59:56.927
1947104 2014-01-31 21:59:59.365
1947105 2014-01-31 22:00:00.038
Name: Date, dtype: datetime64[ns]
In [4]: fmt='%Y%m%d %H:%M:%S.%f'
In [5]: %time pd.to_datetime(df['Date'], format=fmt)
CPU times: user 37.3 s, sys: 1.31 s, total: 38.6 s
Wall time: 40 s
In [6]: timedelta(minutes=12, seconds=46) / timedelta(seconds=40)
Out[6]: 19.15
There is x19.15 factor!!!
Sample data can be found here
https://drive.google.com/file/d/0B8iUtWjZOTqla3ZZTC1FS0pkZXc/view?usp=sharing
See also pydata/pandas-datareader#153