Description
Large DataFrame with dates is slower in 0.24.0
import pandas as pd
import numpy as np
d = '2018-11-29'
dt = '2018-11-26 11:18:27.0'
N = 10000000
df = pd.DataFrame({'dt': [np.datetime64(dt)] * N, 'd': [np.datetime64(d)] * N, 'r': [np.random.uniform()] * N})
Problem description
Using pandas < 0.24.0 dumping a large dataset (10M rows, 40 columns) which contains text, dates and floats/ints takes about 12 minutes but using version 0.24.0 the same code takes about 2hours and 40 minutes.
We understand the code base to create the csv has change a bit. I didn't notice the speed degradation when the columns are not dates. I've not checked with datetime objects.
We've tried to condense the problem to the few lines above. Below the results I get from %prun in each version.
%prun
for pandas 0.24.0
:
121342 function calls (120733 primitive calls) in 153.774 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 153.774 153.774 {built-in method builtins.exec}
1 0.000 0.000 153.774 153.774 <string>:1(<module>)
1 0.000 0.000 153.774 153.774 generic.py:2873(to_csv)
1 0.000 0.000 153.773 153.773 csvs.py:130(save)
1 0.002 0.002 153.684 153.684 csvs.py:272(_save)
301 0.278 0.001 153.681 0.511 csvs.py:290(_save_chunk)
301 0.009 0.000 119.472 0.397 blocks.py:2200(to_native_types)
301 0.002 0.000 101.820 0.338 format.py:1300(_get_format_datetime64_from_values)
301 96.018 0.319 101.818 0.338 format.py:1249(_is_dates_only)
301 20.040 0.067 20.040 0.067 {pandas._libs.writers.write_csv_rows}
301 17.525 0.058 17.580 0.058 {pandas._libs.tslib.format_array_from_datetime}
602 13.766 0.023 13.766 0.023 {method 'astype' of 'numpy.ndarray' objects}
301 0.005 0.000 8.095 0.027 blocks.py:1982(to_native_types)
1204 5.798 0.005 5.798 0.005 {method 'reduce' of 'numpy.ufunc' objects}
302 0.003 0.000 5.796 0.019 base.py:1023(to_native_types)
302 0.005 0.000 5.777 0.019 base.py:1049(_format_native_types)
301 0.002 0.000 5.749 0.019 {method 'sum' of 'numpy.ndarray' objects}
301 0.001 0.000 5.747 0.019 _methods.py:34(_sum)
603 0.001 0.000 0.089 0.000 missing.py:25(isna)
603 0.003 0.000 0.088 0.000 missing.py:105(_isna_new)
[snip]
cProfile of my script with 0.24.0
(I took the liberty of removing paths/username information)
stats: 3434284 function calls (3423766 primitive calls) in 8642.079 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 8642.079 8642.079 ../lib/python3.7/site-packages/pandas/core/generic.py:2873(to_csv)
1 0.000 0.000 8642.077 8642.077 ../lib/python3.7/site-packages/pandas/io/formats/csvs.py:130(save)
1 0.042 0.042 8642.002 8642.002../lib/python3.7/site-packages/pandas/io/formats/csvs.py:272(_save)
5242 3.588 0.001 8641.950 1.649../lib/python3.7/site-packages/pandas/io/formats/csvs.py:290(_save_chunk)
5242 0.165 0.000 8294.922 1.582 ../lib/python3.7/site-packages/pandas/core/internals/blocks.py:2200(to_native_types)
5242 0.044 0.000 8203.096 1.565 ../lib/python3.7/site-packages/pandas/io/formats/format.py:1300(_get_format_datetime64_from_values)
5242 7757.167 1.480 8203.052 1.565 ../lib/python3.7/site-packages/pandas/io/formats/format.py:1249(_is_dates_only)
20968 445.184 0.021 445.184 0.021 {method 'reduce' of 'numpy.ufunc' objects}
5242 0.041 0.000 444.994 0.085 {method 'sum' of 'numpy.ndarray' objects}
5242 0.023 0.000 444.953 0.085 ../lib/python3.7/site-packages/numpy/core/_methods.py:34(_sum)
5242 212.677 0.041 212.677 0.041 {pandas._libs.writers.write_csv_rows}
15726 122.216 0.008 122.216 0.008 {method 'astype' of 'numpy.ndarray' objects}
5242 90.999 0.017 91.300 0.017 {pandas._libs.tslib.format_array_from_datetime}
5242 0.570 0.000 79.782 0.015 ../lib/python3.7/site-packages/pandas/core/internals/blocks.py:1982(to_native_types)
10484 0.200 0.000 43.947 0.004 ../lib/python3.7/site-packages/pandas/core/internals/blocks.py:730(to_native_types)
5243 0.031 0.000 7.008 0.001 ../lib/python3.7/site-packages/pandas/core/indexes/base.py:1023(to_native_types)
...the rest was removed...
%prun
for pandas 0.23.4
:
71818 function calls (71517 primitive calls) in 52.371 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 52.371 52.371 {built-in method builtins.exec}
1 0.000 0.000 52.371 52.371 <string>:1(<module>)
1 0.000 0.000 52.371 52.371 frame.py:1653(to_csv)
1 0.000 0.000 52.370 52.370 csvs.py:123(save)
1 0.002 0.002 51.546 51.546 csvs.py:270(_save)
301 0.245 0.001 51.543 0.171 csvs.py:288(_save_chunk)
301 19.846 0.066 19.846 0.066 {pandas._libs.writers.write_csv_rows}
301 0.007 0.000 18.031 0.060 internals.py:2743(to_native_types)
301 17.654 0.059 17.698 0.059 {pandas._libs.tslib.format_array_from_datetime}
602 13.309 0.022 13.309 0.022 {method 'astype' of 'numpy.ndarray' objects}
301 0.006 0.000 7.480 0.025 internals.py:2052(to_native_types)
302 0.002 0.000 5.940 0.020 base.py:2408(to_native_types)
302 0.004 0.000 5.925 0.020 base.py:2434(_format_native_types)
2 0.823 0.411 0.823 0.411 {method 'close' of '_io.TextIOWrapper' objects}
301 0.001 0.000 0.293 0.001 format.py:1291(_get_format_datetime64_from_values)
301 0.249 0.001 0.292 0.001 format.py:1243(_is_dates_only)
603 0.001 0.000 0.081 0.000 missing.py:32(isna)
603 0.003 0.000 0.080 0.000 missing.py:112(_isna_new)
603 0.033 0.000 0.072 0.000 missing.py:189(_isna_ndarraylike)
1204 0.067 0.000 0.067 0.000 {method 'reduce' of 'numpy.ufunc' objects}
903 0.001 0.000 0.044 0.000 _methods.py:42(_any)
302 0.026 0.000 0.026 0.000 {method 'ravel' of 'numpy.ndarray' objects}
301 0.001 0.000 0.024 0.000 {method 'sum' of 'numpy.ndarray' objects}
cProfile of my script with 0.23.4
stats: 2170825 function calls (2165580 primitive calls) in 442.783 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 442.783 442.783 ../lib/python3.7/site-packages/pandas/core/frame.py:1653(to_csv)
1 0.000 0.000 442.782 442.782 ../lib/python3.7/site-packages/pandas/io/formats/csvs.py:123(save)
1 0.041 0.041 442.781 442.781 ../lib/python3.7/site-packages/pandas/io/formats/csvs.py:270(_save)
5242 2.815 0.001 442.730 0.084 ../lib/python3.7/site-packages/pandas/io/formats/csvs.py:288(_save_chunk)
5242 216.105 0.041 216.105 0.041 {pandas._libs.writers.write_csv_rows}
15726 125.188 0.008 125.188 0.008 {method 'astype' of 'numpy.ndarray' objects}
5242 0.121 0.000 90.872 0.017 ../lib/python3.7/site-packages/pandas/core/internals.py:2743(to_native_types)
5242 88.329 0.017 88.590 0.017 {pandas._libs.tslib.format_array_from_datetime}
5242 0.606 0.000 79.149 0.015 ../lib/python3.7/site-packages/pandas/core/internals.py:2052(to_native_types)
10484 0.188 0.000 46.683 0.004 ../lib/python3.7/site-packages/pandas/core/internals.py:755(to_native_types)
5243 0.026 0.000 7.077 0.001 ../lib/python3.7/site-packages/pandas/core/indexes/base.py:2408(to_native_types)
5243 0.039 0.000 6.497 0.001 ../lib/python3.7/site-packages/pandas/core/indexes/base.py:2434(_format_native_types)
What is Expected
No degradation in speed.
Output of pd.show_versions()
for 0.24.0
INSTALLED VERSIONS
commit: None
python: 3.7.0.final.0
python-bits: 64
OS: Linux
OS-release: 4.14.72-68.55.amzn1.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.24.0
pytest: None
pip: 10.0.1
setuptools: 39.0.1
Cython: None
numpy: 1.16.2
scipy: None
pyarrow: None
xarray: None
IPython: 7.3.0
sphinx: 1.8.4
patsy: None
dateutil: 2.8.0
pytz: 2018.9
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None
Output of pd.show_versions()
for 0.23.4
INSTALLED VERSIONS
commit: None
python: 3.7.0.final.0
python-bits: 64
OS: Linux
OS-release: 4.14.72-68.55.amzn1.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.23.4
pytest: 3.7.1
pip: 10.0.1
setuptools: 39.0.1
Cython: None
numpy: 1.16.1
scipy: 1.2.1
pyarrow: None
xarray: None
IPython: 7.3.0
sphinx: 1.8.4
patsy: None
dateutil: 2.8.0
pytz: 2018.9
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None