Skip to content

Performance degradation when dumping large dataframe with np.datetime to csv #25708

Closed
@alep

Description

@alep

Large DataFrame with dates is slower in 0.24.0

import pandas as pd
import numpy as np

d = '2018-11-29'
dt = '2018-11-26 11:18:27.0'
N = 10000000
df = pd.DataFrame({'dt': [np.datetime64(dt)] * N, 'd': [np.datetime64(d)] * N, 'r': [np.random.uniform()] * N})

Problem description

Using pandas < 0.24.0 dumping a large dataset (10M rows, 40 columns) which contains text, dates and floats/ints takes about 12 minutes but using version 0.24.0 the same code takes about 2hours and 40 minutes.

We understand the code base to create the csv has change a bit. I didn't notice the speed degradation when the columns are not dates. I've not checked with datetime objects.

We've tried to condense the problem to the few lines above. Below the results I get from %prun in each version.

%prun for pandas 0.24.0:

        121342 function calls (120733 primitive calls) in 153.774 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000  153.774  153.774 {built-in method builtins.exec}
        1    0.000    0.000  153.774  153.774 <string>:1(<module>)
        1    0.000    0.000  153.774  153.774 generic.py:2873(to_csv)
        1    0.000    0.000  153.773  153.773 csvs.py:130(save)
        1    0.002    0.002  153.684  153.684 csvs.py:272(_save)
      301    0.278    0.001  153.681    0.511 csvs.py:290(_save_chunk)
      301    0.009    0.000  119.472    0.397 blocks.py:2200(to_native_types)
      301    0.002    0.000  101.820    0.338 format.py:1300(_get_format_datetime64_from_values)
      301   96.018    0.319  101.818    0.338 format.py:1249(_is_dates_only)
      301   20.040    0.067   20.040    0.067 {pandas._libs.writers.write_csv_rows}
      301   17.525    0.058   17.580    0.058 {pandas._libs.tslib.format_array_from_datetime}
      602   13.766    0.023   13.766    0.023 {method 'astype' of 'numpy.ndarray' objects}
      301    0.005    0.000    8.095    0.027 blocks.py:1982(to_native_types)
     1204    5.798    0.005    5.798    0.005 {method 'reduce' of 'numpy.ufunc' objects}
      302    0.003    0.000    5.796    0.019 base.py:1023(to_native_types)
      302    0.005    0.000    5.777    0.019 base.py:1049(_format_native_types)
      301    0.002    0.000    5.749    0.019 {method 'sum' of 'numpy.ndarray' objects}
      301    0.001    0.000    5.747    0.019 _methods.py:34(_sum)
      603    0.001    0.000    0.089    0.000 missing.py:25(isna)
      603    0.003    0.000    0.088    0.000 missing.py:105(_isna_new)
      [snip]

cProfile of my script with 0.24.0

(I took the liberty of removing paths/username information)

stats:         3434284 function calls (3423766 primitive calls) in 8642.079 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000 8642.079 8642.079 ../lib/python3.7/site-packages/pandas/core/generic.py:2873(to_csv)
        1    0.000    0.000 8642.077 8642.077 ../lib/python3.7/site-packages/pandas/io/formats/csvs.py:130(save)
        1    0.042    0.042 8642.002 8642.002../lib/python3.7/site-packages/pandas/io/formats/csvs.py:272(_save)
     5242    3.588    0.001 8641.950    1.649../lib/python3.7/site-packages/pandas/io/formats/csvs.py:290(_save_chunk)
     5242    0.165    0.000 8294.922    1.582 ../lib/python3.7/site-packages/pandas/core/internals/blocks.py:2200(to_native_types)
     5242    0.044    0.000 8203.096    1.565 ../lib/python3.7/site-packages/pandas/io/formats/format.py:1300(_get_format_datetime64_from_values)
     5242 7757.167    1.480 8203.052    1.565 ../lib/python3.7/site-packages/pandas/io/formats/format.py:1249(_is_dates_only)
    20968  445.184    0.021  445.184    0.021 {method 'reduce' of 'numpy.ufunc' objects}
     5242    0.041    0.000  444.994    0.085 {method 'sum' of 'numpy.ndarray' objects}
     5242    0.023    0.000  444.953    0.085 ../lib/python3.7/site-packages/numpy/core/_methods.py:34(_sum)
     5242  212.677    0.041  212.677    0.041 {pandas._libs.writers.write_csv_rows}
    15726  122.216    0.008  122.216    0.008 {method 'astype' of 'numpy.ndarray' objects}
     5242   90.999    0.017   91.300    0.017 {pandas._libs.tslib.format_array_from_datetime}
     5242    0.570    0.000   79.782    0.015 ../lib/python3.7/site-packages/pandas/core/internals/blocks.py:1982(to_native_types)
    10484    0.200    0.000   43.947    0.004 ../lib/python3.7/site-packages/pandas/core/internals/blocks.py:730(to_native_types)
     5243    0.031    0.000    7.008    0.001 ../lib/python3.7/site-packages/pandas/core/indexes/base.py:1023(to_native_types)
    ...the rest was removed...

%prun for pandas 0.23.4:

        71818 function calls (71517 primitive calls) in 52.371 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   52.371   52.371 {built-in method builtins.exec}
        1    0.000    0.000   52.371   52.371 <string>:1(<module>)
        1    0.000    0.000   52.371   52.371 frame.py:1653(to_csv)
        1    0.000    0.000   52.370   52.370 csvs.py:123(save)
        1    0.002    0.002   51.546   51.546 csvs.py:270(_save)
      301    0.245    0.001   51.543    0.171 csvs.py:288(_save_chunk)
      301   19.846    0.066   19.846    0.066 {pandas._libs.writers.write_csv_rows}
      301    0.007    0.000   18.031    0.060 internals.py:2743(to_native_types)
      301   17.654    0.059   17.698    0.059 {pandas._libs.tslib.format_array_from_datetime}
      602   13.309    0.022   13.309    0.022 {method 'astype' of 'numpy.ndarray' objects}
      301    0.006    0.000    7.480    0.025 internals.py:2052(to_native_types)
      302    0.002    0.000    5.940    0.020 base.py:2408(to_native_types)
      302    0.004    0.000    5.925    0.020 base.py:2434(_format_native_types)
        2    0.823    0.411    0.823    0.411 {method 'close' of '_io.TextIOWrapper' objects}
      301    0.001    0.000    0.293    0.001 format.py:1291(_get_format_datetime64_from_values)
      301    0.249    0.001    0.292    0.001 format.py:1243(_is_dates_only)
      603    0.001    0.000    0.081    0.000 missing.py:32(isna)
      603    0.003    0.000    0.080    0.000 missing.py:112(_isna_new)
      603    0.033    0.000    0.072    0.000 missing.py:189(_isna_ndarraylike)
     1204    0.067    0.000    0.067    0.000 {method 'reduce' of 'numpy.ufunc' objects}
      903    0.001    0.000    0.044    0.000 _methods.py:42(_any)
      302    0.026    0.000    0.026    0.000 {method 'ravel' of 'numpy.ndarray' objects}
      301    0.001    0.000    0.024    0.000 {method 'sum' of 'numpy.ndarray' objects}

cProfile of my script with 0.23.4

stats:         2170825 function calls (2165580 primitive calls) in 442.783 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000  442.783  442.783 ../lib/python3.7/site-packages/pandas/core/frame.py:1653(to_csv)
        1    0.000    0.000  442.782  442.782 ../lib/python3.7/site-packages/pandas/io/formats/csvs.py:123(save)
        1    0.041    0.041  442.781  442.781 ../lib/python3.7/site-packages/pandas/io/formats/csvs.py:270(_save)
     5242    2.815    0.001  442.730    0.084 ../lib/python3.7/site-packages/pandas/io/formats/csvs.py:288(_save_chunk)
     5242  216.105    0.041  216.105    0.041 {pandas._libs.writers.write_csv_rows}
    15726  125.188    0.008  125.188    0.008 {method 'astype' of 'numpy.ndarray' objects}
     5242    0.121    0.000   90.872    0.017 ../lib/python3.7/site-packages/pandas/core/internals.py:2743(to_native_types)
     5242   88.329    0.017   88.590    0.017 {pandas._libs.tslib.format_array_from_datetime}
     5242    0.606    0.000   79.149    0.015 ../lib/python3.7/site-packages/pandas/core/internals.py:2052(to_native_types)
    10484    0.188    0.000   46.683    0.004 ../lib/python3.7/site-packages/pandas/core/internals.py:755(to_native_types)
     5243    0.026    0.000    7.077    0.001 ../lib/python3.7/site-packages/pandas/core/indexes/base.py:2408(to_native_types)
     5243    0.039    0.000    6.497    0.001 ../lib/python3.7/site-packages/pandas/core/indexes/base.py:2434(_format_native_types)

What is Expected

No degradation in speed.

Output of pd.show_versions() for 0.24.0

INSTALLED VERSIONS

commit: None
python: 3.7.0.final.0
python-bits: 64
OS: Linux
OS-release: 4.14.72-68.55.amzn1.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.0
pytest: None
pip: 10.0.1
setuptools: 39.0.1
Cython: None
numpy: 1.16.2
scipy: None
pyarrow: None
xarray: None
IPython: 7.3.0
sphinx: 1.8.4
patsy: None
dateutil: 2.8.0
pytz: 2018.9
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

Output of pd.show_versions() for 0.23.4

INSTALLED VERSIONS

commit: None
python: 3.7.0.final.0
python-bits: 64
OS: Linux
OS-release: 4.14.72-68.55.amzn1.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.4
pytest: 3.7.1
pip: 10.0.1
setuptools: 39.0.1
Cython: None
numpy: 1.16.1
scipy: 1.2.1
pyarrow: None
xarray: None
IPython: 7.3.0
sphinx: 1.8.4
patsy: None
dateutil: 2.8.0
pytz: 2018.9
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

Metadata

Metadata

Assignees

No one assigned

    Labels

    DatetimeDatetime data dtypeIO CSVread_csv, to_csvPerformanceMemory or execution speed performance

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions