Description
- I have checked that this issue has not already been reported.
There are some historic "to_csv is slow" reports, but none mention the specific behaviour I've seen with regard to indices.
-
I have confirmed this bug exists on the latest version of pandas.
-
(optional) I have confirmed this bug exists on the master branch of pandas.
Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.
Code Sample, a copy-pastable example
import cProfile
import time
import pstats
from io import BytesIO, StringIO
import pandas as pd
import numpy as np
NUM_INDEX_COLS = 3
NUM_COLS = 50
NUM_ROWS = 1000000
# Generate a large-ish dummy dataframe
data = {}
idx_cols = []
for idx_id in range(NUM_INDEX_COLS):
idx_name = f"Index_{idx_id}"
idx_cols.append(idx_name)
if idx_id == 0:
data[idx_name] = np.random.randint(0, NUM_ROWS, NUM_ROWS)
else:
data[idx_name] = np.full(NUM_ROWS, 1, dtype=np.int)
for col_id in range(NUM_COLS):
col_name = f"Column_{col_id}"
data[col_name] = np.random.uniform(0, 100000.0, NUM_ROWS)
source_df = pd.DataFrame(data)
source_df_indexed = source_df.set_index(idx_cols)
# Default index - fast
df1 = source_df.head(10000)
# MultiIndex from idx_cols, then take head - slow
df2 = source_df_indexed.head(10000)
# Take head, then set MultiIndex from idx_cols - fast
df3 = source_df.head(10000).set_index(idx_cols)
# Deepcopy of df2 - slow (docs say that Index is not copied, so not surprising)
df4 = df2.copy(deep=True)
# Will run each df twice, and take timings from second run.
dataframes = [df1, df1, df2, df2, df3, df3, df4, df4]
#### WRITE TO CSV WITH EXISTING INDEX
#
# df1: 0.53s, with no index set
# df2: 2.35s, with index set before calling .head()
# df3: 0.54s, with index set after calling .head()
# df4: 2.12s, deep copy of df2
#
for i, df in enumerate(dataframes):
out = StringIO()
t1 = time.time()
df.to_csv(out)
t2 = time.time()
if (i % 2) == 1:
print(t2-t1)
#### WRITE TO CSV WITH INCLUDED CALL TO RESET INDEX
#
# df1: 0.55s, with no index set
# df2: 0.53s, with index set before calling .head()
# df3: 0.54s, with index set after calling .head()
# df4: 0.54s, deep copy of df2
for i, df in enumerate(dataframes):
out = StringIO()
t1 = time.time()
df.reset_index().to_csv(out)
t2 = time.time()
if (i % 2) == 1:
print(t2-t1)
Problem description
The speed of to_csv
varies significantly depending on whether an Index was set, and when/how that Index was set. This is not the case with e.g. to_parquet
. The original case where I noticed this speed difference is not so trivial, but it involves a MultiIndexed dataframe that has been manipulated before being output to CSV, taking approximately 50 minutes to output a ~30GB file.
(Yes, CSV is slow and horrible for many other reasons even at the best of times, and we avoid it as much as possible.)
In the example above, I create a million row dataframe source_df
, with three integer columns (two have the value 1
throughout), and 50 float columns. I then create four subsets of the first 10000 rows, doing different things with the index:
- df1: Just the source dataframe, no index modification (so it'll have the default RangeIndex)
- df2: Set the three integer columns as a MultiIndex, and then take .head(10000)
- df3: Take .head(10000), and then set the three integer columns as a MultiIndex
- df4: Deep copy of df2. (The Pandas documentation suggests that at least the index won't actually be copied).
I then time how long it takes to write each out to CSV (using a StringIO, to avoid IO speed concerns). df1 (no index) and df3 (index set after taking .head) are fast; df2 (index set before taking .head) and df4 (deepcopy of df2) are much slower. It appears that the index being a subset of a larger dataframe significantly slows down the writing.
I then do a separate test, whereby I reset the index of each dataframe immediately before writing. In this case, all write out as fast as df1/df3 in the original test.
I have also tested with to_parquet
by changing to_csv
to to_parquet
and providing a BytesIO. All four dataframes are output in 0.08s, regardless of index origin and whether the index is reset or not.
Surprisingly (for me), profiling indicates that {method 'astype' of 'numpy.ndarray' objects}
is where the extra time originates from. There aren't many extra calls - they just take much, much longer. If you're lucky, this might even be an upstream issue!
Summarised profiling output for df1, df2, and df3 in the don't-reset-index case below.
df1 - Default index, fast
1928 function calls (1910 primitive calls) in 0.530 seconds
Ordered by: internal time
List reduced from 126 to 20 due to restriction <20>
ncalls tottime percall cumtime percall filename:lineno(function)
12 0.285 0.024 0.285 0.024 {method 'astype' of 'numpy.ndarray' objects}
6 0.242 0.040 0.242 0.040 {pandas._libs.writers.write_csv_rows}
19 0.001 0.000 0.001 0.000 missing.py:193(_isna_ndarraylike)
6 0.000 0.000 0.530 0.088 csvs.py:330(_save_chunk)
20/13 0.000 0.000 0.000 0.000 {built-in method numpy.array}
384/382 0.000 0.000 0.000 0.000 {built-in method builtins.isinstance}
278 0.000 0.000 0.000 0.000 {built-in method builtins.getattr}
6 0.000 0.000 0.282 0.047 blocks.py:1938(to_native_types)
1 0.000 0.000 0.530 0.530 csvs.py:313(_save)
200 0.000 0.000 0.000 0.000 generic.py:10(_check)
12 0.000 0.000 0.000 0.000 blocks.py:292(getitem_block)
19 0.000 0.000 0.001 0.000 missing.py:130(_isna)
6 0.000 0.000 0.000 0.000 generic.py:3551(_slice)
12 0.000 0.000 0.000 0.000 range.py:697(__getitem__)
6 0.000 0.000 0.000 0.000 managers.py:757(get_slice)
19 0.000 0.000 0.000 0.000 dtypes.py:903(is_dtype)
12 0.000 0.000 0.000 0.000 range.py:153(_data)
7 0.000 0.000 0.004 0.001 base.py:984(_format_native_types)
6 0.000 0.000 0.000 0.000 {built-in method numpy.arange}
6 0.000 0.000 0.001 0.000 blocks.py:656(to_native_types)
df2 - MultiIndex set before head, slow
10629 function calls (10518 primitive calls) in 2.135 seconds
Ordered by: internal time
List reduced from 182 to 20 due to restriction <20>
ncalls tottime percall cumtime percall filename:lineno(function)
20 1.630 0.082 1.630 0.082 {method 'astype' of 'numpy.ndarray' objects}
5 0.247 0.049 0.247 0.049 {pandas._libs.writers.write_csv_rows}
122/91 0.163 0.001 0.163 0.002 {built-in method numpy.array}
6 0.042 0.007 1.588 0.265 base.py:954(to_native_types)
15 0.027 0.002 0.027 0.002 {pandas._libs.lib.infer_dtype}
1 0.016 0.016 2.135 2.135 csvs.py:313(_save)
15 0.001 0.000 0.001 0.000 {pandas._libs.algos.take_1d_object_object}
21 0.001 0.000 0.002 0.000 missing.py:193(_isna_ndarraylike)
2323/2321 0.001 0.000 0.001 0.000 {built-in method builtins.isinstance}
5 0.000 0.000 0.001 0.000 {pandas._libs.lib.fast_zip}
30/15 0.000 0.000 0.192 0.013 base.py:293(__new__)
1386 0.000 0.000 0.000 0.000 {built-in method builtins.getattr}
16 0.000 0.000 1.350 0.084 base.py:984(_format_native_types)
5 0.000 0.000 2.119 0.424 csvs.py:330(_save_chunk)
5 0.000 0.000 1.546 0.309 multi.py:1201(_format_native_types)
910 0.000 0.000 0.000 0.000 generic.py:10(_check)
90 0.000 0.000 0.000 0.000 base.py:498(_shallow_copy)
105 0.000 0.000 0.000 0.000 base.py:463(_simple_new)
171 0.000 0.000 0.000 0.000 common.py:1460(is_extension_array_dtype)
217 0.000 0.000 0.000 0.000 {built-in method builtins.hasattr}
df3 - MultiIndex set after head, fast
10629 function calls (10518 primitive calls) in 0.554 seconds
Ordered by: internal time
List reduced from 182 to 20 due to restriction <20>
ncalls tottime percall cumtime percall filename:lineno(function)
20 0.301 0.015 0.301 0.015 {method 'astype' of 'numpy.ndarray' objects}
5 0.243 0.049 0.243 0.049 {pandas._libs.writers.write_csv_rows}
122/91 0.002 0.000 0.002 0.000 {built-in method numpy.array}
15 0.001 0.000 0.001 0.000 {pandas._libs.lib.infer_dtype}
21 0.001 0.000 0.001 0.000 missing.py:193(_isna_ndarraylike)
2323/2321 0.000 0.000 0.001 0.000 {built-in method builtins.isinstance}
1386 0.000 0.000 0.000 0.000 {built-in method builtins.getattr}
6 0.000 0.000 0.028 0.005 base.py:954(to_native_types)
30/15 0.000 0.000 0.004 0.000 base.py:293(__new__)
5 0.000 0.000 0.000 0.000 {pandas._libs.lib.fast_zip}
1 0.000 0.000 0.554 0.554 csvs.py:313(_save)
5 0.000 0.000 0.554 0.111 csvs.py:330(_save_chunk)
910 0.000 0.000 0.000 0.000 generic.py:10(_check)
5 0.000 0.000 0.027 0.005 multi.py:1201(_format_native_types)
15 0.000 0.000 0.000 0.000 {pandas._libs.algos.take_1d_object_object}
171 0.000 0.000 0.000 0.000 common.py:1460(is_extension_array_dtype)
90 0.000 0.000 0.000 0.000 base.py:498(_shallow_copy)
105 0.000 0.000 0.000 0.000 base.py:463(_simple_new)
171 0.000 0.000 0.000 0.000 base.py:413(find)
15 0.000 0.000 0.001 0.000 algorithms.py:1616(take_nd)
The simplest (but silliest) suggestion would just be to call .reset_index()
within df.to_csv()
if index=True
is set. However, I suspect this indicates something more fundamental that can be improved.
Expected Output
Speed of outputting to CSV does not vary significantly depending on how the index was created, as is the case when outputting to Parquet, etc.
Output of pd.show_versions()
INSTALLED VERSIONS
commit : db08276
python : 3.6.8.final.0
python-bits : 64
OS : Linux
OS-release : 3.10.0-1127.19.1.el7.x86_64
Version : #1 SMP Thu Aug 20 14:39:03 CDT 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_GB.UTF-8
LOCALE : en_GB.UTF-8
pandas : 1.1.3
numpy : 1.18.1
pytz : 2020.1
dateutil : 2.8.1
pip : 20.2.4
setuptools : 45.2.0
Cython : 0.29.21
pytest : 5.4.3
hypothesis : 5.16.0
sphinx : 2.4.0
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.5.2
html5lib : None
pymysql : None
psycopg2 : 2.8.4 (dt dec pq3 ext)
jinja2 : 2.11.2
IPython : None
pandas_datareader: None
bs4 : 4.9.3
bottleneck : 1.3.2
fsspec : 0.6.2
fastparquet : None
gcsfs : None
matplotlib : 3.1.2
numexpr : 2.7.1
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 0.15.1
pytables : None
pyxlsb : None
s3fs : None
scipy : 1.2.1
sqlalchemy : None
tables : 3.5.2
tabulate : 0.8.6
xarray : None
xlrd : 1.2.0
xlwt : None
numba : 0.51.0