Description
Problem description
The openpxyl engine for ExcelWriter's append mode allows us to write data into a new Worksheet and append it to an existing Workbook. However, when trying to append data into an existing Worksheet, it creates a new blank Worksheet (without any existing data) with the Worksheet's name suffixed numerically incremented.
Sample Code
import pandas as pd
from openpyxl import load_workbook
excel_filename = 'excel_file.xlsx'
# Create initial excel file
initial_data = pd.DataFrame({'col1': ['A', 'B'], 'col2': [1, 2]})
with pd.ExcelWriter(excel_filename, engine='openpyxl', mode='w') as writer:
initial_data.to_excel(writer, sheet_name="sheet", index=False)
# Append some new rows into an existing worksheet
new_data = pd.DataFrame({'col1': ['C', 'D'], 'col2': [3, 4]})
workbook = load_workbook(excel_filename)
with pd.ExcelWriter(excel_filename, engine='openpyxl', mode='a') as writer:
new_data.to_excel(writer,
sheet_name="sheet",
startrow=workbook['sheet'].max_row,
header=False,
index=False)
Expected Output
We would expect the Excel file to contain a Worksheet named "sheet", with the following data:
col1 | col2 |
---|---|
A | 1 |
B | 2 |
C | 3 |
D | 4 |
Actual Output
However, the Excel file contains two Worksheets, namely "sheet" and "sheet1", with the following data:
sheet
:
col1 | col2 |
---|---|
A | 1 |
B | 2 |
sheet1
:
C | 3 |
D | 4 |
Explanation
When the ExcelWriter
object is instantiated, it assumes a new Empty Workbook with no Worksheets (see ExcelWriter: line 701).
pandas/pandas/io/excel/_base.py
Line 701 in 171c716
So when we try to write data, it always creates a new blank Worksheet instead (see _OpenpyxlWriter: line 400-403).
pandas/pandas/io/excel/_openpyxl.py
Lines 400 to 403 in 171c716
In addition, the openpxyl library performs a check before writing to "avoid duplicate names" (see openpxyl: line 18), which numerically increment the suffix of the sheet name.
Existing solution
An existing solution is found in Stack Overflow (see here).
It is a simple fix with just one line to add all existing worksheets after the ExcelWriter
object is instantiated (see Sample Code above):
writer.sheets = dict((ws.title, ws) for ws in writer.book.worksheets)
It would be preferred if this is done in the backend, right after the workbook is loaded here:
pandas/pandas/io/excel/_openpyxl.py
Lines 23 to 27 in 171c716
It will improve usability of the DataFrame.to_excel
function.
However, some additional flags should be added to warn the user that existing data in existing worksheets may be overwritten (if the startrow
or startcol
parameter is stated wrongly by accident). Perhaps a new default parameter overwrite=false
should be created for DataFrame.to_excel
to only write data to empty regions (by performing bound checks for startrow
and startcol
).
Output of pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 2.7.6.final.0
python-bits: 64
OS: Darwin
OS-release: 18.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: None.None
pandas: 0.24.2
pytest: 2.8.5
pip: 10.0.1
setuptools: 39.2.0
Cython: 0.23.4
numpy: 1.16.4
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 5.5.0
sphinx: 1.3.5
patsy: 0.4.0
dateutil: 2.5.3
pytz: 2016.7
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.4.6
feather: None
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.8.4
lxml.etree: 3.5.0
bs4: 4.4.1
html5lib: None
sqlalchemy: 1.0.11
pymysql: 0.8.0
psycopg2: 2.7.5 (dt dec pq3 ext lo64)
jinja2: 2.8
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None