Skip to content

BUG: some read_excel engines still load trailing blank cells #41167

Closed
@ahawryluk

Description

@ahawryluk
  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

pd.read_excel('trailing_blanks.xlsx').shape

Test files available here.

trailing_blanks

I loaded this sample spreadsheet in all five filetypes to see what shape is returned:

filetype pandas 1.2.x master
xls (3, 3) (3, 3)
xlsx (11, 6) (3, 6)
xlsm (11, 6) (3, 6)
xlsb (11, 6) (11, 6)
ods (3, 3) (3, 3)

Expected Output

(3, 3)

Problem description

Spreadsheet files can contain cells with no values, such as the formatted blank cells shown below. In the worst cases, the cell used to contain a value or formatting but currently contains neither and is thus invisible to the user of the spreadsheet application. The trailing rows of NaNs in xls[x|m] files were recently resolved in #39547 by @rhshadrach but the additional columns still remain. In a case of the size pictured here, it's a minor annoyance, but I've come across spreadsheets in the wild where one more cells were unintentionally created on row 2**20 or column 2**14, which can create a severe performance issues. To fix the performance issue, the empty rows/columns must be trimmed before the data is passed to TextParser.

I suspect that these bug reports are related:
#40569
#40976

Output of pd.show_versions()

INSTALLED VERSIONS

commit : 1e91282
python : 3.8.6.final.0
python-bits : 64
OS : Linux
OS-release : 5.8.0-50-generic
Version : #56-Ubuntu SMP Mon Apr 12 17:18:36 UTC 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_CA.UTF-8
LOCALE : en_CA.UTF-8

pandas : 1.3.0.dev0+1412.g1e912821c0
numpy : 1.19.5
pytz : 2021.1
dateutil : 2.8.1
pip : 21.0.1
setuptools : 49.6.0.post20210108
Cython : 0.29.21
pytest : 6.2.2
hypothesis : 6.1.1
sphinx : 3.4.3
blosc : None
feather : None
xlsxwriter : 1.3.7
lxml.etree : 4.6.2
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.3
IPython : 7.20.0
pandas_datareader: None
bs4 : 4.9.3
bottleneck : 1.3.2
fsspec : 0.8.5
fastparquet : 0.5.0
gcsfs : 0.7.1
matplotlib : 3.3.4
numexpr : 2.7.2
odfpy : None
openpyxl : 3.0.6
pandas_gbq : None
pyarrow : 3.0.0
pyxlsb : 1.0.8
s3fs : 0.5.2
scipy : 1.6.0
sqlalchemy : 1.3.23
tables : 3.6.1
tabulate : 0.8.7
xarray : 0.16.2
xlrd : 2.0.1
xlwt : 1.3.0
numba : 0.52.0

Metadata

Metadata

Assignees

Labels

BugIO Excelread_excel, to_excelPerformanceMemory or execution speed performance

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions