Skip to content

ENH: Excel output in non-ascii encodings #5025

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 7 commits into from
Closed

ENH: Excel output in non-ascii encodings #5025

wants to merge 7 commits into from

Conversation

jtornero
Copy link

ENH/TST: Support for non-ascii encodings in DataFrame.to_excel

Closes #3710.

Notice: Despite (for my modest knowledge of python) it should be easier to just put encoding='ascii' in the declaration of to_excel in line 1352 of frame.py, I've declared it with encoding=None as jreback suggested, and then making it default to ascii later with the if clause.

ENH/TST: Support for non-ascii encodings in DataFrame.to_excel
@@ -1396,8 +1398,11 @@ def to_excel(self, excel_writer, sheet_name='Sheet1', na_rep='',
"""
from pandas.io.excel import ExcelWriter
need_save = False
if encoding == None:
encoding = 'ascii'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks off

@jtratner
Copy link
Contributor

@jtornero please replace all tabs with 4 spaces. Otherwise we won't be able to incorporate this.

ENH: Excel output in non-ascii encodings
Replaced tabs to follow PEP8
@jtornero
Copy link
Author

Well, I've replaced the tabs by spaces and commited again. I got some complains from Travis IC, but I can't interpretate them very well, specially those about python2. . What should I do now? A new pull request?

Thank you very much

Jorge Tornero

@jtratner
Copy link
Contributor

you don't need a new pull request, let me take a look at Travis.

@jtratner
Copy link
Contributor

It's an issue with a new excel engine we added. Going to need the expert on this. @jmcnamara - how do you handle non-ascii encoding in xlsxwriter?

@jmcnamara
Copy link
Contributor

Xlsxwriter handles utf8 at the very least. I'll have have a look at the issue.

@jmcnamara
Copy link
Contributor

The xlsxwriter issue isn't with the encoding, it handles that correctly and passes the test. The issue is that the constructor doesn't handle arbitrary keywords like the new encoding keyword.

I should have spotted that and added some mapping of user keywords to keywords supported by xlsxwriter (an earlier version of my patch had that I think).

Anyway, for now @jtornero can patch excel.py as follows to remove keyword handling from XlsxWriter while I work on a cleaner solution.

$ git diff
diff --git a/pandas/io/excel.py b/pandas/io/excel.py
index 2c82261..8b49a40 100644
--- a/pandas/io/excel.py
+++ b/pandas/io/excel.py
@@ -668,7 +668,7 @@ class _XlsxWriter(ExcelWriter):

         super(_XlsxWriter, self).__init__(path, **engine_kwargs)

-        self.book = xlsxwriter.Workbook(path, **engine_kwargs)
+        self.book = xlsxwriter.Workbook(path)

     def save(self):
         """

@jmcnamara
Copy link
Contributor

@jreback

This patch is only required for xlwt since openpyxl, xlsxwriter and pyexcelerate (I'm getting around to it) don't need it for utf8 and don't handle other encodings anyway (afaik).

So it seems like overkill to add a new keyword just for this and opening the flood gates for any other future patches that want to add configuration to specific engines.

So, instead perhaps the new parameter should be something like engine_kwargs or even just **kwargs as a generic interface to passing options into the various engines. The engines can then have their own logic for handling or ignoring options.

Or perhaps that is the way it already works and I've missed it. I've only had a chance to have a quick look.

@jtratner
Copy link
Contributor

@jmcnamara okay, that's fine. Yes, the clear answer is to not pass encoding to other engines.

Easiest way to handle this is to add to the constructor: encoding=None and not pass it forwards for engines that don't need it. Simple. Plus add a warning that non-utf8 encodings aren't supported.

@jtratner
Copy link
Contributor

@jmcnamara @jtornero To be clear, to fix this all you need to do is change the __init__ method of each ExcelWriter to the following (I agree we should just change it to kwargs):

def __init__(self, io, encoding=None, **kwargs):

I'm not convinced we need to validate kwargs or anything like that (yet).

@jtratner
Copy link
Contributor

One additional note - before this PR, you could pass encoding to xlwt without issue right? So this is really just a testing thing. The other easy way to do this would be to have the xlwt engine set a default for encoding, and then set read_excel to do:

if encoding:
    kwargs['encoding'] = encoding

And that way it only gets passed to the constructor if necessary. I might prefer this second option because it means that constructors don't need to deal with extra keyword arguments.

@unutbu
Copy link
Contributor

unutbu commented Oct 1, 2013

@jtornero: The Travis test that failed is being run under Python3.2. That version of Python does not accept the u'...' syntax for strs (or what's called unicode in Python2). The u'...' syntax was added back in Python3, which is why the last Travis test passed.

To fix, add

from pandas.compat import u

to the top of test_excel.py, and replace u'...' with u('...').

@jtornero
Copy link
Author

jtornero commented Oct 2, 2013

Guess it's all done

@jtratner
Copy link
Contributor

jtratner commented Oct 2, 2013

okay I'll take look in the next few days. No major changes to Excel so should be fine.

@@ -544,7 +547,11 @@ def __init__(self, path, **engine_kwargs):

super(_XlwtWriter, self).__init__(path, **engine_kwargs)

self.book = xlwt.Workbook()
if 'encoding' in engine_kwargs:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should have __init__ with encoding=None and then add

if encoding is None:
    encoding = 'ascii'
self.book = xlwt.Workbook(encoding=encoding)

@jreback
Copy link
Contributor

jreback commented Jan 3, 2014

@jtratner what's the state of this?

@jtornero
Copy link
Author

jtornero commented Jan 3, 2014

Well I did the PR and just waiting.

@jtratner
Copy link
Contributor

jtratner commented Jan 4, 2014

Please rebase this on current master and then we can make sure everything passes

@jreback
Copy link
Contributor

jreback commented Feb 16, 2014

pls rebase on master and can see where this is

@jreback
Copy link
Contributor

jreback commented Mar 9, 2014

@jtornero needs a rebase

@jtornero
Copy link
Author

@jreback @jtratner I'm so sorry but I am absolutely lost about this. I guess I have to clone my forked repo again, then rebase (don't know the steps clearly)? And then? I'm so sorry this is disappointing but my git skills are unfortunately too low!!

@jreback
Copy link
Contributor

jreback commented Mar 10, 2014

I rebased you to this commit: pls incorporate this and go from here

jreback@1478660

@jtornero
Copy link
Author

@jreback can you take a look at https://gist.github.com/jtornero/1324e77425715cdfe987

Best regards,

Jorge Tornero

@jreback
Copy link
Contributor

jreback commented Mar 11, 2014

@jtornero just post here what you need

@jreback
Copy link
Contributor

jreback commented Mar 11, 2014

@jtornero
Copy link
Author

@jreback Well... what it is supposed that I have to expect and what it is supposed from me to do? I mean: when I rebase, I should run the tests again? or just make a new PR? I will clone my forked repo again, because the "original" one (where I did the work that origined my PR) is not available anymore for me. Does it matter?

@jreback
Copy link
Contributor

jreback commented Mar 11, 2014

in this case just confirm that the version I out up looks ok
if so I can just directly incorporate it

if not
then create a new branch
pull this commit in
then make changes and submit a new pr

@jtornero
Copy link
Author

So steps are

  1. clone my forked repo
  2. Because it is already rebased (you did it yesterday), run the tests
  3. It tests are ok, post here

If not:

  1. create branch
  2. pull this commit in???
  3. make whatever changes are neccessary to pass the tests and in that case, make PR

@jreback
Copy link
Contributor

jreback commented Mar 11, 2014

you don't need to fork again, just create a new branch

like this:

git checkout -b new_excel_branch upstream/master
git pull https://github.com/jreback/pandas.git excel_encoding

then you can work with the new branch

I am suggesting this as sort of an exercise as the fix is fine, but this is nice to know how to do

@jtornero
Copy link
Author

Well, I've cloned and installed and updated some libraries and run the tests. Some errors arise related to google and timeseries (see the nose output at https://gist.github.com/jtornero/9496276)

It is ok then?

@jreback
Copy link
Contributor

jreback commented Mar 12, 2014

hmm those are a bit odd
what version of numpy do u have?

@jtornero
Copy link
Author

@ jreback Well I my first attempts complainted with lots of messages sort of "compiled against version 9 but your numpy version y 7" so I installed 1.8.0 with pip, so currently I'm using 1.8.0. I'll try to compile it from source and see what happens

@jtornero
Copy link
Author

@jreback
Copy link
Contributor

jreback commented Mar 12, 2014

not sure what you are doing
tests pass on 0.13.1 for a properly installed version?
can u show print_versions?

@jtornero
Copy link
Author

@jtornero
Copy link
Author

So first thing is to install AT LEAST xlrd and xlwt (sorry, I thought this system already had it)

Will update all when installed

@jreback
Copy link
Contributor

jreback commented Mar 12, 2014

you are running with an old version of master
you need to

git pull upstream/master

@jtornero
Copy link
Author

@jreback This is... what is this?!?!?

Well

Cloned my repo
Added upstream
now

git checkout -b new_excel_branch upstream/master

fatal: Cannot update paths and switch to branch 'new_excel_branch' at the same time.
Did you intend to checkout 'upstream/master' which can not be resolved as commit?

OMG!!! I think I need a shrink!!

@jreback
Copy link
Contributor

jreback commented Mar 12, 2014

merged via 268ee80

thanks for the PR.

maybe next time will be a bit easier.

@jreback jreback closed this Mar 12, 2014
@jtornero
Copy link
Author

@jreback Thank you very much for all your patience. For, say so, very amateurs, it is reasonablely easy to try to improve things, but the big white git shark really annoys us (or, at least, me). You've suffered the bite, also... my apologies.

Best regards,

Jorge Tornero

@jreback
Copy link
Contributor

jreback commented Mar 12, 2014

git can be a bit of a beast.....

try looking here for a sample workflow: https://github.com/pydata/pandas/wiki/Git-Workflows

give a try again!

always appreciate contributions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO Excel read_excel, to_excel Unicode Unicode strings
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants