Description
Bug report
Bug description:
There is a logical error in pickle.Pickler.save_str
for protocol 0, such that it repeats pickling of a string object each time it is presented. The design clearly intends to re-use the first pickled representation, and the C-implementation _pickle
does that.
In an implementation that does not provide a compiled _pickle
(PyPy may be one) this is inefficient, but not actually wrong. The intended behaviour occurs with a simple string:
>>> s = "hello"
>>> pickle._dumps((s,s), 0)
b'(Vhello\np0\ng0\ntp1\n.'
When read by loads()
this string says:
- stack "hello",
- save a copy in memory 0,
- stack the contents of memory 0,
- make a tuple from the stack,
- save a copy in memory 1.
The bug emerges when the pickled string needs pre-encoding:
>>> s = "hello\n"
>>> pickle._dumps((s,s), 0)
b'(Vhello\\u000a\np0\nVhello\\u000a\np1\ntp2\n.'
Here we see identical data stacked and saved (but not used). The problem is here:
Lines 860 to 866 in 42a86df
where the return from
obj.replace
may be a different object from obj
. In CPython, that is only if a replacement takes place, which is why the problem only appears in the second case above.
save_str
is only called when the object has not already been memoized, but in the cases at issue, the string memoized is not the original object, and so when the original string object is presented again, save_str
is called again.
Depending upon the detailed behaviour of str.replace
(in particular, if you decide to return an interned value when the result is, say, a Latin-1 character) an assertion may fail in memoize()
:
Lines 504 to 507 in 42a86df
AssertionError
in CPython.
This has probably gone unnoticed so long only because pickle.py
is not tested. (At least, I think it isn't. #105250 and #53350 note this coverage problem.)
CPython versions tested on:
3.11
Operating systems tested on:
Windows
Linked PRs
Metadata
Metadata
Assignees
Projects
Status