Skip to content

Difference between pickle.py and _pickle for certain strings #113028

Closed
@jeff5

Description

@jeff5

Bug report

Bug description:

There is a logical error in pickle.Pickler.save_str for protocol 0, such that it repeats pickling of a string object each time it is presented. The design clearly intends to re-use the first pickled representation, and the C-implementation _pickle does that.

In an implementation that does not provide a compiled _pickle (PyPy may be one) this is inefficient, but not actually wrong. The intended behaviour occurs with a simple string:

>>> s = "hello"
>>> pickle._dumps((s,s), 0)
b'(Vhello\np0\ng0\ntp1\n.'

When read by loads() this string says:

  1. stack "hello",
  2. save a copy in memory 0,
  3. stack the contents of memory 0,
  4. make a tuple from the stack,
  5. save a copy in memory 1.

The bug emerges when the pickled string needs pre-encoding:

>>> s = "hello\n"
>>> pickle._dumps((s,s), 0)
b'(Vhello\\u000a\np0\nVhello\\u000a\np1\ntp2\n.'

Here we see identical data stacked and saved (but not used). The problem is here:

cpython/Lib/pickle.py

Lines 860 to 866 in 42a86df

obj = obj.replace("\\", "\\u005c")
obj = obj.replace("\0", "\\u0000")
obj = obj.replace("\n", "\\u000a")
obj = obj.replace("\r", "\\u000d")
obj = obj.replace("\x1a", "\\u001a") # EOF on DOS
self.write(UNICODE + obj.encode('raw-unicode-escape') +
b'\n')

where the return from obj.replace may be a different object from obj. In CPython, that is only if a replacement takes place, which is why the problem only appears in the second case above.

save_str is only called when the object has not already been memoized, but in the cases at issue, the string memoized is not the original object, and so when the original string object is presented again, save_str is called again.

Depending upon the detailed behaviour of str.replace (in particular, if you decide to return an interned value when the result is, say, a Latin-1 character) an assertion may fail in memoize():

cpython/Lib/pickle.py

Lines 504 to 507 in 42a86df

assert id(obj) not in self.memo
idx = len(self.memo)
self.write(self.put(idx))
self.memo[id(obj)] = idx, obj
I have not managed to trigger an AssertionError in CPython.

This has probably gone unnoticed so long only because pickle.py is not tested. (At least, I think it isn't. #105250 and #53350 note this coverage problem.)

CPython versions tested on:

3.11

Operating systems tested on:

Windows

Linked PRs

Metadata

Metadata

Assignees

No one assigned

    Labels

    stdlibPython modules in the Lib dirtype-bugAn unexpected behavior, bug, or error

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions