Skip to content

bpo-34454: fix crash in .fromisoformat() methods when given inputs with surrogate code points #8859

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

taleinat
Copy link
Contributor

@taleinat taleinat commented Aug 22, 2018

The current implementation crashes if the input includes a surrogate Unicode code point, which is not possible to encode in UTF-8.

It is important to note that the pure-Python implementation allows using a surrogate as the separator character when calling datetime.fromisoformat().

As mentioned in a comment on the bpo issue, there's actually no need to encode the input as UTF-8.

This implementation uses PyUnicode_1BYTE_DATA() when possible. In other cases (which should be extremely rare!), it creates temporary intermediate Python objects. For date and time, a single temporary bytes object is created. For datetime, up to four temporary objects are created: two str objects (for sub-strings) and two bytes objects (for converting the sub-strings to ASCII). This increases the size and complexity of the code considerably, but I couldn't find a reasonable way to avoid creating these objects entirely.

https://bugs.python.org/issue34454

@taleinat taleinat added the type-bug An unexpected behavior, bug, or error label Aug 22, 2018
@taleinat
Copy link
Contributor Author

@pganssle, your review and comments would be welcome!

This breaks if a surrogate Unicode code point is used as
the separator character in the input string.
@taleinat taleinat force-pushed the bpo-34454/fromisoformat_avoid_utf8_encoding branch from 978ccf3 to 6df3ca9 Compare August 22, 2018 13:13
@pganssle
Copy link
Member

@taleinat Can we add some tests for the new failure behavior? Might want to coordinate with @izbyshev so as not to step on one anothers' toes about that, but I'd like to see tests for some of these unprintable characters both as the separator (passing) and in the date, time and datetime strings (failing).

This also needs a news entry, but other than that it's essentially what I would have done (probably better, honestly - I didn't even know about %R).

Thanks for making this PR! I'm glad to have one item off my backlog. 😄

if (bytes == NULL) {
return NULL;
}
const char * p = PyBytes_AS_STRING(bytes);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I missed this. I think it may be better to do PyUnicode_AsUTF8 to avoid creating an unnecessary intermediate bytes object. If there are non-ascii characters in the string the parse step will fail.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But then it will still fail on surrogate Unicode code points. Which after additional reading it apparently should.

It is strange to me that there is a way to directly get a char * for UTF-encoded strings, but not for ASCII.

Indeed the overhead here is likely too high, but using PyUnicode_AsUTF8 here feels like a hack... maybe there's no better option though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine. ASCII is a subset of UTF-8. The actual requirements just happen to include no non-ASCII (and thus no non-UTF8) characters in valid ISO 8601 strings, so it allows us to conveniently treat any encoding errors as a proxy for "this is not a valid ISO 8601 string".

There's no need to separately validate that it is both valid ASCII and a valid ISO 8601 string, because we only care if it's a valid ISO 8601 string. We should just do whatever is the most readable and performant way to figure that out (which, IMO, is "convert to UTF8, then try to parse it").

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that for ASCII-only strings, which should account for nearly all actual inputs in real use cases, there is a fast shortcut available to get a char *. I'll work up an update using that.

@pganssle
Copy link
Member

Also, similar changes to the UnicodeEncodeError handling need to be made in time_fromisoformat and date_fromisoformat.

@pganssle
Copy link
Member

pganssle commented Aug 22, 2018

I think all the failures are because you create the temporary bytes object. Here's an annotated version of the function:

substr_bytes = PyUnicode_AsASCIIString(substr);     // substr_bytes refcount = 1
Py_DECREF(substr);
if (substr_bytes == NULL) {
    goto invalid_string_error;
}

// p is a pointer to memory owned by substr_bytes
p = PyBytes_AS_STRING(substr_bytes);
Py_DECREF(substr_bytes);                        // substr_bytes refcount = 0, deleted
if (p == NULL) {                                // p != NULL, points to freed memory
    return NULL;
}

// attempt to parse a freed location in memory
rv = parse_isoformat_date(p, &year, &month, &day);

Since the bytes object doesn't need to be created in the first place, I believe the problem will be solved by using PyUnicode_AsUTF8 on the substr, though I'm not sure how the memory management works for PyUnicode_AsUTF8, you may need to hold the reference to the PyUnicode object until after the parsing is done.

@izbyshev
Copy link
Contributor

@taleinat If you're going to add tests as @pganssle suggested, you can't make them pass without checking the return value of PyUnicode_AsUTF8. I think that at this point it's not useful to keep my fix and your PR separate. Would it better to simply add fromisoformat tests from my PR to yours and close the issue with your PR? I can then file another issue for adding the remaining tests.

@pganssle
Copy link
Member

Hm... using this profiling script:

from datetime import datetime
from datetime import timezone, timedelta
from timeit import default_timer as timer

N = 100000
tzi = timezone(timedelta(hours=4))
tzi = None
comps = (2014, 3, 25, 4, 17, 30, 204300, tzi)
dt = datetime(*comps)

s = timer()
for i in range(N):
    dt_c = datetime(2014, 3, 25, 4, 17, 30, 241300, tzinfo=tzi)
e = timer()

tt = 1000000000 * (e - s) / N
dtstr = dt.isoformat()

s = timer()
for i in range(N):
    dt_fi = datetime.fromisoformat(dtstr)
e = timer()
tt2 = 1000000000 * (e - s) / N

print('datetime constructor: {:0.1f}ns'.format(tt))
print('fromisoformat:       {:0.1f}ns'.format(tt2))

It turns out that this PR, even when you drop the creation of the temporary PyBytes object, makes fromisoformat much slower (all times are done via non-optimized builds):

Current master:

datetime constructor: 1156.0ns
fromisoformat:       535.0ns

AsUTF8 version:

datetime constructor: 1240.5ns
fromisoformat:       952.3ns

AsASCII version (with refcount issue fixed):

datetime constructor: 1172.4ns
fromisoformat:       1114.3ns

I think the reason for the major slowdown is that PyUnicode_Substring creates a new PyUnicode object, and PyUnicode_AsUTF8 also creates a few intermediate PyObjects, all of which is expensive.

I do think it's important to fix the crashing bug, but I think we should be able to do it with less impact on performance.

For now I think we should go ahead with the current approach, but eventually I'd like to make some tweaks so that performance isn't hit quite so hard.

@taleinat
Copy link
Contributor Author

@pganssl, please wait for my next update to this PR, which will be significantly faster.

@pganssle
Copy link
Member

@taleinat I have prepared one that sacrifices essentially no speed in the case of UTF-8 strings, at the cost of a greater performance hit in the pathological case. I think it's fair to say that if you use non-UTF8 characters in your datetime string you deserve to take a bit of a performance hit. PR #8862

Feel free to use it as a base for your branch if you want. I've updated the profiling script:

from datetime import datetime
from datetime import timezone, timedelta
from timeit import default_timer as timer

N = 100000
tzi = timezone(timedelta(hours=4))
tzi = None
comps = (2014, 3, 25, 4, 17, 30, 204300, tzi)
dt = datetime(*comps)

s = timer()
for i in range(N):
    dt_c = datetime(2014, 3, 25, 4, 17, 30, 241300, tzinfo=tzi)
e = timer()

tt = 1000000000 * (e - s) / N
dtstr = dt.isoformat()

s = timer()
for i in range(N):
    dt_fi = datetime.fromisoformat(dtstr)
e = timer()
tt2 = 1000000000 * (e - s) / N

dtstr = dt.isoformat(sep='\u8000')
s = timer()
for i in range(N):
    dt_fi = datetime.fromisoformat(dtstr)
e = timer()
tt3 = 1000000000 * (e - s) / N

dtstr = dt.isoformat(sep='\ud800')
s = timer()
for i in range(N):
    dt_fi = datetime.fromisoformat(dtstr)
e = timer()
tt4 = 1000000000 * (e - s) / N


print('datetime constructor:                {:0.1f}ns'.format(tt))
print('fromisoformat:                       {:0.1f}ns'.format(tt2))
print('fromisoformat (special characters):  {:0.1f}ns'.format(tt3))
print('fromisoformat (non-utf8):            {:0.1f}ns'.format(tt4))

And the new results give:

datetime constructor:                1280.6ns
fromisoformat:                       579.6ns
fromisoformat (special characters):  624.8ns
fromisoformat (non-utf8):            2804.3ns

Probably the non-utf8 performance could be improved, but I don't think it matters much or at all.

Now I think it's just a matter of adding tests and a new item.

@pganssle
Copy link
Member

In #8862 I've also handled the date_fromisoformat and time_fromisoformat cases (again, without tests), feel free to cherry-pick that commit into your branch if you end up finding a faster/cleaner approach for the datetime part.

@taleinat
Copy link
Contributor Author

@izbyshev, I agree with your suggestion to copy the appropriate tests into this PR. I'll let you know once I've done so, so that you can remove them from PR #8850.

@taleinat
Copy link
Contributor Author

@pganssle, thanks for the great analysis of the cause of the failures and the timing measurements!

@taleinat
Copy link
Contributor Author

@pganssle, the more I think about it, the more I think we shouldn't be encoding into UTF-8. We're doing so and relying on the fact that all non-ASCII code points will be encoded into one or more non-ASCII bytes, in turn triggering appropriate rejections by the low-level date/time parsing functions.

This is a hack. That is what led to this hidden bug, it could also be hiding other bugs now, while also causing other future bugs when the code continues to change.

I have a good idea for a cleaner approach that could be even faster. I'll have an update ready shortly.

@pganssle
Copy link
Member

pganssle commented Aug 22, 2018

We're doing so and relying on the fact that all non-ASCII code points will be encoded into one or more non-ASCII bytes, in turn triggering appropriate rejections by the low-level date/time parsing functions.

This is a hack. That is what led to this hidden bug, it could also be hiding other bugs now, while also causing other future bugs when the code continues to change.

It's simply not true that it's a hack. We're encoding it as UTF-8 and the C functions are designed to take UTF-8 strings (as const char *) and convert them to datetimes. The outer layer is designed to take Python Unicode objects and convert them to UTF-8, then pass them to the C functions. Since UTF-8 is a superset of ASCII, it is equally acceptable to convert them to ASCII.

What led to this bug was that I did not handle the error case of Python Unicode objects that cannot be encoded as UTF-8. Further complicating the matter is the fact that there are Python Unicode objects that cannot be encoded as UTF-8 that nonetheless satisfy the contract of the the function (i.e. datetime.fromisoformat(dt.isoformat(*args, **kwargs)) == dt for all args and kwargs).

Admittedly, in my approach in #8862, if a non-UTF-8 string appears, the function I've used is lossy, which I did because it was expedient to do so. It is possible, for example, that datetime.isoformat() could change to start emitting dates like '2018?09?04', or that str.encode() could start using - instead of ? when in replace mode, but that seems very unlikely (and we can write tests to guard against this). That is a calculated risk and, frankly, the failure mode would be that strings that are invalid for technical reasons would succeed to create datetimes that any human looking at it would choose anyway. It's really not a bad failure mode, particularly since it's so unlikely.

@pganssle
Copy link
Member

That said if you have a cleaner and faster way to do it, those are the criteria that I would use for judging success in this case. I just think that there's no reason to discount a solution that encodes as UTF-8 if that ends up being faster and/or cleaner for the typical case.

@taleinat
Copy link
Contributor Author

The approach of encoding to UTF-8 does not supply a way of matching the functionality of the pure-Python implementation.

I went ahead with my original approach here, adding an optimization for the common case of ASCII-only strings, using the PyUnicode_1BYTE_DATA() macro.

I've also added the relevant tests from @izbyshev's PR #8850, and a NEWS entry.

@taleinat taleinat changed the title bpo-34454: avoid internally encoding fromisoformat() input to UTF-8 bpo-34454: fix crash in .fromisoformat() methods when given inputs with surrogate code points Aug 22, 2018
@pganssle
Copy link
Member

pganssle commented Aug 22, 2018

The approach of encoding to UTF-8 does not supply a way of matching the functionality of the pure-Python implementation.

Definitely inaccurate, considering I do just that in #8862.

I think a hybrid of the two approaches would be preferable, because this is still slower for the case of non-surrogate UTF strings:

datetime constructor:                1216.3ns
fromisoformat:                       549.8ns
fromisoformat (special characters):  1232.0ns
fromisoformat (non-utf8):            1237.5ns
fromisoformat (fail, non-utf8):      3015.8ns
fromisoformat (fail, utf8):          1756.4ns

Compare to the approach in #8862:

datetime constructor:                1167.0ns
fromisoformat:                       530.6ns
fromisoformat (special characters):  549.5ns
fromisoformat (non-utf8):            2433.2ns
fromisoformat (fail, non-utf8):      3354.4ns
fromisoformat (fail, utf8):          1690.3ns

If we care about improving performance in the case of non-utf8 separators, a hybrid approach is feasible (though a bit trickier).

@taleinat
Copy link
Contributor Author

Please don't merge for now until the segfaults on VSTS: Linux-PR are investigated.

@taleinat
Copy link
Contributor Author

@pganssle, I was confused by #8862 currently only doing so for datetime, but not for date and time.

The approach taken for datetime.fromisoformat() in #8862 is very similar to what is done here, but having a "fast path" for all UTF-8-encodable strings, rather than just those that would be kept in the "1BYTE" format by Python. Still, I prefer this approach, since it completely avoids UTF-8, and would still activate its "fast path" for practically all non-adversarial inputs in practice.

@pganssle
Copy link
Member

@taleinat There are no valid date and time strings that contain non-UTF-8 characters, so there's no need to handle them other than throwing an error if they are found. The only weirdness about datetime is that the character in position 10 is allowed to be any Python unicode object.

The approach in #8862 is faster than this one for UTF-8 and the change to the existing algorithm is minimal.

The main advantage to your current approach is that it is faster for valid ISO strings containing surrogate characters, but it is slower for valid ISO strings containing UTF-8 characters. Given that "avoiding encoding as UTF-8" is not really a desirable goal, I see no reason not to use the #8862 approach.

@pganssle
Copy link
Member

I have re-factored #8862 and have cherry-picked the tests and news entry from this PR into that one (with a few additional tests added), ensuring that the git history has the authorship appropriately credited. I now believe that PR can be reviewed for merge (not sure if it complies with PEP 7).

@taleinat
Copy link
Contributor Author

Closed in favor of #8862.

@taleinat taleinat closed this Aug 23, 2018
@taleinat taleinat deleted the bpo-34454/fromisoformat_avoid_utf8_encoding branch August 24, 2018 11:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting merge DO-NOT-MERGE type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants