-
-
Notifications
You must be signed in to change notification settings - Fork 32.2k
bpo-34454: fix crash in .fromisoformat() methods when given inputs with surrogate code points #8859
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bpo-34454: fix crash in .fromisoformat() methods when given inputs with surrogate code points #8859
Conversation
@pganssle, your review and comments would be welcome! |
This breaks if a surrogate Unicode code point is used as the separator character in the input string.
978ccf3
to
6df3ca9
Compare
@taleinat Can we add some tests for the new failure behavior? Might want to coordinate with @izbyshev so as not to step on one anothers' toes about that, but I'd like to see tests for some of these unprintable characters both as the separator (passing) and in the date, time and datetime strings (failing). This also needs a news entry, but other than that it's essentially what I would have done (probably better, honestly - I didn't even know about Thanks for making this PR! I'm glad to have one item off my backlog. 😄 |
Modules/_datetimemodule.c
Outdated
if (bytes == NULL) { | ||
return NULL; | ||
} | ||
const char * p = PyBytes_AS_STRING(bytes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I missed this. I think it may be better to do PyUnicode_AsUTF8
to avoid creating an unnecessary intermediate bytes
object. If there are non-ascii characters in the string the parse step will fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But then it will still fail on surrogate Unicode code points. Which after additional reading it apparently should.
It is strange to me that there is a way to directly get a char *
for UTF-encoded strings, but not for ASCII.
Indeed the overhead here is likely too high, but using PyUnicode_AsUTF8
here feels like a hack... maybe there's no better option though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine. ASCII is a subset of UTF-8. The actual requirements just happen to include no non-ASCII (and thus no non-UTF8) characters in valid ISO 8601 strings, so it allows us to conveniently treat any encoding errors as a proxy for "this is not a valid ISO 8601 string".
There's no need to separately validate that it is both valid ASCII and a valid ISO 8601 string, because we only care if it's a valid ISO 8601 string. We should just do whatever is the most readable and performant way to figure that out (which, IMO, is "convert to UTF8, then try to parse it").
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that for ASCII-only strings, which should account for nearly all actual inputs in real use cases, there is a fast shortcut available to get a char *
. I'll work up an update using that.
Also, similar changes to the |
I think all the failures are because you create the temporary substr_bytes = PyUnicode_AsASCIIString(substr); // substr_bytes refcount = 1
Py_DECREF(substr);
if (substr_bytes == NULL) {
goto invalid_string_error;
}
// p is a pointer to memory owned by substr_bytes
p = PyBytes_AS_STRING(substr_bytes);
Py_DECREF(substr_bytes); // substr_bytes refcount = 0, deleted
if (p == NULL) { // p != NULL, points to freed memory
return NULL;
}
// attempt to parse a freed location in memory
rv = parse_isoformat_date(p, &year, &month, &day); Since the |
@taleinat If you're going to add tests as @pganssle suggested, you can't make them pass without checking the return value of |
Hm... using this profiling script: from datetime import datetime
from datetime import timezone, timedelta
from timeit import default_timer as timer
N = 100000
tzi = timezone(timedelta(hours=4))
tzi = None
comps = (2014, 3, 25, 4, 17, 30, 204300, tzi)
dt = datetime(*comps)
s = timer()
for i in range(N):
dt_c = datetime(2014, 3, 25, 4, 17, 30, 241300, tzinfo=tzi)
e = timer()
tt = 1000000000 * (e - s) / N
dtstr = dt.isoformat()
s = timer()
for i in range(N):
dt_fi = datetime.fromisoformat(dtstr)
e = timer()
tt2 = 1000000000 * (e - s) / N
print('datetime constructor: {:0.1f}ns'.format(tt))
print('fromisoformat: {:0.1f}ns'.format(tt2)) It turns out that this PR, even when you drop the creation of the temporary Current master:
AsUTF8 version:
AsASCII version (with refcount issue fixed):
I think the reason for the major slowdown is that I do think it's important to fix the crashing bug, but I think we should be able to do it with less impact on performance. For now I think we should go ahead with the current approach, but eventually I'd like to make some tweaks so that performance isn't hit quite so hard. |
@pganssl, please wait for my next update to this PR, which will be significantly faster. |
@taleinat I have prepared one that sacrifices essentially no speed in the case of UTF-8 strings, at the cost of a greater performance hit in the pathological case. I think it's fair to say that if you use non-UTF8 characters in your datetime string you deserve to take a bit of a performance hit. PR #8862 Feel free to use it as a base for your branch if you want. I've updated the profiling script: from datetime import datetime
from datetime import timezone, timedelta
from timeit import default_timer as timer
N = 100000
tzi = timezone(timedelta(hours=4))
tzi = None
comps = (2014, 3, 25, 4, 17, 30, 204300, tzi)
dt = datetime(*comps)
s = timer()
for i in range(N):
dt_c = datetime(2014, 3, 25, 4, 17, 30, 241300, tzinfo=tzi)
e = timer()
tt = 1000000000 * (e - s) / N
dtstr = dt.isoformat()
s = timer()
for i in range(N):
dt_fi = datetime.fromisoformat(dtstr)
e = timer()
tt2 = 1000000000 * (e - s) / N
dtstr = dt.isoformat(sep='\u8000')
s = timer()
for i in range(N):
dt_fi = datetime.fromisoformat(dtstr)
e = timer()
tt3 = 1000000000 * (e - s) / N
dtstr = dt.isoformat(sep='\ud800')
s = timer()
for i in range(N):
dt_fi = datetime.fromisoformat(dtstr)
e = timer()
tt4 = 1000000000 * (e - s) / N
print('datetime constructor: {:0.1f}ns'.format(tt))
print('fromisoformat: {:0.1f}ns'.format(tt2))
print('fromisoformat (special characters): {:0.1f}ns'.format(tt3))
print('fromisoformat (non-utf8): {:0.1f}ns'.format(tt4)) And the new results give:
Probably the non-utf8 performance could be improved, but I don't think it matters much or at all. Now I think it's just a matter of adding tests and a new item. |
In #8862 I've also handled the |
@pganssle, thanks for the great analysis of the cause of the failures and the timing measurements! |
@pganssle, the more I think about it, the more I think we shouldn't be encoding into UTF-8. We're doing so and relying on the fact that all non-ASCII code points will be encoded into one or more non-ASCII bytes, in turn triggering appropriate rejections by the low-level date/time parsing functions. This is a hack. That is what led to this hidden bug, it could also be hiding other bugs now, while also causing other future bugs when the code continues to change. I have a good idea for a cleaner approach that could be even faster. I'll have an update ready shortly. |
It's simply not true that it's a hack. We're encoding it as UTF-8 and the C functions are designed to take UTF-8 strings (as What led to this bug was that I did not handle the error case of Python Unicode objects that cannot be encoded as UTF-8. Further complicating the matter is the fact that there are Python Unicode objects that cannot be encoded as UTF-8 that nonetheless satisfy the contract of the the function (i.e. Admittedly, in my approach in #8862, if a non-UTF-8 string appears, the function I've used is lossy, which I did because it was expedient to do so. It is possible, for example, that |
That said if you have a cleaner and faster way to do it, those are the criteria that I would use for judging success in this case. I just think that there's no reason to discount a solution that encodes as UTF-8 if that ends up being faster and/or cleaner for the typical case. |
The approach of encoding to UTF-8 does not supply a way of matching the functionality of the pure-Python implementation. I went ahead with my original approach here, adding an optimization for the common case of ASCII-only strings, using the I've also added the relevant tests from @izbyshev's PR #8850, and a NEWS entry. |
Definitely inaccurate, considering I do just that in #8862. I think a hybrid of the two approaches would be preferable, because this is still slower for the case of non-surrogate UTF strings:
Compare to the approach in #8862:
If we care about improving performance in the case of non-utf8 separators, a hybrid approach is feasible (though a bit trickier). |
Please don't merge for now until the segfaults on VSTS: Linux-PR are investigated. |
@pganssle, I was confused by #8862 currently only doing so for The approach taken for |
@taleinat There are no valid date and time strings that contain non-UTF-8 characters, so there's no need to handle them other than throwing an error if they are found. The only weirdness about The approach in #8862 is faster than this one for UTF-8 and the change to the existing algorithm is minimal. The main advantage to your current approach is that it is faster for valid ISO strings containing surrogate characters, but it is slower for valid ISO strings containing UTF-8 characters. Given that "avoiding encoding as UTF-8" is not really a desirable goal, I see no reason not to use the #8862 approach. |
I have re-factored #8862 and have cherry-picked the tests and news entry from this PR into that one (with a few additional tests added), ensuring that the git history has the authorship appropriately credited. I now believe that PR can be reviewed for merge (not sure if it complies with PEP 7). |
Closed in favor of #8862. |
The current implementation crashes if the input includes a surrogate Unicode code point, which is not possible to encode in UTF-8.
It is important to note that the pure-Python implementation allows using a surrogate as the separator character when calling
datetime.fromisoformat()
.As mentioned in a comment on the bpo issue, there's actually no need to encode the input as UTF-8.
This implementation uses
PyUnicode_1BYTE_DATA()
when possible. In other cases (which should be extremely rare!), it creates temporary intermediate Python objects. Fordate
andtime
, a single temporarybytes
object is created. Fordatetime
, up to four temporary objects are created: twostr
objects (for sub-strings) and twobytes
objects (for converting the sub-strings to ASCII). This increases the size and complexity of the code considerably, but I couldn't find a reasonable way to avoid creating these objects entirely.https://bugs.python.org/issue34454