-
Notifications
You must be signed in to change notification settings - Fork 4.9k
Improve performance of WebUtility.Decode for non-escaped strings #7671
Conversation
- Cuts allocations down to 1/3 of the original - Doubles performance (in time) - We still allocate a little bit, as removing all allocations would harm the performance of strings that actually need escaping Fixes #6542
@@ -610,6 +612,9 @@ private static bool StringRequiresHtmlDecoding(string s) | |||
// Internal struct to facilitate URL decoding -- keeps char buffer and byte buffer, allows appending of either chars or bytes | |||
private struct UrlDecoder | |||
{ | |||
public bool _containsUnsafe; | |||
public bool _containsSpaces; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why store these as fields on the UrlDecoder? Seems like they could just be a single local needsDecoding
, and then instead of doing return helper.GetString();
, you'd do return needsDecoding ? helper.GetString() : value;
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with moving them out of UrlDecoder.
I would like to keep the fields containsUnsafe
and containsSpaces
as separate variables, as I'm working on a PR that's gonna heavily optimize UrlEncode
and UrlDecode
for strings that only need space encoding/decoding. Is that OK?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer it if you kept them as a single local now. If a subsequent change needs them separated, then that change can do it. Otherwise, if for example that change never materializes or gets merged, we're left with debt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure thing!
_bufferSize = bufferSize; | ||
_encoding = encoding; | ||
|
||
_charBuffer = new char[bufferSize]; | ||
|
||
_charBuffer = null; // char buffer created on demand |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this really useful? Empty string is already special cased to avoid decoding. So isn't this only valuable in the case where all of the inputs are bytes rather than chars, and to enable that you're then doing an extra null check branch on every AddChar? And even then all of the bytes except for those at the end will end up getting dumped into the char[], forcing it to be allocated. Doesn't seem like a good tradeoff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well I found that this halved allocations for the UrlDecode("noescaping")
case, as it seems that ASCII chars are added to the decoder as AddByte
not as AddChar
. I found this behaviour weird, but lazilly instantiating _charBuffer
does help in the no escaping case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a follow up: its due to this line of code.
I don't really understand why "// 7 bit have to go as bytes because of Unicode" but hey
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, ok. (Seems strange then that we're lazily allocating the byte[] array; doing so suggests we don't expect any ASCII chars in common input, but let's not change it.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
understood - obviously I'm not planning to change any of that behaviour in this PR, but do you have a theory as to why we add ascii chars as bytes not chars?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right but they allocate the array only when needed
No. In the code before this PR, the char[] is always allocated (unless the string is empty).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right. looking more, it looks when having ascii characters, it needs to be encoded by the encoder. bu tif the character is not ascii will be stored as it is without encoding. so I believe what they are doing is collecting ascii charcaters in the array, and everytime need to flush it, will run the encoder just once on the array before the flush. so it is just optimizing to not calling the encoder with every ascii character.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so it is just optimizing to not calling the encoder with every ascii character.
but doesn't the implementation check if there are any bytes in the buffer, and if there are none in the buffer it doesn't flush any bytes
So that means that each ascii character wouldn't flush any bytes, only the first in a string of ascii chars
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UrlDecoder is collecting the ascii bytes till either AddChar or GetString get called. at that time it will flush the bytes after encoding it. and then will start collecting the new encountered bytes again and so on.
imagine you have string like aaaaaNaaaaNaaaaNaaa
where a is ascii character and N is not ascii, this will have the byte array get filled 4 times and encoded 4 times. if you didn't collect the ascii in byte array as the code is doing, this mean you'll need to call the encoder the number of 'a' characters in the string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
understood, thanks
Test Innerloop CentOS7.1 Release Build and Test |
@dotnet-bot test code coverage please |
@stephentoub is this new/can non-Microsoft people use that code coverage feature |
nope
yup |
That's cool, I've never noticed that before. Is the idea that it fails if there is an overall decrease in coverage or does it just publish a report? (I guess we'll find out!) |
It doesn't perform any comparison, but it archives the report so you can click through to view it. |
{ | ||
// No decoding needed | ||
return value; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it were me, I'd have written this like:
return needsDecoding ? helper.GetString() : value;
Thanks, I agree its a good idea to add some more tests to cover all the branches I've changed. I also added some more tests for invalid percent encoding, lowercase hex, space escaping and non-ASCII chars. |
LGTM |
LGTM. |
@dotnet-bot Test Innerloop Windows_NT Debug Build and Test |
@dotnet-bot Test Innerloop CentOS7.1 Release Build and Test |
Glad to see this merged. Should I send over a similar PR to coreclr, or will this be automatically handled by the netfx-port-consider label |
Improve performance of WebUtility.Decode for non-escaped strings Commit migrated from dotnet/corefx@be2360b
Benchmark - escaping needed
Benchmark - no escaping needed
Benchmark Code
Click here
/cc @stephentoub @jamesqo @davidsh
Fixes #6542 together with #7546