Skip to content
This repository was archived by the owner on Jan 23, 2023. It is now read-only.

Improve performance of WebUtility.Decode for non-escaped strings #7671

Merged
merged 4 commits into from
Apr 15, 2016
Merged

Improve performance of WebUtility.Decode for non-escaped strings #7671

merged 4 commits into from
Apr 15, 2016

Conversation

hughbe
Copy link

@hughbe hughbe commented Apr 12, 2016

  • Cuts allocations down to 1/3 of the original
  • Doubles performance (in time)
  • We still allocate a little bit, as removing all allocations would harm the performance of strings that actually need escaping

Benchmark - escaping needed

  • No performance regressions

benchmark escaping

Benchmark - no escaping needed

benchmark no escaping

Benchmark Code

Click here

static void Main(string[] args)
{
    // Escaping
    TimeAction("Old: ", () => Old.UrlDecode("%ABabc"));
    TimeAction("New: ", () => New.UrlDecode("%ABabc"));

    // No escaping
    TimeAction("Old: ", () => Old.UrlDecode("abc"));
    TimeAction("New: ", () => New.UrlDecode("abc"));

    Console.ReadLine();
}

public static void TimeAction(string prefix, Action action)
{
    var sw = new Stopwatch();
    for (int iter = 0; iter < 5; iter++)
    {
        int gen0 = GC.CollectionCount(0);
        sw.Restart();
        for (int i = 0; i < 10000000; i++)
        {
            action();
        }
        sw.Stop();
        Console.WriteLine($"{prefix}Time: {sw.Elapsed.TotalSeconds}\tGC0: {GC.CollectionCount(0) - gen0}");
    }
}

/cc @stephentoub @jamesqo @davidsh
Fixes #6542 together with #7546

- Cuts allocations down to 1/3 of the original
- Doubles performance (in time)

- We still allocate a little bit, as removing all allocations would harm
the performance of strings that actually need escaping

Fixes #6542
@hughbe hughbe changed the title Improve performance of WebUtility.Decode for non-escaped string Improve performance of WebUtility.Decode for non-escaped strings Apr 12, 2016
@@ -610,6 +612,9 @@ private static bool StringRequiresHtmlDecoding(string s)
// Internal struct to facilitate URL decoding -- keeps char buffer and byte buffer, allows appending of either chars or bytes
private struct UrlDecoder
{
public bool _containsUnsafe;
public bool _containsSpaces;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why store these as fields on the UrlDecoder? Seems like they could just be a single local needsDecoding, and then instead of doing return helper.GetString();, you'd do return needsDecoding ? helper.GetString() : value;.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with moving them out of UrlDecoder.
I would like to keep the fields containsUnsafe and containsSpaces as separate variables, as I'm working on a PR that's gonna heavily optimize UrlEncode and UrlDecode for strings that only need space encoding/decoding. Is that OK?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer it if you kept them as a single local now. If a subsequent change needs them separated, then that change can do it. Otherwise, if for example that change never materializes or gets merged, we're left with debt.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing!

_bufferSize = bufferSize;
_encoding = encoding;

_charBuffer = new char[bufferSize];

_charBuffer = null; // char buffer created on demand
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really useful? Empty string is already special cased to avoid decoding. So isn't this only valuable in the case where all of the inputs are bytes rather than chars, and to enable that you're then doing an extra null check branch on every AddChar? And even then all of the bytes except for those at the end will end up getting dumped into the char[], forcing it to be allocated. Doesn't seem like a good tradeoff.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well I found that this halved allocations for the UrlDecode("noescaping") case, as it seems that ASCII chars are added to the decoder as AddByte not as AddChar. I found this behaviour weird, but lazilly instantiating _charBuffer does help in the no escaping case.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a follow up: its due to this line of code.

I don't really understand why "// 7 bit have to go as bytes because of Unicode" but hey

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, ok. (Seems strange then that we're lazily allocating the byte[] array; doing so suggests we don't expect any ASCII chars in common input, but let's not change it.)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

understood - obviously I'm not planning to change any of that behaviour in this PR, but do you have a theory as to why we add ascii chars as bytes not chars?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right but they allocate the array only when needed

No. In the code before this PR, the char[] is always allocated (unless the string is empty).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right. looking more, it looks when having ascii characters, it needs to be encoded by the encoder. bu tif the character is not ascii will be stored as it is without encoding. so I believe what they are doing is collecting ascii charcaters in the array, and everytime need to flush it, will run the encoder just once on the array before the flush. so it is just optimizing to not calling the encoder with every ascii character.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so it is just optimizing to not calling the encoder with every ascii character.

but doesn't the implementation check if there are any bytes in the buffer, and if there are none in the buffer it doesn't flush any bytes
So that means that each ascii character wouldn't flush any bytes, only the first in a string of ascii chars

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UrlDecoder is collecting the ascii bytes till either AddChar or GetString get called. at that time it will flush the bytes after encoding it. and then will start collecting the new encountered bytes again and so on.
imagine you have string like aaaaaNaaaaNaaaaNaaa
where a is ascii character and N is not ascii, this will have the byte array get filled 4 times and encoded 4 times. if you didn't collect the ascii in byte array as the code is doing, this mean you'll need to call the encoder the number of 'a' characters in the string.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

understood, thanks

@hughbe
Copy link
Author

hughbe commented Apr 12, 2016

Test Innerloop CentOS7.1 Release Build and Test
Test Innerloop Windows_NT Debug Build and Test

@stephentoub
Copy link
Member

@dotnet-bot test code coverage please

@hughbe
Copy link
Author

hughbe commented Apr 12, 2016

@stephentoub is this new/can non-Microsoft people use that code coverage feature

@stephentoub
Copy link
Member

is this new

nope

can non-Microsoft people use that code coverage feature

yup

@hughbe
Copy link
Author

hughbe commented Apr 12, 2016

nope

That's cool, I've never noticed that before. Is the idea that it fails if there is an overall decrease in coverage or does it just publish a report? (I guess we'll find out!)

@stephentoub
Copy link
Member

Is the idea that it fails if there is an overall decrease in coverage or does it just publish a report? (I guess we'll find out!)

It doesn't perform any comparison, but it archives the report so you can click through to view it.

{
// No decoding needed
return value;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it were me, I'd have written this like:

return needsDecoding ? helper.GetString() : value;

@hughbe
Copy link
Author

hughbe commented Apr 13, 2016

Thanks, I agree its a good idea to add some more tests to cover all the branches I've changed. I also added some more tests for invalid percent encoding, lowercase hex, space escaping and non-ASCII chars.
This brings coverage up to 100% for UrlDecodeInternal and almost 100% for UrlDecoder - I think the missing coverage may be dead code (line 683)

@stephentoub
Copy link
Member

LGTM

@davidsh?

@davidsh
Copy link
Contributor

davidsh commented Apr 15, 2016

LGTM.

@davidsh
Copy link
Contributor

davidsh commented Apr 15, 2016

@davidsh
Copy link
Contributor

davidsh commented Apr 15, 2016

@dotnet-bot Test Innerloop Windows_NT Debug Build and Test

@davidsh
Copy link
Contributor

davidsh commented Apr 15, 2016

@dotnet-bot Test Innerloop CentOS7.1 Release Build and Test

@davidsh davidsh merged commit be2360b into dotnet:master Apr 15, 2016
@hughbe
Copy link
Author

hughbe commented Apr 16, 2016

Glad to see this merged. Should I send over a similar PR to coreclr, or will this be automatically handled by the netfx-port-consider label

@hughbe hughbe deleted the webutility-decode branch April 16, 2016 16:44
@karelz karelz added this to the 1.0.0-rtm milestone Dec 3, 2016
@karelz karelz added this to the 1.0.0-rtm milestone Dec 3, 2016
picenka21 pushed a commit to picenka21/runtime that referenced this pull request Feb 18, 2022
Improve performance of WebUtility.Decode for non-escaped strings

Commit migrated from dotnet/corefx@be2360b
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

WebUtility.UrlEncode/Decode methods allocate even when unneeded
8 participants