Perf improvements to Span and Memory #20771

GrabYourPitchforks · 2018-11-02T22:00:02Z

This improves the performance of MemoryExtensions.AsSpan<T>, Span<T>..ctor, and Span<T>.Slice (and the read-only and Memory<T> equivalents) when the caller is responsible for providing both a start index and an offset parameter. Currently this is implemented via two branches that the JIT can't always elide. With this PR, on 64-bit machines, this is implemented as a single branch and allows the JIT to perform constant-time folding where applicable, which offers performance gains in span-heavy parsing / formatting code.

As an example, I've provided metrics for Guid.Parse(string) : Guid and WebUtility.HtmlEncode(string) : string. Guid.Parse uses Span.Slice heavily, which is shown by the substantial increase in overall throughput between the baseline and this PR. WebUtility.HtmlEncode uses ValueStringBuilder under the covers, which uses Span.Slice internally but is a bit more conservative with those calls, so you'll see some throughput increase but nothing quite as substantial as Guid's code did.

Method	Toolchain	value	Mean	Error	StdDev	Ratio
Guid_Parse	baseline (27012-03)	2153e(...)c7c13 [36]	23.62 ms	0.3638 ms	0.3225 ms	1.00
Guid_Parse	experimental	2153e(...)c7c13 [36]	13.61 ms	0.1180 ms	0.1104 ms	0.58

HtmlEncode	baseline (27012-03)	&<>&<(...)<>&<> [48]	71.50 ms	0.9397 ms	0.8331 ms	1.00
HtmlEncode	experimental	&<>&<(...)<>&<> [48]	68.85 ms	1.0039 ms	0.9391 ms	0.96

- Improves perf of AsSpan, AsMemory, ctor, and Slice

jkotas

Nice trick

src/System.Private.CoreLib/shared/System/Memory.cs

ahsonkhan

Ratio: 0.58

Awesome!

GrabYourPitchforks · 2018-11-02T23:18:47Z

I opened a separate linked issue (https://github.com/dotnet/coreclr/issues/20776) to help the JIT optimize this even further. In my brief testing this JIT optimization would further improve Guid.Parse battery performance by approx. another 10%.

stephentoub

Sweet.

ahsonkhan · 2018-11-03T01:01:43Z

src/System.Private.CoreLib/shared/System/Span.Fast.cs

@@ -344,6 +345,12 @@ public Span<T> Slice(int start)
        public Span<T> Slice(int start, int length)
        {
 #if BIT64
+            // Since start and length are both 32-bit, their sum can be computed across a 64-bit domain


Perfect. Thanks!

benaadams · 2018-11-03T05:33:16Z

Is it worth making this a helper method somewhere? e.g.

[MethodImpl(MethodImplOptions.AggressiveInlining)]
internal static bool IsRangeInBounds(int offset, int length, int availableLength)
{
#if BIT64
    // Since offset and length are both 32-bit, their sum can be computed across a 64-bit domain
    // without loss of fidelity. The cast to uint before the cast to ulong ensures that the
    // extension from 32- to 64-bit is zero-extending rather than sign-extending. The end result
    // of this is that if either input is negative or if the input sum overflows past Int32.MaxValue,
    // that information is captured correctly in the comparison against the availableLength field.
    // We don't use this same mechanism in a 32-bit process due to the overhead of 64-bit arithmetic.
    return ((ulong)(uint)offset + (ulong)(uint)length > (ulong)(uint)availableLength) ? false : true;
#else
    return ((uint)offset > (uint)availableLength || (uint)length > (uint)(availableLength - offset)) ? false : true;
#endif
}

GrabYourPitchforks · 2018-11-03T08:23:51Z

Perhaps. I spoke with Ahson regarding this, and I couldn't figure out how to move it into a separate helper method without getting the JIT to emit the typical "setb al / movzx eax, al" anti-pattern. If you can figure out a good way to coax it to do the right thing inside of a standalone method send a PR. :)

ahsonkhan · 2018-11-03T09:52:32Z

Is it worth making this a helper method somewhere?

Yes, that would definitely be helpful, but it may not be feasible as @GrabYourPitchforks stated.

and I couldn't figure out how to move it into a separate helper method without getting the JIT to emit the typical "setb al / movzx eax, al" anti-pattern.

Maybe writing it the way @benaadams suggested (i.e. return condition ? false : true rather than just return condition) could help work around the JIT issue. Could be worth a shot seeing what code the JIT emits in that case. @AndyAyersMS, if that exercise fails, do you have some other suggestions that could work while we wait for a resolution for issues like https://github.com/dotnet/coreclr/issues/914?

gfoidl · 2018-11-03T12:48:45Z

return condition ? false : true

emits a branch

G_M61453_IG01:

G_M61453_IG02:
       8BC7                 mov      eax, edi
       8BFE                 mov      edi, esi
       4803C7               add      rax, rdi
       8BFA                 mov      edi, edx
       483BC7               cmp      rax, rdi
       7706                 ja       SHORT G_M61453_IG04
       B801000000           mov      eax, 1

G_M61453_IG03:
       C3                   ret      

G_M61453_IG04:
       33C0                 xor      eax, eax

G_M61453_IG05:
       C3                   ret

When written as just return cond the "setb al / movzx eax, al" is generated.

G_M61452_IG01:

G_M61452_IG02:
       8BC7                 mov      eax, edi
       8BFE                 mov      edi, esi
       4803C7               add      rax, rdi
       8BFA                 mov      edi, edx
       483BC7               cmp      rax, rdi
       0F96C0               setbe    al
       0FB6C0               movzx    rax, al

G_M61452_IG03:
       C3                   ret

But when this methods gets inlined, this patterns vanishes. Trivial example:

internal static bool IsRangeInBounds(int offset, int length, int availableLength)
{
    return (ulong)(uint)offset + (ulong)(uint)length <= (ulong)(uint)availableLength;
}

[MethodImpl(MethodImplOptions.NoInlining)]
private static int Foo(int offset, int len, int availLen)
{
    if (!IsRangeInBounds(offset, len, availLen)) Throw();

    return availLen - (offset + len);

    void Throw() => throw new Exception();
}

generates

G_M7900_IG01:
       50                   push     rax

G_M7900_IG02:
       8BC7                 mov      eax, edi
       8BCE                 mov      ecx, esi
       4803C1               add      rax, rcx
       8BCA                 mov      ecx, edx
       483BC1               cmp      rax, rcx
       770B                 ja       SHORT G_M7900_IG05

G_M7900_IG03:
       03FE                 add      edi, esi
       8BC2                 mov      eax, edx
       2BC7                 sub      eax, edi

G_M7900_IG04:
       4883C408             add      rsp, 8
       C3                   ret      

G_M7900_IG05:
       E869F9FFFF           call     Program:<Foo>g__Throw|7_0()
       CC                   int3

jkotas · 2018-11-03T13:23:45Z

But when this methods gets inlined, this patterns vanishe

Does it vanish even for 32-bit variant of the method?

gfoidl · 2018-11-03T13:39:58Z

In the 32-bit variant this pattern ("setb al / movzx eax, al") doesn't show up, because of the explicit jumps from the comparison. For 32-bit I can just view the dasm in SharpLab (have only a setup for x64). Here the sbb + inc-part is interesting, but I can't judge the cost of this compared to the version with the ? false : true-hack.

gfoidl · 2018-11-03T13:47:51Z

When IsRangeInBounds is written without #if BIT64 (for use in "general" libraries) as

[MethodImpl(MethodImplOptions.AggressiveInlining)]
internal static bool IsRangeInBounds(int offset, int length, int availableLength)
{
    if (IntPtr.Size == 8)
    {
        // ...
        return (ulong)(uint)offset + (ulong)(uint)length <= (ulong)(uint)availableLength;
    }
    else
    {
        return ((uint)offset > (uint)availableLength || (uint)length > (uint)(availableLength - offset)) ? false : true;
    }
}

"setb al / movzx eax, al" does not vanish. So if this methods is added, I propose to make it public, in order that other places can benefit from this trick.

mikedn · 2018-11-05T06:35:01Z

// We don't use this same mechanism in a 32-bit process due to the overhead of 64-bit arithmetic.

@GrabYourPitchforks I'm curious how big the overhead is. Did you measure it?

* Perf improvements to Span and Memory - Improves perf of AsSpan, AsMemory, ctor, and Slice * PR feedback - add comments Signed-off-by: dotnet-bot <[email protected]>

GrabYourPitchforks · 2018-11-06T18:54:46Z

@mikedn I did not measure it. TBH it was just an assumption on my part and not something based on empirical measurements.

@ahsonkhan @jkotas @gfoidl @benaadams I experimented a bit more, and it turns out if the helper method returns void rather than bool it should result in optimal codegen.

namespace System
{
    [StackTraceHidden]
    internal static class ArgValidation
    {
        /// <summary>
        /// Calls <see cref="ThrowHelper.ThrowArgumentOutOfRangeException"/> if the input arguments
        /// <paramref name="desiredStart"/> and <paramref name="desiredLength"/> aren't within the range
        /// of <paramref name="actualLength"/>.
        /// </summary>
        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        internal static void ThrowIfArgsOutOfRangeForSlice(int desiredStart, int desiredLength, int actualLength)
        {
            Debug.Assert(actualLength >= 0);

#if BIT64
            // See comment on over overload of ThrowIfArgsOutOfRangeForSlice.
            if ((ulong)(uint)desiredStart + (ulong)(uint)desiredLength > (ulong)(uint)actualLength)
                ThrowHelper.ThrowArgumentOutOfRangeException();
#else
            if ((uint)desiredStart > (uint)actualLength || (uint)desiredLength > (uint)(actualLength - desiredStart))
                ThrowHelper.ThrowArgumentOutOfRangeException();
#endif
        }

        /// <summary>
        /// Calls <see cref="ThrowHelper.ThrowArgumentOutOfRangeException(ExceptionArgument)"/> if the input arguments
        /// <paramref name="desiredStart"/> and <paramref name="desiredLength"/> aren't within the range
        /// of <paramref name="actualLength"/>.
        /// </summary>
        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        internal static void ThrowIfArgsOutOfRangeForSlice(int desiredStart, int desiredLength, int actualLength, ExceptionArgument argument)
        {
            Debug.Assert(actualLength >= 0);

#if BIT64
            // Since start and length are both 32-bit, their sum can be computed across a 64-bit domain
            // without loss of fidelity. The cast to uint before the cast to ulong ensures that the
            // extension from 32- to 64-bit is zero-extending rather than sign-extending. The end result
            // of this is that if either input is negative or if the input sum overflows past Int32.MaxValue,
            // that information is captured correctly in the comparison against the backing _length field.
            // We don't use this same mechanism in a 32-bit process due to the overhead of 64-bit arithmetic.
            if ((ulong)(uint)desiredStart + (ulong)(uint)desiredLength > (ulong)(uint)actualLength)
                ThrowHelper.ThrowArgumentOutOfRangeException(argument);
#else
            if ((uint)desiredStart > (uint)actualLength || (uint)desiredLength > (uint)(actualLength - desiredStart))
                ThrowHelper.ThrowArgumentOutOfRangeException(argument);
#endif
        }
    }
}

This means that the callers no longer have the if statement in their own code; it's instead delegated to this helper method. I don't think this is a pattern we commonly use in the framework, but it may be worth looking into further.

benaadams · 2018-11-06T18:59:05Z

I don't think this is a pattern we commonly use in the framework, but it may be worth looking into further.

It might be more used than you realise 😉 e.g. ThrowHelper.cs#L287-L303

* Perf improvements to Span and Memory - Improves perf of AsSpan, AsMemory, ctor, and Slice * PR feedback - add comments Signed-off-by: dotnet-bot <[email protected]>

* Perf improvements to Span and Memory - Improves perf of AsSpan, AsMemory, ctor, and Slice * PR feedback - add comments

* Perf improvements to Span and Memory - Improves perf of AsSpan, AsMemory, ctor, and Slice * PR feedback - add comments Commit migrated from dotnet/coreclr@e8438d3

Perf improvements to Span and Memory

f141b67

- Improves perf of AsSpan, AsMemory, ctor, and Slice

jkotas approved these changes Nov 2, 2018

View reviewed changes

ahsonkhan added the area-System.Memory label Nov 2, 2018

ahsonkhan reviewed Nov 2, 2018

View reviewed changes

src/System.Private.CoreLib/shared/System/Memory.cs Show resolved Hide resolved

ahsonkhan approved these changes Nov 2, 2018

View reviewed changes

PR feedback - add comments

f60c871

stephentoub approved these changes Nov 3, 2018

View reviewed changes

ahsonkhan reviewed Nov 3, 2018

View reviewed changes

ahsonkhan approved these changes Nov 3, 2018

View reviewed changes

ahsonkhan merged commit e8438d3 into dotnet:master Nov 3, 2018

mikedn mentioned this pull request Nov 4, 2018

Recognize unsigned compare with 0 #20661

Merged

GrabYourPitchforks deleted the span_slice branch November 5, 2018 19:16

ahsonkhan mentioned this pull request Nov 7, 2018

Add SequenceReader<T> dotnet/corefx#33288

Merged

A-And pushed a commit to A-And/coreclr that referenced this pull request Nov 20, 2018

Perf improvements to Span and Memory (dotnet#20771)

63765c9

* Perf improvements to Span and Memory - Improves perf of AsSpan, AsMemory, ctor, and Slice * PR feedback - add comments

ahsonkhan mentioned this pull request Mar 3, 2019

Memory spec question dotnet/corefxlab#2650

Closed

gfoidl mentioned this pull request Sep 16, 2019

Introduce Utf8Span, which is a span of UTF-8 text #26711

Merged

GrabYourPitchforks mentioned this pull request Jan 31, 2020

JIT: int64 comparisons can be optimized when both arguments are known to fit into int32 dotnet/runtime#11394

Open

Perf improvements to Span and Memory #20771

Perf improvements to Span and Memory #20771

Uh oh!

Conversation

GrabYourPitchforks commented Nov 2, 2018

Uh oh!

jkotas left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ahsonkhan left a comment

Choose a reason for hiding this comment

Uh oh!

GrabYourPitchforks commented Nov 2, 2018

Uh oh!

stephentoub left a comment

Choose a reason for hiding this comment

Uh oh!

ahsonkhan Nov 3, 2018

Choose a reason for hiding this comment

Uh oh!

benaadams commented Nov 3, 2018

Uh oh!

GrabYourPitchforks commented Nov 3, 2018

Uh oh!

ahsonkhan commented Nov 3, 2018

Uh oh!

gfoidl commented Nov 3, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jkotas commented Nov 3, 2018

Uh oh!

gfoidl commented Nov 3, 2018

Uh oh!

gfoidl commented Nov 3, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mikedn commented Nov 5, 2018

Uh oh!

GrabYourPitchforks commented Nov 6, 2018

Uh oh!

benaadams commented Nov 6, 2018

Uh oh!

Uh oh!

gfoidl commented Nov 3, 2018 •

edited

Loading

gfoidl commented Nov 3, 2018 •

edited

Loading