Skip to content
This repository was archived by the owner on Jan 23, 2023. It is now read-only.

Perf improvements to Span and Memory #20771

Merged
merged 2 commits into from
Nov 3, 2018

Conversation

GrabYourPitchforks
Copy link
Member

This improves the performance of MemoryExtensions.AsSpan<T>, Span<T>..ctor, and Span<T>.Slice (and the read-only and Memory<T> equivalents) when the caller is responsible for providing both a start index and an offset parameter. Currently this is implemented via two branches that the JIT can't always elide. With this PR, on 64-bit machines, this is implemented as a single branch and allows the JIT to perform constant-time folding where applicable, which offers performance gains in span-heavy parsing / formatting code.

As an example, I've provided metrics for Guid.Parse(string) : Guid and WebUtility.HtmlEncode(string) : string. Guid.Parse uses Span.Slice heavily, which is shown by the substantial increase in overall throughput between the baseline and this PR. WebUtility.HtmlEncode uses ValueStringBuilder under the covers, which uses Span.Slice internally but is a bit more conservative with those calls, so you'll see some throughput increase but nothing quite as substantial as Guid's code did.

Method Toolchain value Mean Error StdDev Ratio
Guid_Parse baseline (27012-03) 2153e(...)c7c13 [36] 23.62 ms 0.3638 ms 0.3225 ms 1.00
Guid_Parse experimental 2153e(...)c7c13 [36] 13.61 ms 0.1180 ms 0.1104 ms 0.58
HtmlEncode baseline (27012-03) &<>&<(...)<>&<> [48] 71.50 ms 0.9397 ms 0.8331 ms 1.00
HtmlEncode experimental &<>&<(...)<>&<> [48] 68.85 ms 1.0039 ms 0.9391 ms 0.96

- Improves perf of AsSpan, AsMemory, ctor, and Slice
Copy link
Member

@jkotas jkotas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice trick

Copy link

@ahsonkhan ahsonkhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ratio: 0.58

Awesome!

@GrabYourPitchforks
Copy link
Member Author

I opened a separate linked issue (https://github.com/dotnet/coreclr/issues/20776) to help the JIT optimize this even further. In my brief testing this JIT optimization would further improve Guid.Parse battery performance by approx. another 10%.

Copy link
Member

@stephentoub stephentoub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sweet.

@@ -344,6 +345,12 @@ public Span<T> Slice(int start)
public Span<T> Slice(int start, int length)
{
#if BIT64
// Since start and length are both 32-bit, their sum can be computed across a 64-bit domain

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect. Thanks!

@ahsonkhan ahsonkhan merged commit e8438d3 into dotnet:master Nov 3, 2018
@benaadams
Copy link
Member

Is it worth making this a helper method somewhere? e.g.

[MethodImpl(MethodImplOptions.AggressiveInlining)]
internal static bool IsRangeInBounds(int offset, int length, int availableLength)
{
#if BIT64
    // Since offset and length are both 32-bit, their sum can be computed across a 64-bit domain
    // without loss of fidelity. The cast to uint before the cast to ulong ensures that the
    // extension from 32- to 64-bit is zero-extending rather than sign-extending. The end result
    // of this is that if either input is negative or if the input sum overflows past Int32.MaxValue,
    // that information is captured correctly in the comparison against the availableLength field.
    // We don't use this same mechanism in a 32-bit process due to the overhead of 64-bit arithmetic.
    return ((ulong)(uint)offset + (ulong)(uint)length > (ulong)(uint)availableLength) ? false : true;
#else
    return ((uint)offset > (uint)availableLength || (uint)length > (uint)(availableLength - offset)) ? false : true;
#endif
}

@GrabYourPitchforks
Copy link
Member Author

Perhaps. I spoke with Ahson regarding this, and I couldn't figure out how to move it into a separate helper method without getting the JIT to emit the typical "setb al / movzx eax, al" anti-pattern. If you can figure out a good way to coax it to do the right thing inside of a standalone method send a PR. :)

@ahsonkhan
Copy link

Is it worth making this a helper method somewhere?

Yes, that would definitely be helpful, but it may not be feasible as @GrabYourPitchforks stated.

and I couldn't figure out how to move it into a separate helper method without getting the JIT to emit the typical "setb al / movzx eax, al" anti-pattern.

Maybe writing it the way @benaadams suggested (i.e. return condition ? false : true rather than just return condition) could help work around the JIT issue. Could be worth a shot seeing what code the JIT emits in that case. @AndyAyersMS, if that exercise fails, do you have some other suggestions that could work while we wait for a resolution for issues like https://github.com/dotnet/coreclr/issues/914?

@gfoidl
Copy link
Member

gfoidl commented Nov 3, 2018

return condition ? false : true

emits a branch

G_M61453_IG01:

G_M61453_IG02:
       8BC7                 mov      eax, edi
       8BFE                 mov      edi, esi
       4803C7               add      rax, rdi
       8BFA                 mov      edi, edx
       483BC7               cmp      rax, rdi
       7706                 ja       SHORT G_M61453_IG04
       B801000000           mov      eax, 1

G_M61453_IG03:
       C3                   ret      

G_M61453_IG04:
       33C0                 xor      eax, eax

G_M61453_IG05:
       C3                   ret   

When written as just return cond the "setb al / movzx eax, al" is generated.

G_M61452_IG01:

G_M61452_IG02:
       8BC7                 mov      eax, edi
       8BFE                 mov      edi, esi
       4803C7               add      rax, rdi
       8BFA                 mov      edi, edx
       483BC7               cmp      rax, rdi
       0F96C0               setbe    al
       0FB6C0               movzx    rax, al

G_M61452_IG03:
       C3                   ret      

But when this methods gets inlined, this patterns vanishes. Trivial example:

internal static bool IsRangeInBounds(int offset, int length, int availableLength)
{
    return (ulong)(uint)offset + (ulong)(uint)length <= (ulong)(uint)availableLength;
}

[MethodImpl(MethodImplOptions.NoInlining)]
private static int Foo(int offset, int len, int availLen)
{
    if (!IsRangeInBounds(offset, len, availLen)) Throw();

    return availLen - (offset + len);

    void Throw() => throw new Exception();
}

generates

G_M7900_IG01:
       50                   push     rax

G_M7900_IG02:
       8BC7                 mov      eax, edi
       8BCE                 mov      ecx, esi
       4803C1               add      rax, rcx
       8BCA                 mov      ecx, edx
       483BC1               cmp      rax, rcx
       770B                 ja       SHORT G_M7900_IG05

G_M7900_IG03:
       03FE                 add      edi, esi
       8BC2                 mov      eax, edx
       2BC7                 sub      eax, edi

G_M7900_IG04:
       4883C408             add      rsp, 8
       C3                   ret      

G_M7900_IG05:
       E869F9FFFF           call     Program:<Foo>g__Throw|7_0()
       CC                   int3     

@jkotas
Copy link
Member

jkotas commented Nov 3, 2018

But when this methods gets inlined, this patterns vanishe

Does it vanish even for 32-bit variant of the method?

@gfoidl
Copy link
Member

gfoidl commented Nov 3, 2018

In the 32-bit variant this pattern ("setb al / movzx eax, al") doesn't show up, because of the explicit jumps from the comparison. For 32-bit I can just view the dasm in SharpLab (have only a setup for x64). Here the sbb + inc-part is interesting, but I can't judge the cost of this compared to the version with the ? false : true-hack.

@gfoidl
Copy link
Member

gfoidl commented Nov 3, 2018

When IsRangeInBounds is written without #if BIT64 (for use in "general" libraries) as

[MethodImpl(MethodImplOptions.AggressiveInlining)]
internal static bool IsRangeInBounds(int offset, int length, int availableLength)
{
    if (IntPtr.Size == 8)
    {
        // ...
        return (ulong)(uint)offset + (ulong)(uint)length <= (ulong)(uint)availableLength;
    }
    else
    {
        return ((uint)offset > (uint)availableLength || (uint)length > (uint)(availableLength - offset)) ? false : true;
    }
}

"setb al / movzx eax, al" does not vanish. So if this methods is added, I propose to make it public, in order that other places can benefit from this trick.

@mikedn
Copy link

mikedn commented Nov 5, 2018

// We don't use this same mechanism in a 32-bit process due to the overhead of 64-bit arithmetic.

@GrabYourPitchforks I'm curious how big the overhead is. Did you measure it?

dotnet-maestro-bot pushed a commit to dotnet-maestro-bot/corefx that referenced this pull request Nov 5, 2018
* Perf improvements to Span and Memory
- Improves perf of AsSpan, AsMemory, ctor, and Slice

* PR feedback - add comments

Signed-off-by: dotnet-bot <[email protected]>
dotnet-maestro-bot pushed a commit to dotnet-maestro-bot/corert that referenced this pull request Nov 5, 2018
* Perf improvements to Span and Memory
- Improves perf of AsSpan, AsMemory, ctor, and Slice

* PR feedback - add comments

Signed-off-by: dotnet-bot <[email protected]>
stephentoub pushed a commit to dotnet/corefx that referenced this pull request Nov 5, 2018
* Perf improvements to Span and Memory
- Improves perf of AsSpan, AsMemory, ctor, and Slice

* PR feedback - add comments

Signed-off-by: dotnet-bot <[email protected]>
@GrabYourPitchforks GrabYourPitchforks deleted the span_slice branch November 5, 2018 19:16
@GrabYourPitchforks
Copy link
Member Author

@mikedn I did not measure it. TBH it was just an assumption on my part and not something based on empirical measurements.

@ahsonkhan @jkotas @gfoidl @benaadams I experimented a bit more, and it turns out if the helper method returns void rather than bool it should result in optimal codegen.

namespace System
{
    [StackTraceHidden]
    internal static class ArgValidation
    {
        /// <summary>
        /// Calls <see cref="ThrowHelper.ThrowArgumentOutOfRangeException"/> if the input arguments
        /// <paramref name="desiredStart"/> and <paramref name="desiredLength"/> aren't within the range
        /// of <paramref name="actualLength"/>.
        /// </summary>
        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        internal static void ThrowIfArgsOutOfRangeForSlice(int desiredStart, int desiredLength, int actualLength)
        {
            Debug.Assert(actualLength >= 0);

#if BIT64
            // See comment on over overload of ThrowIfArgsOutOfRangeForSlice.
            if ((ulong)(uint)desiredStart + (ulong)(uint)desiredLength > (ulong)(uint)actualLength)
                ThrowHelper.ThrowArgumentOutOfRangeException();
#else
            if ((uint)desiredStart > (uint)actualLength || (uint)desiredLength > (uint)(actualLength - desiredStart))
                ThrowHelper.ThrowArgumentOutOfRangeException();
#endif
        }

        /// <summary>
        /// Calls <see cref="ThrowHelper.ThrowArgumentOutOfRangeException(ExceptionArgument)"/> if the input arguments
        /// <paramref name="desiredStart"/> and <paramref name="desiredLength"/> aren't within the range
        /// of <paramref name="actualLength"/>.
        /// </summary>
        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        internal static void ThrowIfArgsOutOfRangeForSlice(int desiredStart, int desiredLength, int actualLength, ExceptionArgument argument)
        {
            Debug.Assert(actualLength >= 0);

#if BIT64
            // Since start and length are both 32-bit, their sum can be computed across a 64-bit domain
            // without loss of fidelity. The cast to uint before the cast to ulong ensures that the
            // extension from 32- to 64-bit is zero-extending rather than sign-extending. The end result
            // of this is that if either input is negative or if the input sum overflows past Int32.MaxValue,
            // that information is captured correctly in the comparison against the backing _length field.
            // We don't use this same mechanism in a 32-bit process due to the overhead of 64-bit arithmetic.
            if ((ulong)(uint)desiredStart + (ulong)(uint)desiredLength > (ulong)(uint)actualLength)
                ThrowHelper.ThrowArgumentOutOfRangeException(argument);
#else
            if ((uint)desiredStart > (uint)actualLength || (uint)desiredLength > (uint)(actualLength - desiredStart))
                ThrowHelper.ThrowArgumentOutOfRangeException(argument);
#endif
        }
    }
}

This means that the callers no longer have the if statement in their own code; it's instead delegated to this helper method. I don't think this is a pattern we commonly use in the framework, but it may be worth looking into further.

@benaadams
Copy link
Member

I don't think this is a pattern we commonly use in the framework, but it may be worth looking into further.

It might be more used than you realise 😉 e.g. ThrowHelper.cs#L287-L303

MichalStrehovsky pushed a commit to dotnet/corert that referenced this pull request Nov 8, 2018
* Perf improvements to Span and Memory
- Improves perf of AsSpan, AsMemory, ctor, and Slice

* PR feedback - add comments

Signed-off-by: dotnet-bot <[email protected]>
A-And pushed a commit to A-And/coreclr that referenced this pull request Nov 20, 2018
* Perf improvements to Span and Memory
- Improves perf of AsSpan, AsMemory, ctor, and Slice

* PR feedback - add comments
picenka21 pushed a commit to picenka21/runtime that referenced this pull request Feb 18, 2022
* Perf improvements to Span and Memory
- Improves perf of AsSpan, AsMemory, ctor, and Slice

* PR feedback - add comments


Commit migrated from dotnet/coreclr@e8438d3
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants