-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Perf improvements to Span and Memory #20771
Conversation
- Improves perf of AsSpan, AsMemory, ctor, and Slice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice trick
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ratio: 0.58
Awesome!
I opened a separate linked issue (https://github.com/dotnet/coreclr/issues/20776) to help the JIT optimize this even further. In my brief testing this JIT optimization would further improve |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sweet.
@@ -344,6 +345,12 @@ public Span<T> Slice(int start) | |||
public Span<T> Slice(int start, int length) | |||
{ | |||
#if BIT64 | |||
// Since start and length are both 32-bit, their sum can be computed across a 64-bit domain |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect. Thanks!
Is it worth making this a helper method somewhere? e.g. [MethodImpl(MethodImplOptions.AggressiveInlining)]
internal static bool IsRangeInBounds(int offset, int length, int availableLength)
{
#if BIT64
// Since offset and length are both 32-bit, their sum can be computed across a 64-bit domain
// without loss of fidelity. The cast to uint before the cast to ulong ensures that the
// extension from 32- to 64-bit is zero-extending rather than sign-extending. The end result
// of this is that if either input is negative or if the input sum overflows past Int32.MaxValue,
// that information is captured correctly in the comparison against the availableLength field.
// We don't use this same mechanism in a 32-bit process due to the overhead of 64-bit arithmetic.
return ((ulong)(uint)offset + (ulong)(uint)length > (ulong)(uint)availableLength) ? false : true;
#else
return ((uint)offset > (uint)availableLength || (uint)length > (uint)(availableLength - offset)) ? false : true;
#endif
} |
Perhaps. I spoke with Ahson regarding this, and I couldn't figure out how to move it into a separate helper method without getting the JIT to emit the typical "setb al / movzx eax, al" anti-pattern. If you can figure out a good way to coax it to do the right thing inside of a standalone method send a PR. :) |
Yes, that would definitely be helpful, but it may not be feasible as @GrabYourPitchforks stated.
Maybe writing it the way @benaadams suggested (i.e. |
emits a branch G_M61453_IG01:
G_M61453_IG02:
8BC7 mov eax, edi
8BFE mov edi, esi
4803C7 add rax, rdi
8BFA mov edi, edx
483BC7 cmp rax, rdi
7706 ja SHORT G_M61453_IG04
B801000000 mov eax, 1
G_M61453_IG03:
C3 ret
G_M61453_IG04:
33C0 xor eax, eax
G_M61453_IG05:
C3 ret When written as just G_M61452_IG01:
G_M61452_IG02:
8BC7 mov eax, edi
8BFE mov edi, esi
4803C7 add rax, rdi
8BFA mov edi, edx
483BC7 cmp rax, rdi
0F96C0 setbe al
0FB6C0 movzx rax, al
G_M61452_IG03:
C3 ret But when this methods gets inlined, this patterns vanishes. Trivial example: internal static bool IsRangeInBounds(int offset, int length, int availableLength)
{
return (ulong)(uint)offset + (ulong)(uint)length <= (ulong)(uint)availableLength;
}
[MethodImpl(MethodImplOptions.NoInlining)]
private static int Foo(int offset, int len, int availLen)
{
if (!IsRangeInBounds(offset, len, availLen)) Throw();
return availLen - (offset + len);
void Throw() => throw new Exception();
} generates G_M7900_IG01:
50 push rax
G_M7900_IG02:
8BC7 mov eax, edi
8BCE mov ecx, esi
4803C1 add rax, rcx
8BCA mov ecx, edx
483BC1 cmp rax, rcx
770B ja SHORT G_M7900_IG05
G_M7900_IG03:
03FE add edi, esi
8BC2 mov eax, edx
2BC7 sub eax, edi
G_M7900_IG04:
4883C408 add rsp, 8
C3 ret
G_M7900_IG05:
E869F9FFFF call Program:<Foo>g__Throw|7_0()
CC int3 |
Does it vanish even for 32-bit variant of the method? |
In the 32-bit variant this pattern ("setb al / movzx eax, al") doesn't show up, because of the explicit jumps from the comparison. For 32-bit I can just view the dasm in SharpLab (have only a setup for x64). Here the |
When [MethodImpl(MethodImplOptions.AggressiveInlining)]
internal static bool IsRangeInBounds(int offset, int length, int availableLength)
{
if (IntPtr.Size == 8)
{
// ...
return (ulong)(uint)offset + (ulong)(uint)length <= (ulong)(uint)availableLength;
}
else
{
return ((uint)offset > (uint)availableLength || (uint)length > (uint)(availableLength - offset)) ? false : true;
}
} "setb al / movzx eax, al" does not vanish. So if this methods is added, I propose to make it public, in order that other places can benefit from this trick. |
@GrabYourPitchforks I'm curious how big the overhead is. Did you measure it? |
* Perf improvements to Span and Memory - Improves perf of AsSpan, AsMemory, ctor, and Slice * PR feedback - add comments Signed-off-by: dotnet-bot <[email protected]>
* Perf improvements to Span and Memory - Improves perf of AsSpan, AsMemory, ctor, and Slice * PR feedback - add comments Signed-off-by: dotnet-bot <[email protected]>
* Perf improvements to Span and Memory - Improves perf of AsSpan, AsMemory, ctor, and Slice * PR feedback - add comments Signed-off-by: dotnet-bot <[email protected]>
@mikedn I did not measure it. TBH it was just an assumption on my part and not something based on empirical measurements. @ahsonkhan @jkotas @gfoidl @benaadams I experimented a bit more, and it turns out if the helper method returns namespace System
{
[StackTraceHidden]
internal static class ArgValidation
{
/// <summary>
/// Calls <see cref="ThrowHelper.ThrowArgumentOutOfRangeException"/> if the input arguments
/// <paramref name="desiredStart"/> and <paramref name="desiredLength"/> aren't within the range
/// of <paramref name="actualLength"/>.
/// </summary>
[MethodImpl(MethodImplOptions.AggressiveInlining)]
internal static void ThrowIfArgsOutOfRangeForSlice(int desiredStart, int desiredLength, int actualLength)
{
Debug.Assert(actualLength >= 0);
#if BIT64
// See comment on over overload of ThrowIfArgsOutOfRangeForSlice.
if ((ulong)(uint)desiredStart + (ulong)(uint)desiredLength > (ulong)(uint)actualLength)
ThrowHelper.ThrowArgumentOutOfRangeException();
#else
if ((uint)desiredStart > (uint)actualLength || (uint)desiredLength > (uint)(actualLength - desiredStart))
ThrowHelper.ThrowArgumentOutOfRangeException();
#endif
}
/// <summary>
/// Calls <see cref="ThrowHelper.ThrowArgumentOutOfRangeException(ExceptionArgument)"/> if the input arguments
/// <paramref name="desiredStart"/> and <paramref name="desiredLength"/> aren't within the range
/// of <paramref name="actualLength"/>.
/// </summary>
[MethodImpl(MethodImplOptions.AggressiveInlining)]
internal static void ThrowIfArgsOutOfRangeForSlice(int desiredStart, int desiredLength, int actualLength, ExceptionArgument argument)
{
Debug.Assert(actualLength >= 0);
#if BIT64
// Since start and length are both 32-bit, their sum can be computed across a 64-bit domain
// without loss of fidelity. The cast to uint before the cast to ulong ensures that the
// extension from 32- to 64-bit is zero-extending rather than sign-extending. The end result
// of this is that if either input is negative or if the input sum overflows past Int32.MaxValue,
// that information is captured correctly in the comparison against the backing _length field.
// We don't use this same mechanism in a 32-bit process due to the overhead of 64-bit arithmetic.
if ((ulong)(uint)desiredStart + (ulong)(uint)desiredLength > (ulong)(uint)actualLength)
ThrowHelper.ThrowArgumentOutOfRangeException(argument);
#else
if ((uint)desiredStart > (uint)actualLength || (uint)desiredLength > (uint)(actualLength - desiredStart))
ThrowHelper.ThrowArgumentOutOfRangeException(argument);
#endif
}
}
} This means that the callers no longer have the |
It might be more used than you realise 😉 e.g. ThrowHelper.cs#L287-L303 |
* Perf improvements to Span and Memory - Improves perf of AsSpan, AsMemory, ctor, and Slice * PR feedback - add comments Signed-off-by: dotnet-bot <[email protected]>
* Perf improvements to Span and Memory - Improves perf of AsSpan, AsMemory, ctor, and Slice * PR feedback - add comments
* Perf improvements to Span and Memory - Improves perf of AsSpan, AsMemory, ctor, and Slice * PR feedback - add comments Commit migrated from dotnet/coreclr@e8438d3
This improves the performance of
MemoryExtensions.AsSpan<T>
,Span<T>..ctor
, andSpan<T>.Slice
(and the read-only andMemory<T>
equivalents) when the caller is responsible for providing both a start index and an offset parameter. Currently this is implemented via two branches that the JIT can't always elide. With this PR, on 64-bit machines, this is implemented as a single branch and allows the JIT to perform constant-time folding where applicable, which offers performance gains in span-heavy parsing / formatting code.As an example, I've provided metrics for
Guid.Parse(string) : Guid
andWebUtility.HtmlEncode(string) : string
.Guid.Parse
usesSpan.Slice
heavily, which is shown by the substantial increase in overall throughput between the baseline and this PR.WebUtility.HtmlEncode
usesValueStringBuilder
under the covers, which usesSpan.Slice
internally but is a bit more conservative with those calls, so you'll see some throughput increase but nothing quite as substantial asGuid
's code did.