-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Workaround memset alignment sensitivity #24302
Conversation
memset is up to 2x slower on misaligned block on some types of hardware. The problem is uneven performance of "rep stosb" used to implement the memset in some cases. The exact matrix on when it is slower and by how much is very complex. This change workarounds the issue by aligning the memory block before it is passed to memset and filling in the potential misaligned part manually. This workaround will regress performance by a few percent (<10%) in some cases, but we will gain up to 2x improvement in other cases. Fixes #24300
I wonder if it makes sense for big arrays to implement E.g.:
|
@@ -24,7 +24,7 @@ public static unsafe void ClearWithoutReferences(ref byte b, nuint byteLength) | |||
return; | |||
|
|||
#if CORECLR && (AMD64 || ARM64) | |||
if (byteLength > 4096) | |||
if (byteLength > 768) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: could you add a comment as to how this number was deduced?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ran the benchmark for different array sizes:
| Method | arraySize | Mean |
|--------------- |---------- |---------:|
| Clear | 500 | 13.00 ns |
| ClearUnaligned | 500 | 11.32 ns |
| Clear | 600 | 16.99 ns |
| ClearUnaligned | 600 | 16.94 ns |
| Clear | 700 | 17.21 ns |
| ClearUnaligned | 700 | 23.03 ns |
| Clear | 800 | 17.60 ns |
| ClearUnaligned | 800 | 18.15 ns |
| Clear | 900 | 18.02 ns |
| ClearUnaligned | 900 | 19.22 ns |
but these results are for the old code if (byteLength > 4096)
so all of them use Unsafe.InitBlockUnaligned
and the problem appears after count > 600-700 on my machine (i8700K Windows10 x64)
Aside; as not what this PR is doing, but |
It is a rocket science to do this well. These things need to be tuned against real workloads, microbenchmarks are not enough. |
@benaadams will try but that was just a proof of concept :) |
Co-Authored-By: jkotas <[email protected]>
@benaadams just tried - I am not sure but according to https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=vmovdqa&expand=5666,5656,5666,5666,5668,5657,5656,5590,3338&techs=AVX,AVX2 the Throughput of |
@dotnet-bot test this please |
AVX with force align: public static void MemsetLargeArrayAvx(byte* dst, int length)
{
// this code is supposed to be called only for large arrays
// so we don't have to care about small arrays.
*((nuint*)dst + 0) = 0;
*((nuint*)dst + 1) = 0;
*((nuint*)dst + 2) = 0;
*((nuint*)dst + 3) = 0;
byte* dsta = (byte*)(((nuint) (dst + 1) + 31) & ~(nuint) 31);
var zero = Vector256<byte>.Zero;
int i = 0;
for (; i < length - 128; i += 128)
{
Avx.StoreAligned(dsta + i, zero);
Avx.StoreAligned(dsta + i + 32, zero);
Avx.StoreAligned(dsta + i + 64, zero);
Avx.StoreAligned(dsta + i + 96, zero);
}
var endElements = (uint) (length - i);
if (endElements > 0)
Unsafe.InitBlockUnaligned(dst + i, 0, endElements);
} Results:
I understand it should be tested on different machines, on different test data, including some real world apps so I am leaving it here in case if anybody is interested. |
* Workaround memset alignment sensitivity memset is up to 2x slower on misaligned block on some types of hardware. The problem is uneven performance of "rep stosb" used to implement the memset in some cases. The exact matrix on when it is slower and by how much is very complex. This change workarounds the issue by aligning the memory block before it is passed to memset and filling in the potential misaligned part manually. This workaround will regress performance by a few percent (<10%) in some cases, but we will gain up to 2x improvement in other cases. Fixes dotnet#24300 Signed-off-by: dotnet-bot <[email protected]>
* Workaround memset alignment sensitivity memset is up to 2x slower on misaligned block on some types of hardware. The problem is uneven performance of "rep stosb" used to implement the memset in some cases. The exact matrix on when it is slower and by how much is very complex. This change workarounds the issue by aligning the memory block before it is passed to memset and filling in the potential misaligned part manually. This workaround will regress performance by a few percent (<10%) in some cases, but we will gain up to 2x improvement in other cases. Fixes #24300 Signed-off-by: dotnet-bot <[email protected]>
* Workaround memset alignment sensitivity memset is up to 2x slower on misaligned block on some types of hardware. The problem is uneven performance of "rep stosb" used to implement the memset in some cases. The exact matrix on when it is slower and by how much is very complex. This change workarounds the issue by aligning the memory block before it is passed to memset and filling in the potential misaligned part manually. This workaround will regress performance by a few percent (<10%) in some cases, but we will gain up to 2x improvement in other cases. Fixes #24300 Signed-off-by: dotnet-bot <[email protected]>
* Workaround memset alignment sensitivity memset is up to 2x slower on misaligned block on some types of hardware. The problem is uneven performance of "rep stosb" used to implement the memset in some cases. The exact matrix on when it is slower and by how much is very complex. This change workarounds the issue by aligning the memory block before it is passed to memset and filling in the potential misaligned part manually. This workaround will regress performance by a few percent (<10%) in some cases, but we will gain up to 2x improvement in other cases. Fixes #24300 Signed-off-by: dotnet-bot <[email protected]>
* Workaround memset alignment sensitivity memset is up to 2x slower on misaligned block on some types of hardware. The problem is uneven performance of "rep stosb" used to implement the memset in some cases. The exact matrix on when it is slower and by how much is very complex. This change workarounds the issue by aligning the memory block before it is passed to memset and filling in the potential misaligned part manually. This workaround will regress performance by a few percent (<10%) in some cases, but we will gain up to 2x improvement in other cases. Fixes #24300 Signed-off-by: dotnet-bot <[email protected]>
* Workaround memset alignment sensitivity memset is up to 2x slower on misaligned block on some types of hardware. The problem is uneven performance of "rep stosb" used to implement the memset in some cases. The exact matrix on when it is slower and by how much is very complex. This change workarounds the issue by aligning the memory block before it is passed to memset and filling in the potential misaligned part manually. This workaround will regress performance by a few percent (<10%) in some cases, but we will gain up to 2x improvement in other cases. Fixes #24300 Signed-off-by: dotnet-bot <[email protected]>
* Workaround memset alignment sensitivity memset is up to 2x slower on misaligned block on some types of hardware. The problem is uneven performance of "rep stosb" used to implement the memset in some cases. The exact matrix on when it is slower and by how much is very complex. This change workarounds the issue by aligning the memory block before it is passed to memset and filling in the potential misaligned part manually. This workaround will regress performance by a few percent (<10%) in some cases, but we will gain up to 2x improvement in other cases. Fixes dotnet/coreclr#24300 Commit migrated from dotnet/coreclr@3661584
memset is up to 2x slower on misaligned block on some types of hardware. The problem is uneven performance of "rep stosb"
used to implement the memset in some cases. The exact matrix on when it is slower and by how much is very complex.
This change workarounds the issue by aligning the memory block before it is passed to memset and filling in the potential misaligned
part manually. This workaround will regress performance by a few percent (<10%) in some cases, but we will gain up to 2x improvement
in other cases.
Fixes #24300