Workaround memset alignment sensitivity #24302

jkotas · 2019-04-29T19:00:07Z

memset is up to 2x slower on misaligned block on some types of hardware. The problem is uneven performance of "rep stosb"
used to implement the memset in some cases. The exact matrix on when it is slower and by how much is very complex.

This change workarounds the issue by aligning the memory block before it is passed to memset and filling in the potential misaligned
part manually. This workaround will regress performance by a few percent (<10%) in some cases, but we will gain up to 2x improvement
in other cases.

Fixes #24300

memset is up to 2x slower on misaligned block on some types of hardware. The problem is uneven performance of "rep stosb" used to implement the memset in some cases. The exact matrix on when it is slower and by how much is very complex. This change workarounds the issue by aligning the memory block before it is passed to memset and filling in the potential misaligned part manually. This workaround will regress performance by a few percent (<10%) in some cases, but we will gain up to 2x improvement in other cases. Fixes #24300

jkotas · 2019-04-29T19:02:06Z

cc @stephentoub @EgorBo

EgorBo · 2019-04-29T19:24:47Z

I wonder if it makes sense for big arrays to implement memset in the managed code via AVX if available instead of rep stosb (e.g. this is what clang generates for a plain C loop: https://godbolt.org/z/Tl-YFi).

E.g.:

    [Benchmark]
    public unsafe void Clear8192AvxUnrolled()
    {
        var zerov = Vector256<byte>.Zero;
        var c = _arr8192;
        fixed (byte* p = c)
        {
            for (int i = 0; i < c.Length; i += 32 * 4)
            {
                Avx.Store(p + i, zerov);
                Avx.Store(p + i + 32, zerov);
                Avx.Store(p + i + 64, zerov);
                Avx.Store(p + i + 96, zerov);
            }
        }
    }

    [Benchmark]
    public unsafe void Clear8192AvxUnrolledUnaligned()
    {
        var zerov = Vector256<byte>.Zero;
        var c = _arr8192;
        fixed (byte* p = c)
        {
            // TODO: align input
            for (int i = 0; i < c.Length; i += 32 * 4)
            {
                Avx.Store(p + i, zerov);
                Avx.Store(p + i + 32, zerov);
                Avx.Store(p + i + 64, zerov);
                Avx.Store(p + i + 96, zerov);
            }
        }
    }

|                        Method |      Mean |     Error |    StdDev |
|------------------------------ |----------:|----------:|----------:|
|                     Clear8192 |  71.18 ns | 0.5130 ns | 0.4798 ns |
|            Clear8192Unaligned | 101.68 ns | 0.0083 ns | 0.0077 ns |
|          Clear8192AvxUnrolled |  54.55 ns | 0.0267 ns | 0.0250 ns |
| Clear8192AvxUnrolledUnaligned |  90.52 ns | 0.0139 ns | 0.0130 ns |

stephentoub · 2019-04-29T19:27:50Z

src/System.Private.CoreLib/shared/System/SpanHelpers.cs

@@ -24,7 +24,7 @@ public static unsafe void ClearWithoutReferences(ref byte b, nuint byteLength)
                return;

 #if CORECLR && (AMD64 || ARM64)
-            if (byteLength > 4096)
+            if (byteLength > 768)


Nit: could you add a comment as to how this number was deduced?

ran the benchmark for different array sizes:

| Method | arraySize | Mean | |--------------- |---------- |---------:| | Clear | 500 | 13.00 ns | | ClearUnaligned | 500 | 11.32 ns | | Clear | 600 | 16.99 ns | | ClearUnaligned | 600 | 16.94 ns | | Clear | 700 | 17.21 ns | | ClearUnaligned | 700 | 23.03 ns | | Clear | 800 | 17.60 ns | | ClearUnaligned | 800 | 18.15 ns | | Clear | 900 | 18.02 ns | | ClearUnaligned | 900 | 19.22 ns |

but these results are for the old code if (byteLength > 4096) so all of them use Unsafe.InitBlockUnaligned and the problem appears after count > 600-700 on my machine (i8700K Windows10 x64)

benaadams · 2019-04-29T19:41:04Z

Avx.Store

Aside; as not what this PR is doing, but Avx.StoreAlignedNonTemporal would be better, as clear after use is more common than clear before use?

jkotas · 2019-04-29T19:42:38Z

I wonder if it makes sense for big arrays to implement memset in the managed code via AVX if available instead of rep stosb

It is a rocket science to do this well. These things need to be tuned against real workloads, microbenchmarks are not enough.

EgorBo · 2019-04-29T20:03:03Z

@benaadams will try but that was just a proof of concept :)

src/System.Private.CoreLib/shared/System/SpanHelpers.cs

Co-Authored-By: jkotas <[email protected]>

EgorBo · 2019-04-29T22:00:59Z

@benaadams just tried - Avx.StoreAlignedNonTemporal is 5x times slower for the benchmark above (aligned data, 8196 bytes). The only difference in the Asm output is vmovntdq instead of vmovdqa (or vmovdqu).

I am not sure but according to https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=vmovdqa&expand=5666,5656,5666,5666,5668,5657,5656,5590,3338&techs=AVX,AVX2 the Throughput of vmovdqa is 4x times better.

jkotas · 2019-04-29T23:07:07Z

@dotnet-bot test this please

EgorBo · 2019-04-29T23:39:38Z

AVX with force align:

    public static void MemsetLargeArrayAvx(byte* dst, int length)
    {
        // this code is supposed to be called only for large arrays
        // so we don't have to care about small arrays.
        *((nuint*)dst + 0) = 0;
        *((nuint*)dst + 1) = 0;
        *((nuint*)dst + 2) = 0;
        *((nuint*)dst + 3) = 0;
        byte* dsta = (byte*)(((nuint) (dst + 1) + 31) & ~(nuint) 31);

        var zero = Vector256<byte>.Zero;
        int i = 0;
        for (; i < length - 128; i += 128)
        {
            Avx.StoreAligned(dsta + i, zero);
            Avx.StoreAligned(dsta + i + 32, zero);
            Avx.StoreAligned(dsta + i + 64, zero);
            Avx.StoreAligned(dsta + i + 96, zero);
        }

        var endElements = (uint) (length - i);
        if (endElements > 0)
            Unsafe.InitBlockUnaligned(dst + i, 0, endElements);
    }

Results:

|                        Method |         array | arraySize |      Mean |
|------------------------------ |-------------- |---------- |----------:|
|                     Clear8192 | System.Byte[] |      8192 |  72.56 ns |
|            Clear8192Unaligned | System.Byte[] |      8192 | 102.74 ns |
|     Clear8192Unaligned_JanFix | System.Byte[] |      8192 |  74.92 ns |
|         Clear819Unaligned_Avx | System.Byte[] |      8192 |  56.45 ns |
|                 Clear8192_Avx | System.Byte[] |      8192 |  56.13 ns |

I understand it should be tested on different machines, on different test data, including some real world apps so I am leaving it here in case if anybody is interested.

* Workaround memset alignment sensitivity memset is up to 2x slower on misaligned block on some types of hardware. The problem is uneven performance of "rep stosb" used to implement the memset in some cases. The exact matrix on when it is slower and by how much is very complex. This change workarounds the issue by aligning the memory block before it is passed to memset and filling in the potential misaligned part manually. This workaround will regress performance by a few percent (<10%) in some cases, but we will gain up to 2x improvement in other cases. Fixes dotnet#24300 Signed-off-by: dotnet-bot <[email protected]>

* Workaround memset alignment sensitivity memset is up to 2x slower on misaligned block on some types of hardware. The problem is uneven performance of "rep stosb" used to implement the memset in some cases. The exact matrix on when it is slower and by how much is very complex. This change workarounds the issue by aligning the memory block before it is passed to memset and filling in the potential misaligned part manually. This workaround will regress performance by a few percent (<10%) in some cases, but we will gain up to 2x improvement in other cases. Fixes #24300 Signed-off-by: dotnet-bot <[email protected]>

* Workaround memset alignment sensitivity memset is up to 2x slower on misaligned block on some types of hardware. The problem is uneven performance of "rep stosb" used to implement the memset in some cases. The exact matrix on when it is slower and by how much is very complex. This change workarounds the issue by aligning the memory block before it is passed to memset and filling in the potential misaligned part manually. This workaround will regress performance by a few percent (<10%) in some cases, but we will gain up to 2x improvement in other cases. Fixes dotnet/coreclr#24300 Commit migrated from dotnet/coreclr@3661584

stephentoub reviewed Apr 29, 2019

View reviewed changes

stephentoub approved these changes Apr 29, 2019

View reviewed changes

EgorBo approved these changes Apr 29, 2019

View reviewed changes

filipnavara approved these changes Apr 29, 2019

View reviewed changes

Add comment

1aaf27e

stephentoub reviewed Apr 29, 2019

View reviewed changes

src/System.Private.CoreLib/shared/System/SpanHelpers.cs Outdated Show resolved Hide resolved

Update src/System.Private.CoreLib/shared/System/SpanHelpers.cs

fc60f2b

Co-Authored-By: jkotas <[email protected]>

stephentoub approved these changes Apr 29, 2019

View reviewed changes

jkotas merged commit 3661584 into dotnet:master Apr 30, 2019

jkotas deleted the issue-18101 branch May 2, 2019 21:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Workaround memset alignment sensitivity #24302

Workaround memset alignment sensitivity #24302

Uh oh!

jkotas commented Apr 29, 2019

Uh oh!

jkotas commented Apr 29, 2019

Uh oh!

EgorBo commented Apr 29, 2019 •

edited

Loading

Uh oh!

stephentoub Apr 29, 2019

Uh oh!

EgorBo Apr 29, 2019 •

edited

Loading

Uh oh!

benaadams commented Apr 29, 2019

Uh oh!

jkotas commented Apr 29, 2019

Uh oh!

EgorBo commented Apr 29, 2019

Uh oh!

Uh oh!

EgorBo commented Apr 29, 2019 •

edited

Loading

Uh oh!

jkotas commented Apr 29, 2019

Uh oh!

EgorBo commented Apr 29, 2019

Uh oh!

Uh oh!

Workaround memset alignment sensitivity #24302

Workaround memset alignment sensitivity #24302

Uh oh!

Conversation

jkotas commented Apr 29, 2019

Uh oh!

jkotas commented Apr 29, 2019

Uh oh!

EgorBo commented Apr 29, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stephentoub Apr 29, 2019

Choose a reason for hiding this comment

Uh oh!

EgorBo Apr 29, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

benaadams commented Apr 29, 2019

Uh oh!

jkotas commented Apr 29, 2019

Uh oh!

EgorBo commented Apr 29, 2019

Uh oh!

Uh oh!

EgorBo commented Apr 29, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jkotas commented Apr 29, 2019

Uh oh!

EgorBo commented Apr 29, 2019

Uh oh!

Uh oh!

EgorBo commented Apr 29, 2019 •

edited

Loading

EgorBo Apr 29, 2019 •

edited

Loading

EgorBo commented Apr 29, 2019 •

edited

Loading