Skip to content
This repository was archived by the owner on Jan 23, 2023. It is now read-only.

Workaround memset alignment sensitivity #24302

Merged
merged 3 commits into from
Apr 30, 2019
Merged

Conversation

jkotas
Copy link
Member

@jkotas jkotas commented Apr 29, 2019

memset is up to 2x slower on misaligned block on some types of hardware. The problem is uneven performance of "rep stosb"
used to implement the memset in some cases. The exact matrix on when it is slower and by how much is very complex.

This change workarounds the issue by aligning the memory block before it is passed to memset and filling in the potential misaligned
part manually. This workaround will regress performance by a few percent (<10%) in some cases, but we will gain up to 2x improvement
in other cases.

Fixes #24300

memset is up to 2x slower on misaligned block on some types of hardware. The problem is uneven performance of "rep stosb"
used to implement the memset in some cases. The exact matrix on when it is slower and by how much is very complex.

This change workarounds the issue by aligning the memory block before it is passed to memset and filling in the potential misaligned
part manually. This workaround will regress performance by a few percent (<10%) in some cases, but we will gain up to 2x improvement
in other cases.

Fixes #24300
@jkotas
Copy link
Member Author

jkotas commented Apr 29, 2019

cc @stephentoub @EgorBo

@EgorBo
Copy link
Member

EgorBo commented Apr 29, 2019

I wonder if it makes sense for big arrays to implement memset in the managed code via AVX if available instead of rep stosb (e.g. this is what clang generates for a plain C loop: https://godbolt.org/z/Tl-YFi).

E.g.:

    [Benchmark]
    public unsafe void Clear8192AvxUnrolled()
    {
        var zerov = Vector256<byte>.Zero;
        var c = _arr8192;
        fixed (byte* p = c)
        {
            for (int i = 0; i < c.Length; i += 32 * 4)
            {
                Avx.Store(p + i, zerov);
                Avx.Store(p + i + 32, zerov);
                Avx.Store(p + i + 64, zerov);
                Avx.Store(p + i + 96, zerov);
            }
        }
    }

    [Benchmark]
    public unsafe void Clear8192AvxUnrolledUnaligned()
    {
        var zerov = Vector256<byte>.Zero;
        var c = _arr8192;
        fixed (byte* p = c)
        {
            // TODO: align input
            for (int i = 0; i < c.Length; i += 32 * 4)
            {
                Avx.Store(p + i, zerov);
                Avx.Store(p + i + 32, zerov);
                Avx.Store(p + i + 64, zerov);
                Avx.Store(p + i + 96, zerov);
            }
        }
    }
|                        Method |      Mean |     Error |    StdDev |
|------------------------------ |----------:|----------:|----------:|
|                     Clear8192 |  71.18 ns | 0.5130 ns | 0.4798 ns |
|            Clear8192Unaligned | 101.68 ns | 0.0083 ns | 0.0077 ns |
|          Clear8192AvxUnrolled |  54.55 ns | 0.0267 ns | 0.0250 ns |
| Clear8192AvxUnrolledUnaligned |  90.52 ns | 0.0139 ns | 0.0130 ns |

@@ -24,7 +24,7 @@ public static unsafe void ClearWithoutReferences(ref byte b, nuint byteLength)
return;

#if CORECLR && (AMD64 || ARM64)
if (byteLength > 4096)
if (byteLength > 768)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: could you add a comment as to how this number was deduced?

Copy link
Member

@EgorBo EgorBo Apr 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ran the benchmark for different array sizes:

|         Method | arraySize |     Mean |
|--------------- |---------- |---------:|
|          Clear |       500 | 13.00 ns |
| ClearUnaligned |       500 | 11.32 ns |
|          Clear |       600 | 16.99 ns |
| ClearUnaligned |       600 | 16.94 ns |
|          Clear |       700 | 17.21 ns |
| ClearUnaligned |       700 | 23.03 ns |
|          Clear |       800 | 17.60 ns |
| ClearUnaligned |       800 | 18.15 ns |
|          Clear |       900 | 18.02 ns |
| ClearUnaligned |       900 | 19.22 ns |

but these results are for the old code if (byteLength > 4096) so all of them use Unsafe.InitBlockUnaligned and the problem appears after count > 600-700 on my machine (i8700K Windows10 x64)

@benaadams
Copy link
Member

Avx.Store

Aside; as not what this PR is doing, but Avx.StoreAlignedNonTemporal would be better, as clear after use is more common than clear before use?

@jkotas
Copy link
Member Author

jkotas commented Apr 29, 2019

I wonder if it makes sense for big arrays to implement memset in the managed code via AVX if available instead of rep stosb

It is a rocket science to do this well. These things need to be tuned against real workloads, microbenchmarks are not enough.

@EgorBo
Copy link
Member

EgorBo commented Apr 29, 2019

@benaadams will try but that was just a proof of concept :)

@EgorBo
Copy link
Member

EgorBo commented Apr 29, 2019

@benaadams just tried - Avx.StoreAlignedNonTemporal is 5x times slower for the benchmark above (aligned data, 8196 bytes). The only difference in the Asm output is vmovntdq instead of vmovdqa (or vmovdqu).

I am not sure but according to https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=vmovdqa&expand=5666,5656,5666,5666,5668,5657,5656,5590,3338&techs=AVX,AVX2 the Throughput of vmovdqa is 4x times better.

@jkotas
Copy link
Member Author

jkotas commented Apr 29, 2019

@dotnet-bot test this please

@EgorBo
Copy link
Member

EgorBo commented Apr 29, 2019

AVX with force align:

    public static void MemsetLargeArrayAvx(byte* dst, int length)
    {
        // this code is supposed to be called only for large arrays
        // so we don't have to care about small arrays.
        *((nuint*)dst + 0) = 0;
        *((nuint*)dst + 1) = 0;
        *((nuint*)dst + 2) = 0;
        *((nuint*)dst + 3) = 0;
        byte* dsta = (byte*)(((nuint) (dst + 1) + 31) & ~(nuint) 31);

        var zero = Vector256<byte>.Zero;
        int i = 0;
        for (; i < length - 128; i += 128)
        {
            Avx.StoreAligned(dsta + i, zero);
            Avx.StoreAligned(dsta + i + 32, zero);
            Avx.StoreAligned(dsta + i + 64, zero);
            Avx.StoreAligned(dsta + i + 96, zero);
        }

        var endElements = (uint) (length - i);
        if (endElements > 0)
            Unsafe.InitBlockUnaligned(dst + i, 0, endElements);
    }

Results:

|                        Method |         array | arraySize |      Mean |
|------------------------------ |-------------- |---------- |----------:|
|                     Clear8192 | System.Byte[] |      8192 |  72.56 ns |
|            Clear8192Unaligned | System.Byte[] |      8192 | 102.74 ns |
|     Clear8192Unaligned_JanFix | System.Byte[] |      8192 |  74.92 ns |
|         Clear819Unaligned_Avx | System.Byte[] |      8192 |  56.45 ns |
|                 Clear8192_Avx | System.Byte[] |      8192 |  56.13 ns |

I understand it should be tested on different machines, on different test data, including some real world apps so I am leaving it here in case if anybody is interested.

@jkotas jkotas merged commit 3661584 into dotnet:master Apr 30, 2019
Dotnet-GitSync-Bot pushed a commit to Dotnet-GitSync-Bot/corefx that referenced this pull request Apr 30, 2019
* Workaround memset alignment sensitivity

memset is up to 2x slower on misaligned block on some types of hardware. The problem is uneven performance of "rep stosb"
used to implement the memset in some cases. The exact matrix on when it is slower and by how much is very complex.

This change workarounds the issue by aligning the memory block before it is passed to memset and filling in the potential misaligned
part manually. This workaround will regress performance by a few percent (<10%) in some cases, but we will gain up to 2x improvement
in other cases.

Fixes dotnet#24300

Signed-off-by: dotnet-bot <[email protected]>
stephentoub pushed a commit to dotnet/corefx that referenced this pull request Apr 30, 2019
* Workaround memset alignment sensitivity

memset is up to 2x slower on misaligned block on some types of hardware. The problem is uneven performance of "rep stosb"
used to implement the memset in some cases. The exact matrix on when it is slower and by how much is very complex.

This change workarounds the issue by aligning the memory block before it is passed to memset and filling in the potential misaligned
part manually. This workaround will regress performance by a few percent (<10%) in some cases, but we will gain up to 2x improvement
in other cases.

Fixes #24300

Signed-off-by: dotnet-bot <[email protected]>
Dotnet-GitSync-Bot pushed a commit to Dotnet-GitSync-Bot/corert that referenced this pull request Apr 30, 2019
* Workaround memset alignment sensitivity

memset is up to 2x slower on misaligned block on some types of hardware. The problem is uneven performance of "rep stosb"
used to implement the memset in some cases. The exact matrix on when it is slower and by how much is very complex.

This change workarounds the issue by aligning the memory block before it is passed to memset and filling in the potential misaligned
part manually. This workaround will regress performance by a few percent (<10%) in some cases, but we will gain up to 2x improvement
in other cases.

Fixes #24300

Signed-off-by: dotnet-bot <[email protected]>
Dotnet-GitSync-Bot pushed a commit to Dotnet-GitSync-Bot/mono that referenced this pull request Apr 30, 2019
* Workaround memset alignment sensitivity

memset is up to 2x slower on misaligned block on some types of hardware. The problem is uneven performance of "rep stosb"
used to implement the memset in some cases. The exact matrix on when it is slower and by how much is very complex.

This change workarounds the issue by aligning the memory block before it is passed to memset and filling in the potential misaligned
part manually. This workaround will regress performance by a few percent (<10%) in some cases, but we will gain up to 2x improvement
in other cases.

Fixes #24300

Signed-off-by: dotnet-bot <[email protected]>
jkotas added a commit to dotnet/corert that referenced this pull request Apr 30, 2019
* Workaround memset alignment sensitivity

memset is up to 2x slower on misaligned block on some types of hardware. The problem is uneven performance of "rep stosb"
used to implement the memset in some cases. The exact matrix on when it is slower and by how much is very complex.

This change workarounds the issue by aligning the memory block before it is passed to memset and filling in the potential misaligned
part manually. This workaround will regress performance by a few percent (<10%) in some cases, but we will gain up to 2x improvement
in other cases.

Fixes #24300

Signed-off-by: dotnet-bot <[email protected]>
marek-safar pushed a commit to mono/mono that referenced this pull request Apr 30, 2019
* Workaround memset alignment sensitivity

memset is up to 2x slower on misaligned block on some types of hardware. The problem is uneven performance of "rep stosb"
used to implement the memset in some cases. The exact matrix on when it is slower and by how much is very complex.

This change workarounds the issue by aligning the memory block before it is passed to memset and filling in the potential misaligned
part manually. This workaround will regress performance by a few percent (<10%) in some cases, but we will gain up to 2x improvement
in other cases.

Fixes #24300

Signed-off-by: dotnet-bot <[email protected]>
@jkotas jkotas deleted the issue-18101 branch May 2, 2019 21:29
picenka21 pushed a commit to picenka21/runtime that referenced this pull request Feb 18, 2022
* Workaround memset alignment sensitivity

memset is up to 2x slower on misaligned block on some types of hardware. The problem is uneven performance of "rep stosb"
used to implement the memset in some cases. The exact matrix on when it is slower and by how much is very complex.

This change workarounds the issue by aligning the memory block before it is passed to memset and filling in the potential misaligned
part manually. This workaround will regress performance by a few percent (<10%) in some cases, but we will gain up to 2x improvement
in other cases.

Fixes dotnet/coreclr#24300

Commit migrated from dotnet/coreclr@3661584
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Array.Clear performance is sensitive to alignment
5 participants