37266 – Performance regression 4.0 to 6.0 due to unrolling the first trip through an SSE2 ASCII validation loop

LLVM Bugzilla is read-only and represents the historical archive of all LLVM issues filled before November 26, 2021. Use github to submit LLVM bugs

Bug 37266 - Performance regression 4.0 to 6.0 due to unrolling the first trip through an SSE2 ASCII validation loop

Summary: Performance regression 4.0 to 6.0 due to unrolling the first trip through an ...

Status:	RESOLVED WORKSFORME

Alias:	None

Product:	new-bugs
Classification:	Unclassified
Component:	new bugs (show other bugs)
Version:	6.0
Hardware:	PC All

Importance:	P normal
Assignee:	Unassigned LLVM Bugs

URL:
Keywords:

Depends on:
Blocks:

Reported:	2018-04-27 01:54 PDT by Henri Sivonen
Modified:	2018-05-21 01:42 PDT (History)
CC List:	6 users (show)

See Also:	https://github.com/rust-lang/rust/issues/49873 https://bugzilla.mozilla.org/show_bug.cgi?id=1451703
Fixed By Commit(s):

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Henri Sivonen 2018-04-27 01:54:44 PDT

Minimized test case: https://github.com/hsivonen/llvm_ascii_validation (assumes that https://rustup.rs/ is installed.)

When rustc switched from LLVM 4.0 to LLVM 6.0, Firefox's Rust-based SSE2-using UTF-8 validation function regressed up to 12.5% in performance (in CI on EC2).

What changed was that LLVM 6.0 unrolls the first trip through the innermost SSE2 ASCII validation loop, but LLVM 4.0 didn't unroll it and compiled it to the most obvious form.

The basic block of the loop as produced by LLVM 4.0:

.LBB0_6:
        movdqu  (%rdi,%rax), %xmm0
        pmovmskb        %xmm0, %edx
        testl   %edx, %edx
        jne     .LBB0_7
        addq    $16, %rax
        cmpq    %rcx, %rax
        jbe     .LBB0_6
        jmp     .LBB0_2

The unrolled part and the actual loop as produced by LLVM 6.0:

        .cfi_startproc
        cmpq    $16, %rsi
        jb      .LBB0_1
        movdqu  (%rdi), %xmm0
        pmovmskb        %xmm0, %ecx
        testl   %ecx, %ecx
        je      .LBB0_10
        xorl    %esi, %esi
        testl   %ecx, %ecx
        je      .LBB0_7
.LBB0_8:
        bsfl    %ecx, %eax
        jmp     .LBB0_9
.LBB0_1:
        xorl    %eax, %eax
        cmpq    %rsi, %rax
        jb      .LBB0_13
.LBB0_15:
        movq    %rsi, %rax
        retq
.LBB0_10:
        leaq    -16(%rsi), %rdx
        movl    $16, %eax
        .p2align        4, 0x90
.LBB0_11:
        cmpq    %rdx, %rax
        ja      .LBB0_12
        movdqu  (%rdi,%rax), %xmm0
        pmovmskb        %xmm0, %ecx
        addq    $16, %rax
        testl   %ecx, %ecx
        je      .LBB0_11
        addq    $-16, %rax
        movq    %rax, %rsi
        testl   %ecx, %ecx
        jne     .LBB0_8

The unrolling is visible already on the LLVM-IR level. Both cases used opt-level 2, which is what Firefox ships with.

# Benchmarking the minimized test case looks like this to me:

x86_64 code running on Intel(R) Xeon(R) CPU E5-2637 v4 @ 3.50GHz (Broadwell-EP)

Similar results with both the powersave and performance governors.

$ rustup default 1.24.0
$ ./bench.sh
[...]
test bench ... bench:   1,539,341 ns/iter (+/- 216,985)

$ rustup default 1.25.0
$ ./bench.sh
[...]
test bench ... bench:   1,865,801 ns/iter (+/- 22,297)

x86_64 code running on Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz (Haswell-DT)

$ rustup default 1.24.0
$ ./bench.sh
[...]
test bench ... bench:   1,491,560 ns/iter (+/- 65,163)

$ rustup default 1.25.0
$ ./bench.sh
[...]
test bench ... bench:   1,673,239 ns/iter (+/- 15,355)

# Links to other bug databases

Rust: https://github.com/rust-lang/rust/issues/49873
Firefox: https://bugzilla.mozilla.org/show_bug.cgi?id=1451703

Comment 1 Henri Sivonen 2018-04-27 02:17:53 PDT

This affects both x86 and x86_64.

Comment 2 Alex Crichton 2018-04-27 08:10:49 PDT

I've made an attempt at https://gist.github.com/alexcrichton/a2bdcaed3d87743893b6c5ab0a2e92cb to translate the Rust to C and reproduce the slowdown. Using compilers from http://releases.llvm.org/ I get:



$ ./clang+llvm-4.0.0-x86_64-linux-gnu-ubuntu-16.04/bin/clang foo.c -O -o before
$ time ./before
./before  0.88s user 0.00s system 99% cpu 0.879 total
$ ./clang+llvm-6.0.0-x86_64-linux-gnu-ubuntu-16.04/bin/clang foo.c -O -o after
$ time ./after
./after  1.08s user 0.00s system 99% cpu 1.085 total


Looking at the assembly looks to show the same behavior as the Rust assembly

Comment 3 Alex Crichton 2018-04-27 08:14:20 PDT

Er sorry that was the wrong code, the corrected C code is https://gist.github.com/alexcrichton/dcd5186cf282fd6246c398b0f276a3e5

Comment 4 Henri Sivonen 2018-04-29 23:47:21 PDT

(In reply to Alex Crichton from comment #3)
> Er sorry that was the wrong code, the corrected C code is
> https://gist.github.com/alexcrichton/dcd5186cf282fd6246c398b0f276a3e5

Thank you.

(It seems to me this version of main() reproduces the alignment for benchmarking, which doesn't affect the assembly generated ascii_valid_up_to:

int main() {
    char* buf = (char*)malloc(108);
    for (int i = 0; i < 108; i++) {
      buf[i] = 'a';
    }
    for (int i = 0; i < 200000000; i++) {
      black_box((void*) buf);
      size_t len = 100;
      black_box(&len);
      size_t ret = ascii_valid_up_to(buf + 8, len);
      black_box((void*) &ret);
    }
    free(buf);
    return 0;
}

)

Findings from looking at the code on godboolt.org (https://godbolt.org/g/4Lg4PM):

 * The change happened between clang 4.0.1 and 5.0.0.
 * #pragma nounroll doesn't affect the behavior.
 * -mllvm -unroll-allow-peeling=false doesn't affect the behavior.

Comment 5 Henri Sivonen 2018-05-21 01:42:00 PDT

This was fixed in rustc by cherrypicking https://github.com/llvm-mirror/llvm/commit/4c64dfea37812732f39e68ca444b2eb809d78dcc into the copy of LLVM used by rustc.