Minimized test case: https://github.com/hsivonen/llvm_ascii_validation (assumes that https://rustup.rs/ is installed.) When rustc switched from LLVM 4.0 to LLVM 6.0, Firefox's Rust-based SSE2-using UTF-8 validation function regressed up to 12.5% in performance (in CI on EC2). What changed was that LLVM 6.0 unrolls the first trip through the innermost SSE2 ASCII validation loop, but LLVM 4.0 didn't unroll it and compiled it to the most obvious form. The basic block of the loop as produced by LLVM 4.0: .LBB0_6: movdqu (%rdi,%rax), %xmm0 pmovmskb %xmm0, %edx testl %edx, %edx jne .LBB0_7 addq $16, %rax cmpq %rcx, %rax jbe .LBB0_6 jmp .LBB0_2 The unrolled part and the actual loop as produced by LLVM 6.0: .cfi_startproc cmpq $16, %rsi jb .LBB0_1 movdqu (%rdi), %xmm0 pmovmskb %xmm0, %ecx testl %ecx, %ecx je .LBB0_10 xorl %esi, %esi testl %ecx, %ecx je .LBB0_7 .LBB0_8: bsfl %ecx, %eax jmp .LBB0_9 .LBB0_1: xorl %eax, %eax cmpq %rsi, %rax jb .LBB0_13 .LBB0_15: movq %rsi, %rax retq .LBB0_10: leaq -16(%rsi), %rdx movl $16, %eax .p2align 4, 0x90 .LBB0_11: cmpq %rdx, %rax ja .LBB0_12 movdqu (%rdi,%rax), %xmm0 pmovmskb %xmm0, %ecx addq $16, %rax testl %ecx, %ecx je .LBB0_11 addq $-16, %rax movq %rax, %rsi testl %ecx, %ecx jne .LBB0_8 The unrolling is visible already on the LLVM-IR level. Both cases used opt-level 2, which is what Firefox ships with. # Benchmarking the minimized test case looks like this to me: x86_64 code running on Intel(R) Xeon(R) CPU E5-2637 v4 @ 3.50GHz (Broadwell-EP) Similar results with both the powersave and performance governors. $ rustup default 1.24.0 $ ./bench.sh [...] test bench ... bench: 1,539,341 ns/iter (+/- 216,985) $ rustup default 1.25.0 $ ./bench.sh [...] test bench ... bench: 1,865,801 ns/iter (+/- 22,297) x86_64 code running on Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz (Haswell-DT) $ rustup default 1.24.0 $ ./bench.sh [...] test bench ... bench: 1,491,560 ns/iter (+/- 65,163) $ rustup default 1.25.0 $ ./bench.sh [...] test bench ... bench: 1,673,239 ns/iter (+/- 15,355) # Links to other bug databases Rust: https://github.com/rust-lang/rust/issues/49873 Firefox: https://bugzilla.mozilla.org/show_bug.cgi?id=1451703
This affects both x86 and x86_64.
I've made an attempt at https://gist.github.com/alexcrichton/a2bdcaed3d87743893b6c5ab0a2e92cb to translate the Rust to C and reproduce the slowdown. Using compilers from http://releases.llvm.org/ I get: $ ./clang+llvm-4.0.0-x86_64-linux-gnu-ubuntu-16.04/bin/clang foo.c -O -o before $ time ./before ./before 0.88s user 0.00s system 99% cpu 0.879 total $ ./clang+llvm-6.0.0-x86_64-linux-gnu-ubuntu-16.04/bin/clang foo.c -O -o after $ time ./after ./after 1.08s user 0.00s system 99% cpu 1.085 total Looking at the assembly looks to show the same behavior as the Rust assembly
Er sorry that was the wrong code, the corrected C code is https://gist.github.com/alexcrichton/dcd5186cf282fd6246c398b0f276a3e5
(In reply to Alex Crichton from comment #3) > Er sorry that was the wrong code, the corrected C code is > https://gist.github.com/alexcrichton/dcd5186cf282fd6246c398b0f276a3e5 Thank you. (It seems to me this version of main() reproduces the alignment for benchmarking, which doesn't affect the assembly generated ascii_valid_up_to: int main() { char* buf = (char*)malloc(108); for (int i = 0; i < 108; i++) { buf[i] = 'a'; } for (int i = 0; i < 200000000; i++) { black_box((void*) buf); size_t len = 100; black_box(&len); size_t ret = ascii_valid_up_to(buf + 8, len); black_box((void*) &ret); } free(buf); return 0; } ) Findings from looking at the code on godboolt.org (https://godbolt.org/g/4Lg4PM): * The change happened between clang 4.0.1 and 5.0.0. * #pragma nounroll doesn't affect the behavior. * -mllvm -unroll-allow-peeling=false doesn't affect the behavior.
This was fixed in rustc by cherrypicking https://github.com/llvm-mirror/llvm/commit/4c64dfea37812732f39e68ca444b2eb809d78dcc into the copy of LLVM used by rustc.