Originally reported at: https://github.com/rust-lang/rust/issues/53823 The cmov conversion pass converts a cmov in Rust's binary search implementation into a branch. As binary search branches are badly predicted, this significantly hurts performance. For small data sets, which are not dominated by cache performance, the slowdown is around 2-3x. https://gist.github.com/nikic/13e907d7097f74a92d082fcc61bc212c has the LLVM IR for the function, as well as generated assembly with cmov conversion enabled (default) and disabled (-x86-cmov-converter=false). It would be great if the cmov conversion heuristics could be adjusted in some way to avoid converting this case.
For reference, the relevant line in the original Rust code is: https://github.com/rust-lang/rust/blob/059e6a6f57f4e80d527a3cd8a8afe7f51f01af8e/src/libcore/slice/mod.rs#L1423 The interesting part of the IR: %4 = lshr i64 %size.022.i.i, 1 %5 = add i64 %4, %base.021.i.i %6 = getelementptr inbounds [0 x i64], [0 x i64]* %slice.0, i64 0, i64 %5 %7 = load i64, i64* %6, align 8, !alias.scope !14, !noalias !19 %8 = icmp ugt i64 %7, %0 %base.0..i.i = select i1 %8, i64 %base.021.i.i, i64 %5 And assembly with cmov (what we want): movq %rsi, %r9 shrq %r9 leaq (%r9,%r8), %rcx cmpq %rdx, (%rdi,%rcx,8) cmovaq %r8, %rcx And with branch (what we get): movq %rsi, %rcx shrq %rcx leaq (%rcx,%rdx), %r9 cmpq %r8, (%rdi,%r9,8) ja .LBB0_7 # %bb.6: # %bb9.i.i # in Loop: Header=BB0_5 Depth=1 movq %r9, %rdx .LBB0_7: # %bb9.i.i # in Loop: Header=BB0_5 Depth=1
I remember this problem from a while ago, and I wonder if the !unpredictable metadata attachment could solve this problem: https://llvm.org/docs/LangRef.html#unpredictable-metadata I see a few instances of it in the test suite which gives me hope: $ git grep 'select.*!unpredict' ../llvm ../llvm/test/Transforms/CodeGenPrepare/X86/select.ll:; CHECK-NEXT: [[SEL:%.*]] = select i1 [[CMP]], float [[DIV]], float 2.000000e+00, !unpredictable !0 ../llvm/test/Transforms/CodeGenPrepare/X86/select.ll: %sel = select i1 %cmp, float %div, float 2.0, !unpredictable !0 ../llvm/test/Transforms/Inline/profile-meta.ll: %sel = select i1 %c, i32 %a, i32 %b, !prof !0, !unpredictable !1 ../llvm/test/Transforms/Inline/profile-meta.ll:; CHECK-NEXT: [[SEL:%.*]] = select i1 %C, i32 %A, i32 %B, !prof !0, !unpredictable !1 ../llvm/test/Transforms/SimplifyCFG/PhiEliminate2.ll:; CHECK-NEXT: %V6 = select i1 %C, i32 %V4, i32 %V5, !prof !0, !unpredictable !1
Just gave !unpredictable a try, and unfortunately it didn't help. This does sound like something that the pass should take into account -- not sure if this information is still available at that point though.
Looks like a similar optimization in CodeGenPrepare can be disabled via !unpredictable: https://github.com/llvm-mirror/llvm/blob/f2511abd84e3aa730ed68ac30881b0895d03b6d4/lib/CodeGen/CodeGenPrepare.cpp#L5756 But this metadata is not preserved in SelectionDAG.
Yes, 'unpredictable' was created for situations like this, but I never looked at getting metadata to survive to a machine IR pass. Taming the heuristic cost calc might be the easier fix. Which CPU(s) are showing the 2-3x slowdown? It's possible that a particular sched model and/or generic defaults should be adjusted to avoid this problem.
@spatel: I'm seeing 2.8x slowdown on i5-4690 CPU (Haswell), with both default and native target CPU. On the Rust bug report two other people report similar numbers, though I don't know which CPUs those are. I don't think that changing scheduling models will help for this particular case (where the transform should be pretty much universally disadvantageous, as the branch is unpredictable).
(In reply to Nikita Popov from comment #6) > @spatel: I'm seeing 2.8x slowdown on i5-4690 CPU (Haswell), with both > default and native target CPU. On the Rust bug report two other people > report similar numbers, though I don't know which CPUs those are. > > I don't think that changing scheduling models will help for this particular > case (where the transform should be pretty much universally disadvantageous, > as the branch is unpredictable). Yes, I agree - and apparently so did the author of this pass. :) There's a test in /test/CodeGen/X86/x86-cmov-converter.ll called "BinarySearch", and that's supposed to be guarding this transform from happening on code like yours. Seeing how your machine code differs from that reference should be informative.
So that test case has a structure like %Left.sink = select i1 %tobool, %struct.Node** %Left, %struct.Node** %Right %3 = load %struct.Node*, %struct.Node** %Left.sink, align 8 which I think is guarded by https://github.com/llvm-mirror/llvm/blob/0bbe50f2fb47207ac8093bb700edcb75616b7dd5/lib/Target/X86/X86CmovConversion.cpp#L536 in the cmov pass. Notably, and as the comment there indicates, this is not actually a binary search, but rather a binary *tree* search, and the code uses "load depending on select" to detect that case.
This case seems pretty tricky to me, because on the surface, the pattern is exactly one where cmov conversion should be profitable (both select values are already computed, we're only waiting on the condition) -- under the assumption that the branch is predictable. The cmov conversion cost modeling currently only models whether the conversion is profitable under the assumption of a reasonably well-predicted branch, but (apart from the tree-search case mentioned above) doesn't make an effort to determine whether or not the branch is expected to be predictable. And I'm not sure how it could really do that. My best idea would be that branch predictability correlates with access locality, and that branches depending on random-access loads are more likely to be unpredictable, while branches depending on loads from a linear induction variable are likely to be predictable. But that's something that doesn't sound easy to check at the point where cmov conversion operates.
Looks like this issue has previously been reported at https://bugs.llvm.org/show_bug.cgi?id=37144.
This commit added a magic-number (why 12.5%?) restriction: https://reviews.llvm.org/rL310352 (bug 33954) ...so there's still a possibility that we could adjust the code to avoid this case. If not (or even if we do that), we should figure out how to run the plumbing to get 'unpredictable' down to this layer and have this pass use that info.
For reference, the computed values in the cost modelling (for generic target) are: Diff[0]: 1 Diff[1]: 6 LoopDepth[0].Depth: 8 LoopDepth[1].Depth: 15
*** Bug 40646 has been marked as a duplicate of this bug. ***
I was wondering if there has been any progress on this. One case very similar to binary search where this hits us is binary heap primitives. See for example this code: https://godbolt.org/z/h6hFxg The condition highlighted is unpredictable, and the cmov conversion has a significant negative impact. We have a workaround that uses inline asm to force a cmov, but supporting that is quite annoying (the real implementation is a template and we have an extension point to define per-comparator workarounds). It also seems that branch probabilities from FDO data are not taken into account by the optimization. As in the godbolt link, disabling the optimization globally restores the cmov, but that may have other unwanted side effects. Is there any way to disable cmov conversion on a per-function basis? Ideally, we'd use the "unpredictable" annotation, but I'm not sure whether it's possible to express that in Clang (is there a corresponding builtin?) and it also looks like the optimization is still not honoring that either.
(In reply to Giuseppe Ottaviano from comment #14) > Ideally, we'd use the "unpredictable" annotation, but I'm not sure whether > it's possible to express that in Clang (is there a corresponding builtin?) > and it also looks like the optimization is still not honoring that either. Yes, there is a clang builtin: if (__builtin_unpredictable(x > 0)) { foo(); } https://clang.llvm.org/docs/LanguageExtensions.html#builtin-functions But no, it does not make a difference on your code example because that information is lost before the problematic pass. AFAIK, there have been no improvements on this since this bug was filed.
We did make the cost modelling a bit more conservative to avoid some additional regressions in LLVM 10. But nothing that would affect the binary search related cases. We probably should just make this fully conservative and assume worst-case probabilities for everything, until such a time as accurate profiling information is available this deep in the backend.