LLVM Bugzilla is read-only and represents the historical archive of all LLVM issues filled before November 26, 2021. Use github to submit LLVM bugs

Bug 40027 - [X86] cmov conversion hurts binary search performance
Summary: [X86] cmov conversion hurts binary search performance
Status: NEW
Alias: None
Product: libraries
Classification: Unclassified
Component: Backend: X86 (show other bugs)
Version: trunk
Hardware: PC All
: P enhancement
Assignee: Unassigned LLVM Bugs
URL:
Keywords:
: 40646 (view as bug list)
Depends on:
Blocks:
 
Reported: 2018-12-14 09:45 PST by Nikita Popov
Modified: 2021-10-22 07:50 PDT (History)
13 users (show)

See Also:
Fixed By Commit(s):


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Nikita Popov 2018-12-14 09:45:51 PST
Originally reported at: https://github.com/rust-lang/rust/issues/53823

The cmov conversion pass converts a cmov in Rust's binary search implementation into a branch. As binary search branches are badly predicted, this significantly hurts performance. For small data sets, which are not dominated by cache performance, the slowdown is around 2-3x.

https://gist.github.com/nikic/13e907d7097f74a92d082fcc61bc212c has the LLVM IR for the function, as well as generated assembly with cmov conversion enabled (default) and disabled (-x86-cmov-converter=false).

It would be great if the cmov conversion heuristics could be adjusted in some way to avoid converting this case.
Comment 1 Nikita Popov 2018-12-14 09:51:51 PST
For reference, the relevant line in the original Rust code is: https://github.com/rust-lang/rust/blob/059e6a6f57f4e80d527a3cd8a8afe7f51f01af8e/src/libcore/slice/mod.rs#L1423

The interesting part of the IR:

  %4 = lshr i64 %size.022.i.i, 1
  %5 = add i64 %4, %base.021.i.i
  %6 = getelementptr inbounds [0 x i64], [0 x i64]* %slice.0, i64 0, i64 %5
  %7 = load i64, i64* %6, align 8, !alias.scope !14, !noalias !19
  %8 = icmp ugt i64 %7, %0
  %base.0..i.i = select i1 %8, i64 %base.021.i.i, i64 %5

And assembly with cmov (what we want):

	movq	%rsi, %r9
	shrq	%r9
	leaq	(%r9,%r8), %rcx
	cmpq	%rdx, (%rdi,%rcx,8)
	cmovaq	%r8, %rcx

And with branch (what we get):

	movq	%rsi, %rcx
	shrq	%rcx
	leaq	(%rcx,%rdx), %r9
	cmpq	%r8, (%rdi,%r9,8)
	ja	.LBB0_7
# %bb.6:                                # %bb9.i.i
                                        #   in Loop: Header=BB0_5 Depth=1
	movq	%r9, %rdx
.LBB0_7:                                # %bb9.i.i
                                        #   in Loop: Header=BB0_5 Depth=1
Comment 2 Reid Kleckner 2018-12-14 11:26:21 PST
I remember this problem from a while ago, and I wonder if the !unpredictable metadata attachment could solve this problem:
https://llvm.org/docs/LangRef.html#unpredictable-metadata

I see a few instances of it in the test suite which gives me hope:

$ git grep 'select.*!unpredict' ../llvm
../llvm/test/Transforms/CodeGenPrepare/X86/select.ll:; CHECK-NEXT:    [[SEL:%.*]] = select i1 [[CMP]], float [[DIV]], float 2.000000e+00, !unpredictable !0
../llvm/test/Transforms/CodeGenPrepare/X86/select.ll:  %sel = select i1 %cmp, float %div, float 2.0, !unpredictable !0
../llvm/test/Transforms/Inline/profile-meta.ll:  %sel = select i1 %c, i32 %a, i32 %b, !prof !0, !unpredictable !1
../llvm/test/Transforms/Inline/profile-meta.ll:; CHECK-NEXT:  [[SEL:%.*]] = select i1 %C, i32 %A, i32 %B, !prof !0, !unpredictable !1
../llvm/test/Transforms/SimplifyCFG/PhiEliminate2.ll:; CHECK-NEXT:  %V6 = select i1 %C, i32 %V4, i32 %V5, !prof !0, !unpredictable !1
Comment 3 Nikita Popov 2018-12-14 11:40:05 PST
Just gave !unpredictable a try, and unfortunately it didn't help.

This does sound like something that the pass should take into account -- not sure if this information is still available at that point though.
Comment 4 Nikita Popov 2018-12-14 12:14:58 PST
Looks like a similar optimization in CodeGenPrepare can be disabled via !unpredictable: https://github.com/llvm-mirror/llvm/blob/f2511abd84e3aa730ed68ac30881b0895d03b6d4/lib/CodeGen/CodeGenPrepare.cpp#L5756

But this metadata is not preserved in SelectionDAG.
Comment 5 Sanjay Patel 2018-12-14 12:16:03 PST
Yes, 'unpredictable' was created for situations like this, but I never looked at getting metadata to survive to a machine IR pass. Taming the heuristic cost calc might be the easier fix.

Which CPU(s) are showing the 2-3x slowdown? 

It's possible that a particular sched model and/or generic defaults should be adjusted to avoid this problem.
Comment 6 Nikita Popov 2018-12-14 13:16:28 PST
@spatel: I'm seeing 2.8x slowdown on i5-4690 CPU (Haswell), with both default and native target CPU. On the Rust bug report two other people report similar numbers, though I don't know which CPUs those are.

I don't think that changing scheduling models will help for this particular case (where the transform should be pretty much universally disadvantageous, as the branch is unpredictable).
Comment 7 Sanjay Patel 2018-12-14 13:41:40 PST
(In reply to Nikita Popov from comment #6)
> @spatel: I'm seeing 2.8x slowdown on i5-4690 CPU (Haswell), with both
> default and native target CPU. On the Rust bug report two other people
> report similar numbers, though I don't know which CPUs those are.
> 
> I don't think that changing scheduling models will help for this particular
> case (where the transform should be pretty much universally disadvantageous,
> as the branch is unpredictable).

Yes, I agree - and apparently so did the author of this pass. :)
There's a test in /test/CodeGen/X86/x86-cmov-converter.ll called "BinarySearch", and that's supposed to be guarding this transform from happening on code like yours. Seeing how your machine code differs from that reference should be informative.
Comment 8 Nikita Popov 2018-12-14 13:52:00 PST
So that test case has a structure like

  %Left.sink = select i1 %tobool, %struct.Node** %Left, %struct.Node** %Right
  %3 = load %struct.Node*, %struct.Node** %Left.sink, align 8

which I think is guarded by https://github.com/llvm-mirror/llvm/blob/0bbe50f2fb47207ac8093bb700edcb75616b7dd5/lib/Target/X86/X86CmovConversion.cpp#L536 in the cmov pass.

Notably, and as the comment there indicates, this is not actually a binary search, but rather a binary *tree* search, and the code uses "load depending on select" to detect that case.
Comment 9 Nikita Popov 2018-12-15 02:25:14 PST
This case seems pretty tricky to me, because on the surface, the pattern is exactly one where cmov conversion should be profitable (both select values are already computed, we're only waiting on the condition) -- under the assumption that the branch is predictable.

The cmov conversion cost modeling currently only models whether the conversion is profitable under the assumption of a reasonably well-predicted branch, but (apart from the tree-search case mentioned above) doesn't make an effort to determine whether or not the branch is expected to be predictable.

And I'm not sure how it could really do that. My best idea would be that branch predictability correlates with access locality, and that branches depending on random-access loads are more likely to be unpredictable, while branches depending on loads from a linear induction variable are likely to be predictable. But that's something that doesn't sound easy to check at the point where cmov conversion operates.
Comment 10 Nikita Popov 2018-12-17 05:51:27 PST
Looks like this issue has previously been reported at https://bugs.llvm.org/show_bug.cgi?id=37144.
Comment 11 Sanjay Patel 2018-12-17 06:14:45 PST
This commit added a magic-number (why 12.5%?) restriction:
https://reviews.llvm.org/rL310352 (bug 33954)

...so there's still a possibility that we could adjust the code to avoid this case. 

If not (or even if we do that), we should figure out how to run the plumbing to get 'unpredictable' down to this layer and have this pass use that info.
Comment 12 Nikita Popov 2018-12-17 06:35:52 PST
For reference, the computed values in the cost modelling (for generic target) are:

Diff[0]: 1
Diff[1]: 6
LoopDepth[0].Depth: 8
LoopDepth[1].Depth: 15
Comment 13 Antonin Kriz 2019-02-19 05:00:57 PST
*** Bug 40646 has been marked as a duplicate of this bug. ***
Comment 14 Giuseppe Ottaviano 2020-06-30 14:49:19 PDT
I was wondering if there has been any progress on this. One case very similar to binary search where this hits us is binary heap primitives. See for example this code: https://godbolt.org/z/h6hFxg

The condition highlighted is unpredictable, and the cmov conversion has a significant negative impact. We have a workaround that uses inline asm to force a cmov, but supporting that is quite annoying (the real implementation is a template and we have an extension point to define per-comparator workarounds).

It also seems that branch probabilities from FDO data are not taken into account by the optimization.

As in the godbolt link, disabling the optimization globally restores the cmov, but that may have other unwanted side effects. Is there any way to disable cmov conversion on a per-function basis?

Ideally, we'd use the "unpredictable" annotation, but I'm not sure whether it's possible to express that in Clang (is there a corresponding builtin?) and it also looks like the optimization is still not honoring that either.
Comment 15 Sanjay Patel 2020-07-01 05:53:22 PDT
(In reply to Giuseppe Ottaviano from comment #14) 
> Ideally, we'd use the "unpredictable" annotation, but I'm not sure whether
> it's possible to express that in Clang (is there a corresponding builtin?)
> and it also looks like the optimization is still not honoring that either.

Yes, there is a clang builtin:

if (__builtin_unpredictable(x > 0)) {
   foo();
}

https://clang.llvm.org/docs/LanguageExtensions.html#builtin-functions

But no, it does not make a difference on your code example because that information is lost before the problematic pass. 

AFAIK, there have been no improvements on this since this bug was filed.
Comment 16 Nikita Popov 2020-07-01 06:00:00 PDT
We did make the cost modelling a bit more conservative to avoid some additional regressions in LLVM 10. But nothing that would affect the binary search related cases.

We probably should just make this fully conservative and assume worst-case probabilities for everything, until such a time as accurate profiling information is available this deep in the backend.