RFC: PGO Late instrumentation for LLVM

Accidentally sent to uiuc server.

Accidentally sent to uiuc server.

Can you compare your results with another approach: simply do not
instrument the top 1% hottest functions (by function entry count)? If this
simple approach provides most of the benefits (my measurements on one
codebase I tested show that it would eliminate over 97% of total function
counts), we may be able to use a simpler approach.AFor large C++ programs, the

For static compiler, this is not possible. It also seem to defeat the
purpose of PGO -- hottest functions are those which need profile
guidance the most.

The biggest thing I notice about this proposal is that although the focus
of the proposal is on reducing profiling overhead, it requires using a
different pass pipeline during *both* the "instr-generate" compilation and
the "instr-use" compilation. Therefore it is more than simply a reduction in
profiling overhead -- it can also change performance even ignoring adding
the profiling information in the IR. I think it would be interesting to test
the modified pass pipeline even in the absence of profile information to
better understand its effects in isolation (for example, maybe it would be
good to add these passes during -O3 regardless of whether we are doing PGO).

The pipeline change is very PGO specific. I expect it has very little
impact on regular compilations:
1) LLVM's bottom up inliner is already iterative.
2) The performance impact (on instrumented build) can be 4x large --
which is unlikely for any nonPGO pipeline change.

LLVM already supports running SCC passes iteratively, so experiment
like this will be easy to do -- the data can be collected.

thanks,

David

> Accidentally sent to uiuc server.
>
>
>>
>> Can you compare your results with another approach: simply do not
>> instrument the top 1% hottest functions (by function entry count)? If
this
>> simple approach provides most of the benefits (my measurements on one
>> codebase I tested show that it would eliminate over 97% of total
function
>> counts), we may be able to use a simpler approach.AFor large C++
programs, the

For static compiler, this is not possible. It also seem to defeat the
purpose of PGO -- hottest functions are those which need profile
guidance the most.

In the program I looked at the top 1% were just trivial getters and
constructors or similar. We should already be "getting these right".

Stuff like:

class Foo {
....
Foo(int bar) : m_bar(bar) {}
int getBar() { return m_bar; }
...
};

Are the results different for your codebases? Have you tried something like
simply not instrumenting the hottest 1% or 0.5% of functions? (maybe
restrict the instrumentation skipping to functions of just a single BB with
less than, say, 10 instructions).

Rong's approach is quite sophisticated; I'm just interested on getting a
sanity check against a "naive" approach to see how much the sophisticated
approach is buying us.

>>
>> The biggest thing I notice about this proposal is that although the
focus
>> of the proposal is on reducing profiling overhead, it requires using a
>> different pass pipeline during *both* the "instr-generate" compilation
and
>> the "instr-use" compilation. Therefore it is more than simply a
reduction in
>> profiling overhead -- it can also change performance even ignoring
adding
>> the profiling information in the IR. I think it would be interesting to
test
>> the modified pass pipeline even in the absence of profile information to
>> better understand its effects in isolation (for example, maybe it would
be
>> good to add these passes during -O3 regardless of whether we are doing
PGO).
>>

The pipeline change is very PGO specific. I expect it has very little
impact on regular compilations:
1) LLVM's bottom up inliner is already iterative.
2) The performance impact (on instrumented build) can be 4x large --
which is unlikely for any nonPGO pipeline change.

With respect to adding extra passes, I'm actually more concerned about the
non-instrumented build, for which Rong did not show any data. For example,
will users find their program is X% faster than no PGO with ("ME") PGO, but
really (X/2)% of that is due simply to the extra passes, and not any
profile guidance? We should then prefer to use the extra passes during
regular -O3 builds. Conversely, if they find that their program is X%
faster with ("ME") PGO, but the extra passes are making the program (X/2)%
slower, then the users could be getting (3X/2)% faster instead. I am only
concerned about having two variables change simultaneously; I think that
instrumenting after some amount of cleanup has been done makes a lot of
sense.

Could Rong's proposal be made to work within the existing pipeline, but
doing the instrumentation after a subset of the existing pass pipeline has
been run?

-- Sean Silva

> Accidentally sent to uiuc server.
>
>
>>
>> Can you compare your results with another approach: simply do not
>> instrument the top 1% hottest functions (by function entry count)? If
this
>> simple approach provides most of the benefits (my measurements on one
>> codebase I tested show that it would eliminate over 97% of total
function
>> counts), we may be able to use a simpler approach.AFor large C++
programs, the

For static compiler, this is not possible. It also seem to defeat the
purpose of PGO -- hottest functions are those which need profile
guidance the most.

In the program I looked at the top 1% were just trivial getters and
constructors or similar. We should already be "getting these right".

Stuff like:

class Foo {
....
Foo(int bar) : m_bar(bar) {}
int getBar() { return m_bar; }
...
};

Are the results different for your codebases? Have you tried something
like simply not instrumenting the hottest 1% or 0.5% of functions? (maybe
restrict the instrumentation skipping to functions of just a single BB with
less than, say, 10 instructions).

Rong's approach is quite sophisticated; I'm just interested on getting a
sanity check against a "naive" approach to see how much the sophisticated
approach is buying us.

I've seen plenty of this kind of code patterns in the benchmarks. I did not
measure if they are the top 1% of hottest functions, but I'm not surprised
to see so.
Your suggesting of not instrumenting this small single BB function is
interesting. I'm pretty sure it will cut a lot of the overhead.
But will not destroy the integrity of the profile, (for example, it will
harder to be used in the coverage test).

I'm curious why you think my approach is sophisticated. It's pretty simple
in high level: do a preinline pass to inline all the trivial inlines and
then do the instrumentation. This way the instrumentation binary is faster
and result profile is smaller, and complete.

>>
>> The biggest thing I notice about this proposal is that although the
focus
>> of the proposal is on reducing profiling overhead, it requires using a
>> different pass pipeline during *both* the "instr-generate" compilation
and
>> the "instr-use" compilation. Therefore it is more than simply a
reduction in
>> profiling overhead -- it can also change performance even ignoring
adding
>> the profiling information in the IR. I think it would be interesting
to test
>> the modified pass pipeline even in the absence of profile information
to
>> better understand its effects in isolation (for example, maybe it
would be
>> good to add these passes during -O3 regardless of whether we are doing
PGO).
>>

The pipeline change is very PGO specific. I expect it has very little
impact on regular compilations:
1) LLVM's bottom up inliner is already iterative.
2) The performance impact (on instrumented build) can be 4x large --
which is unlikely for any nonPGO pipeline change.

With respect to adding extra passes, I'm actually more concerned about the
non-instrumented build, for which Rong did not show any data. For example,
will users find their program is X% faster than no PGO with ("ME") PGO, but
really (X/2)% of that is due simply to the extra passes, and not any
profile guidance? We should then prefer to use the extra passes during
regular -O3 builds. Conversely, if they find that their program is X%
faster with ("ME") PGO, but the extra passes are making the program (X/2)%
slower, then the users could be getting (3X/2)% faster instead. I am only
concerned about having two variables change simultaneously; I think that
instrumenting after some amount of cleanup has been done makes a lot of
sense.

I agree with David that LLVM's bottom up, and iterative inliner should make
the pre-inline pass have little performance impact in regular build, at
least in theory.
The performance swing caused by the pre-inline is likely due to inliner
implementation/heuristic issues, and should be handled in the inliner.

Here is the performance data to enable pre-inliner for regular compilation
(i.e. O2).

SPEC2006 and SPEC2000 C++ programs (train input)

program O2_+_preinline_time / O2_time - 1
471.omnetpp 0.40%
473.astar 11.62%
483.xalancbmk -3.58%
444.namd 0.00%
447.dealII 1.02%
450.soplex 0.83%
453.povray -1.81%
252.eon 0.33%

Note this uses the time. 473.astar runs 11.62% longer with preinline pass.

For Google Internal benchmarks, we use the performance number
                                        O2_+_pre_inline speedup vs O2
C++benchmark01 -6.88%
C++benchmark02 +4.42%
C++benchmark03 +0.74%
C++benchmark04 +0.56%
C++benchmark05 -1.02%
C++benchmark06 -14.98%
C++benchmark07 +4.21%
C++benchmark08 -1.00%
C_benchmark09 +0.92%
C_benchmark10 -0.43%
C++benchmark11 -12.23%
C++benchmark12 +0.65%
C++benchmark13 -0.96%
C++benchmark14 -12.89%
C_benchmark15 -0.12%
C++benchmark16 -0.07%
C++benchmark17 -0.18%
C++benchmark18 +0.24%
C++benchmark19 -1.86%
C++benchmark20 +3.63%
C_benchmark21 +0.40%
C++benchmark22 -0.08%

> Accidentally sent to uiuc server.
>
>
>>
>> Can you compare your results with another approach: simply do not
>> instrument the top 1% hottest functions (by function entry count)? If
this
>> simple approach provides most of the benefits (my measurements on one
>> codebase I tested show that it would eliminate over 97% of total
function
>> counts), we may be able to use a simpler approach.AFor large C++
programs, the

For static compiler, this is not possible. It also seem to defeat the
purpose of PGO -- hottest functions are those which need profile
guidance the most.

In the program I looked at the top 1% were just trivial getters and
constructors or similar. We should already be "getting these right".

Stuff like:

class Foo {
....
Foo(int bar) : m_bar(bar) {}
int getBar() { return m_bar; }
...
};

Are the results different for your codebases? Have you tried something
like simply not instrumenting the hottest 1% or 0.5% of functions? (maybe
restrict the instrumentation skipping to functions of just a single BB with
less than, say, 10 instructions).

Rong's approach is quite sophisticated; I'm just interested on getting a
sanity check against a "naive" approach to see how much the sophisticated
approach is buying us.

I've seen plenty of this kind of code patterns in the benchmarks. I did
not measure if they are the top 1% of hottest functions, but I'm not
surprised to see so.
Your suggesting of not instrumenting this small single BB function is
interesting. I'm pretty sure it will cut a lot of the overhead.
But will not destroy the integrity of the profile, (for example, it will
harder to be used in the coverage test).

I'm curious why you think my approach is sophisticated.

It's a relative term. Compared to a naive approach like simply not
instrumenting the hottest 1% of function if they are single-BB, it is
"sophisticated" (both in terms of lines of code to implement and in the
amount of thought that needs to be put into understanding it).

My interest is putting your instrumented-binary speedups in perspective.
Currently, you are comparing your approach with the existing approach which
clearly leaves a lot of low-hanging fruit -- it is not surprising that your
approach can do better. To put in perspective the improvements of your
approach, I think it would be useful to compare it to a "naive" approach
for minimizing the instrumentation overhead.

There are two outcomes which I think would be interesting and make me
reconsider my current (mostly positive) perspective on your approach:

- You find that your approach is not a noticeable improvement over the
"naive" approach.
- You find that your approach is worse than the "naive" approach in a
non-trivial way.

It's pretty simple in high level: do a preinline pass to inline all the
trivial inlines and then do the instrumentation. This way the
instrumentation binary is faster and result profile is smaller, and
complete.

>>
>> The biggest thing I notice about this proposal is that although the
focus
>> of the proposal is on reducing profiling overhead, it requires using a
>> different pass pipeline during *both* the "instr-generate"
compilation and
>> the "instr-use" compilation. Therefore it is more than simply a
reduction in
>> profiling overhead -- it can also change performance even ignoring
adding
>> the profiling information in the IR. I think it would be interesting
to test
>> the modified pass pipeline even in the absence of profile information
to
>> better understand its effects in isolation (for example, maybe it
would be
>> good to add these passes during -O3 regardless of whether we are
doing PGO).
>>

The pipeline change is very PGO specific. I expect it has very little
impact on regular compilations:
1) LLVM's bottom up inliner is already iterative.
2) The performance impact (on instrumented build) can be 4x large --
which is unlikely for any nonPGO pipeline change.

With respect to adding extra passes, I'm actually more concerned about
the non-instrumented build, for which Rong did not show any data. For
example, will users find their program is X% faster than no PGO with ("ME")
PGO, but really (X/2)% of that is due simply to the extra passes, and not
any profile guidance? We should then prefer to use the extra passes during
regular -O3 builds. Conversely, if they find that their program is X%
faster with ("ME") PGO, but the extra passes are making the program (X/2)%
slower, then the users could be getting (3X/2)% faster instead. I am only
concerned about having two variables change simultaneously; I think that
instrumenting after some amount of cleanup has been done makes a lot of
sense.

I agree with David that LLVM's bottom up, and iterative inliner should
make the pre-inline pass have little performance impact in regular build,
at least in theory.
The performance swing caused by the pre-inline is likely due to inliner
implementation/heuristic issues, and should be handled in the inliner.

Here is the performance data to enable pre-inliner for regular compilation
(i.e. O2).

SPEC2006 and SPEC2000 C++ programs (train input)

program O2_+_preinline_time / O2_time - 1
471.omnetpp 0.40%
473.astar 11.62%
483.xalancbmk -3.58%
444.namd 0.00%
447.dealII 1.02%
450.soplex 0.83%
453.povray -1.81%
252.eon 0.33%

Note this uses the time. 473.astar runs 11.62% longer with preinline pass.

For Google Internal benchmarks, we use the performance number
                                        O2_+_pre_inline speedup vs O2
C++benchmark01 -6.88%
C++benchmark02 +4.42%
C++benchmark03 +0.74%
C++benchmark04 +0.56%
C++benchmark05 -1.02%
C++benchmark06 -14.98%
C++benchmark07 +4.21%
C++benchmark08 -1.00%
C_benchmark09 +0.92%
C_benchmark10 -0.43%
C++benchmark11 -12.23%
C++benchmark12 +0.65%
C++benchmark13 -0.96%
C++benchmark14 -12.89%
C_benchmark15 -0.12%
C++benchmark16 -0.07%
C++benchmark17 -0.18%
C++benchmark18 +0.24%
C++benchmark19 -1.86%
C++benchmark20 +3.63%
C_benchmark21 +0.40%
C++benchmark22 -0.08%
--------------------------------
geometric mean -1.82%

For many benchmarks, we get negative speedup (ie.e slowdown).

These swings seem quite large in some cases, contrary to what was
theoretically expected. Do you know which part of the inliner is causing
the problem? Some of my users would be happy with these speedups even
without PGO; at the same time for some of my users this could negate much
of the advantage of using PGO. Is there a way for your approach to simply
do the instrumentation after a subset of the existing pass pipeline has
been run?

Also, in your OP you discussed that the advantages of the context sensitive
profile enabled by your approach. Do you have performance numbers to
quantify how much this improves things?

-- Sean Silva

The pipeline change is very PGO specific. I expect it has very little
impact on regular compilations:
1) LLVM's bottom up inliner is already iterative.
2) The performance impact (on instrumented build) can be 4x large --
which is unlikely for any nonPGO pipeline change.

With respect to adding extra passes, I'm actually more concerned about the
non-instrumented build, for which Rong did not show any data. For example,
will users find their program is X% faster than no PGO with ("ME") PGO, but
really (X/2)% of that is due simply to the extra passes, and not any profile
guidance? We should then prefer to use the extra passes during regular -O3
builds.

My reply about actually predicts that the pipeline changes is unlikely
to have large positive performance impact for regular non-PGO build.
It would be useful to see that effect (extra passes) -- but instead of
using Rong's untuned new passes, it can be done by forcing at least 2
iterations for CallGraphSCCPass (currently iterate >1 only when
devirtualization happens).

Conversely, if they find that their program is X% faster with ("ME")
PGO, but the extra passes are making the program (X/2)% slower, then the
users could be getting (3X/2)% faster instead.

Pre-cleanup is required for ME-PGO -- but the simple math you
mentioned does not apply. In your case, the (X/2)% slowdown is
required to enable the X% improvement over the original baseline
(without extra passes) -- otherwise only 0% can be materialized
(analogy : retraction and punching).

I am only concerned about
having two variables change simultaneously; I think that instrumenting after
some amount of cleanup has been done makes a lot of sense.

Could Rong's proposal be made to work within the existing pipeline, but
doing the instrumentation after a subset of the existing pass pipeline has
been run?

Conceptually it is doable using iterative pass manager (which the
support already exists). With PGO, the inliner needs to be greatly
tuned down to enable only small function inlining though.

David

Also, in your OP you discussed that the advantages of the context sensitive
profile enabled by your approach. Do you have performance numbers to
quantify how much this improves things?

This experiment can not be done with Clang/LLVM now because there are
many related to LLVM's profile usage that needs to be fixed first
(which we are working on) : profile data not used in inliner, block
layout tuning, missing profile update etc.

However we do have data with GCC if we care to know. We have seen PGO
performance (final optimized build, not instrumentation build)
improves from 1 to 4% across the set of internal benchmarks.

thanks,

David