exploring possibilities for unifying ThinLTO and FullLTO frontend + initial optimization pipeline

Hello,

I am exploring the possibility of unifying the BC file generation phase for ThinLTO and FullLTO. Our third party library providers prefer to give us only one version of the BC archives, rather than test and ship both Thin and Full LTO BC archives. We want to find a way to allow our users to pick either Thin or Full LTO, while having only one “unified” version of the BC archive.

Note, I am not necessarily proposing to do this work in the upstream compiler. If there is no interest from other companies, we might have to keep this as a private patch for Sony.

One of the ideas (not my preference) is to mix and match files in the Thin and Full BC formats. I’m not sure how well the “mix and match” scenario works in general. I was wondering if Apple or Google are doing this for production?

I wrote a toy example, compiled one group of files with ThinLTO and the rest with FullLTO, linked them with gold. I saw that irrespective of whether the Thin or Full LTO option was used at the link step, files are optimized within the Thin group and within the Full group separately, but they don’t know about the files in the other group (which makes sense). Basically, the border between Thin and Full LTO bitcode files created an artificial “barrier” which prevented cross-border optimization.

Obviously, I am not too fond of this idea. Even if mixing and matching ThinLTO and FullLTO bitcode files will work “as is”, I suspect we will see a non-trivial runtime performance degradation because of the “ThinLTO”/“FullLTO” border. Are you aware of any potential problems with this solution, other than performance?

Another, hopefully, better idea is to introduce a “unified” BC format, which could either be FullLTO, ThinLTO, or neither (e.g., something in between).

If the user chooses FullLTO at the link step, but some of the files are in the Thin BC format – the linker will call a special LTO API to convert these files to the Full LTO BC format (i.e., stripping the module summary section + potentially do some additional optimizations from the FullLTO pass manager pipeline).

If the user chooses ThinLTO at the link step, but some of the files are in the Full BC format – the linker will call an LTO API to convert these files to the Thin LTO bitcode format (by regenerating the module summary section dynamically for the Full LTO bitcode files).

I think the most reasonable idea for the unification of the Thin and Full LTO compilation pipelines is to use Full LTO as the “unified” BC format. If the user requests FullLTO – no additional work is needed, the linker will perform FullLTO as usual. If the user request ThinLTO, the linker will call an API to regenerate the module summary section for all the files in the FullLTO format and perform ThinLTO as usual.

In reality I suspect things will be much more complicated. The pipelines for the Thin and Full LTO compilation phases are quite different. ThinLTO can afford to do much more optimization in the linking phase (since it has parallel backends & smaller IR compared to FullLTO), while for FullLTO we are forced to move some optimizations from linking to the compilation phase.

So, if we pick FullLTO as our unified format, we would increase the build time for ThinLTO (we will be doing the FullLTO initial optimization pipeline in the compile phase, which is more than what ThinLTO is currently doing, but the pipeline of the optimizations in the backend will stay the same). It’s not clear what will happen with the runtime performance: we might improve it (because we repeat some of the optimizations several times), or we might make it worse (because we might do an optimization in the early compilation phase, potentially preventing more aggressive optimization later). What are your expectations? Will this approach work in general? If so, what do you think will happen with the runtime performance?

I also noticed that the pass manager pipeline is different for ThinLTO+Sample PGO (use profile case). This might create some additional complications for unification of Thin and FullLTO BC generation phase too, but it’s too small detail to worry about right now. I’m more interested in choosing a right general direction for solving this problem now.

Please share your thoughts!

Thank you!

Katya.

Hi Katya,

[+Teresa since this is about ThinLTO & she’s the owner there]

I’m not sure how other folks feel, but terminologically I’m not sure I think of these as different formats (for example you mention the idea of stripping the summaries from ThinLTO BC files to then feed them in as FullLTO files - I would imagine it’d be reasonable to modify/fix/improve the linker integration to have it (perhaps optionally) /ignore/ the summaries, or use the summaries but in a non-siloed way (so that there’s not that optimization boundary between ThinLTO and FullLTO))

You’re dealing with a situation where you are shipped BC files offline and then do one, or multiple builds with these BC files?

If the scenario was more like a naive build: Multiple BC files generated on a single (multi-core/threaded) machine (but some Thin, some Full) & then fed to the linker, I would wonder if it’d be relatively cheap for the LTO step to support this by computing summaries for FullLTO files on the fly (without a separate tool/writing the summary to disk, etc). Though I suppose that’d produce a pretty wildly different behavior in the link when just a single ThinLTO BC file was added to an otherwise FullLTO build.

Anyway - just some (admittedly fairly uninformed) thoughts. I’m sure Teresa has more informed ideas about how this might all look.

Hi David,

Thank you so much for your reply!

You’re dealing with a situation where you are shipped BC files offline and then do one, or multiple builds with these BC files?
Yes, that’s exactly the case.

If the scenario was more like a naive build: Multiple BC files generated on a single (multi-core/threaded) machine (but some Thin, some

Full) & then fed to the linker, I would wonder if it’d be relatively cheap for the LTO step to support this by computing summaries for

FullLTO files on the fly (without a separate tool/writing the summary to disk, etc).

I think so. My understanding that for FullLTO files, it’s possible to perform name anonymous globals pass and compute summaries on the fly, which should allow to perform ThinLTO at link phase.

Katya.

Hi,

It is non trivial to recompute summaries (which is why we have summaries in the bitcode in the first place by the way), because bitcode is expensive to load.

I think shipping two different variant of the bitcode, one with and one without summaries isn’t providing much benefit while complicating the flow. We could achieve what you’re looking for by revisiting the flow a little.

I would try to consider if we can:

  1. always generate summaries.
  2. Use the same compile-phase optimization pipeline for ThinLTO and LTO.
  3. Decide at link time if you want to do FullLTO or ThinLTO.

We haven’t got this route 2 years ago because during the bringup we didn’t want to affect FullLTO in any way, but it may make sense now to have clang -flto=thin and clang -flto=full be identical and change the linker plugins to operate either in full-LTO mode or in ThinLTO mode but not differentiate based on the availability of the summaries.

A possible behavior could be:

The -flto flag in the compile phase does not change the produced bitcode but for a flag that record the preference in the bitcode (FullLTO vs ThinLTO)

$ clang -c -flto=thin a.cpp

$ clang -c -flto=full b.cpp

$ clang -c -flto=full c.cpp

At link time the behavior depends on the -flto flag passed in.

No flag: use the compile-phase preference, perform ThinLTO on a.o and FullLTO on b.o/c.o, but allow ThinLTO import between the LTO group and the ThinLTO objects

$ clang a.o b.o c.o

Forces full LTO, merges all the objects, no cross module importing will happen.

clang a.o b.o c.o -flto=full

Forces ThinLTO for all objects, FullLTO won’t happen, no objects will be merged.

clang a.o b.o c.o -flto=thin

Cheers,

I think for ld64, you can mix thinLTO and fullLTO files and ld64 is going to compile them separately and combine the result. (Mehdi can confirm). I think this is aligned with the fact that whether to use full or thin LTO is decided during clang invocation, not linker invocation. I am not against any of the model, but I think we need to do some research before making the effort to switch the model.

On the other hand, I think it should work if you feed thinLTO object file into FullLTO code generator (if not, it is probably easy to implement). The issue there is thin and full LTO uses different optimization pipeline. Probably need to do some benchmark to figure out the impact of that.

Mehdi is correct that recompute summaries is expensive. You will get either memory or disk IO overhead that might make the compile time even slower than fullLTO. If you want to pick one format to use in both, it has to be the thinLTO format with the summary. But now, you need to deal with what happen if there is a legacy library with fullLTO info and user specify thinLTO on the linker command line.

Steven

As Mehdi and Steven noted, regenerating the summaries on the fly will be prohibitively expensive, so it would be better to have the summaries always available, and just ignore if the user wants full LTO.

However, the biggest issue will be the different pipelines. As Katya notes, with Full LTO more is done in the compile step, whereas ThinLTO exits early since aggressive optimizations can be performed in the backends, and also we avoid bloating out the code due to things like loop unrolling, etc (which at the very least would require adjustment of the importing thresholds). Making ThinLTO use the Full LTO pipeline will reduce performance (even if we adjust all the thresholds due to the changed compile pipeline, the backend pipeline is currently more aggressive for ThinLTO). Making Full LTO use ThinLTO’s pipeline will increase its compile time. You’d have to do some performance experiments to see if, for example, we could make the ThinLTO compile step optimization pipeline the same as FullLTO’s for the purpose of sharing the bitcode(+summary), but then use either the Thin or Full LTO pipeline in the backend depending on the mode.

We could, as Mehdi notes, allow importing from FullLTO modules if they had summaries, without too much difficulty.

Teresa

Hi Mehdi,

Awesome! It’s a very clear design. The only question left is which pipeline to choose for unified compile-phase optimization pipeline.

  • ThinLTO compile-phase pipeline? It might very negatively affect compile-time and the memory footprint for FullLTO link-phase. That was the reason why so many optimization were moved from the link-phase to the parallel compile-phase for FullLTO in the first place.

  • FullLTO compile-phase pipeline? More optimization passes at compile-phase will obviously increase compile time for ThinLTO, though I suspect it will be tolerable. It is not very clear how this choice will affect the overall runtime performance for ThinLTO. Assuming we keep well-tuned link-phase/backend optimization pipeline “as is” for ThinLTO and FullLTO, we will repeat some optimization passes for ThinLTO at compile-phase and later at link-phase which potentially could improve the performance… or it could make it worse, because we might perform an optimization early at compile-time, potentially preventing more aggressive optimization at link-phase when we see a larger scope. Any prediction on what would happen to the ThinLTO runtime performance at run-time?

  • New “unified” compile-phase pipeline?

I guess, there is not a definitive answer and we have to experiment, measure compile-time/run-time performance and potentially make some adjustments to the pipeline and to the thresholds. We have a few proprietary tests in Sony that we could use for the performance measurements, but it will be nicer if there are some open source benchmarks that we could use. What did you use in Google/Apple for ThinLTO/FullLTO measurements? Have you used some proprietary benchmarks also? It’s important to make sure we won’t have run-time/compile-time performance degradation, but it will be nicer if anyone can run previously used ThinLTO/FullLTO benchmarks oneself, while making changes to the optimization pipeline and heuristics.

No flag: use the compile-phase preference, perform ThinLTO on a.o and FullLTO on b.o/c.o, but allow ThinLTO import between the LTO group and the ThinLTO >> objects

$ clang a.o b.o c.o

If I understood you correctly, while doing ThinLTO on a.o, we could import from b.o and c.o (this is possible since the summaries are available), while we won’t see a.o when doing FullLTO for b.o/c.o. (i.e., the previous non-permeable barrier between ThinLTO and FullLTO groups will become permeable in one direction). However, do you think by doing this, we will achieve a better performance than doing ThinLTO backend for all of the files (a.o, b.o, c.o)?

Thank you!

Katya.

Hi Mehdi,

Awesome! It’s a very clear design. The only question left is which pipeline to choose for unified compile-phase optimization pipeline.

  • ThinLTO compile-phase pipeline? It might very negatively affect compile-time and the memory footprint for FullLTO link-phase. That was the reason why so many optimization were moved from the link-phase to the parallel compile-phase for FullLTO in the first place.

Just to clarify: “optimizations” were not “moved from the link-phase to the parallel compile-phase for FullLTO”, they have never been in the link phase for FullLTO. It has always been this way.

I think that the ThinLTO compile-phase pipeline will only affect FullLTO in the sense that we need to add more passes during the link phase, is this what you meant?

  • FullLTO compile-phase pipeline? More optimization passes at compile-phase will obviously increase compile time for ThinLTO, though I suspect it will be tolerable. It is not very clear how this choice will affect the overall runtime performance for ThinLTO. Assuming we keep well-tuned link-phase/backend optimization pipeline “as is” for ThinLTO and FullLTO, we will repeat some optimization passes for ThinLTO at compile-phase and later at link-phase which potentially could improve the performance… or it could make it worse, because we might perform an optimization early at compile-time, potentially preventing more aggressive optimization at link-phase when we see a larger scope. Any prediction on what would happen to the ThinLTO runtime performance at run-time?

Note: repeating optimization is not supposed to improve performance, at least this isn’t the goal of the pipeline.
The pipeline for ThinLTO has been modeled on O3, good or bad we felt there was no reason to really deviate and any improvement to one could (should!) reflect on the other.

The rational behind the ThinLTO pipeline is not only compile time: it split the O3 pipeline at the point where we stop the “function simplification” / inliner loop and before we get into unrolling/vectorization.
I remember even trying to stop the compile-phase without inlining but the generated IR was too big: the inliner CGSCC visit actually reduces the size of the IR considerably in some cases.

  • New “unified” compile-phase pipeline?

I guess, there is not a definitive answer and we have to experiment, measure compile-time/run-time performance and potentially make some adjustments to the pipeline and to the thresholds. We have a few proprietary tests in Sony that we could use for the performance measurements, but it will be nicer if there are some open source benchmarks that we could use. What did you use in Google/Apple for ThinLTO/FullLTO measurements? Have you used some proprietary benchmarks also? It’s important to make sure we won’t have run-time/compile-time performance degradation, but it will be nicer if anyone can run previously used ThinLTO/FullLTO benchmarks oneself, while making changes to the optimization pipeline and heuristics.

We benchmarked multiple variants of the pipeline two years ago. There were some regressions when adoption the ThinLTO pipeline in FullLTO (and some improvements), but when investigated we didn’t find any real regressions that couldn’t be solved by fixing the optimizer.
I.e. these are cases where FullLTO gets it right “by luck” and not by principle, and fixing such cases helps the non-LTO O3 (for example this test case https://bugs.llvm.org/show_bug.cgi?id=27395 )

No flag: use the compile-phase preference, perform ThinLTO on a.o and FullLTO on b.o/c.o, but allow ThinLTO import between the LTO group and the ThinLTO >> objects

$ clang a.o b.o c.o

If I understood you correctly, while doing ThinLTO on a.o, we could import from b.o and c.o (this is possible since the summaries are available), while we won’t see a.o when doing FullLTO for b.o/c.o. (i.e., the previous non-permeable barrier between ThinLTO and FullLTO groups will become permeable in one direction).

It could be permeable in both direction: b.o+c.o become “like a single ThinLTO object” after they get merged.

However, do you think by doing this, we will achieve a better performance than doing ThinLTO backend for all of the files (a.o, b.o, c.o)?

Performance is always very much use-case dependent.
One may know that a group of files performs better when they get merged together with FullLTO while the rest of the app does not?

I don’t know but this all needs to be carefully looked at from a user-interface point of view I think (will it be intuitive for the users? Will it fit in every (most) scenarios? etc.).

Cheers,

Hi Mehdi,

Awesome! It’s a very clear design. The only question left is which
pipeline to choose for unified compile-phase optimization pipeline.

- ThinLTO compile-phase pipeline? It might very negatively affect
compile-time and the memory footprint for FullLTO link-phase. That was the
reason why so many optimization were moved from the link-phase to the
parallel compile-phase for FullLTO in the first place.

Just to clarify: "optimizations" were not "moved from the link-phase to
the parallel compile-phase for FullLTO", they have never been in the link
phase for FullLTO. It has always been this way.

I think that the ThinLTO compile-phase pipeline will only affect FullLTO
in the sense that we need to add more passes during the link phase, is this
what you meant?

- FullLTO compile-phase pipeline? More optimization passes at
compile-phase will obviously increase compile time for ThinLTO, though I
suspect it will be tolerable. It is not very clear how this choice will
affect the overall runtime performance for ThinLTO. Assuming we keep
well-tuned link-phase/backend optimization pipeline “as is” for ThinLTO and
FullLTO, we will repeat some optimization passes for ThinLTO at
compile-phase and later at link-phase which potentially could improve the
performance… or it could make it worse, because we might perform an
optimization early at compile-time, potentially preventing more aggressive
optimization at link-phase when we see a larger scope. Any prediction on
what would happen to the ThinLTO runtime performance at run-time?

Note: repeating optimization is not supposed to improve performance, at
least this isn't the goal of the pipeline.
The pipeline for ThinLTO has been modeled on O3, good or bad we felt there
was no reason to really deviate and any improvement to one could (should!)
reflect on the other.

The rational behind the ThinLTO pipeline is not only compile time: it
split the O3 pipeline at the point where we stop the "function
simplification" / inliner loop and before we get into
unrolling/vectorization.

Right - see my reply on this from last night, at the very least the ThinLTO
importing thresholds will need retuning if we will perform optimizations
like unrolling/vectorization/etc that tend to increase code side.

I remember even trying to stop the compile-phase without inlining but the
generated IR was too big: the inliner CGSCC visit actually reduces the size
of the IR considerably in some cases.

To add on, this affects not only the importing thresholds, but also the
cost of doing the thin link (which will have a graph with many more
nodes/edges).

- New “unified” compile-phase pipeline?

I guess, there is not a definitive answer and we have to experiment,
measure compile-time/run-time performance and potentially make some
adjustments to the pipeline and to the thresholds. We have a few
proprietary tests in Sony that we could use for the performance
measurements, but it will be nicer if there are some open source benchmarks
that we could use. What did you use in Google/Apple for ThinLTO/FullLTO
measurements? Have you used some proprietary benchmarks also? It’s
important to make sure we won’t have run-time/compile-time performance
degradation, but it will be nicer if anyone can run previously used
ThinLTO/FullLTO benchmarks oneself, while making changes to the
optimization pipeline and heuristics.

We benchmarked multiple variants of the pipeline two years ago. There were
some regressions when adoption the ThinLTO pipeline in FullLTO (and some
improvements), but when investigated we didn't find any real regressions
that couldn't be solved by fixing the optimizer.
I.e. these are cases where FullLTO gets it right "by luck" and not by
principle, and fixing such cases helps the non-LTO O3 (for example this
test case 27395 – Missing loop canonicalization when iterating using a pointer or an integer index )

We have a number of internal benchmarks/applications used to evaluate
ThinLTO changes. We don't use Full LTO (with very limited exceptions).

>> # No flag: use the compile-phase preference, perform ThinLTO on a.o
and FullLTO on b.o/c.o, but allow ThinLTO import between the LTO group and
the ThinLTO >> objects

>> $ clang a.o b.o c.o

If I understood you correctly, while doing ThinLTO on a.o, we could
import from b.o and c.o (this is possible since the summaries are
available), while we won’t see a.o when doing FullLTO for b.o/c.o. (i.e.,
the previous non-permeable barrier between ThinLTO and FullLTO groups will
become permeable in one direction).

It could be permeable in both direction: b.o+c.o become "like a single
ThinLTO object" after they get merged.

Yes in fact right now (at least in the new LTO API, but certainly similar
in the old one) the Full LTO partition gets fully
merged/optimized/codegened before running any ThinLTO. If we wanted a
mixed-mode LTO compilation, it would be better to do some of the Full LTO
in parallel with the ThinLTO. The Thin Link will need to be split out and
done before any Full LTO optimization/codegen. E.g., if we are importing
from FullLTO into ThinLTO, then some symbols will need to be promoted in
the Full LTO IR. And to import from ThinLTO into FullLTO, we will also need
to have the Thin Link results. Either the ThinLink + ThinLTO optimizations
like importing/promotion would need to be done before any Full LTO IR
merging + backend, or as Mehdi suggests, do the Full LTO IR merging, then
treat as a new ThinLTO IR object during the ThinLink and beyond. For
distributed builds, we would need to serialize out the Full LTO IR. I'm not
sure if it will be worth it from an optimization standpoint in that case.

Before we start modifying this significantly, BTW, it would be good to
revisit the idea of migrating the old LTO library to the new LTO API. Is
anyone thinking of investing in that? I know Mehdi started this a long time
back, but I'm sure doesn't have the bandwidth.

However, do you think by doing this, we will achieve a better performance
than doing ThinLTO backend for all of the files (a.o, b.o, c.o)?

Performance is always very much use-case dependent.
One may know that a group of files performs better when they get merged
together with FullLTO while the rest of the app does not?

That's what I am wondering. There are still some places where Full LTO
outperforms ThinLTO due to optimizations not ported to ThinLTO (e.g. some
global variable optimizations). But you'd need to figure out how to detect
these opportunities ahead of time.

Katya - did you have a particular use case in mind?

Thanks,
Teresa

Hi Mehdi,

Awesome! It’s a very clear design. The only question left is which pipeline to choose for unified compile-phase optimization pipeline.

- ThinLTO compile-phase pipeline? It might very negatively affect compile-time and the memory footprint for FullLTO link-phase. That was the reason why so many optimization were moved from the link-phase to the parallel compile-phase for FullLTO in the first place.

Just to clarify: "optimizations" were not "moved from the link-phase to the parallel compile-phase for FullLTO", they have never been in the link phase for FullLTO. It has always been this way.

I see. What I meant was the following comment from the phabricator review about defining the ThinLTO pipeline, but I didn’t remember its exact wording.
https://reviews.llvm.org/D17115
“On the contrary to Full LTO, ThinLTO can afford to shift compile time from the frontend to the linker: both phases are parallel”.

I think that the ThinLTO compile-phase pipeline will only affect FullLTO in the sense that we need to add more passes during the link phase, is this what you meant?

Yes, that’s exactly what I meant.

- FullLTO compile-phase pipeline? More optimization passes at compile-phase will obviously increase compile time for ThinLTO, though I suspect it will be tolerable. It is not very clear how this choice will affect the overall runtime performance for ThinLTO. Assuming we keep well-tuned link-phase/backend optimization pipeline “as is” for ThinLTO and FullLTO, we will repeat some optimization passes for ThinLTO at compile-phase and later at link-phase which potentially could improve the performance… or it could make it worse, because we might perform an optimization early at compile-time, potentially preventing more aggressive optimization at link-phase when we see a larger scope. Any prediction on what would happen to the ThinLTO runtime performance at run-time?

Note: repeating optimization is not supposed to improve performance, at least this isn't the goal of the pipeline.
The pipeline for ThinLTO has been modeled on O3, good or bad we felt there was no reason to really deviate and any improvement to one could (should!) reflect on the other.

The rational behind the ThinLTO pipeline is not only compile time: it split the O3 pipeline at the point where we stop the "function simplification" / inliner loop and before we get into unrolling/vectorization.
I remember even trying to stop the compile-phase without inlining but the generated IR was too big: the inliner CGSCC visit actually reduces the size of the IR considerably in some cases.

Thank you for sharing! It’s a very helpful.

Mehdi, It seems that you have spent a significant time experimenting with ThinLTO pipeline and determining where exactly the compile-phase should end and link-phase should start. How do you envision unified ThinLTO/FullLTO compile-phase pipeline? We might tune/improve this pipeline it in the future, but having a good starting point is very important too.

- New “unified” compile-phase pipeline?

I guess, there is not a definitive answer and we have to experiment, measure compile-time/run-time performance and potentially make some adjustments to the pipeline and to the thresholds. We have a few proprietary tests in Sony that we could use for the performance measurements, but it will be nicer if there are some open source benchmarks that we could use. What did you use in Google/Apple for ThinLTO/FullLTO measurements? Have you used some proprietary benchmarks also? It’s important to make sure we won’t have run-time/compile-time performance degradation, but it will be nicer if anyone can run previously used ThinLTO/FullLTO benchmarks oneself, while making changes to the optimization pipeline and heuristics.

We benchmarked multiple variants of the pipeline two years ago. There were some regressions when adoption the ThinLTO pipeline in FullLTO (and some improvements), but when investigated we didn't find any real regressions that couldn't be solved by fixing the optimizer.

When referring to ThinLTO and FullLTO pipelines here do you mean compile-phase pipeline, link-phase pipeline or full pipeline (i.e., compile-phase + link-phase)? The terminology is slightly confusing here.

I.e. these are cases where FullLTO gets it right "by luck" and not by principle, and fixing such cases helps the non-LTO O3 (for example this test case 27395 – Missing loop canonicalization when iterating using a pointer or an integer index )

# No flag: use the compile-phase preference, perform ThinLTO on a.o and FullLTO on b.o/c.o, but allow ThinLTO import between the LTO group and the ThinLTO >> objects
$ clang a.o b.o c.o

If I understood you correctly, while doing ThinLTO on a.o, we could import from b.o and c.o (this is possible since the summaries are available), while we won’t see a.o when doing FullLTO for b.o/c.o. (i.e., the previous non-permeable barrier between ThinLTO and FullLTO groups will become permeable in one direction).

It could be permeable in both direction: b.o+c.o become "like a single ThinLTO object" after they get merged.

I see…
However, do you think by doing this, we will achieve a better performance than doing ThinLTO backend for all of the files (a.o, b.o, c.o)?

Performance is always very much use-case dependent.
One may know that a group of files performs better when they get merged together with FullLTO while the rest of the app does not?

I don't know but this all needs to be carefully looked at from a user-interface point of view I think (will it be intuitive for the users? Will it fit in every (most) scenarios? etc.).

# No flag: use the compile-phase preference, perform ThinLTO on a.o and FullLTO on b.o/c.o, but allow ThinLTO import between the LTO group and the ThinLTO >> objects
$ clang a.o b.o c.o

I wonder if we have a use-case for the “mix and match compile-phase preference” situation that you described above? Maybe the linker should simply report an error in this case? Or do we have to accept this because of backwards compatibility?

Cheers,

From: Mehdi AMINI <[email protected]>
Sent: Tuesday, April 10, 2018 11:53 PM
To: Romanova, Katya <[email protected]>
Cc: David Blaikie <[email protected]>; Teresa Johnson <[email protected]>; llvm-dev <[email protected]>
Subject: Re: [llvm-dev] exploring possibilities for unifying ThinLTO and FullLTO frontend + initial optimization pipeline

Hi Mehdi,

Awesome! It’s a very clear design. The only question left is which pipeline to choose for unified compile-phase optimization pipeline.

  • ThinLTO compile-phase pipeline? It might very negatively affect compile-time and the memory footprint for FullLTO link-phase. That was the reason why so many optimization were moved from the link-phase to the parallel compile-phase for FullLTO in the first place.

Just to clarify: “optimizations” were not “moved from the link-phase to the parallel compile-phase for FullLTO”, they have never been in the link phase for FullLTO. It has always been this way.

I see. What I meant was the following comment from the phabricator review about defining the ThinLTO pipeline, but I didn’t remember its exact wording.

https://reviews.llvm.org/D17115

“On the contrary to Full LTO, ThinLTO can afford to shift compile time from the frontend to the linker: both phases are parallel”.

I think that the ThinLTO compile-phase pipeline will only affect FullLTO in the sense that we need to add more passes during the link phase, is this what you meant?

Yes, that’s exactly what I meant.

  • FullLTO compile-phase pipeline? More optimization passes at compile-phase will obviously increase compile time for ThinLTO, though I suspect it will be tolerable. It is not very clear how this choice will affect the overall runtime performance for ThinLTO. Assuming we keep well-tuned link-phase/backend optimization pipeline “as is” for ThinLTO and FullLTO, we will repeat some optimization passes for ThinLTO at compile-phase and later at link-phase which potentially could improve the performance… or it could make it worse, because we might perform an optimization early at compile-time, potentially preventing more aggressive optimization at link-phase when we see a larger scope. Any prediction on what would happen to the ThinLTO runtime performance at run-time?

Note: repeating optimization is not supposed to improve performance, at least this isn’t the goal of the pipeline.

The pipeline for ThinLTO has been modeled on O3, good or bad we felt there was no reason to really deviate and any improvement to one could (should!) reflect on the other.

The rational behind the ThinLTO pipeline is not only compile time: it split the O3 pipeline at the point where we stop the “function simplification” / inliner loop and before we get into unrolling/vectorization.

I remember even trying to stop the compile-phase without inlining but the generated IR was too big: the inliner CGSCC visit actually reduces the size of the IR considerably in some cases.

Thank you for sharing! It’s a very helpful.

Mehdi, It seems that you have spent a significant time experimenting with ThinLTO pipeline and determining where exactly the compile-phase should end and link-phase should start. How do you envision unified ThinLTO/FullLTO compile-phase pipeline? We might tune/improve this pipeline it in the future, but having a good starting point is very important too.

I don’t know: it is all about tradeoffs :slight_smile:
I was in favor of using a single pipeline based on ~O3, the reason being mainly that it is easier to maintain/validate/evolve: when folks improve the O3 pipeline you get the benefit immediately in the ThinLTO optimization phase, in contrary with FullLTO. The tradeoff is about compile-time: it can become really long for FullLTO in some extreme cases. I suggested in the past that such cases could be handled by running the FullLTO linker optimization phase with O1 to reduce the amount of optimization.

  • New “unified” compile-phase pipeline?

I guess, there is not a definitive answer and we have to experiment, measure compile-time/run-time performance and potentially make some adjustments to the pipeline and to the thresholds. We have a few proprietary tests in Sony that we could use for the performance measurements, but it will be nicer if there are some open source benchmarks that we could use. What did you use in Google/Apple for ThinLTO/FullLTO measurements? Have you used some proprietary benchmarks also? It’s important to make sure we won’t have run-time/compile-time performance degradation, but it will be nicer if anyone can run previously used ThinLTO/FullLTO benchmarks oneself, while making changes to the optimization pipeline and heuristics.

We benchmarked multiple variants of the pipeline two years ago. There were some regressions when adoption the ThinLTO pipeline in FullLTO (and some improvements), but when investigated we didn’t find any real regressions that couldn’t be solved by fixing the optimizer.

When referring to ThinLTO and FullLTO pipelines here do you mean compile-phase pipeline, link-phase pipeline or full pipeline (i.e., compile-phase + link-phase)? The terminology is slightly confusing here.

Here I meant everything: trying to use the exact same pipeline in both phases.

I.e. these are cases where FullLTO gets it right “by luck” and not by principle, and fixing such cases helps the non-LTO O3 (for example this test case https://bugs.llvm.org/show_bug.cgi?id=27395 )

No flag: use the compile-phase preference, perform ThinLTO on a.o and FullLTO on b.o/c.o, but allow ThinLTO import between the LTO group and the ThinLTO >> objects

$ clang a.o b.o c.o

If I understood you correctly, while doing ThinLTO on a.o, we could import from b.o and c.o (this is possible since the summaries are available), while we won’t see a.o when doing FullLTO for b.o/c.o. (i.e., the previous non-permeable barrier between ThinLTO and FullLTO groups will become permeable in one direction).

It could be permeable in both direction: b.o+c.o become “like a single ThinLTO object” after they get merged.

I see…

However, do you think by doing this, we will achieve a better performance than doing ThinLTO backend for all of the files (a.o, b.o, c.o)?

Performance is always very much use-case dependent.

One may know that a group of files performs better when they get merged together with FullLTO while the rest of the app does not?

I don’t know but this all needs to be carefully looked at from a user-interface point of view I think (will it be intuitive for the users? Will it fit in every (most) scenarios? etc.).

No flag: use the compile-phase preference, perform ThinLTO on a.o and FullLTO on b.o/c.o, but allow ThinLTO import between the LTO group and the ThinLTO >> objects

$ clang a.o b.o c.o

I wonder if we have a use-case for the “mix and match compile-phase preference” situation that you described above? Maybe the linker should simply report an error in this case? Or do we have to accept this because of backwards compatibility?

I don’t know :slight_smile:
We need to consider the cases of “old” bitcode that wouldn’t have summaries (maybe they could get merged in the LTO partition but not participate in cross-module optimizations?)
We should hear from Apple folks as well.

See attached some quick slides (backup from the dev meeting talk) about the pass pipeline.

ThinLTO Pipeline.pdf (374 KB)

Hi Teresa,

Thank you so much for your reply!

I am on vacation until the end of this week and on EuroLLVM next week, so I have to apologize in advance that my replies are delayed.

Right - see my reply on this from last night, at the very least the ThinLTO importing thresholds will need retuning if we will

perform optimizations like unrolling/vectorization/etc that tend to increase code side.

Have you only used internal benchmarks for tuning the ThinLTO’s importing thresholds? A couple of old postings mentioned public benchmarks for perf measurements.

“The measurements on the public test suite as well as on our internal
suite show an overall net improvement.”

https://reviews.llvm.org/D17115

We have a number of internal benchmarks/applications used to evaluate ThinLTO changes. We don’t use Full LTO (with very limited

exceptions).

I would rather use the set of benchmarks that was previously utilized for importing threshold/pipeline tuning as the “main” acceptance criteria and Sony’s internal benchmarks as an auxiliary.

Before we start modifying this significantly, BTW, it would be good to revisit the idea of migrating the old LTO library to the new LTO

API. Is anyone thinking of investing in that? I know Mehdi started this a long time back, but I’m sure doesn’t have the bandwidth.

I was wondering what is the current status of it? Do you have a rough idea of how much time will it take to implement the new LTO API for someone who is new to the project? What are the main benefits of the new LTO API and why it’s important/beneficial to finish this before starting to work on unifying Thin/FullLTO pipeline?

Thank you!

Katya.

Hi Teresa,

Thank you so much for your reply!

I am on vacation until the end of this week and on EuroLLVM next week, so
I have to apologize in advance that my replies are delayed.

>>Right - see my reply on this from last night, at the very least the
ThinLTO importing thresholds will need retuning if we will

>>perform optimizations like unrolling/vectorization/etc that tend to
increase code side.

Have you only used internal benchmarks for tuning the ThinLTO’s importing
thresholds?

Sorry, I forgot since I haven't run it in awhile, but we used SPEC cpu2006
to help tune aspects of ThinLTO including the importing thresholds.

A couple of old postings mentioned public benchmarks for perf measurements.

Mehdi used the LLVM test suite, which I believe is what he is referring to
below, for some of his experiments such as the pipeline tuning.

“The measurements on the public test suite as well as on our internal
suite show an overall net improvement.”

⚙ D17115 Define the ThinLTO Pipeline

>> We have a number of internal benchmarks/applications used to evaluate
ThinLTO changes. We don't use Full LTO (with very limited

>> exceptions).

I would rather use the set of benchmarks that was previously utilized for
importing threshold/pipeline tuning as the “main” acceptance criteria and
Sony’s internal benchmarks as an auxiliary.

>> Before we start modifying this significantly, BTW, it would be good to
revisit the idea of migrating the old LTO library to the new LTO

>> API. Is anyone thinking of investing in that? I know Mehdi started this
a long time back, but I'm sure doesn't have the bandwidth.

I was wondering what is the current status of it?

I found the patch:
https://reviews.llvm.org/D31898

Do you have a rough idea of how much time will it take to implement the
new LTO API for someone who is new to the project?

I haven't reviewed the patch in detail so I'm not sure how close it was.
Hopefully Mehdi and/or Peter can comment.

What are the main benefits of the new LTO API and why it’s
important/beneficial to finish this before starting to work on unifying
Thin/FullLTO pipeline?

The optimization pipeline work doesn't really need it, since that is all
shared code. However, work to enabling importing between Full and Thin LTO
clusters would require some changes to the LTO code, and that would have to
be replicated in the two implementations. It would be nice to unify these
before doing that. The new LTO API provides a richer and more robust way
for the linker to supply symbol resolutions used for LTO optimizations.

Teresa

Thanks Mehdi, for the slides about the pass pipeline.

Teresa, regarding:

>> Performance is always very much use-case dependent.

>>

>> One may know that a group of files performs better when they get merged

>> together with FullLTO while the rest of the app does not?

>

> That’s what I am wondering. There are still some places where Full LTO

> outperforms ThinLTO due to optimizations not ported to ThinLTO (e.g. some

> global variable optimizations). But you’d need to figure out how to detect

> these opportunities ahead of time.

>

> Katya - did you have a particular use case in mind?

Sorry for silence here on the part of Sony. Katya is out for a bit longer, but to answer your question, we don’t have a specific piece of code as a use case, but we do have a general desire to allow cross-module optimizations to happen if a user has a library of FullLTO bitcode and they are doing a ThinLTO build (and vice versa). More generally (and longer term), we have a desire to ship only one form of bitcode.

Thanks,

-Warren Ristow