PGO profile reproducibility

zamazan4ik · October 31, 2024, 1:35pm

Hi!

One of the biggest concerns from OS maintainers about enabling PGO for their packages is the reproducibility question. Since we introduce an additional input into the compilation pipeline - a PGO profile file - it can also introduce an additional source of non-reproducibility into a package. Since reproducibility has become more and more adopted across the industry (for good reasons, IMO), it has become a significant blocker for PGO adoption.

There are multiple ways of mitigating this problem. One way is “simple” - commit collected PGO profiles into a VCS, and then reuse the profiles during each compilation. However, it requires the introduction of some additional storage for PGO profiles, managing their lifecycles, updating them from time to time (since they become outdated), etc.

Also, the former way doesn’t resolve another problem - making the PGO profile reproducible itself. GCC has a dedicated flag for this - -fprofile-reproducible=<option> (docs) but I didn’t find anything similar for LLVM. Is there any option? I know about -fprofile-update flag in Clang (docs) but from the documentation it’s not clear - does it guarantee the PGO profile reproducibility or not.

There is a comment from @davidxl that PGO profile reproducibility is not achievable in practice even with atomic counter updates. If this is true, we need to discuss how GCC folks achieved this (or they “hide” some important details, huh). However, I think @davidxl just meant here that the workload for server applications is not static so we cannot guarantee profile equality.

Thank you.

davidxl · November 1, 2024, 5:52am

The value profile data depends up update order in the current implementation, so enabling atomic update does not guarantee full reproducibility. Disabling value profiling + atomic update may get it but this is not fully validated.

zamazan4ik · November 1, 2024, 7:47am

Thanks!

What is the current way to disable value profiling in LLVM (e.g. Clang)?

zamazan4ik · November 1, 2024, 8:32am

Another question regarding disabling value profiling is whether it affects the quality of PGO efficiency optimization. Does it become better, worse or stay the same? Are there benchmarks in this area?

rnk · November 1, 2024, 6:26pm

In the general case, creating a fully reproducible profile requires that the instrumented program has fully deterministic execution during load testing. This is rarely achieved in practice. We hold build tools to a high standard of determinism, but application code usually has threads, hash tables of pointers, and other things that can create non-deterministic profiles. I’m curious to know what guarantees GCC’s reproducibility flag actually provides.

peterwaller-arm · November 2, 2024, 5:52pm

TL;DR: I discuss some sources of non-determinism I’ve seen in the clang-as-workload PGO+LTO build, even though clang is considered a deterministic(ish?) workload. I share a wishlist of items which could improve the clang build situation. I note that the situation for an arbitrary non-trivial workload, based on what I’ve seen with clang, is likely worse.

This is rarely achieved in practice. We hold build tools to a high standard of determinism […]

Indeed, for example, in my experience clang’s externally observable determinism seems pretty good in practice. (And I understand it’s considered a bug if clang does not produce the same output for the same input).

For profiling though, internal determinism matters. That is to say, the control flow needs to be deterministic. If you profile Clang, you see variations in the instrumented profile unless you run with setarch -R to disable ASLR. With this, the determinism of the profile improves. Presumably this is explakned because pointers can find their way into hash maps and this causes changes in control flow; and those can vary with ASLR.

Other non-obvious things creep into the ‘input’ which are experienced by the user as non-determinism. So, the workload can be deterministic (from run to run with no external changes, it is internally deterministic), but two users running it will get different results.

For example, clang with carefully arranged LTO+PGO produces bitwise identical outputs most of the time, but you can get variations. If you have a absolute paths to the build directory, the build directory name can leak into some (but not all) bitcode. This makes doing multiple builds from the same source tree and comparing those result in differences. The out-of-the-box PGO cmake configuration directly points LLVM_PROFDATA_FILE into the build directory, which results in the instrumented binaries varying according to the abspath build directory.

Differing absolute paths to the source tree also cause variations in the compiler output (at least with LTO), because unfortunately cmake always takes the absolute paths to source files as best as I can tell. I couldn’t find examples of this being configurable. This has the unfortunate effect for example that two users compiling PGO+LTO from different home directories can’t expect the same clang binaries.

Curiously, for bin/clang, the build directory leaks only through clang-driver.cpp compilation, because this trivial driver wrapper is generated and stored in the build directory, and cmake has no way to pass this as a relative path to the compiler, so it ends up in the source_filename attribute of the IR module, and then into the ThinLTO summary hash. Consequently, somehow, this results in quite different binaries. I’m interested to understand the mechanism of this but haven’t got to the bottom of it, if anyone has ideas.

Less obviously than paths, another interesting source of non-determinism for clang-as-a-workload though, is that the inodes of input files/directories go into hash maps inside clang. Therefore if you were to repeat a build with semantically identical inputs, you may still see variations in the profile from build-to-build if doing a fresh build from scratch, because the filesystem assigns inodes arbitrarily. And whether or not this results in profile differences appears to be subtle; sometimes it does, sometimes it doesn’t. (Presumably depending on things like whether the hash map entries on inode keys have collisions or not). This gives the appearance of ‘almost working’ and giving the same clang binary output from a build process, except when it doesn’t.

When the profile differs, even by a trivial amount, it appears this finds its way into the ThinLTO summary hash and consequently, it appears, into the compiled binaries.

I don’t know if that covers all of the sources of non-determinism for clang, there may be more. [Windows] Avoid using FileIndex for unique IDs · llvm/llvm-project@02a3754 · GitHub was interesting because it dropped the ‘inode equivalent’ on windows and replaced it with a determinstic hash derived from the path. This was a fix for a subtle bug where a filesystem could return the same FileIndex for different files, making clang conclude it had already seen a file where it had not. I imagine that was a maddening bug to track down.

My wishlist (unicorns and rainbows!) to improve the situation for repeatable PGO/LTO clang builds would be:

Don’t let inodes become a part of the input with respect to the internal determinism of clang, since inodes are not reproducible between users. (This would require a change to the VFS layer’s implementation of getUniqueID).
Don’t let the build directory become a part of the input, since a user may wish to have multiple side-by-side builds and determine if they are the same.
- It’s really close to having this property already; it appears only the generated driver causes this.
Don’t let the source directories become a part of the input
- This seems harder; you would have most of this property already if CMake would pass source files by relative path (and you ensured a constant relative path between build directory and source directory).
- Or I’m not sure if there would be a feature clang could have to assist in its treatment of paths, maybe something along the lines of -ffile-prefix-map, or maybe this could already help if the clang cmake build had a way to make use of this in order to drop the parent directory of the source and build directories from the effective input.

For workloads other than clang though, I can imagine the situation is as equally bad, even if a program ‘has a deterministic output’. As a thought experiment, as soon as it involves filenames, inodes, or non-deterministic hash-map iteration order, you can’t expect a program to ‘be internally deterministic’, depending on various circumstances, so its profile also may not be deterministic.

Topic		Replies	Views
Profile-Guided Optimization (PGO) related questions and suggestions LLVM Project pgo	24	1751	December 20, 2023
Current PGO status LLVM Dev List Archives	8	175	February 26, 2018
[PGO] Are the `__llvm_profile_` functions stable C APIs across LLVM releases? LLVM Project pgo	8	378	December 27, 2023
Instrumentation based Profiling LLVM Dev List Archives	3	107	June 9, 2014
RFC: PGO Late instrumentation for LLVM LLVM Dev List Archives	1	152	September 2, 2015

PGO profile reproducibility

Related topics