TL;DR: I discuss some sources of non-determinism Iâve seen in the clang-as-workload PGO+LTO build, even though clang is considered a deterministic(ish?) workload. I share a wishlist of items which could improve the clang build situation. I note that the situation for an arbitrary non-trivial workload, based on what Iâve seen with clang, is likely worse.
This is rarely achieved in practice. We hold build tools to a high standard of determinism [âŚ]
Indeed, for example, in my experience clangâs externally observable determinism seems pretty good in practice. (And I understand itâs considered a bug if clang does not produce the same output for the same input).
For profiling though, internal determinism matters. That is to say, the control flow needs to be deterministic. If you profile Clang, you see variations in the instrumented profile unless you run with setarch -R
to disable ASLR. With this, the determinism of the profile improves. Presumably this is explakned because pointers can find their way into hash maps and this causes changes in control flow; and those can vary with ASLR.
Other non-obvious things creep into the âinputâ which are experienced by the user as non-determinism. So, the workload can be deterministic (from run to run with no external changes, it is internally deterministic), but two users running it will get different results.
For example, clang with carefully arranged LTO+PGO produces bitwise identical outputs most of the time, but you can get variations. If you have a absolute paths to the build directory, the build directory name can leak into some (but not all) bitcode. This makes doing multiple builds from the same source tree and comparing those result in differences. The out-of-the-box PGO cmake configuration directly points LLVM_PROFDATA_FILE
into the build directory, which results in the instrumented binaries varying according to the abspath build directory.
Differing absolute paths to the source tree also cause variations in the compiler output (at least with LTO), because unfortunately cmake always takes the absolute paths to source files as best as I can tell. I couldnât find examples of this being configurable. This has the unfortunate effect for example that two users compiling PGO+LTO from different home directories canât expect the same clang binaries.
Curiously, for bin/clang
, the build directory leaks only through clang-driver.cpp
compilation, because this trivial driver wrapper is generated and stored in the build directory, and cmake has no way to pass this as a relative path to the compiler, so it ends up in the source_filename attribute of the IR module, and then into the ThinLTO summary hash. Consequently, somehow, this results in quite different binaries. Iâm interested to understand the mechanism of this but havenât got to the bottom of it, if anyone has ideas.
Less obviously than paths, another interesting source of non-determinism for clang-as-a-workload though, is that the inodes of input files/directories go into hash maps inside clang. Therefore if you were to repeat a build with semantically identical inputs, you may still see variations in the profile from build-to-build if doing a fresh build from scratch, because the filesystem assigns inodes arbitrarily. And whether or not this results in profile differences appears to be subtle; sometimes it does, sometimes it doesnât. (Presumably depending on things like whether the hash map entries on inode keys have collisions or not). This gives the appearance of âalmost workingâ and giving the same clang binary output from a build process, except when it doesnât.
When the profile differs, even by a trivial amount, it appears this finds its way into the ThinLTO summary hash and consequently, it appears, into the compiled binaries.
I donât know if that covers all of the sources of non-determinism for clang, there may be more. [Windows] Avoid using FileIndex for unique IDs ¡ llvm/llvm-project@02a3754 ¡ GitHub was interesting because it dropped the âinode equivalentâ on windows and replaced it with a determinstic hash derived from the path. This was a fix for a subtle bug where a filesystem could return the same FileIndex for different files, making clang conclude it had already seen a file where it had not. I imagine that was a maddening bug to track down.
My wishlist (unicorns and rainbows!) to improve the situation for repeatable PGO/LTO clang builds would be:
- Donât let inodes become a part of the input with respect to the internal determinism of clang, since inodes are not reproducible between users. (This would require a change to the VFS layerâs implementation of getUniqueID).
- Donât let the build directory become a part of the input, since a user may wish to have multiple side-by-side builds and determine if they are the same.
- Itâs really close to having this property already; it appears only the generated driver causes this.
- Donât let the source directories become a part of the input
- This seems harder; you would have most of this property already if CMake would pass source files by relative path (and you ensured a constant relative path between build directory and source directory).
- Or Iâm not sure if there would be a feature clang could have to assist in its treatment of paths, maybe something along the lines of -ffile-prefix-map, or maybe this could already help if the clang cmake build had a way to make use of this in order to drop the parent directory of the source and build directories from the effective input.
For workloads other than clang though, I can imagine the situation is as equally bad, even if a program âhas a deterministic outputâ. As a thought experiment, as soon as it involves filenames, inodes, or non-deterministic hash-map iteration order, you canât expect a program to âbe internally deterministicâ, depending on various circumstances, so its profile also may not be deterministic.