Basics of using instrumented BOLT

I appear to have successfully generated a myprogram.bolt executable from the original PGO’ed with gcc. However I did not see any improved performance. In fact it is a bit slower, even on the exact same input.

  1. Using gcc, when does -fno-reorder-blocks-and-partition flag need to be included? Is this before PGO, after PGO, or both? Also, should I expect an error if I did this incorrectly, or just fail to see any improvement.

  2. When the program was run multiple times it could not update prof.fdata. I saw:
    Error while trying to open profile file for writing: /tmp/prof.fdata
    Basically I need to run the same program with various inputs. Am I correct that, in contrast with using gcc for PGO, it is required to generate a different fdata file for each run then combine them (as described in the readme)?

Thanks for any help.

Using gcc, when does -fno-reorder-blocks-and-partition flag need to be included? Is this before PGO, after PGO, or both?

The flag needs to used at least for pre-BOLT binary build, but it would make sense to apply it for all builds (training and PGO).

Also, should I expect an error if I did this incorrectly, or just fail to see any improvement.

If the input binary contains split functions, BOLT issues a warning: “split function detected on input”, and such functions may not be optimized.

When the program was run multiple times it could not update prof.fdata. I saw:
Error while trying to open profile file for writing: /tmp/prof.fdata
Basically I need to run the same program with various inputs.

You can disambiguate multiple profiles using a suffix appended to the end of prof.fdata file – set -instrumentation-file-append-pid during instrumentation.

The profiles could be merged into one using merge-fdata BOLT tool (should be built as part of bolt component, or just run ninja merge-fdata in a configured llvm build directory).

1 Like

I didn’t end up seeing a benefit but it all worked great, thanks. Perhaps my program/runtime was too small/short.