Skip to content

[ci] limit parallel windows compile jobs to 24 #93329

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

lnihlen
Copy link
Contributor

@lnihlen lnihlen commented May 24, 2024

This is an experiment to see if we can prevent some of the compiler OOMs happening without unduly impacting the Windows build latency.

@lnihlen lnihlen requested review from tstellar and AaronBallman May 24, 2024 18:37
@lnihlen
Copy link
Contributor Author

lnihlen commented May 24, 2024

I want to roll this for a few days and see how it impacts things, please take a look.

Copy link
Collaborator

@tstellar tstellar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is also LLVM_RAM_PER_COMPILE_JOB which might scale better if the machine sizes change, but I'm not sure if this works for Windows.

You may also want to use LLVM_RAM_PER_LINK_JOB/LLVM_PARALLEL_LINK_JOBS too.

@lnihlen
Copy link
Contributor Author

lnihlen commented May 24, 2024

I wasn't aware of the *_RAM_PER_COMPILE_JOB parameters, that's interesting and might be worth a try, although not sure it would help with compiles OOMing. I haven't seen link OOMs so I didn't want to limit that phase, even though in my own experience its the link that causes OOMs the most. But this is the MSVC toolchain, so it's a bit different of a situation, I suppose.

@Endilll
Copy link
Contributor

Endilll commented May 25, 2024

One thing to be aware of is https://learn.microsoft.com/en-us/cpp/build/reference/cgthreads-code-generation-threads?view=msvc-170:

By default, cl.exe uses four threads, as if /cgthreads4 were specified.

As far as I can see, we don't override it anywhere. /cgthreads1 should reduce the thread contention (ninja already launches enough to put all the cores to work), and, hopefully, RAM usage.

@lnihlen
Copy link
Contributor Author

lnihlen commented May 27, 2024

I think the problem is this is all guesswork. I'm looking in to some of the performance monitoring steps suggested by others, to see if we can collect some actual data about what's going on with the Windows build servers. But, in the mean time, if builds are unreliable due to OOM, this may be a band-aid we can put on the problem until we have a more informed solution.

@Meinersbur
Copy link
Member

@Endilll I think this PR is to avoid OOM errors (Compiling Flang requires >8GB RAM per .cpp). msvc being multi-threaded (i.e. NOT specifying /cgthreads1) still enables more parallelism that those 24 compile jobs. In my experience, /cgthreads does not change memory usage.

@lnihlen
Copy link
Contributor Author

lnihlen commented May 27, 2024

If compiling flang really does require >8 GB per .cpp file on MSVC, it probably makes sense to drop this to 16 threads. Although none of the analytics show memory pressure as a problem (see the chart I posted in the Discourse thread), I'm wondering if we're running slowly because we're swap thrashing, because reducing paralellism doesn't seem to negatively impact build times, at least based on my limited testing.

@Meinersbur
Copy link
Member

Meinersbur commented May 27, 2024

I collected statistics some time ago: https://discourse.llvm.org/t/build-time-comparision/56024 (including for /cgthreds1). Usage may have grown since then.

According to the Discourse post, there is a 20 min wait where nothing happens. This matches the --timeout=1200 LIT parameter. That is, there may be a deadlock in some test that is eventually given up on by LIT.

Copy link
Collaborator

FWIW, I've also found I need to set LLVM_PARALLEL_LINK_JOBS to something low (like 2-4) unless building with lld. link.exe tends to be quite memory hungry as well, so maybe that's another option to experiment with here?

Copy link
Collaborator

@AaronBallman AaronBallman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as far as the experiment goes, in case you want to land it to see if it solves the OOM issues immediately. We can explore other options in a follow-up if desired.

Copy link
Collaborator

I collected statistics some time ago: https://discourse.llvm.org/t/build-time-comparision/56024 (including for /cgthreds1). Usage may have grown since then.

According to the Discourse post, there is a 20 min wait where nothing happens. This matches the --timeout=1200 LIT parameter. That is, there may be a deadlock in some test that is eventually given up on by LIT.

FWIW, I see this from a clangd test with some regularity:

timeout=1200 --time-tests C:/ws/src/build/tools/clang/tools/extra/include-cleaner/test C:/ws/src/build/tools/clang/tools/extra/pseudo/test C:/ws/src/build/tools/clang/tools/extra/clangd/test/../unittests C:/ws/src/build/tools/clang/tools/extra/clangd/test C:/ws/src/build/tools/clang/tools/extra/test"

from https://buildkite.com/llvm-project/github-pull-requests/builds/65934

@Endilll
Copy link
Contributor

Endilll commented May 28, 2024

FWIW, I see this from a clangd test with some regularity:

timeout=1200 --time-tests C:/ws/src/build/tools/clang/tools/extra/include-cleaner/test C:/ws/src/build/tools/clang/tools/extra/pseudo/test C:/ws/src/build/tools/clang/tools/extra/clangd/test/../unittests C:/ws/src/build/tools/clang/tools/extra/clangd/test C:/ws/src/build/tools/clang/tools/extra/test"

from https://buildkite.com/llvm-project/github-pull-requests/builds/65934

This was fixed in #92888. Make sure your branches are up to date.

@AaronBallman
Copy link
Collaborator

FWIW, I see this from a clangd test with some regularity:

timeout=1200 --time-tests C:/ws/src/build/tools/clang/tools/extra/include-cleaner/test C:/ws/src/build/tools/clang/tools/extra/pseudo/test C:/ws/src/build/tools/clang/tools/extra/clangd/test/../unittests C:/ws/src/build/tools/clang/tools/extra/clangd/test C:/ws/src/build/tools/clang/tools/extra/test"

from https://buildkite.com/llvm-project/github-pull-requests/builds/65934

This was fixed in #92888. Make sure your branches are up to date.

Good to know, thank you! Yeah, it seems that PR was updated before the fix went in, so it was missing the fix.

@lnihlen lnihlen merged commit d9dec10 into llvm:main May 28, 2024
5 checks passed
@lnihlen lnihlen deleted the topic/lnihlen-limit-parallel-windows-ci-jobs branch May 28, 2024 19:53
vg0204 pushed a commit to vg0204/llvm-project that referenced this pull request May 29, 2024
This is an experiment to see if we can prevent some of the compiler OOMs
happening without unduly impacting the Windows build latency.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants