Description
To help python/docsbuild-scripts#169.
Current situation
Right now the docs server is taking over 40 hours to build a full set of 3.12-3.14 docs, plus 12 translations each:
List of versions/languages
3.14/zh-tw
3.14/zh-cn
3.14/uk
3.14/tr
3.14/pt-br
3.14/pl
3.14/ko
3.14/ja
3.14/it
3.14/id
3.14/fr
3.14/es
3.14/en
3.13/zh-tw
3.13/zh-cn
3.13/uk
3.13/tr
3.13/pt-br
3.13/pl
3.13/ko
3.13/ja
3.13/it
3.13/id
3.13/fr
3.13/es
3.13/en
3.12/zh-tw
3.12/zh-cn
3.12/uk
3.12/tr
3.12/pt-br
3.12/pl
3.12/ko
3.12/ja
3.12/it
3.12/id
3.12/fr
3.12/es
3.12/en
Nearly all these include HTML, plain text, PDF, Texinfo and EPUB (Ukrainian is HTML only). HTML-only is fast to build, about 3-4 minutes. The full set of artifacts is much slower to build, between 40 minutes and two hours, depending on the language, and is mostly due to building latex for PDFs.
What happens is:
A cron goes off at 7 minutes past the hour and starts a new full build loop.
- If there's a build running (= lockfile found), the new one exits and allows the running one to continue.
- If there's no build running, it creates its own lockfile and starts a new build.
For each language/version, we only do a build if the docs have changed since last time, or if the translation has changed since last time. This is good, there's no point rebuilding something that hasn't changed.
However, because the full loop takes over 40 hours, inevitably there have been docs or translation changes since the last time, and we get a full rebuild each time.
This results in long delays between docs being updated, not to mention the high server resources usage.
HTML vs. PDF
We have download stats for the HTML docs, but we don't have download numbers for the other artifacts to compare.
However, I'm certain the HTML is by far the most used, and there's the most benefit to getting fresh HTML up quickly.
An affordance of websites is being able to look up just the pages you need, on-demand. Compared with PDF, where you can download it once and use it as an offline reference. Maybe you'll re-download again later, but there's less benefit in updating often, as the one you usually consult is an old, offline copy.
Proposal
I suggest we have two cron jobs:
-
The current hourly job only builds HTML.
-
A new job builds everything else except HTML.
1. HTML only
When there are new changes, they will be built and uploaded much sooner. It will run much quicker.
It's more likely that on the next pass, some languages can be skipped because there's nothing to update this time round.
2. Everything but HTML
This will be much slower than the HTML-only job, and will take about the same as the current loop does now.
Maybe it'll be a bit quicker due to not needing to build HTML, but maybe a bit slower because we'll sometimes be using CPU to build HTML at the same time. However, the majority of the time is spent running a latex command on a single CPU, so it might not make much difference.
We also don't need to update the non-HTML as often, so its cron could be every few days?