Skip to content

use -flto=thin for clang-cl on Windows #131035

Closed
@chris-eibl

Description

@chris-eibl

This started off as a build time analysis (#130090 (comment)), but since I now have the infrastructure, I tried -flto=thin, too:

  • faster in building 520.6 vs 651.2 seconds
  • is neutral on the pyperformance benchmarks
  • would bring us in sync with Linux, because there CONFIGURE_CFLAGS_NODIST and CONFIGURE_LDFLAGS_NOLTO both use -flto=thin when I configure for clang in WSL Ubuntu-24.04. See also the discussion why not to use full -flto in Revert to default fullLTO on Clang #130048
Benchmark clang.pgo.20.1.0-rc2 clang.pgo.thin.20.1.0-rc2
Geometric mean (ref) 1.00x faster
Detailed pybenchmark results

Benchmark clang.pgo.20.1.0-rc2 clang.pgo.thin.20.1.0-rc2
float 95.0 ms 89.7 ms: 1.06x faster
json_loads 29.8 us 28.6 us: 1.04x faster
mdp 2.86 sec 2.77 sec: 1.03x faster
html5lib 68.3 ms 66.2 ms: 1.03x faster
async_tree_none_tg 330 ms 320 ms: 1.03x faster
pyflate 518 ms 505 ms: 1.03x faster
sqlite_synth 3.21 us 3.13 us: 1.03x faster
pidigits 228 ms 223 ms: 1.02x faster
bench_mp_pool 168 ms 165 ms: 1.02x faster
async_tree_eager_io 742 ms 727 ms: 1.02x faster
generators 34.5 ms 33.8 ms: 1.02x faster
comprehensions 18.3 us 17.9 us: 1.02x faster
async_tree_cpu_io_mixed 641 ms 629 ms: 1.02x faster
scimark_sparse_mat_mult 4.51 ms 4.43 ms: 1.02x faster
async_tree_memoization 425 ms 417 ms: 1.02x faster
sympy_expand 538 ms 529 ms: 1.02x faster
unpack_sequence 57.0 ns 56.0 ns: 1.02x faster
regex_dna 209 ms 205 ms: 1.02x faster
async_generators 465 ms 458 ms: 1.02x faster
scimark_sor 140 ms 137 ms: 1.02x faster
sympy_str 319 ms 314 ms: 1.02x faster
async_tree_io_tg 751 ms 740 ms: 1.01x faster
regex_effbot 3.14 ms 3.10 ms: 1.01x faster
async_tree_eager_tg 272 ms 268 ms: 1.01x faster
pickle_dict 27.3 us 27.0 us: 1.01x faster
async_tree_eager_memoization_tg 363 ms 359 ms: 1.01x faster
sympy_integrate 22.5 ms 22.2 ms: 1.01x faster
sympy_sum 181 ms 179 ms: 1.01x faster
2to3 390 ms 386 ms: 1.01x faster
hexiom 6.68 ms 6.61 ms: 1.01x faster
docutils 3.03 sec 3.00 sec: 1.01x faster
sqlglot_normalize 121 ms 120 ms: 1.01x faster
async_tree_memoization_tg 392 ms 389 ms: 1.01x faster
async_tree_cpu_io_mixed_tg 614 ms 609 ms: 1.01x faster
tomli_loads 2.20 sec 2.18 sec: 1.01x faster
spectral_norm 102 ms 101 ms: 1.01x faster
python_startup_no_site 34.4 ms 34.2 ms: 1.01x faster
genshi_text 24.6 ms 24.5 ms: 1.01x faster
dulwich_log 119 ms 118 ms: 1.00x faster
go 128 ms 128 ms: 1.00x faster
deltablue 3.62 ms 3.63 ms: 1.00x slower
unpickle_pure_python 247 us 248 us: 1.00x slower
xml_etree_generate 107 ms 107 ms: 1.01x slower
django_template 39.2 ms 39.4 ms: 1.01x slower
coroutines 24.8 ms 25.0 ms: 1.01x slower
mako 13.3 ms 13.5 ms: 1.01x slower
unpickle 15.9 us 16.1 us: 1.01x slower
nbody 119 ms 121 ms: 1.01x slower
fannkuch 465 ms 472 ms: 1.01x slower
crypto_pyaes 81.3 ms 82.6 ms: 1.02x slower
json_dumps 11.5 ms 11.7 ms: 1.02x slower
deepcopy 285 us 291 us: 1.02x slower
pprint_safe_repr 858 ms 876 ms: 1.02x slower
xml_etree_iterparse 136 ms 139 ms: 1.02x slower
gc_traversal 5.03 ms 5.14 ms: 1.02x slower
meteor_contest 115 ms 117 ms: 1.02x slower
deepcopy_memo 33.8 us 34.7 us: 1.03x slower
richards_super 51.1 ms 52.6 ms: 1.03x slower
scimark_fft 327 ms 337 ms: 1.03x slower
richards 44.9 ms 46.3 ms: 1.03x slower
pickle_list 4.83 us 4.99 us: 1.03x slower
deepcopy_reduce 2.93 us 3.03 us: 1.03x slower
pprint_pformat 1.74 sec 1.80 sec: 1.03x slower
logging_simple 10.9 us 11.4 us: 1.05x slower
logging_format 12.1 us 12.6 us: 1.05x slower
xml_etree_parse 197 ms 208 ms: 1.05x slower
Geometric mean (ref) 1.00x faster

pgo_clang_20.1.0-rc2 pgo_clang_thin_20.1.0-rc2
pginstr 297.2 219.3
pgo 70.0 69.0
kill 1.2 0.5
pgupd 282.8 231.7
total time 651.2 520.6
Details pginstrument

pgo_clang_20.1.0-rc2 pgo_clang_thin_20.1.0-rc2
_freeze_module 38.5 40.0
python314 141.5 81.3
pyexpat 52.7 3.9
_elementtree 51.8 5.3
sqlite3 46.0 42.4
liblzma 18.2 16.5
_decimal 12.4 7.7
_testcapi 8.3 7.1
_bz2 7.0 4.9
_ctypes 6.9 7.5
_testlimitedcapi 4.9 4.3
_wmi 4.5 3.0
_overlapped 4.5 3.2
_asyncio 4.0 5.2
_lzma 3.8 1.8
_ssl 3.7 5.5
_ctypes_test 3.7 3.4
_multiprocessing 3.5 2.7
_sqlite3 3.4 2.8
venvwlauncher 3.3 2.7
_zoneinfo 3.1 3.4
unicodedata 2.7 3.0
pyshellext 2.7 2.6
pyw 2.7 2.7
py 2.6 2.5
_socket 2.4 3.7
_testinternalcapi 2.4 2.2
_tkinter 2.2 4.1
_testclinic 2.0 1.9
_hashlib 1.8 3.1
select 1.8 2.2
venvlauncher 1.8 1.7
winsound 1.7 3.3
_uuid 1.6 3.2
_queue 1.6 2.3
_testembed 1.5 1.5
_testbuffer 1.4 1.3
pythonw 1.1 1.1
_testconsole 1.1 1.1
_testmultiphase 1.0 1.0
_testsinglephase 1.0 1.0
python 1.0 0.9
_testclinic_limited 0.9 0.9
_testimportmultiple 0.9 0.9
python3 0.5 0.5
total 465.8 303.3

Details pgupdate

pgo_clang_20.1.0-rc2 pgo_clang_thin_20.1.0-rc2
_freeze_module 38.0 39.5
python314 141.9 95.4
sqlite3 44.4 42.9
liblzma 17.3 16.5
_decimal 11.2 8.7
_testcapi 8.6 7.3
_ctypes 8.0 7.2
_bz2 7.8 5.5
_ssl 5.2 5.6
_testlimitedcapi 5.0 4.2
pyexpat 4.6 3.6
_asyncio 4.5 4.6
_socket 4.3 3.5
_tkinter 4.0 4.2
_ctypes_test 3.7 3.4
_overlapped 3.5 3.7
_elementtree 3.5 4.5
_wmi 3.5 3.1
_zoneinfo 3.2 3.2
_lzma 3.2 1.9
unicodedata 3.2 3.0
_sqlite3 3.1 2.7
_hashlib 3.1 3.3
venvwlauncher 3.1 3.0
_multiprocessing 2.8 2.6
pyshellext 2.7 2.6
pyw 2.6 2.6
_uuid 2.6 2.8
py 2.6 2.7
_testinternalcapi 2.4 2.2
_testclinic 2.0 1.9
_queue 1.9 2.2
winsound 1.8 3.0
venvlauncher 1.7 1.5
select 1.6 2.0
_testembed 1.5 1.4
_testbuffer 1.4 1.3
_testconsole 1.1 1.0
pythonw 1.1 1.1
_testmultiphase 1.0 1.1
_testsinglephase 1.0 1.0
python 1.0 0.9
_testclinic_limited 0.9 0.9
_testimportmultiple 0.9 0.9
python3 0.5 0.5
total 372.9 316.8

Linked PRs

Metadata

Metadata

Assignees

No one assigned

    Labels

    OS-windowsbuildThe build process and cross-buildperformancePerformance or resource usagetype-featureA feature request or enhancement

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions