bpo-46372: Try to mutate the LHS during fast `float` ops #30594

brandtbucher · 2022-01-14T06:44:08Z

This modifies the existing ~~int and~~ float specializations to mutate the LHS operand in-place when possible. It also unifies the implementations of all of these operations.

https://bugs.python.org/issue46372

Python/ceval.c

sweeneyde · 2022-01-14T07:15:31Z

I think _PyLong_Add and the like could potentially be deleted as well.

markshannon

A few general comments:

Could you attach the benchmark results to the PR, not the issue? The speedup depends on the actual implementation, not the general idea.

I'm a bit surprised that this offers much of a speedup given the complexity of the int operators. Maybe the current implementation is just really slow?
Does the specialization of ints add much, or is it mainly the float specialization giving the speedup? I'd be curious to see how a float-only version compares.

It would be worth adding stats to give us some idea what fraction of operations can use this optimization or something like it. In other words, what fraction of operations have the lhs with a ref-count of 1 (or 2 when assigning to same variable)? That would be generally useful information beyond this particular optimization.

As a general rule, if something can be done at compile time, it should be. If not then it should be done at specialization time. Only do stuff at runtime, if we can't avoid it.
We can determine at compile time whether a binary operator if of the form s += x or s = s + x (where s is a "fast" local). We should do that.

I don't know if it is worth having two specializations, one for the s = s + x and one for the general form, or embedding the information in the cache. I suspect that it won't matter.

I like the float optimization, but I think our efforts to speed up int operations should focus or making int objects more efficient first. Otherwise the coupling of the interpreter to the int implementation is going to make improving the latter even more difficult.

Python/ceval.c

bedevere-bot · 2022-01-17T17:43:24Z

When you're done making the requested changes, leave the comment: I have made the requested changes; please review again.

Python/ceval.c

brandtbucher · 2022-01-19T02:30:16Z

Could you attach the benchmark results to the PR, not the issue? The speedup depends on the actual implementation, not the general idea.

Yeah, that makes sense. They can also be updated as the implementation changes, too.

I'm a bit surprised that this offers much of a speedup given the complexity of the int operators. Maybe the current implementation is just really slow? Does the specialization of ints add much, or is it mainly the float specialization giving the speedup? I'd be curious to see how a float-only version compares.

Looks like your intuition is correct. Here is a comparison of main, int-only, float-only, and both:

Benchmark	main-0	inplace-ops-int-0	inplace-ops-float-0	inplace-ops-0
2to3	264 ms	not significant	not significant	266 ms: 1.01x slower
chameleon	7.64 ms	7.40 ms: 1.03x faster	7.43 ms: 1.03x faster	7.45 ms: 1.03x faster
chaos	72.9 ms	72.3 ms: 1.01x faster	74.2 ms: 1.02x slower	not significant
crypto_pyaes	84.9 ms	85.5 ms: 1.01x slower	85.6 ms: 1.01x slower	84.2 ms: 1.01x faster
deltablue	4.21 ms	not significant	not significant	4.24 ms: 1.01x slower
django_template	35.7 ms	36.2 ms: 1.02x slower	not significant	not significant
dulwich_log	65.3 ms	65.7 ms: 1.01x slower	65.0 ms: 1.00x faster	65.8 ms: 1.01x slower
fannkuch	404 ms	385 ms: 1.05x faster	395 ms: 1.02x faster	394 ms: 1.02x faster
float	78.0 ms	77.4 ms: 1.01x faster	76.8 ms: 1.02x faster	not significant
go	149 ms	151 ms: 1.01x slower	150 ms: 1.01x slower	not significant
hexiom	6.65 ms	6.58 ms: 1.01x faster	6.71 ms: 1.01x slower	6.76 ms: 1.02x slower
json_dumps	12.7 ms	not significant	12.9 ms: 1.02x slower	12.8 ms: 1.01x slower
json_loads	25.4 us	25.6 us: 1.01x slower	25.2 us: 1.01x faster	25.5 us: 1.00x slower
logging_format	6.51 us	6.36 us: 1.02x faster	6.42 us: 1.02x faster	6.40 us: 1.02x faster
logging_silent	111 ns	108 ns: 1.03x faster	110 ns: 1.01x faster	107 ns: 1.04x faster
logging_simple	5.90 us	5.87 us: 1.00x faster	5.86 us: 1.01x faster	not significant
mako	11.4 ms	11.4 ms: 1.00x faster	11.5 ms: 1.01x slower	not significant
meteor_contest	105 ms	103 ms: 1.01x faster	103 ms: 1.01x faster	103 ms: 1.02x faster
nbody	96.5 ms	98.0 ms: 1.02x slower	95.1 ms: 1.01x faster	not significant
nqueens	83.9 ms	not significant	85.2 ms: 1.02x slower	85.5 ms: 1.02x slower
pathlib	19.4 ms	not significant	not significant	19.2 ms: 1.01x faster
pickle	9.86 us	9.71 us: 1.02x faster	9.70 us: 1.02x faster	10.1 us: 1.02x slower
pickle_dict	27.2 us	28.6 us: 1.05x slower	not significant	28.2 us: 1.04x slower
pickle_list	4.55 us	4.61 us: 1.01x slower	4.61 us: 1.01x slower	4.41 us: 1.03x faster
pickle_pure_python	335 us	331 us: 1.01x faster	333 us: 1.01x faster	332 us: 1.01x faster
pidigits	194 ms	195 ms: 1.00x slower	197 ms: 1.02x slower	189 ms: 1.02x faster
pyflate	447 ms	442 ms: 1.01x faster	441 ms: 1.01x faster	441 ms: 1.01x faster
python_startup	8.16 ms	8.23 ms: 1.01x slower	8.17 ms: 1.00x slower	8.20 ms: 1.00x slower
python_startup_no_site	5.79 ms	5.83 ms: 1.01x slower	5.80 ms: 1.00x slower	5.82 ms: 1.00x slower
raytrace	315 ms	310 ms: 1.02x faster	310 ms: 1.02x faster	309 ms: 1.02x faster
regex_compile	139 ms	136 ms: 1.02x faster	138 ms: 1.01x faster	138 ms: 1.01x faster
regex_dna	216 ms	221 ms: 1.03x slower	not significant	216 ms: 1.00x slower
regex_effbot	3.26 ms	not significant	3.33 ms: 1.02x slower	not significant
regex_v8	24.1 ms	not significant	23.5 ms: 1.03x faster	24.0 ms: 1.01x faster
richards	53.1 ms	51.6 ms: 1.03x faster	51.4 ms: 1.03x faster	51.6 ms: 1.03x faster
scimark_fft	326 ms	341 ms: 1.04x slower	314 ms: 1.04x faster	333 ms: 1.02x slower
scimark_lu	112 ms	not significant	110 ms: 1.01x faster	109 ms: 1.03x faster
scimark_monte_carlo	72.1 ms	69.5 ms: 1.04x faster	71.3 ms: 1.01x faster	68.6 ms: 1.05x faster
scimark_sor	125 ms	121 ms: 1.04x faster	124 ms: 1.01x faster	121 ms: 1.04x faster
scimark_sparse_mat_mult	4.65 ms	4.76 ms: 1.02x slower	4.41 ms: 1.05x faster	4.30 ms: 1.08x faster
spectral_norm	103 ms	93.1 ms: 1.10x faster	93.1 ms: 1.10x faster	91.7 ms: 1.12x faster
sympy_expand	500 ms	not significant	not significant	502 ms: 1.01x slower
sympy_integrate	21.3 ms	21.3 ms: 1.00x faster	21.4 ms: 1.00x slower	21.4 ms: 1.00x slower
sympy_sum	168 ms	not significant	168 ms: 1.00x faster	169 ms: 1.01x slower
sympy_str	302 ms	299 ms: 1.01x faster	300 ms: 1.01x faster	not significant
telco	6.52 ms	not significant	6.63 ms: 1.02x slower	6.35 ms: 1.03x faster
unpack_sequence	44.4 ns	45.4 ns: 1.02x slower	not significant	45.9 ns: 1.03x slower
unpickle	14.2 us	14.4 us: 1.01x slower	14.0 us: 1.02x faster	not significant
unpickle_list	5.08 us	5.16 us: 1.02x slower	5.03 us: 1.01x faster	5.12 us: 1.01x slower
unpickle_pure_python	251 us	247 us: 1.02x faster	253 us: 1.01x slower	247 us: 1.01x faster
xml_etree_parse	155 ms	not significant	157 ms: 1.01x slower	157 ms: 1.01x slower
xml_etree_iterparse	107 ms	not significant	not significant	108 ms: 1.01x slower
xml_etree_generate	81.5 ms	80.6 ms: 1.01x faster	80.7 ms: 1.01x faster	80.9 ms: 1.01x faster
xml_etree_process	57.9 ms	57.7 ms: 1.00x faster	not significant	58.4 ms: 1.01x slower
Geometric mean	(ref)	1.00x faster	1.01x faster	1.01x faster

As a general rule, if something can be done at compile time, it should be. If not then it should be done at specialization time. Only do stuff at runtime, if we can't avoid it. We can determine at compile time whether a binary operator if of the form s += x or s = s + x (where s is a "fast" local). We should do that.

I don't know if it is worth having two specializations, one for the s = s + x and one for the general form, or embedding the information in the cache. I suspect that it won't matter.

I've updated the PR to use the adaptive entry for this purpose. I understand the benefits of a compile-time check, but doing the check at specialization time seems a lot simpler in this particular case.

I like the float optimization, but I think our efforts to speed up int operations should focus or making int objects more efficient first. Otherwise the coupling of the interpreter to the int implementation is going to make improving the latter even more difficult.

Shall I drop the int changes, then?

Python/specialize.c

markshannon · 2022-01-20T17:41:56Z

Looks good.
Do you have up to date performance numbers?

brandtbucher · 2022-01-20T17:45:20Z

Do you have time for a quick Teams call @markshannon? Might be easier to chat about the stats I gathered in-person.

(If not, maybe tomorrow or Monday.)

markshannon · 2022-01-21T16:29:15Z

Somewhat surprising benchmarking results.
An overall speedup, but slowdowns on some more numerical benchmarks, particularly pidigits and spectral_norm.

sweeneyde · 2022-01-21T16:37:14Z

Somewhat surprising benchmarking results. An overall speedup, but slowdowns on some more numerical benchmarks, particularly pidigits and spectral_norm.

pidigits slowdown should be somewhat expected, since it's mostly huge-integer operations, which are specialized and handled by _PyLong_Add and such before this PR but not after.

brandtbucher · 2022-01-25T06:33:40Z

Hm, looks like removing the refcount == 2 optimization from the float ops wiped out most of the performance gains.

I’ll go ahead and try the less-branchy version that we discussed this morning (an adaptive superinstruction based on the existing in-place string addition op).

markshannon · 2022-01-25T12:48:09Z

LGTM. Feel free to merge once the merge conflicts are fixed.

gvanrossum

Nice and simple!

gvanrossum · 2022-01-31T18:59:22Z

Python/ceval.c

+        Py_DECREF(rhs);                                \
+        STACK_SHRINK(1);                               \
+        if (Py_REFCNT(lhs) == 1) {                     \
+            PyFloat_AS_DOUBLE(lhs) = d;                \


Presumably (see PEP 674) Victor won't like this use of a macro as a target. :-)

Nevertheless, this whole macro is so nice and straightforward compared to ints!

gvanrossum · 2022-01-31T19:03:27Z

Python/ceval.c

+        double d = l OP r;                             \
+        Py_DECREF(rhs);                                \
+        STACK_SHRINK(1);                               \
+        if (Py_REFCNT(lhs) == 1) {                     \


Would there be any benefit in checking whether perhaps rhs has a refcount of 1, in case lhs doesn't? While people are probably more likely to write return x+1 rather than return 1+x (though even I could sometimes see a reason to do that, from "rhyming" with some other operations or just how the textbook presents the formula), writing return 2*x is more likely than return x*2. And of course for asymmetric operations you may not have a choice, like return 1/x.

Of course, there's a cost (more code, more branches). There's also the fact that += and friends steer this in the direction you're taking here.

Of course, there's a cost (more code, more branches). There's also the fact that += and friends steer this in the direction you're taking here.

Yeah, those are my main motivations for sticking with the LHS here.

brandtbucher · 2022-02-01T01:48:15Z

Okay, I just pushed variants of float += float and float -= float that optimize the case where lhs has two references: one on the stack, and one named local reference that is about to be overwritten by the next instruction. It costs one extra (fast, predictable) branch, but lets us both mutate lhs in-place and skip the following store entirely!

This gets us back up to a 1% improvement:

Slower (15):
- crypto_pyaes: 83.5 ms +- 0.9 ms -> 85.1 ms +- 1.2 ms: 1.02x slower
- sqlite_synth: 2.71 us +- 0.04 us -> 2.74 us +- 0.07 us: 1.01x slower
- django_template: 34.8 ms +- 0.4 ms -> 35.2 ms +- 0.5 ms: 1.01x slower
- richards: 49.5 ms +- 0.7 ms -> 50.1 ms +- 0.8 ms: 1.01x slower
- pickle_list: 4.51 us +- 0.06 us -> 4.56 us +- 0.05 us: 1.01x slower
- scimark_sor: 122 ms +- 1 ms -> 123 ms +- 2 ms: 1.01x slower
- fannkuch: 385 ms +- 2 ms -> 389 ms +- 5 ms: 1.01x slower
- float: 76.1 ms +- 0.8 ms -> 76.8 ms +- 0.9 ms: 1.01x slower
- deltablue: 4.07 ms +- 0.05 ms -> 4.10 ms +- 0.05 ms: 1.01x slower
- chameleon: 7.34 ms +- 0.09 ms -> 7.39 ms +- 0.12 ms: 1.01x slower
- pickle_pure_python: 326 us +- 2 us -> 328 us +- 3 us: 1.00x slower
- regex_compile: 135 ms +- 1 ms -> 135 ms +- 1 ms: 1.00x slower
- xml_etree_generate: 79.0 ms +- 0.6 ms -> 79.2 ms +- 0.9 ms: 1.00x slower
- python_startup_no_site: 5.87 ms +- 0.00 ms -> 5.87 ms +- 0.01 ms: 1.00x slower
- python_startup: 8.23 ms +- 0.01 ms -> 8.24 ms +- 0.01 ms: 1.00x slower

Faster (25):
- scimark_sparse_mat_mult: 4.56 ms +- 0.10 ms -> 4.27 ms +- 0.04 ms: 1.07x faster
- regex_dna: 220 ms +- 1 ms -> 210 ms +- 1 ms: 1.05x faster
- spectral_norm: 99.0 ms +- 0.9 ms -> 94.8 ms +- 2.8 ms: 1.04x faster
- pyflate: 456 ms +- 5 ms -> 438 ms +- 3 ms: 1.04x faster
- scimark_lu: 116 ms +- 5 ms -> 112 ms +- 3 ms: 1.04x faster
- scimark_fft: 326 ms +- 2 ms -> 318 ms +- 2 ms: 1.03x faster
- nbody: 94.2 ms +- 4.0 ms -> 91.7 ms +- 0.7 ms: 1.03x faster
- scimark_monte_carlo: 70.9 ms +- 1.1 ms -> 69.1 ms +- 1.5 ms: 1.03x faster
- nqueens: 85.7 ms +- 0.8 ms -> 83.7 ms +- 1.0 ms: 1.02x faster
- logging_simple: 5.93 us +- 0.07 us -> 5.80 us +- 0.06 us: 1.02x faster
- mako: 11.5 ms +- 0.1 ms -> 11.3 ms +- 0.1 ms: 1.02x faster
- json_dumps: 12.6 ms +- 0.2 ms -> 12.4 ms +- 0.2 ms: 1.02x faster
- pathlib: 19.2 ms +- 0.2 ms -> 18.9 ms +- 0.3 ms: 1.02x faster
- logging_format: 6.44 us +- 0.08 us -> 6.35 us +- 0.07 us: 1.01x faster
- hexiom: 6.62 ms +- 0.05 ms -> 6.54 ms +- 0.10 ms: 1.01x faster
- chaos: 71.0 ms +- 0.6 ms -> 70.3 ms +- 0.5 ms: 1.01x faster
- json_loads: 25.3 us +- 0.8 us -> 25.0 us +- 0.3 us: 1.01x faster
- xml_etree_iterparse: 107 ms +- 1 ms -> 106 ms +- 1 ms: 1.01x faster
- unpickle: 14.0 us +- 0.2 us -> 13.9 us +- 0.2 us: 1.01x faster
- pickle_dict: 27.4 us +- 0.1 us -> 27.1 us +- 0.2 us: 1.01x faster
- go: 145 ms +- 1 ms -> 144 ms +- 1 ms: 1.01x faster
- raytrace: 307 ms +- 5 ms -> 305 ms +- 2 ms: 1.01x faster
- regex_effbot: 3.21 ms +- 0.03 ms -> 3.18 ms +- 0.04 ms: 1.01x faster
- dulwich_log: 64.9 ms +- 0.4 ms -> 64.5 ms +- 0.4 ms: 1.01x faster
- pidigits: 192 ms +- 0 ms -> 192 ms +- 0 ms: 1.00x faster

Benchmark hidden because not significant (18): 2to3, logging_silent, meteor_contest, pickle, regex_v8, sqlalchemy_declarative, sqlalchemy_imperative, sympy_expand, sympy_integrate, sympy_sum, sympy_str, telco, tornado_http, unpack_sequence, unpickle_list, unpickle_pure_python, xml_etree_parse, xml_etree_process

Geometric mean: 1.01x faster

gvanrossum · 2022-02-01T21:07:20Z

Seems you have a merge conflict. Also, Mark should probably re-review this?

Python/specialize.c

markshannon

Not a fan of the big macro. Otherwise, looks good though.

markshannon · 2022-02-10T17:02:46Z

Python/ceval.c

@@ -857,6 +857,43 @@ static const binaryfunc binary_ops[] = {
    [NB_INPLACE_XOR] = PyNumber_InPlaceXor,
 };

+// NOTE: INPLACE must be a compile-time constant to avoid runtime branching!
+#define BINARY_OP_FAST_FLOAT(OP, INPLACE)                                 \


I'm not a fan of giant macros and I don't like the hidden compile time control flow.
Given that the inplace and non-inplace forms specialized forms differ in a fundamental way, I'd like that to be explicit in the instructions themselves.

Maybe have two different macros, one for the inplace form and one for the non-inplace form.
You should be able to factor out some of the common code into helper macros.
That might be clearer. Up to you.

Makes sense. I'll go ahead and split it up.

iritkatriel

This has merge conflicts now.

bedevere-bot · 2022-11-25T10:55:24Z

When you're done making the requested changes, leave the comment: I have made the requested changes; please review again.

brandtbucher added 5 commits January 11, 2022 12:25

Perform specialized ops in-place when possible

cf64e13

Break out common logic

5e49a13

Clean things up a bit

f7187d9

Clean up macro

dd19cec

Simplify the "small" int case

78cc5c4

brandtbucher added the performance Performance or resource usage label Jan 14, 2022

brandtbucher requested a review from markshannon as a code owner January 14, 2022 06:44

the-knights-who-say-ni added the CLA signed label Jan 14, 2022

bedevere-bot added the awaiting core review label Jan 14, 2022

blurb add

218b10b

sweeneyde reviewed Jan 14, 2022

View reviewed changes

Python/ceval.c Outdated Show resolved Hide resolved

markshannon requested changes Jan 17, 2022

View reviewed changes

Python/ceval.c Outdated Show resolved Hide resolved

Python/ceval.c Outdated Show resolved Hide resolved

bedevere-bot removed the awaiting core review label Jan 17, 2022

bedevere-bot added the awaiting changes label Jan 17, 2022

sweeneyde reviewed Jan 17, 2022

View reviewed changes

Python/ceval.c Outdated Show resolved Hide resolved

brandtbucher added 4 commits January 18, 2022 14:14

Catch up with main

505fd8e

Apply first round of code revew comments

6b4cc6d

Fix 32-bit Windows errrors

f21a813

Perform the inplace checks when specializing

d6a69d5

sweeneyde reviewed Jan 19, 2022

View reviewed changes

Python/specialize.c Outdated Show resolved Hide resolved

brandtbucher and others added 3 commits January 18, 2022 22:35

Remove duplicate code

f11cf6d

Simplify things further

b6f041a

Remove int and refcount == 2 optimizations

eb586b1

Catch up with main

def9c64

brandtbucher changed the title ~~bpo-46372: Try to mutate the LHS during fast int/float ops~~ bpo-46372: Try to mutate the LHS during fast float ops Jan 24, 2022

Revert int changes

6e262ba

Catch up with main

37fab55

brandtbucher mentioned this pull request Jan 31, 2022

Try to mutate the LHS during fast float ops faster-cpython/ideas#243

Closed

gvanrossum approved these changes Jan 31, 2022

View reviewed changes

bedevere-bot added awaiting merge and removed awaiting changes labels Jan 31, 2022

Add the "adaptive superinstructions" back

04f186c

Catch up with main

e0d0042

brandtbucher requested a review from markshannon February 1, 2022 23:58

sweeneyde reviewed Feb 2, 2022

View reviewed changes

Python/specialize.c Show resolved Hide resolved

markshannon reviewed Feb 10, 2022

View reviewed changes

ezio-melotti removed the CLA signed label Jul 13, 2022

iritkatriel requested changes Nov 25, 2022

View reviewed changes

bedevere-bot added awaiting changes and removed awaiting merge labels Nov 25, 2022

brandtbucher closed this Jan 15, 2023

Uh oh!

bpo-46372: Try to mutate the LHS during fast float ops #30594

bpo-46372: Try to mutate the LHS during fast float ops #30594

Uh oh!

Conversation

brandtbucher commented Jan 14, 2022 • edited by gvanrossum Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

sweeneyde commented Jan 14, 2022

Uh oh!

markshannon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

bedevere-bot commented Jan 17, 2022

Uh oh!

Uh oh!

brandtbucher commented Jan 19, 2022

Uh oh!

Uh oh!

markshannon commented Jan 20, 2022

Uh oh!

brandtbucher commented Jan 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

markshannon commented Jan 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sweeneyde commented Jan 21, 2022

Uh oh!

brandtbucher commented Jan 25, 2022

Uh oh!

markshannon commented Jan 25, 2022

Uh oh!

gvanrossum left a comment

Choose a reason for hiding this comment

Uh oh!

gvanrossum Jan 31, 2022

Choose a reason for hiding this comment

Uh oh!

gvanrossum Jan 31, 2022

Choose a reason for hiding this comment

Uh oh!

brandtbucher Feb 1, 2022

Choose a reason for hiding this comment

Uh oh!

brandtbucher commented Feb 1, 2022

Uh oh!

gvanrossum commented Feb 1, 2022

Uh oh!

Uh oh!

markshannon left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

markshannon Feb 10, 2022

Choose a reason for hiding this comment

Uh oh!

brandtbucher Feb 11, 2022

Choose a reason for hiding this comment

Uh oh!

iritkatriel left a comment

Choose a reason for hiding this comment

Uh oh!

bedevere-bot commented Nov 25, 2022

Uh oh!

Uh oh!

bpo-46372: Try to mutate the LHS during fast `float` ops #30594

bpo-46372: Try to mutate the LHS during fast `float` ops #30594

brandtbucher commented Jan 14, 2022 •

edited by gvanrossum

Loading

brandtbucher commented Jan 20, 2022 •

edited

Loading

markshannon commented Jan 21, 2022 •

edited

Loading

markshannon left a comment •

edited

Loading