Skip to content

bpo-46372: Try to mutate the LHS during fast float ops #30594

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 18 commits into from

Conversation

brandtbucher
Copy link
Member

@brandtbucher brandtbucher commented Jan 14, 2022

This modifies the existing int and float specializations to mutate the LHS operand in-place when possible. It also unifies the implementations of all of these operations.

https://bugs.python.org/issue46372

@sweeneyde
Copy link
Member

I think _PyLong_Add and the like could potentially be deleted as well.

Copy link
Member

@markshannon markshannon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few general comments:

Could you attach the benchmark results to the PR, not the issue? The speedup depends on the actual implementation, not the general idea.

I'm a bit surprised that this offers much of a speedup given the complexity of the int operators. Maybe the current implementation is just really slow?
Does the specialization of ints add much, or is it mainly the float specialization giving the speedup? I'd be curious to see how a float-only version compares.

It would be worth adding stats to give us some idea what fraction of operations can use this optimization or something like it. In other words, what fraction of operations have the lhs with a ref-count of 1 (or 2 when assigning to same variable)? That would be generally useful information beyond this particular optimization.

As a general rule, if something can be done at compile time, it should be. If not then it should be done at specialization time. Only do stuff at runtime, if we can't avoid it.
We can determine at compile time whether a binary operator if of the form s += x or s = s + x (where s is a "fast" local). We should do that.

I don't know if it is worth having two specializations, one for the s = s + x and one for the general form, or embedding the information in the cache. I suspect that it won't matter.

I like the float optimization, but I think our efforts to speed up int operations should focus or making int objects more efficient first. Otherwise the coupling of the interpreter to the int implementation is going to make improving the latter even more difficult.

@bedevere-bot
Copy link

When you're done making the requested changes, leave the comment: I have made the requested changes; please review again.

@brandtbucher
Copy link
Member Author

Could you attach the benchmark results to the PR, not the issue? The speedup depends on the actual implementation, not the general idea.

Yeah, that makes sense. They can also be updated as the implementation changes, too.

I'm a bit surprised that this offers much of a speedup given the complexity of the int operators. Maybe the current implementation is just really slow? Does the specialization of ints add much, or is it mainly the float specialization giving the speedup? I'd be curious to see how a float-only version compares.

Looks like your intuition is correct. Here is a comparison of main, int-only, float-only, and both:

Benchmark main-0 inplace-ops-int-0 inplace-ops-float-0 inplace-ops-0
2to3 264 ms not significant not significant 266 ms: 1.01x slower
chameleon 7.64 ms 7.40 ms: 1.03x faster 7.43 ms: 1.03x faster 7.45 ms: 1.03x faster
chaos 72.9 ms 72.3 ms: 1.01x faster 74.2 ms: 1.02x slower not significant
crypto_pyaes 84.9 ms 85.5 ms: 1.01x slower 85.6 ms: 1.01x slower 84.2 ms: 1.01x faster
deltablue 4.21 ms not significant not significant 4.24 ms: 1.01x slower
django_template 35.7 ms 36.2 ms: 1.02x slower not significant not significant
dulwich_log 65.3 ms 65.7 ms: 1.01x slower 65.0 ms: 1.00x faster 65.8 ms: 1.01x slower
fannkuch 404 ms 385 ms: 1.05x faster 395 ms: 1.02x faster 394 ms: 1.02x faster
float 78.0 ms 77.4 ms: 1.01x faster 76.8 ms: 1.02x faster not significant
go 149 ms 151 ms: 1.01x slower 150 ms: 1.01x slower not significant
hexiom 6.65 ms 6.58 ms: 1.01x faster 6.71 ms: 1.01x slower 6.76 ms: 1.02x slower
json_dumps 12.7 ms not significant 12.9 ms: 1.02x slower 12.8 ms: 1.01x slower
json_loads 25.4 us 25.6 us: 1.01x slower 25.2 us: 1.01x faster 25.5 us: 1.00x slower
logging_format 6.51 us 6.36 us: 1.02x faster 6.42 us: 1.02x faster 6.40 us: 1.02x faster
logging_silent 111 ns 108 ns: 1.03x faster 110 ns: 1.01x faster 107 ns: 1.04x faster
logging_simple 5.90 us 5.87 us: 1.00x faster 5.86 us: 1.01x faster not significant
mako 11.4 ms 11.4 ms: 1.00x faster 11.5 ms: 1.01x slower not significant
meteor_contest 105 ms 103 ms: 1.01x faster 103 ms: 1.01x faster 103 ms: 1.02x faster
nbody 96.5 ms 98.0 ms: 1.02x slower 95.1 ms: 1.01x faster not significant
nqueens 83.9 ms not significant 85.2 ms: 1.02x slower 85.5 ms: 1.02x slower
pathlib 19.4 ms not significant not significant 19.2 ms: 1.01x faster
pickle 9.86 us 9.71 us: 1.02x faster 9.70 us: 1.02x faster 10.1 us: 1.02x slower
pickle_dict 27.2 us 28.6 us: 1.05x slower not significant 28.2 us: 1.04x slower
pickle_list 4.55 us 4.61 us: 1.01x slower 4.61 us: 1.01x slower 4.41 us: 1.03x faster
pickle_pure_python 335 us 331 us: 1.01x faster 333 us: 1.01x faster 332 us: 1.01x faster
pidigits 194 ms 195 ms: 1.00x slower 197 ms: 1.02x slower 189 ms: 1.02x faster
pyflate 447 ms 442 ms: 1.01x faster 441 ms: 1.01x faster 441 ms: 1.01x faster
python_startup 8.16 ms 8.23 ms: 1.01x slower 8.17 ms: 1.00x slower 8.20 ms: 1.00x slower
python_startup_no_site 5.79 ms 5.83 ms: 1.01x slower 5.80 ms: 1.00x slower 5.82 ms: 1.00x slower
raytrace 315 ms 310 ms: 1.02x faster 310 ms: 1.02x faster 309 ms: 1.02x faster
regex_compile 139 ms 136 ms: 1.02x faster 138 ms: 1.01x faster 138 ms: 1.01x faster
regex_dna 216 ms 221 ms: 1.03x slower not significant 216 ms: 1.00x slower
regex_effbot 3.26 ms not significant 3.33 ms: 1.02x slower not significant
regex_v8 24.1 ms not significant 23.5 ms: 1.03x faster 24.0 ms: 1.01x faster
richards 53.1 ms 51.6 ms: 1.03x faster 51.4 ms: 1.03x faster 51.6 ms: 1.03x faster
scimark_fft 326 ms 341 ms: 1.04x slower 314 ms: 1.04x faster 333 ms: 1.02x slower
scimark_lu 112 ms not significant 110 ms: 1.01x faster 109 ms: 1.03x faster
scimark_monte_carlo 72.1 ms 69.5 ms: 1.04x faster 71.3 ms: 1.01x faster 68.6 ms: 1.05x faster
scimark_sor 125 ms 121 ms: 1.04x faster 124 ms: 1.01x faster 121 ms: 1.04x faster
scimark_sparse_mat_mult 4.65 ms 4.76 ms: 1.02x slower 4.41 ms: 1.05x faster 4.30 ms: 1.08x faster
spectral_norm 103 ms 93.1 ms: 1.10x faster 93.1 ms: 1.10x faster 91.7 ms: 1.12x faster
sympy_expand 500 ms not significant not significant 502 ms: 1.01x slower
sympy_integrate 21.3 ms 21.3 ms: 1.00x faster 21.4 ms: 1.00x slower 21.4 ms: 1.00x slower
sympy_sum 168 ms not significant 168 ms: 1.00x faster 169 ms: 1.01x slower
sympy_str 302 ms 299 ms: 1.01x faster 300 ms: 1.01x faster not significant
telco 6.52 ms not significant 6.63 ms: 1.02x slower 6.35 ms: 1.03x faster
unpack_sequence 44.4 ns 45.4 ns: 1.02x slower not significant 45.9 ns: 1.03x slower
unpickle 14.2 us 14.4 us: 1.01x slower 14.0 us: 1.02x faster not significant
unpickle_list 5.08 us 5.16 us: 1.02x slower 5.03 us: 1.01x faster 5.12 us: 1.01x slower
unpickle_pure_python 251 us 247 us: 1.02x faster 253 us: 1.01x slower 247 us: 1.01x faster
xml_etree_parse 155 ms not significant 157 ms: 1.01x slower 157 ms: 1.01x slower
xml_etree_iterparse 107 ms not significant not significant 108 ms: 1.01x slower
xml_etree_generate 81.5 ms 80.6 ms: 1.01x faster 80.7 ms: 1.01x faster 80.9 ms: 1.01x faster
xml_etree_process 57.9 ms 57.7 ms: 1.00x faster not significant 58.4 ms: 1.01x slower
Geometric mean (ref) 1.00x faster 1.01x faster 1.01x faster

As a general rule, if something can be done at compile time, it should be. If not then it should be done at specialization time. Only do stuff at runtime, if we can't avoid it. We can determine at compile time whether a binary operator if of the form s += x or s = s + x (where s is a "fast" local). We should do that.

I don't know if it is worth having two specializations, one for the s = s + x and one for the general form, or embedding the information in the cache. I suspect that it won't matter.

I've updated the PR to use the adaptive entry for this purpose. I understand the benefits of a compile-time check, but doing the check at specialization time seems a lot simpler in this particular case.

I like the float optimization, but I think our efforts to speed up int operations should focus or making int objects more efficient first. Otherwise the coupling of the interpreter to the int implementation is going to make improving the latter even more difficult.

Shall I drop the int changes, then?

@markshannon
Copy link
Member

Looks good.
Do you have up to date performance numbers?

@brandtbucher
Copy link
Member Author

brandtbucher commented Jan 20, 2022

Do you have time for a quick Teams call @markshannon? Might be easier to chat about the stats I gathered in-person.

(If not, maybe tomorrow or Monday.)

@markshannon
Copy link
Member

markshannon commented Jan 21, 2022

Somewhat surprising benchmarking results.
An overall speedup, but slowdowns on some more numerical benchmarks, particularly pidigits and spectral_norm.

@sweeneyde
Copy link
Member

Somewhat surprising benchmarking results. An overall speedup, but slowdowns on some more numerical benchmarks, particularly pidigits and spectral_norm.

pidigits slowdown should be somewhat expected, since it's mostly huge-integer operations, which are specialized and handled by _PyLong_Add and such before this PR but not after.

@brandtbucher brandtbucher changed the title bpo-46372: Try to mutate the LHS during fast int/float ops bpo-46372: Try to mutate the LHS during fast float ops Jan 24, 2022
@brandtbucher
Copy link
Member Author

Hm, looks like removing the refcount == 2 optimization from the float ops wiped out most of the performance gains.

I’ll go ahead and try the less-branchy version that we discussed this morning (an adaptive superinstruction based on the existing in-place string addition op).

@markshannon
Copy link
Member

LGTM. Feel free to merge once the merge conflicts are fixed.

Copy link
Member

@gvanrossum gvanrossum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice and simple!

Python/ceval.c Outdated
Py_DECREF(rhs); \
STACK_SHRINK(1); \
if (Py_REFCNT(lhs) == 1) { \
PyFloat_AS_DOUBLE(lhs) = d; \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presumably (see PEP 674) Victor won't like this use of a macro as a target. :-)

Nevertheless, this whole macro is so nice and straightforward compared to ints!

Python/ceval.c Outdated
double d = l OP r; \
Py_DECREF(rhs); \
STACK_SHRINK(1); \
if (Py_REFCNT(lhs) == 1) { \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would there be any benefit in checking whether perhaps rhs has a refcount of 1, in case lhs doesn't? While people are probably more likely to write return x+1 rather than return 1+x (though even I could sometimes see a reason to do that, from "rhyming" with some other operations or just how the textbook presents the formula), writing return 2*x is more likely than return x*2. And of course for asymmetric operations you may not have a choice, like return 1/x.

Of course, there's a cost (more code, more branches). There's also the fact that += and friends steer this in the direction you're taking here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course, there's a cost (more code, more branches). There's also the fact that += and friends steer this in the direction you're taking here.

Yeah, those are my main motivations for sticking with the LHS here.

@brandtbucher
Copy link
Member Author

Okay, I just pushed variants of float += float and float -= float that optimize the case where lhs has two references: one on the stack, and one named local reference that is about to be overwritten by the next instruction. It costs one extra (fast, predictable) branch, but lets us both mutate lhs in-place and skip the following store entirely!

This gets us back up to a 1% improvement:

Slower (15):
- crypto_pyaes: 83.5 ms +- 0.9 ms -> 85.1 ms +- 1.2 ms: 1.02x slower
- sqlite_synth: 2.71 us +- 0.04 us -> 2.74 us +- 0.07 us: 1.01x slower
- django_template: 34.8 ms +- 0.4 ms -> 35.2 ms +- 0.5 ms: 1.01x slower
- richards: 49.5 ms +- 0.7 ms -> 50.1 ms +- 0.8 ms: 1.01x slower
- pickle_list: 4.51 us +- 0.06 us -> 4.56 us +- 0.05 us: 1.01x slower
- scimark_sor: 122 ms +- 1 ms -> 123 ms +- 2 ms: 1.01x slower
- fannkuch: 385 ms +- 2 ms -> 389 ms +- 5 ms: 1.01x slower
- float: 76.1 ms +- 0.8 ms -> 76.8 ms +- 0.9 ms: 1.01x slower
- deltablue: 4.07 ms +- 0.05 ms -> 4.10 ms +- 0.05 ms: 1.01x slower
- chameleon: 7.34 ms +- 0.09 ms -> 7.39 ms +- 0.12 ms: 1.01x slower
- pickle_pure_python: 326 us +- 2 us -> 328 us +- 3 us: 1.00x slower
- regex_compile: 135 ms +- 1 ms -> 135 ms +- 1 ms: 1.00x slower
- xml_etree_generate: 79.0 ms +- 0.6 ms -> 79.2 ms +- 0.9 ms: 1.00x slower
- python_startup_no_site: 5.87 ms +- 0.00 ms -> 5.87 ms +- 0.01 ms: 1.00x slower
- python_startup: 8.23 ms +- 0.01 ms -> 8.24 ms +- 0.01 ms: 1.00x slower

Faster (25):
- scimark_sparse_mat_mult: 4.56 ms +- 0.10 ms -> 4.27 ms +- 0.04 ms: 1.07x faster
- regex_dna: 220 ms +- 1 ms -> 210 ms +- 1 ms: 1.05x faster
- spectral_norm: 99.0 ms +- 0.9 ms -> 94.8 ms +- 2.8 ms: 1.04x faster
- pyflate: 456 ms +- 5 ms -> 438 ms +- 3 ms: 1.04x faster
- scimark_lu: 116 ms +- 5 ms -> 112 ms +- 3 ms: 1.04x faster
- scimark_fft: 326 ms +- 2 ms -> 318 ms +- 2 ms: 1.03x faster
- nbody: 94.2 ms +- 4.0 ms -> 91.7 ms +- 0.7 ms: 1.03x faster
- scimark_monte_carlo: 70.9 ms +- 1.1 ms -> 69.1 ms +- 1.5 ms: 1.03x faster
- nqueens: 85.7 ms +- 0.8 ms -> 83.7 ms +- 1.0 ms: 1.02x faster
- logging_simple: 5.93 us +- 0.07 us -> 5.80 us +- 0.06 us: 1.02x faster
- mako: 11.5 ms +- 0.1 ms -> 11.3 ms +- 0.1 ms: 1.02x faster
- json_dumps: 12.6 ms +- 0.2 ms -> 12.4 ms +- 0.2 ms: 1.02x faster
- pathlib: 19.2 ms +- 0.2 ms -> 18.9 ms +- 0.3 ms: 1.02x faster
- logging_format: 6.44 us +- 0.08 us -> 6.35 us +- 0.07 us: 1.01x faster
- hexiom: 6.62 ms +- 0.05 ms -> 6.54 ms +- 0.10 ms: 1.01x faster
- chaos: 71.0 ms +- 0.6 ms -> 70.3 ms +- 0.5 ms: 1.01x faster
- json_loads: 25.3 us +- 0.8 us -> 25.0 us +- 0.3 us: 1.01x faster
- xml_etree_iterparse: 107 ms +- 1 ms -> 106 ms +- 1 ms: 1.01x faster
- unpickle: 14.0 us +- 0.2 us -> 13.9 us +- 0.2 us: 1.01x faster
- pickle_dict: 27.4 us +- 0.1 us -> 27.1 us +- 0.2 us: 1.01x faster
- go: 145 ms +- 1 ms -> 144 ms +- 1 ms: 1.01x faster
- raytrace: 307 ms +- 5 ms -> 305 ms +- 2 ms: 1.01x faster
- regex_effbot: 3.21 ms +- 0.03 ms -> 3.18 ms +- 0.04 ms: 1.01x faster
- dulwich_log: 64.9 ms +- 0.4 ms -> 64.5 ms +- 0.4 ms: 1.01x faster
- pidigits: 192 ms +- 0 ms -> 192 ms +- 0 ms: 1.00x faster

Benchmark hidden because not significant (18): 2to3, logging_silent, meteor_contest, pickle, regex_v8, sqlalchemy_declarative, sqlalchemy_imperative, sympy_expand, sympy_integrate, sympy_sum, sympy_str, telco, tornado_http, unpack_sequence, unpickle_list, unpickle_pure_python, xml_etree_parse, xml_etree_process

Geometric mean: 1.01x faster

@gvanrossum
Copy link
Member

Seems you have a merge conflict. Also, Mark should probably re-review this?

Copy link
Member

@markshannon markshannon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a fan of the big macro. Otherwise, looks good though.

@@ -857,6 +857,43 @@ static const binaryfunc binary_ops[] = {
[NB_INPLACE_XOR] = PyNumber_InPlaceXor,
};

// NOTE: INPLACE must be a compile-time constant to avoid runtime branching!
#define BINARY_OP_FAST_FLOAT(OP, INPLACE) \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a fan of giant macros and I don't like the hidden compile time control flow.
Given that the inplace and non-inplace forms specialized forms differ in a fundamental way, I'd like that to be explicit in the instructions themselves.

Maybe have two different macros, one for the inplace form and one for the non-inplace form.
You should be able to factor out some of the common code into helper macros.
That might be clearer. Up to you.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I'll go ahead and split it up.

Copy link
Member

@iritkatriel iritkatriel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has merge conflicts now.

@bedevere-bot
Copy link

When you're done making the requested changes, leave the comment: I have made the requested changes; please review again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting changes performance Performance or resource usage
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants