-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Delete !FEATURE_IMPLICIT_TLS #14398
Delete !FEATURE_IMPLICIT_TLS #14398
Conversation
Perf results:
|
6d94a7c
to
281aa40
Compare
Linux and Windows arm64 are using the regular C/C++ thread local statics. This change unifies the remaining Windows architectures to be on the same plan.
There are a lot of changes in this PR, and it impressive to see that most of them are actually deleting code, which is a very good thing. The fact that thread static access is better is also awesome. We need fast thread local access since lots of perf optimnizations (e.g. caches), really need such fast access. The one thing that I noticed that the 'fast' object allocators were removed as well, presumably falling back to some portable version (i did not confirm that). This is arguably an independent change (you could imagine keeping these fast ones and still removing the generated TLS accessors. The fact that your microbenchmark does show a 1% drop in allocation performance is probably due to this (although it is small enough that it may be noise). Still this is only obvious 'downslide' of the change, and one could imagine putting the 'fast' allocators back to fix it. What are the tradeoffs there? (how hard would it be to leave the optimized allocators, at least for now?) We can swallow the 1%, but I really would like to know what we got for that. Otherwise I am OK with the change |
I have not removed the optimized assembly allocation helpers on x86 and x64. The 1% loss is coming from switching to different kind of thread local storage that requires extra indirections to access. Before this change, the instruction sequence to fetch thread-local allocation pointer was:
Note that this instruction sequence only worked for small gThreadTLSIndex. We needed to have to have complex fallback paths for case where gThreadTLSIndex points to TlsSlots overflow area. It kicked in when components loaded before CoreCLR allocated the fast TLS slots. After this change, the minimum instruction sequence to fetch thread-local allocation pointer is:
The extra indirections is where we are losing the ~1%. However, note that this instruction sequence works all the time. There is no need for complex fallback paths. It allows C/C++ compiler to inline the thread local access into statically compiled code. It is where we are getting the 20% gain for thread statics. The actual instruction sequence implemented is two instruction longer (look for I have removed the assembly allocation helpers on ARM because of the C++ code generate for them looks almost the same. And the peak allocation throughput on the typical ARM32 device is limited by memory access (small caches, slow bus), not by the instructions in the allocation helper. We should be able to get rid of one of the indirections by changing the thread local storage from |
That's awesome. |
@jkotas Thanks for the explanation. Just so I confirm I understand things, however _tls_index represents a small integer ID given to the DLL when the DLL is loaded, and is not known at compile time. Thus the actual instructions emitted by the compiler have to include that additional fetch of _tls_index, like this.
Is that correct? I certainly like the simplicity/standardization of using the C++ (implicit) mechanism for thread local variables. I also see that the compiler can do some interesting optimizations (certainly inlining, and then CSE...), and we get real benefit (managed threadStatic and monitors get faster (threadStatic significantly). I also like the idea of getting rid of the last indirection. by making the Thread class itself a thread local variable. That should reduce the 1% more. It does look like we will pay a small price in our allocation helpers. I think that is acceptable. |
Yes, that's correct. Thanks for the review! |
Linux and Windows arm64 are using the regular C/C++ thread local statics. This change unifies the remaining Windows architectures to be on the same plan.