Delete !FEATURE_IMPLICIT_TLS #14398

jkotas · 2017-10-09T22:46:58Z

Linux and Windows arm64 are using the regular C/C++ thread local statics. This change unifies the remaining Windows architectures to be on the same plan.

jkotas · 2017-10-09T22:49:21Z

Perf results:

Thread static access: 20% faster
- [ThreadStatic] static int t_x; [MethodImpl(MethodImplOptions.NoInlining)] static int foo() { return t_x; } called in a loop
Monitor.TryEnter w/ contention: 2% faster
- Monitor.TryEnter called in a loop with a lock taken on a different thread
Raw allocation throughput: : 1% slower
- GC.KeepAlive(new object()) called in a loop

Linux and Windows arm64 are using the regular C/C++ thread local statics. This change unifies the remaining Windows architectures to be on the same plan.

vancem · 2017-10-11T16:46:18Z

There are a lot of changes in this PR, and it impressive to see that most of them are actually deleting code, which is a very good thing. The fact that thread static access is better is also awesome. We need fast thread local access since lots of perf optimnizations (e.g. caches), really need such fast access.

The one thing that I noticed that the 'fast' object allocators were removed as well, presumably falling back to some portable version (i did not confirm that). This is arguably an independent change (you could imagine keeping these fast ones and still removing the generated TLS accessors.

The fact that your microbenchmark does show a 1% drop in allocation performance is probably due to this (although it is small enough that it may be noise). Still this is only obvious 'downslide' of the change, and one could imagine putting the 'fast' allocators back to fix it. What are the tradeoffs there? (how hard would it be to leave the optimized allocators, at least for now?) We can swallow the 1%, but I really would like to know what we got for that.

Otherwise I am OK with the change

@stephentoub

jkotas · 2017-10-11T18:21:16Z

I have not removed the optimized assembly allocation helpers on x86 and x64. The 1% loss is coming from switching to different kind of thread local storage that requires extra indirections to access.

Before this change, the instruction sequence to fetch thread-local allocation pointer was:

    mov  rax, gs:[offsetof(TEB, TlsSlots) + gThreadTLSIndex * sizeof(void*)] // pThread
    mov  rcx, [rax + offsetof(Thread, m_alloc_context.alloc_ptr)]

Note that this instruction sequence only worked for small gThreadTLSIndex. We needed to have to have complex fallback paths for case where gThreadTLSIndex points to TlsSlots overflow area. It kicked in when components loaded before CoreCLR allocated the fast TLS slots.

After this change, the minimum instruction sequence to fetch thread-local allocation pointer is:

    mov   rax, gs:[offsetof(TEB, ThreadLocalStoragePointer)]
    mov   rax, [rax + _tls_index * sizeof(void*)]
    mov   rax, [rax + SECTIONREL gCurrentThreadInfo] // pThread
    mov   rcx, [rax + offsetof(Thread, m_alloc_context.alloc_ptr)]

The extra indirections is where we are losing the ~1%. However, note that this instruction sequence works all the time. There is no need for complex fallback paths. It allows C/C++ compiler to inline the thread local access into statically compiled code. It is where we are getting the 20% gain for thread statics.

The actual instruction sequence implemented is two instruction longer (look for INLINE_GETTHREAD) because of _tls_index is variable and SECTIONREL gCurrentThreadInfo cannot be used as offset directly (masm limitation). I have tried to get rid of these two instructions by patching the code at runtime but it did not help. I believe that it is because of the number of depend indirections that is the critical path have not changed. Since patching statically compiled code at runtime is something we should be working to get rid of (RWX pages), I have left the runtime patching out.

I have removed the assembly allocation helpers on ARM because of the C++ code generate for them looks almost the same. And the peak allocation throughput on the typical ARM32 device is limited by memory access (small caches, slow bus), not by the instructions in the allocation helper.

We should be able to get rid of one of the indirections by changing the thread local storage from __declspec(Thread) Thread * g_pThread to __declspec(Thread) Thread g_Thread. I would like to do that in separate change.

stephentoub · 2017-10-11T19:26:13Z

Thread static access: 20% faster

That's awesome.

vancem · 2017-10-11T19:29:39Z

@jkotas Thanks for the explanation. Just so I confirm I understand things, however _tls_index represents a small integer ID given to the DLL when the DLL is loaded, and is not known at compile time. Thus the actual instructions emitted by the compiler have to include that additional fetch of _tls_index, like this.

    mov   rax, gs:[offsetof(TEB, ThreadLocalStoragePointer)]  
    mov   rcx, [_tls_index]  // fetch the index that the loader assigned to this module.  
    mov   rax, [rax + rcx, * sizeof(void*)]  // fetch the TLS data from this 
    mov   rax, [rax + SECTIONREL gCurrentThreadInfo] // get the specific TLS variable (pThread)
    mov   rcx, [rax + offsetof(Thread, m_alloc_context.alloc_ptr)]  // Fetch allocation context field.  .

Is that correct?

I certainly like the simplicity/standardization of using the C++ (implicit) mechanism for thread local variables. I also see that the compiler can do some interesting optimizations (certainly inlining, and then CSE...), and we get real benefit (managed threadStatic and monitors get faster (threadStatic significantly).

I also like the idea of getting rid of the last indirection. by making the Thread class itself a thread local variable. That should reduce the 1% more.

It does look like we will pay a small price in our allocation helpers. I think that is acceptable.

jkotas · 2017-10-11T20:04:29Z

Yes, that's correct. Thanks for the review!

dnfclas added the cla-already-signed label Oct 9, 2017

jkotas force-pushed the tls branch from bee3db6 to b7d2108 Compare October 9, 2017 22:48

jkotas requested review from vancem and kouvel October 9, 2017 22:49

jkotas force-pushed the tls branch 2 times, most recently from 6d94a7c to 281aa40 Compare October 10, 2017 03:31

Delete !FEATURE_IMPLICIT_TLS

bfff46f

Linux and Windows arm64 are using the regular C/C++ thread local statics. This change unifies the remaining Windows architectures to be on the same plan.

jkotas force-pushed the tls branch from 281aa40 to bfff46f Compare October 10, 2017 14:33

vancem approved these changes Oct 11, 2017

View reviewed changes

jkotas merged commit 27a25bd into dotnet:master Oct 11, 2017

jkotas deleted the tls branch October 11, 2017 22:36

jkotas mentioned this pull request Oct 12, 2017

Fix Linux x86 break #14461

Merged

alpencolt mentioned this pull request Jan 31, 2020

[x86] undefined symbol: _ZN13StubLinkerCPU25X86EmitCurrentThreadFetchE6X86Reg dotnet/runtime#9109

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Delete !FEATURE_IMPLICIT_TLS #14398

Delete !FEATURE_IMPLICIT_TLS #14398

Uh oh!

jkotas commented Oct 9, 2017

Uh oh!

jkotas commented Oct 9, 2017

Uh oh!

vancem commented Oct 11, 2017 •

edited

Loading

Uh oh!

jkotas commented Oct 11, 2017 •

edited

Loading

Uh oh!

stephentoub commented Oct 11, 2017

Uh oh!

vancem commented Oct 11, 2017

Uh oh!

jkotas commented Oct 11, 2017

Uh oh!

Uh oh!

Delete !FEATURE_IMPLICIT_TLS #14398

Delete !FEATURE_IMPLICIT_TLS #14398

Uh oh!

Conversation

jkotas commented Oct 9, 2017

Uh oh!

jkotas commented Oct 9, 2017

Uh oh!

vancem commented Oct 11, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jkotas commented Oct 11, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stephentoub commented Oct 11, 2017

Uh oh!

vancem commented Oct 11, 2017

Uh oh!

jkotas commented Oct 11, 2017

Uh oh!

Uh oh!

vancem commented Oct 11, 2017 •

edited

Loading

jkotas commented Oct 11, 2017 •

edited

Loading