Skip to content
This repository was archived by the owner on Jan 23, 2023. It is now read-only.

Delete !FEATURE_IMPLICIT_TLS #14398

Merged
merged 1 commit into from
Oct 11, 2017
Merged

Delete !FEATURE_IMPLICIT_TLS #14398

merged 1 commit into from
Oct 11, 2017

Conversation

jkotas
Copy link
Member

@jkotas jkotas commented Oct 9, 2017

Linux and Windows arm64 are using the regular C/C++ thread local statics. This change unifies the remaining Windows architectures to be on the same plan.

@jkotas
Copy link
Member Author

jkotas commented Oct 9, 2017

Perf results:

  • Thread static access: 20% faster
    • [ThreadStatic] static int t_x; [MethodImpl(MethodImplOptions.NoInlining)] static int foo() { return t_x; } called in a loop
  • Monitor.TryEnter w/ contention: 2% faster
    • Monitor.TryEnter called in a loop with a lock taken on a different thread
  • Raw allocation throughput: : 1% slower
    • GC.KeepAlive(new object()) called in a loop

@jkotas jkotas requested review from vancem and kouvel October 9, 2017 22:49
@jkotas jkotas force-pushed the tls branch 2 times, most recently from 6d94a7c to 281aa40 Compare October 10, 2017 03:31
Linux and Windows arm64 are using the regular C/C++ thread local statics. This change unifies the remaining Windows architectures to be on the same plan.
@vancem
Copy link

vancem commented Oct 11, 2017

There are a lot of changes in this PR, and it impressive to see that most of them are actually deleting code, which is a very good thing. The fact that thread static access is better is also awesome. We need fast thread local access since lots of perf optimnizations (e.g. caches), really need such fast access.

The one thing that I noticed that the 'fast' object allocators were removed as well, presumably falling back to some portable version (i did not confirm that). This is arguably an independent change (you could imagine keeping these fast ones and still removing the generated TLS accessors.

The fact that your microbenchmark does show a 1% drop in allocation performance is probably due to this (although it is small enough that it may be noise). Still this is only obvious 'downslide' of the change, and one could imagine putting the 'fast' allocators back to fix it. What are the tradeoffs there? (how hard would it be to leave the optimized allocators, at least for now?) We can swallow the 1%, but I really would like to know what we got for that.

Otherwise I am OK with the change

@stephentoub

@jkotas
Copy link
Member Author

jkotas commented Oct 11, 2017

I have not removed the optimized assembly allocation helpers on x86 and x64. The 1% loss is coming from switching to different kind of thread local storage that requires extra indirections to access.

Before this change, the instruction sequence to fetch thread-local allocation pointer was:

    mov  rax, gs:[offsetof(TEB, TlsSlots) + gThreadTLSIndex * sizeof(void*)] // pThread
    mov  rcx, [rax + offsetof(Thread, m_alloc_context.alloc_ptr)]

Note that this instruction sequence only worked for small gThreadTLSIndex. We needed to have to have complex fallback paths for case where gThreadTLSIndex points to TlsSlots overflow area. It kicked in when components loaded before CoreCLR allocated the fast TLS slots.

After this change, the minimum instruction sequence to fetch thread-local allocation pointer is:

    mov   rax, gs:[offsetof(TEB, ThreadLocalStoragePointer)]
    mov   rax, [rax + _tls_index * sizeof(void*)]
    mov   rax, [rax + SECTIONREL gCurrentThreadInfo] // pThread
    mov   rcx, [rax + offsetof(Thread, m_alloc_context.alloc_ptr)]

The extra indirections is where we are losing the ~1%. However, note that this instruction sequence works all the time. There is no need for complex fallback paths. It allows C/C++ compiler to inline the thread local access into statically compiled code. It is where we are getting the 20% gain for thread statics.

The actual instruction sequence implemented is two instruction longer (look for INLINE_GETTHREAD) because of _tls_index is variable and SECTIONREL gCurrentThreadInfo cannot be used as offset directly (masm limitation). I have tried to get rid of these two instructions by patching the code at runtime but it did not help. I believe that it is because of the number of depend indirections that is the critical path have not changed. Since patching statically compiled code at runtime is something we should be working to get rid of (RWX pages), I have left the runtime patching out.

I have removed the assembly allocation helpers on ARM because of the C++ code generate for them looks almost the same. And the peak allocation throughput on the typical ARM32 device is limited by memory access (small caches, slow bus), not by the instructions in the allocation helper.

We should be able to get rid of one of the indirections by changing the thread local storage from __declspec(Thread) Thread * g_pThread to __declspec(Thread) Thread g_Thread. I would like to do that in separate change.

@stephentoub
Copy link
Member

Thread static access: 20% faster

That's awesome.

@vancem
Copy link

vancem commented Oct 11, 2017

@jkotas Thanks for the explanation. Just so I confirm I understand things, however _tls_index represents a small integer ID given to the DLL when the DLL is loaded, and is not known at compile time. Thus the actual instructions emitted by the compiler have to include that additional fetch of _tls_index, like this.

    mov   rax, gs:[offsetof(TEB, ThreadLocalStoragePointer)]  
    mov   rcx, [_tls_index]  // fetch the index that the loader assigned to this module.  
    mov   rax, [rax + rcx, * sizeof(void*)]  // fetch the TLS data from this 
    mov   rax, [rax + SECTIONREL gCurrentThreadInfo] // get the specific TLS variable (pThread)
    mov   rcx, [rax + offsetof(Thread, m_alloc_context.alloc_ptr)]  // Fetch allocation context field.  .  

Is that correct?

I certainly like the simplicity/standardization of using the C++ (implicit) mechanism for thread local variables. I also see that the compiler can do some interesting optimizations (certainly inlining, and then CSE...), and we get real benefit (managed threadStatic and monitors get faster (threadStatic significantly).

I also like the idea of getting rid of the last indirection. by making the Thread class itself a thread local variable. That should reduce the 1% more.

It does look like we will pay a small price in our allocation helpers. I think that is acceptable.

@jkotas
Copy link
Member Author

jkotas commented Oct 11, 2017

Yes, that's correct. Thanks for the review!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants