Skip to content
This repository was archived by the owner on Jan 23, 2023. It is now read-only.

Improve Monitor scaling #14216

Merged
merged 10 commits into from
Nov 27, 2017
Merged

Improve Monitor scaling #14216

merged 10 commits into from
Nov 27, 2017

Conversation

kouvel
Copy link

@kouvel kouvel commented Sep 27, 2017

Fixes https://github.com/dotnet/coreclr/issues/13978

  • Refactored AwareLock::m_MonitorHeld into a class LockState with operations to mutate the state
  • Allowed the lock to be taken by a non-waiter when there is a waiter to prevent creating lock convoys
  • Added a bit to LockState to indicate that a waiter is signaled to wake, to avoid waking more than one waiter at a time. A waiter that wakes by observing the signal unsets this bit. See AwareLock::EnterEpilogHelper().
  • Added a spinner count to LockState. Spinners now register and unregister themselves and lock releasers don't wake a waiter when there is a registered spinner (the spinner guarantees to take the lock if it's available when unregistering itself)
    • This was necessary mostly on Windows to reduce CPU usage to the expected level in contended cases with several threads. I believe it's the priority boost Windows gives to signaled threads, which seems to cause waiters to much more frequently succeed in acquiring the lock. This causes a CPU usage problem because once the woken waiter releases the lock, on the next lock attempt it will become a spinner. This keeps repeating, converting several waiters into spinners unnecessarily. Before registering spinners, I saw typically 4-6 spinners under contention (with delays inside and outside the lock) when I expected to have only 1-2 spinners at most.
    • It costs an interlocked operation before and after the spin loop, doesn't seem to be too significant since spinning is a relatively slow path anyway, and the reduction in CPU usage in turn reduces contention on the lock and lets more useful work get done
  • Updated waiters to spin a bit before going back to waiting, reasons are explained in AwareLock::EnterEpilogHelper()
  • Removed AwareLock::Contention() and any references (this removes the 10 repeats of the entire spin loop in that function). With the lock convoy issue gone, this appears to no longer be necessary.

Perf

  • On Windows, throughput has increased significantly starting at slightly lower than proc count threads. On Linux, latency and throughput have increased more significantly at similar proc counts.
  • Most of the larger regressions are in the unlocked fast paths. The code there hasn't changed and is almost identical (minor layout differences), I'm just considering this noise until we figure out how to get consistently faster code generated.
  • The smaller regressions are within noise range

Processor for numbers below: Core i7 6700 4-core 8-thread

Monitor spin tests

Spin (Windows x64)                                      Left score        Right score       ∆ Score %
------------------------------------------------------  ----------------  ----------------  ---------
MonitorEnterExitLatency 02T                                665.73 ±0.46%     660.39 ±0.54%     -0.80%
MonitorEnterExitLatency 04T                               1499.47 ±0.52%    1502.81 ±0.29%      0.22%
MonitorEnterExitLatency 08T                               1731.28 ±0.10%    1743.89 ±0.14%      0.73%
MonitorEnterExitLatency 16T                               1707.25 ±0.31%    1747.36 ±0.24%      2.35%
MonitorEnterExitThroughput Delay 01T                      5138.81 ±0.09%    5120.83 ±0.10%     -0.35%
MonitorEnterExitThroughput Delay 02T                      4959.48 ±0.15%    4981.27 ±0.10%      0.44%
MonitorEnterExitThroughput Delay 04T                      4462.47 ±0.69%    4760.78 ±0.05%      6.68%
MonitorEnterExitThroughput Delay 08T                      3745.69 ±0.03%    4698.01 ±0.26%     25.42%
MonitorEnterExitThroughput Delay 16T                      3711.20 ±0.34%    4725.44 ±0.38%     27.33%
MonitorEnterExitThroughput_AwareLock 1T                  61200.72 ±0.03%   58933.89 ±0.06%     -3.70%
MonitorEnterExitThroughput_ThinLock 1T                   59430.78 ±0.05%   59396.10 ±0.03%     -0.06%
MonitorReliableEnterExitLatency 02T                        706.79 ±0.24%     705.74 ±0.41%     -0.15%
MonitorReliableEnterExitLatency 04T                       1491.37 ±0.26%    1525.77 ±0.17%      2.31%
MonitorReliableEnterExitLatency 08T                       1722.46 ±0.06%    1703.50 ±0.08%     -1.10%
MonitorReliableEnterExitLatency 16T                       1679.43 ±0.29%    1710.93 ±0.19%      1.88%
MonitorReliableEnterExitThroughput Delay 01T              5064.57 ±0.14%    5083.21 ±0.16%      0.37%
MonitorReliableEnterExitThroughput Delay 02T              4917.33 ±0.11%    4962.94 ±0.09%      0.93%
MonitorReliableEnterExitThroughput Delay 04T              4710.53 ±0.22%    4728.52 ±0.12%      0.38%
MonitorReliableEnterExitThroughput Delay 08T              3733.62 ±0.04%    4745.75 ±0.12%     27.11%
MonitorReliableEnterExitThroughput Delay 16T              3648.78 ±0.37%    4718.97 ±0.31%     29.33%
MonitorReliableEnterExitThroughput_AwareLock 1T          58397.83 ±0.06%   58527.30 ±0.11%      0.22%
MonitorReliableEnterExitThroughput_ThinLock 1T           58441.90 ±0.03%   56825.30 ±0.03%     -2.77%
MonitorTryEnterExitWhenUnlockedThroughput_AwareLock 1T   57540.11 ±0.05%   58440.14 ±0.05%      1.56%
MonitorTryEnterExitWhenUnlockedThroughput_ThinLock 1T    57684.81 ±0.04%   57747.39 ±0.06%      0.11%
MonitorTryEnterWhenLockedThroughput_AwareLock 1T        261834.12 ±0.12%  244767.50 ±0.07%     -6.52%
MonitorTryEnterWhenLockedThroughput_ThinLock 1T         241360.92 ±0.15%  261689.44 ±0.04%      8.42%
------------------------------------------------------  ----------------  ----------------  ---------
Total                                                     7513.73 ±0.19%    7828.46 ±0.16%      4.19%
Spin (Linux x64)                                        Left score        Right score       ∆ Score %
------------------------------------------------------  ----------------  ----------------  ---------
MonitorEnterExitLatency 02T                               3561.29 ±0.64%    3606.19 ±0.31%      1.26%
MonitorEnterExitLatency 04T                               3440.76 ±0.12%    3464.48 ±0.12%      0.69%
MonitorEnterExitLatency 08T                               2292.54 ±0.50%    3429.38 ±0.46%     49.59%
MonitorEnterExitLatency 16T                               2095.67 ±0.67%    3433.30 ±0.30%     63.83%
MonitorEnterExitThroughput Delay 01T                      5043.59 ±0.31%    5008.86 ±0.26%     -0.69%
MonitorEnterExitThroughput Delay 02T                      4972.94 ±0.04%    4955.36 ±0.03%     -0.35%
MonitorEnterExitThroughput Delay 04T                      4707.27 ±0.08%    4650.77 ±0.06%     -1.20%
MonitorEnterExitThroughput Delay 08T                      2637.27 ±0.20%    4601.87 ±0.22%     74.49%
MonitorEnterExitThroughput Delay 16T                      2312.48 ±0.81%    4650.57 ±0.11%    101.11%
MonitorEnterExitThroughput_AwareLock 1T                  58822.27 ±0.03%   57910.71 ±0.10%     -1.55%
MonitorEnterExitThroughput_ThinLock 1T                   57274.55 ±0.15%   56441.84 ±0.22%     -1.45%
MonitorReliableEnterExitLatency 02T                       3558.43 ±0.27%    3553.18 ±0.61%     -0.15%
MonitorReliableEnterExitLatency 04T                       2920.09 ±0.20%    3440.91 ±0.14%     17.84%
MonitorReliableEnterExitLatency 08T                       2269.82 ±0.05%    3052.08 ±6.08%     34.46%
MonitorReliableEnterExitLatency 16T                       2086.47 ±0.75%    3275.11 ±2.67%     56.97%
MonitorReliableEnterExitThroughput Delay 01T              5169.16 ±0.64%    5189.40 ±0.14%      0.39%
MonitorReliableEnterExitThroughput Delay 02T              5075.91 ±0.26%    5074.15 ±0.05%     -0.03%
MonitorReliableEnterExitThroughput Delay 04T              4602.26 ±1.49%    4767.14 ±0.05%      3.58%
MonitorReliableEnterExitThroughput Delay 08T              2773.58 ±0.07%    4744.31 ±0.28%     71.05%
MonitorReliableEnterExitThroughput Delay 16T              2438.12 ±0.66%    4776.94 ±0.13%     95.93%
MonitorReliableEnterExitThroughput_AwareLock 1T          56198.53 ±0.11%   55248.53 ±0.07%     -1.69%
MonitorReliableEnterExitThroughput_ThinLock 1T           56322.95 ±0.11%   55207.00 ±0.24%     -1.98%
MonitorTryEnterExitWhenUnlockedThroughput_AwareLock 1T   54299.42 ±0.10%   54580.95 ±0.07%      0.52%
MonitorTryEnterExitWhenUnlockedThroughput_ThinLock 1T    55292.39 ±0.20%   53787.62 ±0.26%     -2.72%
MonitorTryEnterWhenLockedThroughput_AwareLock 1T        216569.65 ±0.26%  213135.11 ±0.32%     -1.59%
MonitorTryEnterWhenLockedThroughput_ThinLock 1T         225565.05 ±0.68%  230333.25 ±0.22%      2.11%
------------------------------------------------------  ----------------  ----------------  ---------
Total                                                     8698.50 ±0.36%   10232.00 ±0.53%     17.63%

Raw numbers

Code is in https://github.com/dotnet/coreclr/issues/13978

Default spin heuristics

Tc = Thread count
WB = Windows baseline
WC = Windows changed
LB = Linux baseline
LC = Linux changed
L/ms = Locks taken per millisecond
Cpu = Number of threads worth of full CPU usage
Tc WB L/ms Cpu WC L/ms Cpu LB L/ms Cpu LC L/ms Cpu
1 5344.76 1 5262.77 1 5148.87 1 5044.62 1
1 5391.14 1 5433.34 1 5368.91 1 5262.98 1
1 5348.51 1 5296.45 1 5111.95 1 4988.24 1
2 5168.75 2 5110.51 2 4938.35 2 4882.07 2
2 5130.30 2 5130.79 2 4989.40 2 4857.45 2
2 5156.88 2 5117.37 2 4958.15 2 4932.81 2
4 4671.16 4 4885.46 4 4732.73 4 4655.51 3
4 4846.14 4 4847.92 4 4390.75 4 4597.65 3
4 9572.59 4 4839.56 4 4731.39 4 4652.74 3
8 3967.98 8 4832.56 4 2827.30 8 4621.80 3
8 4379.97 8 4901.81 4 2824.45 8 4574.13 3
8 4413.90 8 4826.43 4 2819.81 8 4642.79 3
16 3867.15 8 4826.82 4 2413.83 8 4634.02 3
16 4846.04 8 4815.79 4 2524.47 8 4604.29 3
16 5695.38 8 4853.34 4 2363.39 8 4633.61 3
32 3454.42 8 4615.44 4 1150.34 8 4665.71 3
32 6477.07 8 4508.67 4 1607.75 8 4665.56 3
32 7059.96 8 4479.78 4 1777.20 8 4659.60 3
64 3282.26 8 4611.93 4 0.70 8 4573.43 3
64 4991.26 8 4883.61 4 0.72 8 4672.35 3
64 7521.77 8 4907.73 4 0.56 8 4654.06 3
128 3149.11 8 4836.03 4 0.62 8 4621.83 3
128 8962.21 8 4872.29 4 0.61 8 4565.08 3
128 7503.35 8 4858.68 4 0.72 8 4667.60 3
256 3310.44 8 4755.44 4 82.47 8 4648.06 3
256 6567.14 8 4852.55 4 0.82 8 4626.99 3
256 5420.92 8 4884.91 4 0.79 8 4657.88 3
512 2081.56 8 4900.86 4 0.61 8 4648.02 3
512 4125.43 8 4863.88 4 0.71 8 4578.18 3
512 3685.25 8 4898.62 4 0.77 8 4601.07 3
1024 1436.46 8 4842.57 4 0.61 8 4593.89 3
1024 2900.22 8 4880.87 4 0.91 8 4627.16 3
1024 3808.18 8 4917.50 4 0.81 8 4615.44 3

Lower spin counts

set COMPlus_SpinInitialDuration=0x1
set COMPlus_SpinLimitProcFactor=0x80
set COMPlus_SpinBackoffFactor=0x2
set COMPlus_SpinRetryCount=0x0
Tc WB L/ms Cpu WC L/ms Cpu LB L/ms Cpu LC L/ms Cpu
1 5335.34 1 5264.16 1 4493.90 1 5019.77 1
1 5448.57 1 5439.20 1 4866.98 1 4872.09 1
1 5325.00 1 5297.86 1 5078.65 1 4541.78 1
2 4923.72 2 4859.70 2 4494.79 2 4565.30 2
2 4929.58 2 4834.64 2 4052.52 2 4582.16 2
2 4888.14 2 4865.00 2 3864.39 2 4572.25 2
4 4546.01 4 4357.90 3 90.96 2 3448.15 3
4 4531.11 4 4367.87 3 100.28 2 3449.68 3
4 4541.45 4 4357.42 3 125.34 2 3542.46 3
8 3752.67 8 4317.60 3 86.85 2 4700.69 3
8 3765.35 8 4367.94 3 83.59 2 4592.01 3
8 3716.64 8 4289.56 3 81.16 2 4554.52 3
16 97.13 8 4241.93 3 74.24 2 4550.29 2
16 96.85 8 4242.78 3 84.72 2 4416.01 2
16 97.21 8 4239.82 3 78.49 2 4560.80 2
32 97.11 8 4382.21 3 89.36 2 4405.91 2
32 96.83 8 4371.04 3 85.16 2 4255.65 2
32 97.18 8 4330.79 3 86.11 2 4158.98 2
64 96.86 8 4385.51 3 71.96 2 3839.43 2
64 97.08 8 4375.39 3 83.92 2 4145.21 2
64 96.84 8 4369.57 3 77.99 2 4261.37 2

Spinning disabled

set COMPlus_SpinInitialDuration=0x1
set COMPlus_SpinLimitProcFactor=0x0
set COMPlus_SpinBackoffFactor=0x2
set COMPlus_SpinRetryCount=0x0
Tc WB L/ms Cpu WC L/ms Cpu LB L/ms Cpu LC L/ms Cpu
1 5268.41 1.0 5343.09 1 5030.34 1.0 5034.13 1
1 5464.13 1.0 5348.70 1 5361.13 1.0 5028.44 1
1 5255.47 1.0 5317.01 1 5055.16 1.0 4867.83 1
2 157.00 1.5 2930.07 2 92.38 1.5 2475.49 2
2 272.23 1.5 2964.10 2 93.58 1.5 2418.42 2
2 269.83 1.5 2939.90 2 89.18 1.5 2374.65 2
4 215.94 1.5 2835.81 2 101.05 1.5 2125.89 2
4 213.10 1.5 2713.06 2 99.78 1.5 2002.67 2
4 214.34 1.5 2725.35 2 119.61 1.5 2089.86 2
8 215.03 1.5 2908.95 2 102.79 1.5 4650.46 2
8 219.86 1.5 2853.62 2 86.06 1.5 4718.31 2
8 209.58 1.5 2882.41 2 110.83 1.5 4561.14 2
16 217.18 1.5 2886.65 2 96.64 1.5 4770.49 2
16 213.08 1.5 2884.71 2 94.27 1.5 4687.61 2
16 220.32 1.5 2870.19 2 96.02 1.5 4735.26 2

@kouvel kouvel added area-System.Threading tenet-performance Performance related issue labels Sep 27, 2017
@kouvel kouvel added this to the 2.1.0 milestone Sep 27, 2017
@kouvel kouvel self-assigned this Sep 27, 2017
@kouvel
Copy link
Author

kouvel commented Oct 3, 2017

@vancem, would you be able to validate the design? If the design is ok then maybe @tarekgh or @stephentoub could review the code.

@vancem
Copy link

vancem commented Oct 3, 2017

I will take a look later today.

@vancem
Copy link

vancem commented Oct 5, 2017

Sorry for the I started to look at this on Tues, but got preempted. I will continue today.

@vancem
Copy link

vancem commented Oct 5, 2017

@dotnet-bot test Ubuntu x64 Formatting
@dotnet-bot test Ubuntu x64 Checked Build and Test

@@ -3726,7 +3726,7 @@ ClrDataAccess::GetSyncBlockData(unsigned int SBNumber, struct DacpSyncBlockData
}
#endif // FEATURE_COMINTEROP

pSyncBlockData->MonitorHeld = pBlock->m_Monitor.m_MonitorHeld;
pSyncBlockData->MonitorHeld = pBlock->m_Monitor.m_lockState.VolatileLoad().GetMonitorHeldState();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like 'ProbablyMonitorHeld to be a property and then define that to be m_lockState.VolatileLoad().GetMonitorHeldState(). In general I would like m_lockState to be private to AwareLock

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I should have done that, will do

@@ -675,7 +675,7 @@ RETAIL_CONFIG_DWORD_INFO_EX(EXTERNAL_SpinBackoffFactor, W("SpinBackoffFactor"),
RETAIL_CONFIG_DWORD_INFO_EX(EXTERNAL_SpinLimitProcCap, W("SpinLimitProcCap"), 0xFFFFFFFF, "Hex value specifying the largest value of NumProcs to use when calculating the maximum spin duration", EEConfig_default)
RETAIL_CONFIG_DWORD_INFO_EX(EXTERNAL_SpinLimitProcFactor, W("SpinLimitProcFactor"), 0x4E20, "Hex value specifying the multiplier on NumProcs to use when calculating the maximum spin duration", EEConfig_default)
RETAIL_CONFIG_DWORD_INFO_EX(EXTERNAL_SpinLimitConstant, W("SpinLimitConstant"), 0x0, "Hex value specifying the constant to add when calculating the maximum spin duration", EEConfig_default)
RETAIL_CONFIG_DWORD_INFO_EX(EXTERNAL_SpinRetryCount, W("SpinRetryCount"), 0xA, "Hex value specifying the number of times the entire spin process is repeated (when applicable)", EEConfig_default)
RETAIL_CONFIG_DWORD_INFO_EX(EXTERNAL_SpinRetryCount, W("SpinRetryCount"), 0x0, "Hex value specifying the number of times the entire spin process is repeated (when applicable)", EEConfig_default)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This now is only used in UTSemReadWrite. Is that your intent? Seems like we have not benchmarked the effect of that.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No I should revert this change (it has no impact on monitor anymore). I had filed an issue to deal with other spin loops that use the same constants but I forgot about the impact of this change, thanks for pointing it out.

@kouvel
Copy link
Author

kouvel commented Oct 6, 2017

@dotnet-bot test CentOS7.1 x64 Release Priority 1 Build and Test

@@ -675,7 +675,7 @@ RETAIL_CONFIG_DWORD_INFO_EX(EXTERNAL_SpinBackoffFactor, W("SpinBackoffFactor"),
RETAIL_CONFIG_DWORD_INFO_EX(EXTERNAL_SpinLimitProcCap, W("SpinLimitProcCap"), 0xFFFFFFFF, "Hex value specifying the largest value of NumProcs to use when calculating the maximum spin duration", EEConfig_default)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize that your change is not about tuning the spin counts, and this comment should certainly not block merge, but of all the numbers that is 'out of wack' I think the SpinLimitProcCap is one that causes grief (on machines with > 8 procs). The fact that you have more processors really is only weakly correlated with how valuable additional spinning might be (It requires all the processors to be beating on the same lock, which frankly is never going to work well, so we should not be tuning for it). Simply changing the cap to something 8, (I would prefer 4) would make me significantly happier. Consider experimenting with this when you have the time.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm currently working on reducing the spin counts. I was planning on removing the proc count multiplier altogether and to just make it a simple configurable max number of iterations, I found that it didn't work very well in other cases. It seems to be like more procs is not a good reason by itself to spin more.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have always been suspicious that the Proc-count multiplier was causing WAY too much spinning (at least for large CPU count machines). Theory says you should spin for the expected time of the delay. Now the more CPUs contending on the same lock (that is otherwise held for a very short time), would argue for more spinning, but probably not by the FACTOR the MAXIMUM number of CPUs possible (if you have more than 4 CPU constantly contending for a single lock you already have a scalability problem to fix, so increasing the spinning by more than that seems pointless).

For what it is worth, I still have not given up on the idea that the amount of spinning should be based on HOW LONG IT TOOK THE LAST TIME. Unlike the reader-writer lock (where we didn't want the complexity of doing Stopwatch/QPC timing, you CAN do that easily for monitor. Thus snap a timestamp when contention starts and stop (when the wait wakes up from any spinning/waits), and you now have PRECISE estimate of a good amount of time to spin. Keep a running average and you should have a VERY good amount for THAT lock, which seems like something close to optimal. If this time is over some threshold (somewhere between 1 and 30usec) bail, and don't bother spinning at all. This is what I would suggest.

You would only do this for AwareLocks, but frankly any lock that you DO wait on become and awarelock (indeed I would argue that if the first Interlock does not succeed, and the aware lock exists (that is you waited in the past), then you should do this history-based amount of spinning (since you have place to put the timestamps and other logic).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now the more CPUs contending on the same lock (that is otherwise held for a very short time), would argue for more spinning, but probably not by the FACTOR the MAXIMUM number of CPUs possible

I may be better to spin a little more, or it may be better to spin a bit less and have more of those threads wait. Since only one thread can hold the lock at a time, it may also be better to cap the total number of spinners to some low number. I'll do some experiments around these in the future if I have some time.

For what it is worth, I still have not given up on the idea that the amount of spinning should be based on HOW LONG IT TOOK THE LAST TIME.

Agreed, this could be the best thing to do for monitor. I plan on looking into that idea as part of issue https://github.com/dotnet/coreclr/issues/13981 once we have a working low spin count that the dynamic tuning can work with.

@sharwell
Copy link

sharwell commented Oct 6, 2017

@kouvel Is there a way to provide a measure of "balance" (fairness) in the testing?

@tannergooding Your machine is 16C/32T right? I'm interested in seeing hard numbers for bigger machines, if someone is able to do that.

@tannergooding
Copy link
Member

Yeah, I have a 8c/16t and a 16c/32t. I can try to test this weekend

@kouvel
Copy link
Author

kouvel commented Oct 6, 2017

@sharwell:

Is there a way to provide a measure of "balance" (fairness) in the testing?

The latency tests give a measure of how quickly another spinning/waiting thread responds to the lock being released. To measure fairness I can try to add tests that measure the total wait duration including spinning and waiting portions. I expect fairness would decrease with this change, at least it should decrease since that's the idea.

@kouvel
Copy link
Author

kouvel commented Oct 6, 2017

@tannergooding thanks that would be great, I can send you PGO'ed binaries (profiled only on some of the tests above) and my perf harness for the tests too if that would be useful.

@kouvel
Copy link
Author

kouvel commented Oct 6, 2017

The last commit adds back contention history, ETW events, and logging into the wait path. The other commits were just rebased.

@sharwell
Copy link

sharwell commented Oct 6, 2017

@kouvel I'd like to see your test harness if it's available, and write an equivalent test in Java to see how the numbers you posted above compare to the same operation running in a JVM.

@kouvel
Copy link
Author

kouvel commented Oct 6, 2017

@sharwell, RE fairness:

I added a couple of tests (updated test code in PR #13670) that measure and calculate sqrt(average of squares of the durations of Enter across all threads), squares to penalize longer waits, and convert that to a score such that higher is better. Here are the results:

Default spin counts

Spin                                  Left score     Right score    ∆ Score %
------------------------------------  -------------  -------------  ---------
MonitorEnterExitFairness 02T          358.57 ±0.60%  358.80 ±0.97%      0.06%
MonitorEnterExitFairness 04T          203.41 ±0.46%  137.27 ±1.30%    -32.52%
MonitorEnterExitFairness 08T          144.51 ±0.43%   30.73 ±1.17%    -78.73%
MonitorEnterExitFairness 16T          14.30 ±11.82%   13.02 ±2.73%     -8.92%
MonitorReliableEnterExitFairness 02T  354.67 ±0.72%  360.63 ±0.68%      1.68%
MonitorReliableEnterExitFairness 04T  203.61 ±0.26%  138.32 ±0.76%    -32.06%
MonitorReliableEnterExitFairness 08T  144.69 ±0.41%   30.06 ±0.97%    -79.22%
MonitorReliableEnterExitFairness 16T  13.46 ±10.10%   12.35 ±1.26%     -8.27%
------------------------------------  -------------  -------------  ---------
Total                                 109.85 ±3.21%   66.12 ±1.23%    -39.81%

With this change, fairness decreases a lot more quickly with increasing thread count (increasing contention). At 16 threads though (this machine is 4-core 8-thread), the baseline's CPU usage is so high that it it looks like threads are suddenly waiting much longer to just get a time slice (and the machine locks up pretty badly at this stage).

Lower spin counts

set COMPlus_SpinInitialDuration=0x1
set COMPlus_SpinLimitProcFactor=0x80
set COMPlus_SpinBackoffFactor=0x2
set COMPlus_SpinRetryCount=0x0
Spin                                  Left score      Right score     ∆ Score %
------------------------------------  --------------  --------------  ---------
MonitorEnterExitFairness 02T          4964.11 ±0.19%  4980.83 ±0.59%      0.34%
MonitorEnterExitFairness 04T          1979.84 ±0.09%  1062.52 ±0.42%    -46.33%
MonitorEnterExitFairness 08T          1102.74 ±0.26%   264.05 ±0.12%    -76.06%
MonitorEnterExitFairness 16T           442.42 ±0.14%   105.04 ±0.29%    -76.26%
MonitorReliableEnterExitFairness 02T  5097.17 ±0.61%  5125.45 ±0.20%      0.55%
MonitorReliableEnterExitFairness 04T  1969.22 ±0.47%  1088.44 ±0.50%    -44.73%
MonitorReliableEnterExitFairness 08T  1113.24 ±0.22%   272.59 ±0.17%    -75.51%
MonitorReliableEnterExitFairness 16T   442.25 ±0.13%   107.45 ±0.12%    -75.70%
------------------------------------  --------------  --------------  ---------
Total                                 1485.36 ±0.26%   627.33 ±0.30%    -57.77%

At lower spin counts (which is not a reasonable thing to do with the baseline's design as shown by numbers above), the changed version still has faster-decreasing fairness with increasing contention, but compared with the baseline with default spin counts, it actually experiences shorter waits at all thread counts above. I'm currently working on significantly reducing the total spin duration in my next PR as part of issue https://github.com/dotnet/coreclr/issues/13980, so that should hopefully improve fairness on average at realistic levels of contention compared with baseline.

@sharwell
Copy link

sharwell commented Oct 6, 2017

@kouvel Thank you for that update, it's very informative. Cases where it appears the new code has a disadvantage are the following:

  1. (Minor) Non-concurrent code executing with locks, e.g. a library which uses synchronized collections used by an application that never uses the library on multiple threads
  2. (Major*) Work throttling algorithms using locks, where individual operations are not related to each other

The asterisk here, and likely the redeeming factor, is due to the exceptionally poor scalability of the current algorithm under heavy contention. I would expect that the majority of users who attempt this approach would find the overall scalability sufficiently poor that alternatives are used instead.

Can you run an additional round-robin test with the following parameters:

  • N threads with N locks
  • Each thread acquires and releases the N locks in order (looping)

In its theoretical limit, this benchmark in a steady state would acquire locks at an overall rate of Min(N, #CPU) times the rate of the result of N=1.

@kouvel
Copy link
Author

kouvel commented Oct 7, 2017

@sharwell:

(Minor) Non-concurrent code executing with locks, e.g. a library which uses synchronized collections used by an application that never uses the library on multiple threads

There should be no difference here, all of the differences in single-thread numbers above are within noise range. The differences in ThinLock/AwareLock no-delay tests are also mostly noise (code gen noise or otherwise), the actual code that is hit on those paths has not changed. Also in the no-contention cases the monitors will typically remain as thin locks and most of the changes are in AwareLock.

(Major*) Work throttling algorithms using locks, where individual operations are not related to each other

Do you mean unrelated locks, or same lock and unrelated work? If it's the same lock then the way in which waiting threads would be released would be affected and they may be starved for much longer than before, but the code intends on throttling some threads. I'm not sure if throttling work using locks is a common thing that is done, but it doesn't seem like a good thing to do, an event or some sort of semaphore probably would work better.

Can you run an additional round-robin test with the following parameters

Added a test and updated the code in the same place. I don't see much difference from baseline, here are the numbers:

Spin                                                    Left score       Right score      ∆ Score %
------------------------------------------------------  ---------------  ---------------  ---------
MonitorReliableEnterExitRoundRobinThroughput 01T         4843.24 ±0.12%   4861.15 ±0.11%      0.37%
MonitorReliableEnterExitRoundRobinThroughput 02T         4668.02 ±0.13%   4751.62 ±0.17%      1.79%
MonitorReliableEnterExitRoundRobinThroughput 04T         5771.62 ±0.72%   5814.81 ±0.53%      0.75%
MonitorReliableEnterExitRoundRobinThroughput 08T         8914.22 ±0.11%   8875.42 ±0.14%     -0.44%
MonitorReliableEnterExitRoundRobinThroughput 16T        11363.94 ±0.77%  11244.56 ±0.98%     -1.05%
------------------------------------------------------  ---------------  ---------------  ---------

In its theoretical limit, this benchmark in a steady state would acquire locks at an overall rate of Min(N, #CPU) times the rate of the result of N=1.

I think there is a possibility for a thread to repeatedly go through a large part or even the whole round robin several times before getting into contention.

@kouvel
Copy link
Author

kouvel commented Oct 7, 2017

@sharwell / @tannergooding, I have copied everything necessary to run the tests here: https://1drv.ms/f/s!AvtRwG9CobRTpwmUKY2-xtXNzuXh

See instructions.txt first for info.

I decided not to PGO the binaries because it was doing some strange layout for some hot paths for some reason, didn't see that before.

@sharwell:

I'd like to see your test harness if it's available, and write an equivalent test in Java to see how the numbers you posted above compare to the same operation running in a JVM.

I have included the test code in the zip above. My test runner (PerfTester) is a WPF app, but if you want to use the same runner, you can add a test suite to Settings.xml and add an attribute like rightExecutable="path_to_exe" to the TestSuite element, can also change the exe path and arguments in the UI.

If you create a Java version of the test code, could you please share it? I'd like to compare it after reducing spin counts as well.

@sharwell
Copy link

sharwell commented Oct 7, 2017

(Minor) Non-concurrent code executing with locks

There should be no difference here, all of the differences in single-thread numbers above are within noise range.

💯

(Major*) Work throttling algorithms using locks

Do you mean unrelated locks, or same lock and unrelated work?

Yes, I was talking about the same lock with unrelated work. You are correct that this is a bad pattern, but I still wanted to be clear about the potential negative impact of the change on such an operation.

If you create a Java version of the test code, could you please share it?

Absolutely 👍

@tannergooding
Copy link
Member

Sorry, was out sick essentially the whole weekend and did not get around to benching this. I'm going to let it run overnight and should have results tomorrow.

@kouvel
Copy link
Author

kouvel commented Oct 10, 2017

Hope you feel better @tannergooding. Actually you may want to hold off on testing this until I reduce the spinning as well, since that would be the intended end result of these changes (this PR only takes it part way there). I'll probably add those changes to this same PR since it's not fully reviewed yet and a lot of the questions raised are relevant to and would be resolved by that. I'll try to have those changes up by tomorrow.

@kouvel
Copy link
Author

kouvel commented Oct 12, 2017

The first commit's description was updated, all commits were rebased, and the last two commits are new.

Part 2: Reduce Monitor spinning

Fixes https://github.com/dotnet/coreclr/issues/13980

  • Added new config value Monitor_SpinCount and Monitor spins for that many iterations, default is 30 (0x1e). This seems to give a somewhat decent balance between latency, fairness, and throughput. Lower spin counts improve latency and fairness significantly and regress throughput slightly, and higher spin counts improve throughput slightly and regress latency and fairness significantly.
    • The other constants can still be used to disable spinning but otherwise they are no longer used by Monitor
  • Decreased the number of bits used for tracking spinner count to 3. This seems to be more than enough since only one thread can take a lock at a time, and prevents spikes of unnecessary CPU usage.

Tried some things that didn't pan out:

  • Sleep(0) doesn't seem to add anything to the spin loop, so left it out. Instead of Sleep(0) it can just proceed to waiting. Waiting is more expensive than Sleep(0), but I didn't see that benefit in the tests. Omitting Sleep(0) also keeps the spin loop very short (a few microseconds max).
  • Increasing the average YieldProcessor() duration per spin iteration improved thorughput slightly but regressed latency and fairness very quickly. Given that fairness is generally worse with part 1 of this change above, it felt like a better compromise to take a small reduction in throughput for larger improvements in latency and fairness.
  • Tried adding a very small % of lock releases by random wake a waiter despite there being spinners to improve fairness. This improved fairness noticeably but not as much as decreasing the spin count slightly, and it was making latency and throughput worse more quickly. After reducing the % to a point where I was hardly seeing fairness improvements, there were still noticeable latency and throughput regressions.

Miscellaneous

  • Moved YieldProcessorNormalized code into separate files so that they can be included earlier and where needed
  • Added a max for "optimal max normalized yields per spin iteration" since it has a potential to be very large on machines where YieldProcessor may be implemented as no-op, in which case it's probably not worth spinning for the full duration
  • Refactored duplicate code in portable versions of MonEnterWorker, MonEnter, and MonReliableEnter. MonTryEnter has a slightly different structure, did not refactor that.

Perf

  • Throughput is a bit lower than before at lower thread counts and better at medium-high thread counts. It's a bit lower at lower thread counts because of two reasons:
    • Shorter spin loop means the lock will be polled more frequently because the exponential backoff does not get as high, making it more likely for a spinner to steal the lock from another thread, causing the other thread to sometimes wait early
    • The duration of YieldProcessor() calls per spin iteration has decreased and a spinner or spinning waiter are more likely to take the lock, the rest is similar to above
  • For the same reasons as above, latency is better than before. Fairness is better on Windows and worse on Linux compared to baseline due to the baseline having differences between these platforms. Latency also has differences between Windows/Linux in the baseline, I suspect those are due to differences in scheduling.
  • Performance now scales appropriately on processors with different pause delays

@tannergooding, I have updated the zip file with new binaries and tests. It would be great if you could run this version, see instructions.txt first.

Xeon E5-1650 (Sandy Bridge, 6-core, 12-thread)

Spin (Windows x64)                                   Left score       Right score      ∆ Score %
---------------------------------------------------  ---------------  ---------------  ---------
MonitorReliableEnterExit1PcTOtherWorkThroughput 02T   3601.57 ±0.44%   3669.27 ±0.17%      1.88%
MonitorReliableEnterExit1PcTOtherWorkThroughput 03T   3473.99 ±0.89%   3452.75 ±0.58%     -0.61%
MonitorReliableEnterExit1PcTOtherWorkThroughput 06T   3108.48 ±0.40%   3294.53 ±0.33%      5.99%
MonitorReliableEnterExit1PcTOtherWorkThroughput 12T   2855.57 ±1.10%   3223.51 ±0.48%     12.88%
MonitorReliableEnterExit1PcTOtherWorkThroughput 24T   2426.56 ±1.29%   3158.87 ±0.17%     30.18%
MonitorReliableEnterExitFairness 02T                 59636.33 ±0.46%  51680.15 ±0.41%    -13.34%
MonitorReliableEnterExitFairness 03T                 43231.28 ±0.48%  50139.66 ±0.27%     15.98%
MonitorReliableEnterExitFairness 06T                 26863.18 ±0.35%  27331.16 ±0.72%      1.74%
MonitorReliableEnterExitFairness 12T                 16498.26 ±0.44%   9107.69 ±0.49%    -44.80%
MonitorReliableEnterExitFairness 24T                   917.89 ±1.50%   3947.68 ±3.24%    330.08%
MonitorReliableEnterExitLatency 02T                   2455.98 ±0.13%   2954.59 ±0.30%     20.30%
MonitorReliableEnterExitLatency 03T                   2361.80 ±0.11%   3331.66 ±0.43%     41.06%
MonitorReliableEnterExitLatency 06T                   2303.51 ±0.35%   3200.04 ±0.10%     38.92%
MonitorReliableEnterExitLatency 12T                   1794.39 ±0.07%   3053.88 ±0.02%     70.19%
MonitorReliableEnterExitLatency 24T                   1979.18 ±0.07%   3221.45 ±0.24%     62.77%
MonitorReliableEnterExitRoundRobinThroughput 02T      4879.01 ±0.41%   4969.44 ±0.24%      1.85%
MonitorReliableEnterExitRoundRobinThroughput 03T      5846.00 ±0.21%   6856.36 ±0.10%     17.28%
MonitorReliableEnterExitRoundRobinThroughput 06T      9632.26 ±0.48%  11075.35 ±0.26%     14.98%
MonitorReliableEnterExitRoundRobinThroughput 12T     13486.48 ±0.16%  15495.07 ±0.18%     14.89%
MonitorReliableEnterExitRoundRobinThroughput 24T     15981.35 ±0.66%  16325.04 ±0.33%      2.15%
MonitorReliableEnterExitThroughput 01T                3981.77 ±0.08%   3990.61 ±0.14%      0.22%
MonitorReliableEnterExitThroughput 02T                3918.84 ±0.07%   4065.31 ±0.25%      3.74%
MonitorReliableEnterExitThroughput 03T                3830.99 ±0.36%   3637.33 ±0.18%     -5.06%
MonitorReliableEnterExitThroughput 06T                3512.34 ±0.87%   3695.12 ±0.09%      5.20%
MonitorReliableEnterExitThroughput 12T                2428.19 ±0.11%   3638.53 ±0.29%     49.85%
MonitorReliableEnterExitThroughput 24T                2423.08 ±0.06%   3718.06 ±0.12%     53.44%
---------------------------------------------------  ---------------  ---------------  ---------
Total                                                 4920.28 ±0.45%   5865.08 ±0.39%     19.20%
Spin (Linux x64)                                     Left score       Right score      ∆ Score %
---------------------------------------------------  ---------------  ---------------  ---------
MonitorReliableEnterExit1PcTOtherWorkThroughput 02T   3386.49 ±0.32%   3577.29 ±0.20%      5.63%
MonitorReliableEnterExit1PcTOtherWorkThroughput 03T   3298.58 ±0.49%   3535.88 ±0.21%      7.19%
MonitorReliableEnterExit1PcTOtherWorkThroughput 06T   2952.87 ±0.25%   3253.12 ±0.35%     10.17%
MonitorReliableEnterExit1PcTOtherWorkThroughput 12T   2193.76 ±1.39%   3507.20 ±0.43%     59.87%
MonitorReliableEnterExit1PcTOtherWorkThroughput 24T   1489.82 ±0.53%   3517.09 ±0.17%    136.07%
MonitorReliableEnterExitFairness 02T                 53339.39 ±0.13%  46741.15 ±0.14%    -12.37%
MonitorReliableEnterExitFairness 03T                 40796.59 ±0.21%  49633.35 ±0.65%     21.66%
MonitorReliableEnterExitFairness 06T                 26351.18 ±0.35%   9982.45 ±0.76%    -62.12%
MonitorReliableEnterExitFairness 12T                 16061.63 ±0.47%   3267.09 ±1.60%    -79.66%
MonitorReliableEnterExitFairness 24T                  2591.08 ±0.82%   1333.03 ±1.35%    -48.55%
MonitorReliableEnterExitLatency 02T                   2466.61 ±1.00%   2860.57 ±0.52%     15.97%
MonitorReliableEnterExitLatency 03T                   2506.31 ±0.24%   3427.80 ±0.44%     36.77%
MonitorReliableEnterExitLatency 06T                   2339.34 ±0.69%   3170.80 ±0.30%     35.54%
MonitorReliableEnterExitLatency 12T                   1826.63 ±0.87%   3155.68 ±0.82%     72.76%
MonitorReliableEnterExitLatency 24T                   1569.18 ±0.91%   3252.57 ±0.68%    107.28%
MonitorReliableEnterExitRoundRobinThroughput 02T      4917.09 ±0.46%   5304.75 ±0.33%      7.88%
MonitorReliableEnterExitRoundRobinThroughput 03T      6309.67 ±0.37%   7310.92 ±0.44%     15.87%
MonitorReliableEnterExitRoundRobinThroughput 06T      8926.38 ±2.10%  10067.00 ±0.84%     12.78%
MonitorReliableEnterExitRoundRobinThroughput 12T     12373.17 ±0.79%  13775.49 ±0.47%     11.33%
MonitorReliableEnterExitRoundRobinThroughput 24T      8800.82 ±1.62%  12091.51 ±0.60%     37.39%
MonitorReliableEnterExitThroughput 01T                4147.62 ±0.31%   4181.48 ±0.36%      0.82%
MonitorReliableEnterExitThroughput 02T                4130.19 ±0.27%   3595.22 ±0.63%    -12.95%
MonitorReliableEnterExitThroughput 03T                4035.08 ±0.26%   4006.80 ±0.37%     -0.70%
MonitorReliableEnterExitThroughput 06T                3800.91 ±0.63%   3671.25 ±0.35%     -3.41%
MonitorReliableEnterExitThroughput 12T                2340.23 ±0.07%   3798.45 ±0.44%     62.31%
MonitorReliableEnterExitThroughput 24T                2060.55 ±0.46%   3914.75 ±0.17%     89.99%
---------------------------------------------------  ---------------  ---------------  ---------
Total                                                 4756.76 ±0.62%   5166.57 ±0.52%      8.62%

Core i7-6700 (Skylake, 4-core, 8-thread)

Spin (Windows x64)                                   Left score       Right score      ∆ Score %
---------------------------------------------------  ---------------  ---------------  ---------
MonitorReliableEnterExit1PcTOtherWorkThroughput 02T   2380.40 ±0.55%   2408.90 ±0.31%      1.20%
MonitorReliableEnterExit1PcTOtherWorkThroughput 04T   2139.07 ±0.92%   2090.72 ±0.87%     -2.26%
MonitorReliableEnterExit1PcTOtherWorkThroughput 08T   1814.14 ±1.35%   2047.91 ±0.63%     12.89%
MonitorReliableEnterExit1PcTOtherWorkThroughput 16T   1466.09 ±1.85%   1953.88 ±0.45%     33.27%
MonitorReliableEnterExitFairness 02T                 21507.56 ±0.71%  53818.11 ±0.42%    150.23%
MonitorReliableEnterExitFairness 04T                 12625.90 ±0.42%  50981.21 ±0.28%    303.78%
MonitorReliableEnterExitFairness 08T                  7844.67 ±0.34%  17843.39 ±1.26%    127.46%
MonitorReliableEnterExitFairness 16T                   952.16 ±2.04%   6820.58 ±0.40%    616.32%
MonitorReliableEnterExitLatency 02T                    541.00 ±0.92%   3826.30 ±0.70%    607.26%
MonitorReliableEnterExitLatency 04T                   1361.54 ±0.49%   3586.32 ±0.16%    163.40%
MonitorReliableEnterExitLatency 08T                   1591.01 ±0.47%   3441.13 ±0.12%    116.29%
MonitorReliableEnterExitLatency 16T                   1579.71 ±0.44%   3763.92 ±0.07%    138.27%
MonitorReliableEnterExitRoundRobinThroughput 02T      4507.71 ±0.44%   5925.06 ±0.69%     31.44%
MonitorReliableEnterExitRoundRobinThroughput 04T      5572.57 ±0.39%   8946.49 ±0.47%     60.55%
MonitorReliableEnterExitRoundRobinThroughput 08T      9031.12 ±0.20%  14435.55 ±0.12%     59.84%
MonitorReliableEnterExitRoundRobinThroughput 16T     11540.97 ±0.49%  15730.55 ±0.29%     36.30%
MonitorReliableEnterExitThroughput 01T                4751.57 ±0.44%   4752.57 ±0.42%      0.02%
MonitorReliableEnterExitThroughput 02T                4485.06 ±0.22%   4272.16 ±0.43%     -4.75%
MonitorReliableEnterExitThroughput 04T                4328.26 ±0.37%   3974.57 ±0.45%     -8.17%
MonitorReliableEnterExitThroughput 08T                3725.16 ±0.22%   4009.33 ±0.11%      7.63%
MonitorReliableEnterExitThroughput 16T                3669.12 ±0.46%   4028.55 ±0.10%      9.80%
---------------------------------------------------  ---------------  ---------------  ---------
Total                                                 3423.90 ±0.65%   5955.54 ±0.42%     73.94%
Spin (Linux x64)                                     Left score       Right score      ∆ Score %
---------------------------------------------------  ---------------  ---------------  ---------
MonitorReliableEnterExit1PcTOtherWorkThroughput 02T   2386.87 ±0.32%   2481.74 ±0.08%      3.97%
MonitorReliableEnterExit1PcTOtherWorkThroughput 04T   2109.09 ±1.00%   2391.76 ±0.14%     13.40%
MonitorReliableEnterExit1PcTOtherWorkThroughput 08T   1576.46 ±1.08%   2137.35 ±2.74%     35.58%
MonitorReliableEnterExit1PcTOtherWorkThroughput 16T   1067.17 ±0.60%   2419.25 ±0.07%    126.70%
MonitorReliableEnterExitFairness 02T                 54226.98 ±0.14%  46849.14 ±0.11%    -13.61%
MonitorReliableEnterExitFairness 04T                 34307.77 ±0.91%  30693.91 ±0.77%    -10.53%
MonitorReliableEnterExitFairness 08T                 20917.59 ±0.18%   6664.24 ±2.34%    -68.14%
MonitorReliableEnterExitFairness 16T                  3099.32 ±1.37%   2287.26 ±1.74%    -26.20%
MonitorReliableEnterExitLatency 02T                   2952.08 ±0.45%   3979.68 ±0.30%     34.81%
MonitorReliableEnterExitLatency 04T                   2854.36 ±0.24%   3433.97 ±0.42%     20.31%
MonitorReliableEnterExitLatency 08T                   2149.09 ±0.05%   3620.85 ±0.61%     68.48%
MonitorReliableEnterExitLatency 16T                   1897.45 ±0.67%   3838.72 ±0.47%    102.31%
MonitorReliableEnterExitRoundRobinThroughput 02T      5294.95 ±0.57%   5918.15 ±0.55%     11.77%
MonitorReliableEnterExitRoundRobinThroughput 04T      8202.10 ±1.84%   8914.79 ±1.20%      8.69%
MonitorReliableEnterExitRoundRobinThroughput 08T     11862.08 ±0.23%  13337.19 ±0.25%     12.44%
MonitorReliableEnterExitRoundRobinThroughput 16T      9145.08 ±0.70%  12728.64 ±0.61%     39.19%
MonitorReliableEnterExitThroughput 01T                4961.81 ±0.27%   5036.57 ±0.32%      1.51%
MonitorReliableEnterExitThroughput 02T                4867.11 ±0.14%   4551.73 ±0.25%     -6.48%
MonitorReliableEnterExitThroughput 04T                4657.16 ±0.11%   3880.82 ±0.14%    -16.67%
MonitorReliableEnterExitThroughput 08T                2617.26 ±0.07%   3824.41 ±0.34%     46.12%
MonitorReliableEnterExitThroughput 16T                2287.17 ±0.44%   4381.41 ±0.38%     91.56%
---------------------------------------------------  ---------------  ---------------  ---------
Total                                                 4665.37 ±0.54%   5312.95 ±0.66%     13.88%

@kouvel
Copy link
Author

kouvel commented Oct 12, 2017

@dotnet-bot test Windows_NT x86 full_opt legacy_backend CoreCLR Perf Tests Correctness

@kouvel
Copy link
Author

kouvel commented Oct 18, 2017

@dotnet-bot test this

@kouvel
Copy link
Author

kouvel commented Oct 18, 2017

@vancem, do these numbers look reasonable to proceed with this direction? Based on my understanding there is a tradeoff between thoughput and latency/fairness, I felt that this is a reasonable compromise but I'd like others to weigh in. We can choose to not regress throughput but instead to significantly regress latency/fairness by going back to the previous spinning scheme and just switching to YieldProcessorNormalized to normalize behavior across different processors. Let me know if you want to see numbers for some specific parameters.

@kouvel kouvel removed the request for review from tarekgh October 18, 2017 02:33
@kouvel
Copy link
Author

kouvel commented Oct 18, 2017

@dotnet-bot test this please

@kouvel
Copy link
Author

kouvel commented Oct 18, 2017

@dotnet-bot test Windows_NT x86 full_opt legacy_backend CoreCLR Perf Tests Correctness

1 similar comment
@kouvel
Copy link
Author

kouvel commented Oct 24, 2017

@dotnet-bot test Windows_NT x86 full_opt legacy_backend CoreCLR Perf Tests Correctness

@kouvel
Copy link
Author

kouvel commented Oct 29, 2017

@dotnet-bot test Ubuntu armlb Cross Debug Innerloop Build

Koundinya Veluri added 7 commits November 14, 2017 12:10
Part 1: Improve Monitor scaling

Fixes https://github.com/dotnet/coreclr/issues/13978
- Refactored AwareLock::m_MonitorHeld into a class LockState with operations to mutate the state
- Allowed the lock to be taken by a non-waiter when there is a waiter to prevent creating lock convoys
- Added a bit to LockState to indicate that a waiter is signaled to wake, to avoid waking more than one waiter at a time. A waiter that wakes by observing the signal unsets this bit. See AwareLock::EnterEpilogHelper().
- Added a spinner count to LockState. Spinners now register and unregister themselves and lock releasers don't wake a waiter when there is a registered spinner (the spinner guarantees to take the lock if it's available when unregistering itself)
  - This was necessary mostly on Windows to reduce CPU usage to the expected level in contended cases with several threads. I believe it's the priority boost Windows gives to signaled threads, which seems to cause waiters to much more frequently succeed in acquiring the lock. This causes a CPU usage problem because once the woken waiter releases the lock, on the next lock attempt it will become a spinner. This keeps repeating, converting several waiters into spinners unnecessarily. Before registering spinners, I saw typically 4-6 spinners under contention (with delays inside and outside the lock) when I expected to have only 1-2 spinners at most.
  - It costs an interlocked operation before and after the spin loop, doesn't seem to be too significant since spinning is a relatively slow path anyway, and the reduction in CPU usage in turn reduces contention on the lock and lets more useful work get done
- Updated waiters to spin a bit before going back to waiting, reasons are explained in AwareLock::EnterEpilogHelper()
- Removed AwareLock::Contention() and any references (this removes the 10 repeats of the entire spin loop in that function). With the lock convoy issue gone, this appears to no longer be necessary.

Perf
- On Windows, throughput has increased significantly starting at slightly lower than proc count threads. On Linux, latency and throughput have increased more significantly at similar proc counts.
- Most of the larger regressions are in the unlocked fast paths. The code there hasn't changed and is almost identical (minor layout differences), I'm just considering this noise until we figure out how to get consistently faster code generated.
- The smaller regressions are within noise range

Part 2: Reduce Monitor spinning

Fixes https://github.com/dotnet/coreclr/issues/13980
- Added new config value Monitor_SpinCount and Monitor spins for that many iterations, default is 30 (0x1e). This seems to give a somewhat decent balance between latency, fairness, and throughput. Lower spin counts improve latency and fairness significantly and regress throughput slightly, and higher spin counts improve throughput slightly and regress latency and fairness significantly.
  - The other constants can still be used to disable spinning but otherwise they are no longer used by Monitor
- Decreased the number of bits used for tracking spinner count to 3. This seems to be more than enough since only one thread can take a lock at a time, and prevents spikes of unnecessary CPU usage.

Tried some things that didn't pan out:
- Sleep(0) doesn't seem to add anything to the spin loop, so left it out. Instead of Sleep(0) it can just proceed to waiting. Waiting is more expensive than Sleep(0), but I didn't see that benefit in the tests. Omitting Sleep(0) also keeps the spin loop very short (a few microseconds max).
- Increasing the average YieldProcessor() duration per spin iteration improved thorughput slightly but regressed latency and fairness very quickly. Given that fairness is generally worse with part 1 of this change above, it felt like a better compromise to take a small reduction in throughput for larger improvements in latency and fairness.
- Tried adding a very small % of lock releases by random wake a waiter despite there being spinners to improve fairness. This improved fairness noticeably but not as much as decreasing the spin count slightly, and it was making latency and throughput worse more quickly. After reducing the % to a point where I was hardly seeing fairness improvements, there were still noticeable latency and throughput regressions.

Miscellaneous
- Moved YieldProcessorNormalized code into separate files so that they can be included earlier and where needed
- Added a max for "optimal max normalized yields per spin iteration" since it has a potential to be very large on machines where YieldProcessor may be implemented as no-op, in which case it's probably not worth spinning for the full duration
- Refactored duplicate code in portable versions of MonEnterWorker, MonEnter, and MonReliableEnter. MonTryEnter has a slightly different structure, did not refactor that.

Perf
- Throughput is a bit lower than before at lower thread counts and better at medium-high thread counts. It's a bit lower at lower thread counts because of two reasons:
  - Shorter spin loop means the lock will be polled more frequently because the exponential backoff does not get as high, making it more likely for a spinner to steal the lock from another thread, causing the other thread to sometimes wait early
  - The duration of YieldProcessor() calls per spin iteration has decreased and a spinner or spinning waiter are more likely to take the lock, the rest is similar to above
- For the same reasons as above, latency is better than before. Fairness is better on Windows and worse on Linux compared to baseline due to the baseline having differences between these platforms. Latency also has differences between Windows/Linux in the baseline, I suspect those are due to differences in scheduling.
- Performance now scales appropriately on processors with different pause delays
@kouvel
Copy link
Author

kouvel commented Nov 16, 2017

Added a commit that mitigates waiter starvation and rebased other commits.

Normally, threads are allowed to preempt waiters to acquire the lock. There are cases where waiters can be easily starved as a result. For example, a thread that holds a lock for a significant amount of time (much longer than the time it takes to do a context switch), then releases and reacquires the lock in quick succession, and repeats. Though a waiter would be woken upon lock release, usually it will not have enough time to context-switch-in and take the lock, and can be starved for an unreasonably long duration.

In order to prevent such starvation and force a bit of fair forward progress, it is sometimes necessary to change the normal policy and disallow threads from preempting waiters. A new bit was added to LockState and ShouldNotPreemptWaiters() indicates the current state of the policy.

  • When the first waiter begins waiting, it records the current time as a "waiter starvation start time". That is a point in time after which no forward progress has occurred for waiters. When a waiter acquires the lock, the time is updated to the current time.
  • Before a spinner begins spinning, and when a waiter is signaled to wake, it checks whether the starvation duration has crossed a threshold (currently one second) and if so, sets ShouldNotPreemptWaiters()

When unreasonable starvation is occurring, the lock will be released occasionally and if caused by spinners, spinners will be starting to spin.

  • Before starting to spin, if ShouldNotPreemptWaiters() is set, the spinner will skip spinning and wait instead. Spinners that are already registered at the time ShouldNotPreemptWaiters() is set will stop spinning as necessary. Eventually, all spinners will drain and no new ones will be registered.
  • After spinners have drained, only a waiter will be able to acquire the lock. When a waiter acquires the lock, or when the last waiter unregisters itself, ShouldNotPreemptWaiters() is cleared to restore the normal policy.

There was no change to perf from before. In the example above, where before the change, a waiter is starved for a very long time (in the order of 10s of seconds to minutes), after the change the waiter acquires the lock once per second.

Koundinya Veluri added 2 commits November 16, 2017 01:30
Normally, threads are allowed to preempt waiters to acquire the lock. There are cases where waiters can be easily starved as a result. For example, a thread that holds a lock for a significant amount of time (much longer than the time it takes to do a context switch), then releases and reacquires the lock in quick succession, and repeats. Though a waiter would be woken upon lock release, usually it will not have enough time to context-switch-in and take the lock, and can be starved for an unreasonably long duration.

In order to prevent such starvation and force a bit of fair forward progress, it is sometimes necessary to change the normal policy and disallow threads from preempting waiters. A new bit was added to LockState and ShouldNotPreemptWaiters() indicates the current state of the policy.
- When the first waiter begins waiting, it records the current time as a "waiter starvation start time". That is a point in time after which no forward progress has occurred for waiters. When a waiter acquires the lock, the time is updated to the current time.
- Before a spinner begins spinning, and when a waiter is signaled to wake, it checks whether the starvation duration has crossed a threshold (currently one second) and if so, sets ShouldNotPreemptWaiters()

When unreasonable starvation is occurring, the lock will be released occasionally and if caused by spinners, spinners will be starting to spin.
- Before starting to spin, if ShouldNotPreemptWaiters() is set, the spinner will skip spinning and wait instead. Spinners that are already registered at the time ShouldNotPreemptWaiters() is set will stop spinning as necessary. Eventually, all spinners will drain and no new ones will be registered.
- After spinners have drained, only a waiter will be able to acquire the lock. When a waiter acquires the lock, or when the last waiter unregisters itself, ShouldNotPreemptWaiters() is cleared to restore the normal policy.
@kouvel
Copy link
Author

kouvel commented Nov 16, 2017

@dotnet-bot test Ubuntu x64 Checked Innerloop Build and Test

@@ -4427,69 +4427,18 @@ NOINLINE static void JIT_MonEnter_Helper(Object* obj, BYTE* pbLockTaken, LPVOID
}

/*********************************************************************/
NOINLINE static void JIT_MonContention_Helper(Object* obj, BYTE* pbLockTaken, LPVOID __me)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am trying to rationalize the value of this MonContention helper. This is uniformly called as a second last chance' before bailing and calling the 'slow' MonEnter_Helper path. Why do we not simply skip this and call the slow helper and be done with it? At this point we don't care about CPU as we need to kill some time anyway.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I follow, as that's kind of what I did, I removed the MonContention helper. The main purpose of it before the changes was to repeat the entire spin loop 10 more times before waiting. I suspect it was necessary mainly to avoid lock convoys, as spinning more helps to avoid them with the previous design. After the changes, there is a first-level fast path check for the unlocked case (one try for the lock), followed by a spin loop of a default of 30 iterations, then the slow path that waits.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, my mistake ignore my comment. I was looking at the wrong branch on my machine and confused myself about that fact that this code is being removed.

SyncBlock *syncBlock = g_pSyncTable[oldValue & MASK_SYNCBLOCKINDEX].m_SyncBlock;
_ASSERTE(syncBlock != NULL);
AwareLock *awareLock = &syncBlock->m_Monitor;
if (awareLock->EnterHelper(pCurThread, true /* checkRecursiveCase */))

AwareLock::EnterHelperResult result = awareLock->TryEnterBeforeSpinLoopHelper(pCurThread);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize that what I am about to comment on is not part of the delta in this PR, but I am trying to rationalize the whole code flow here. My mental model for the EnterObjMonitorHelperSpin helper is that it was there to avoid making an awareLock (otherwise even ONE contention on a lock at any time would force the awareLock to be created). However we have an awareLock at this point, so the question is why can't we just return UseSlowPath and avoid this code? This code should be a clone of what is done in AwareLock, and the extra cost of getting there (making an awareLock (but we have one) and setting up a transition frame (but we need to spin anyway), should not be an issue. It would be nice to simply remove this code. What prevents this?

Copy link
Author

@kouvel kouvel Nov 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about that, it's kind of a pity to have to force-inline the moderately slow-path
spinning code (which it used to do in the asm helpers to avoid calls). The overhead of the transition is not something I fully understand but my high level guess is that taking the slow path is equivalent in overhead to a p/invoke, which is fairly high. I'm not sure if it will matter though, probably worth a test. Moving spinning to the slow path would also significantly improve register allocation in the fast path since it removes the last offending call. I plan on looking into this more closely as part of https://github.com/dotnet/coreclr/issues/14571.

Yes avoiding creating the aware lock is another advantage of spinning more (when it's a thin lock anyway). With this change a monitor would transition into an aware lock much more easily, there are perf/memory tradeoffs involved and I think this is a decent one.

Spinning can't be avoided altogether once we have an aware lock. Spinning just a little bit significantly improves perf under contention and avoids excessive context switching in scenarios involving smaller time scales, which I'm sure are very common.

Regarding moving the spin path to the slow path, I don't think it matters whether we have an an aware lock already or not, it's the same kind of situation for a thin lock.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just FYI, Yes, a HelperMethodFrame transition is like a PINVOKE, but it is slower (because it does something called epilog walking to set up the frame). I have not measured it recently, but a normal optimized PINVOKE is 20-30 clock ticks. HelperMethodFrames are probably 2-3 times that. There is some goo to get from there to the AwareLock.Enter() where we again try to take the lock with in interlocked operation. But it seems like that overhead is basically on the same order as what we would spin for, so we should not be too concerned about the overhead.

As you say it has the benefit of making the register allocation leaner for the paths that remain. But I also care about the fact that we are cloning non-trivial code (we have a very similar spin wait loop in EnterEpilogHelper). The less cloned code we have the better (for example we don't do the ETW contention logging in TryEnterObjMonitorSpinHelper, and arguably we should. It would also be SO easy for the logic that counts spinners to be off in one place but not the other).

I agree we need spinning in the aware-lock case. I just want it done one place if possible, and it seems like we are doing it in EnterEpilogHelper so that should be enough.

The value of trying this now, is that we don't have to deal with any bugs associated with the code we are about to remove. I suppose it could wait until #14571, but if you see no issue, and removing it does not cause regressions it is a very nice improvement in my opinion...

On the issue of spinning to avoid creating an awareLock: I would not spin more just to avoid Aware lock creation. If spinning just a bit more is what saved you, odds are that sooner or later you will 'fall off' that edge and need to make the lock. The key part to handle the case where this is little to no contention (which is common in many cases).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I take back some of what I said above. I see that the logic in EnterEpilogHelper does the Wait before is does any spinning (it does the spinning on waking up). So you can't just remove the code in EnterEpilogHelper without losing spinning in the AwareLock case before waiting.

I must admit I would prefer that it work the way I thought it worked (that there is no logic in EnterObjMonitorHelperSpin to AwareLock (we always bail to the slow helper)). This would require that AwareLock.EnterEpilog to spin before waiting (it is the same loop, you just spin before you wait). Thus all AwareLock logic is now centralized (LockState actually does no need to be in a header-file, it is only used in AwareLock's code in SyncBlk.cpp).

But that is not a trivial change (not huge, but not trivial), and would need to be perf-tested. I will not ask you to make that change to get this fix check in.

Please do consider it, however. I think it would simplify the code and as you mention, may help #14571.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For what it is worth, it seems that EnterObjMonitorHelper EnterObjMonitorHelperSpin are only called from one place TryEnterObjMonitorSpinHelper, and there are some tricky conditions that are cloned in each of these methods. You could effectively combine them to EnterObjMonitorHelperSpin which pretty much just EnterObjMonitorHelper with a loop that calls SpinWait

This of course only works well if we first kick all the AwareLock stuff out of EnterObjMonitorHelperSpin suggested above.

While I like this idea, I am not suggesting you do it, just giving you food for thought.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RE HelperMethodFrame overhead, the overhead may even be worth it as it would add a potentially beneficial delay before the first try in the spin loop. Can't be sure until I try it.

The value of trying this now, is that we don't have to deal with any bugs associated with the code we are about to remove. I suppose it could wait until #14571, but if you see no issue, and removing it does not cause regressions it is a very nice improvement in my opinion...

I would like to keep this effort separate from this PR due to time constraints. My estimation is that this is not a short task. If it would be beneficial in avoiding bugs, we could consider postponing this PR but in that event a smaller PR that fixes the YieldProcessor issue for Skylake processors would be necessary in the interim in case the full set of changes don't make it into 2.1.

But I also care about the fact that we are cloning non-trivial code (we have a very similar spin wait loop in EnterEpilogHelper). The less cloned code we have the better
It would also be SO easy for the logic that counts spinners to be off in one place but not the other

The spin loops in EnterObjMonitorHelperSpin and EnterEpilogHelper are distinct and most of their logic is not shared:

  • EnterObjMonitorHelperSpin spin loop for AwareLock
    • Before spin loop, atomically attempt to acquire lock or register spinner (must not register spinner if lock is available)
    • In spin loop, if lock is available, atomically acquire lock and unregister spinner
    • After spin loop, unregister spinner, but if lock was available at the time of unregistering the spinner, the lock must be acquired and the spinner unregistered atomically unless the lock is stolen by another thread, the comment there explains why
  • EnterEpilogHelper spin loop
    • Precondition: waiter is registered
    • In spin loop, if lock is available, atomically acquire lock and clear waiter-signaled-to-wake bit
    • After spin loop, clear waiter-signaled-to-wake bit, but if lock was available at the time of clearing the bit, the lock must be acquired and the waiter unregistered atomically unless the lock is stolen by another thread. If lock is not acquired, it will proceed to waiting again.

There are also differences relevant to mitigating waiter starvation. There is some similarly in the flow and it's likely possible to refactor the flow to be shared, but to me it doesn't seem worth it.

we don't do the ETW contention logging in TryEnterObjMonitorSpinHelper, and arguably we should

This would be far too verbose. Considering how short the spin duration is, from the perspective of the event I think it's fair to consider that there is no contention until waiting is necessary. It may even already be too verbose compared to before, but it's something that can be fixed later if it turns out to be an issue.

For what it is worth, it seems that EnterObjMonitorHelper EnterObjMonitorHelperSpin are only called from one place TryEnterObjMonitorSpinHelper, and there are some tricky conditions that are cloned in each of these methods.

Agreed, and there's certainly more refactoring that can be done to simplify this code. At this point though, I think a separate PR would be appropriate for those kinds of changes.

You could effectively combine them to EnterObjMonitorHelperSpin which pretty much just EnterObjMonitorHelper with a loop that calls SpinWait

This is actually what I tried initially but this showed significant perf regressions on Linux and I decided to keep them separate and not inline the spin loop. On Windows there was not much difference in perf. Since we're thinking about moving the spinning to the slow helper path anyway, it may make sense to keep them separate at least until that idea can be tested.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vancem
Copy link

vancem commented Nov 20, 2017

I have set my approval for this change. I do want to spend a bit more time looking over the code (I am basically creating a PR where I put a bunch of comments in to document the design), but what is clear however is that this is a good step in the right direction. We should proceed.

@kouvel
Copy link
Author

kouvel commented Nov 27, 2017

@dotnet-bot test OSX10.12 x64 Checked Innerloop Build and Test

@kouvel
Copy link
Author

kouvel commented Nov 27, 2017

Thanks @vancem, I'll go ahead and merge the changes once everything passes. Let me know if you have any other feedback or if some parts could use more documentation.

@kouvel kouvel merged commit 93ddc4d into dotnet:master Nov 27, 2017
@kouvel kouvel deleted the MonitorScaleFix branch November 27, 2017 18:28
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants