Let's Talk Locks!

InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• News 15-20 / week
• Articles 3-4 / week
• Presentations (videos) 12-15 / week
• Interviews 2-3 / week
• Books 1 / month
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
go-locks/

Presented at QCon New York
www.qconnewyork.com
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide

“locks are slow”
lock contention causes ~10x latency
latency(ms)
time

…but they’re used everywhere.
from schedulers to databases and web servers.
latency(ms)
time

…but they’re used everywhere.
from schedulers to databases and web servers.
latency(ms)
time
?

let’s analyze its performance!
performance models for contention
let’s build a lock!
a tour through lock internals
let’s use it, smartly!
a few closing strategies

our case-study
Lock implementations are hardware, ISA, OS and language speciﬁc: 
 
We assume an x86_64 SMP machine running a modern Linux. 
We’ll look at the lock implementation in Go 1.12.
CPU 0 CPU 1
cache cache
interconnect
memory
simpliﬁed SMP system diagram

use as you would threads  
> go handle_request(r) 
but user-space threads: 
managed entirely by the Go runtime, not the operating system.
The unit of concurrent execution: goroutines.
a brief go primer

use as you would threads  
> go handle_request(r) 
but user-space threads: 
managed entirely by the Go runtime, not the operating system.
The unit of concurrent execution: goroutines.
a brief go primer
Data shared between goroutines must be synchronized.
One way is to use the blocking, non-recursive lock construct:
> var mu sync.Mutex 
mu.Lock() 
…
mu.Unlock()

let’s build a lock!
a tour through lock internals.

want: “mutual exclusion”
only one thread has access to shared data at any given time

T1
running on CPU 1
T2
running on CPU 2
func reader() { 
// Read a task 
t := tasks.get() 
 
// Do something with it. 
...
}
func writer() { 
// Write to tasks 
tasks.put(t)
}
// track whether tasks is
// available (0) or not (1)
// shared ring buffer
var tasks Tasks
only one thread has access to shared data at any given time

func reader() { 
// Read a task 
t := tasks.get() 
 
// Do something with it. 
...
}
func writer() { 
// Write to tasks 
tasks.put(t)
}
// track whether tasks is
// available (0) or not (1)
// shared ring buffer
var tasks Tasks
…idea! use a ﬂag?
T1
running on CPU 1
T2
running on CPU 2

// track whether tasks can be
// accessed (0) or not (1)
var flag int
var tasks Tasks

var flag int
var tasks Tasks
func reader() { 
for {
/* If flag is 0,
can access tasks. */ 
if flag == 0 { 
/* Set flag */
flag++
... 
/* Unset flag */
flag--
return
}
/* Else, keep looping. */  
}
}
T1
running on CPU 1

var flag int
var tasks Tasks
func reader() { 
for {
/* If flag is 0,
if flag == 0 { 
/* Set flag */
flag++
... 
/* Unset flag */
flag--
return
}
}
}
func writer() { 
for {
/* If flag is 0,
if flag == 0 { 
/* Set flag */
flag++
... 
/* Unset flag */
flag--
return
}
}
}
T1
running on CPU 1
T2
running on CPU 2

flag++
CPU
memory
1. Read (0)
2. Modify
3. Write (1)
T1
running on CPU 1

R
W
flag++
timeline of
memory operations
T1
running on CPU 1

R
R
W
flag++
if flag == 0
timeline of
memory operations
T1
running on CPU 1
T2
running on CPU 2
T2 may observe T1’s RMW half-complete

atomicity
A memory operation is non-atomic if it can be
observed half-complete by another thread.
An operation may be non-atomic because it: 
• uses multiple CPU instructions: 
operations on a large data structure;  
compiler decisions. 
• use a single non-atomic CPU instruction: 
RMW instructions; unaligned loads and stores.
> o := Order {
id: 10,
name: “yogi bear”,
order: “pie”,
count: 3,
}

atomicity
• uses a single non-atomic CPU instruction: 
> flag++

atomicity
• uses a single non-atomic CPU instruction: 
> flag++
An atomic operation is an “indivisible”
memory access.
In x86_64, loads, stores that are  
naturally aligned up to 64b.*
guarantees the data item ﬁts within a cache line; 
cache coherency guarantees a consistent view for a
single cache line.
* these are not the only guaranteed atomic operations.

nope; not atomic.

func reader() { 
for {
/* If flag is 0,
if flag == 0 { 
/* Set flag */
flag = 1
t := tasks.get() 
... 
/* Unset flag */
flag = 0
return
}
}
}
T1
running on CPU 1

the compiler may reorder operations.
// Sets flag to 1 & reads data.
func reader() {
flag = 1
t := tasks.get()
...
flag = 0

the processor may reorder operations.
StoreLoad reordering
load t before store flag = 1
// Sets flag to 1 & reads data.
func reader() {
flag = 1
t := tasks.get()
...
flag = 0

memory access reordering
The compiler, processor can reorder memory operations to optimize execution.

• The only cardinal rule is sequential consistency for single threaded programs. 
• Other guarantees about compiler reordering are captured by a  
language’s memory model: 
C++, Go guarantee data-race free programs will be sequentially consistent.
• For processor reordering, by the hardware memory model: 
x86_64 provides Total Store Ordering (TSO).

• The only cardinal rule is sequential consistency for single threaded programs. 
• Other guarantees about compiler reordering are captured by a  
language’s memory model: 
C++, Go guarantee data-race free programs will be sequentially consistent.
• For processor reordering, by the hardware memory model: 
x86_64 provides Total Store Ordering (TSO).
a relaxed consistency model.
most reorderings are invalid but StoreLoad is game; 
allows processor to hide the latency of writes.

nope; not atomic and no memory order guarantees.

need a construct that provides atomicity and prevents memory reordering.

need a construct that provides atomicity and prevents memory reordering.
…the hardware provides!

For guaranteed atomicity and to prevent memory reordering.
special hardware instructions
x86 example:
XCHG (exchange)
these instructions are called memory barriers.
they prevent reordering by the compiler too.
x86 example: MFENCE, LFENCE, SFENCE.

The x86 LOCK instruction preﬁx provides both.
Used to preﬁx memory access instructions:
LOCK ADD
}
atomic operations in languages like Go:
atomic.Add
atomic.CompareAndSwap

The x86 LOCK instruction preﬁx provides both.
Used to preﬁx memory access instructions:
LOCK ADD
}
atomic operations in languages like Go:
atomic.Add
atomic.CompareAndSwapLOCK CMPXCHG
Atomic compare-and-swap (CAS) conditionally updates a variable: 
checks if it has the expected value and if so, changes it to the desired value.

the CAS succeeded;
we set flag to 1.
flag was 1 so our CAS failed;
try again.
var flag int
var tasks Tasks
func reader() {
for {
// Try to atomically CAS flag from 0 -> 1
if atomic.CompareAndSwap(&flag, 0, 1) {
...
// Atomically set flag back to 0.
atomic.Store(&flag, 0)
return
} 
// CAS failed, try again :)
}
}
baby’s ﬁrst lock

var flag int
var tasks Tasks
func reader() {
for {
...
return
} 
}
}
baby’s ﬁrst lock: spinlocks
This is a simpliﬁed spinlock.
Spinlocks are used extensively in
the Linux kernel.}

The atomic CAS is the quintessence of any lock implementation.

cost of an atomic operation
Run on a 12-core x86_64 SMP machine. 
Atomic store to a C _Atomic int, 10M times in
a tight loop.
Measure average time taken per operation 
(from within the program).
With 1 thread: ~13ns (vs. regular operation: ~2ns)
With 12 cpu-pinned threads: ~110ns
threads are effectively serialized
var flag int
var tasks Tasks
func reader() {
for {
...
return
} 
}
}
spinlocks

sweet.
We have a scheme for mutual exclusion that provides atomicity and
memory ordering guarantees.

sweet.
…but
spinning for long durations is wasteful; it takes away CPU time from
other threads.

sweet.
…but
spinning for long durations is wasteful; it takes away CPU time from
other threads.
enter the operating system!

Linux’s futex
Interface and mechanism for userspace code to ask the kernel to suspend/ resume threads.
futex syscall kernel-managed queue

ﬂag can be 0: unlocked 
1: locked
2: there’s a waiter
var flag int
var tasks Tasks

set flag to 2 (there’s a waiter)
flag can be 0: unlocked 
1: locked
2: there’s a waiter
futex syscall to tell the kernel
to suspend us until flag changes.
when we’re resumed, we’ll CAS again.
var flag int
var tasks Tasks
func reader() {
for {
...
} 
// CAS failed, set flag to sleeping.
v := atomic.Xchg(&flag, 2)
// and go to sleep.
futex(&flag, FUTEX_WAIT, ...) 
}
}
T1’s CAS fails 
(because T2 has set the flag)
T1

in the kernel:
keyA
(from the userspace address: 
&flag)
keyA
T1
futex_q
1. arrange for thread to be resumed in the future: 
add an entry for this thread in the kernel queue for the address we care about

in the kernel:
keyA
&flag)
keyA
T1
futex_q
keyother
Tother
futex_q
keyother
hash(keyA)

in the kernel:
keyA
&flag)
keyA
T1
futex_q
keyother
Tother
futex_q
keyother
hash(keyA)
2. deschedule the calling thread to suspend it.

T2 is done 
(accessing the shared data)
T2
func writer() {
for {
...  
// Set flag to unlocked.
if v == 2 {
// If there was a waiter, issue a wake up.
futex(&flag, FUTEX_WAKE, ...)
}
return
} 
futex(&flag, FUTEX_WAIT, …)
}
}

T2 is done 
T2
func writer() {
for {
...  
if v == 2 {
}
return
} 
}
}
if flag was 2, there’s at least one waiter
futex syscall to tell the kernel to wake
a waiter up.

func writer() {
for {
...  
if v == 2 {
}
return
} 
}
}
if flag was 2, there’s at least one waiter
futex syscall to tell the kernel to wake
a waiter up.
hashes the key
walks the hash bucket’s futex queue
ﬁnds the ﬁrst thread waiting on the address
schedules it to run again!
}
T2 is done 
T2

pretty convenient!
pthread mutexes use futexes.
That was a hella simpliﬁed futex.
…but we still have a nice, lightweight primitive to build synchronization constructs.

cost of a futex
Lock & unlock a pthread mutex 10M times in loop 
(lock, increment an integer, unlock). 
Measure average time taken per lock/unlock pair 
uncontended case (1 thread): ~13ns
contended case (12 cpu-pinned threads): ~0.9us

cost of a futex
Lock & unlock a pthread mutex 10M times in loop 
uncontended case (1 thread): ~13ns
contended case (12 cpu-pinned threads): ~0.9us
cost of the user-space atomic CAS = ~13ns
}
cost of the atomic CAS +
syscall + thread context switch = ~0.9us
}

spinning vs. sleeping
Spinning makes sense for short durations; it keeps the thread on the CPU.
The trade-off is it uses CPU cycles not making progress.
So at some point, it makes sense to pay the cost of the context switch to go to sleep.
There are smart “hybrid” futexes: 
CAS-spin a small, ﬁxed number of times —> if that didn’t lock, make the futex syscall.
Example: the Go runtime’s futex implementation.

spinning vs. sleeping
Spinning makes sense for short durations; it keeps the thread on the CPU.
The trade-off is it uses CPU cycles not making progress.
So at some point, it makes sense to pay the cost of the context switch to go to sleep.
There are smart “hybrid” futexes: 
CAS-spin a small, ﬁxed number of times —> if that didn’t lock, make the futex syscall.
Examples: the Go runtime’s futex implementation; a variant of the pthread_mutex.

…can we do better for user-space threads?

goroutines are user-space threads.
The go runtime multiplexes them onto threads.
lighter-weight and cheaper than threads: 
goroutine switches = ~tens of ns;  
thread switches = ~a µs. CPU core
g1 g6g2
thread
CPU core } OS scheduler
Go scheduler
}

goroutines are user-space threads.
The go runtime multiplexes them onto threads.
lighter-weight and cheaper than threads: 
goroutine switches = ~tens of ns;  
thread switches = ~a µs. CPU core
g1 g6g2
thread
CPU core } OS scheduler
Go scheduler
}
we can block the goroutine without blocking the underlying thread!
to avoid the thread context switch cost.

This is what the Go runtime’s semaphore does! 
The semaphore is conceptually very similar to futexes in Linux*, but it is used to  
sleep/wake goroutines:
a goroutine that blocks on a mutex is descheduled, but not the underlying thread.
the goroutine wait queues are managed by the runtime, in user-space.
* There are, of course, differences in implementation though.

the goroutine wait queues are managed
by the Go runtime, in user-space.
var flag int
var tasks Tasks
func reader() {
for {
// Attempt to CAS flag.
if atomic.CompareAndSwap(&flag, ...) {
...
} 
// CAS failed; add G1 as a waiter for flag.
root.queue()
// and to sleep.
futex(&flag, FUTEX_WAIT, ...)
}
}
G1’s CAS fails 
(because G2 has set the ﬂag)
G1

&flag
(the userspace address)
&flag
G1 G3
G4
&other
hash(&flag)
}
the top-level waitlist for a hash bucket
is implemented as a treap
}
there’s a second-level wait queue  
for each unique address
the goroutine wait queues
(in user-space, managed by the go runtime)

the goroutine wait queues are managed
by the Go runtime, in user-space.
var flag int
var tasks Tasks
func reader() {
for {
// Attempt to CAS flag.
...
} 
root.queue()
// and suspend G1.
gopark()
}
}
G1’s CAS fails 
(because G2 has set the ﬂag)
G1
the Go runtime deschedules the goroutine;
keeps the thread running!

G2’s done 
G2
func writer() {
for {
...  
atomic.Xadd(&flag, ...) 
 
// If there’s a waiter, reschedule it.
waiter := root.dequeue(&flag)
goready(waiter)
return
} 
root.queue()
gopark()
}
}
ﬁnd the ﬁrst waiter goroutine and reschedule it
]

this is clever.
Avoids the hefty thread context switch cost in the contended case, 
up to a point.

this is clever.
Avoids the hefty thread context switch cost in the contended case, 
up to a point.
but…

func reader() {
for {
...
} 
semaroot.queue()
// and suspend G1.
gopark()
}
}
once G1 is resumed,  
it will try to CAS again.
Resumed goroutines have to compete with any other goroutines trying to CAS. 
 
They will likely lose: 
there’s a delay between when the ﬂag was set to 0 and this goroutine was rescheduled..
G1

 
atomic.Xadd(&flag, …) 
 
// If there’s a waiter, reschedule it.
waiter := root.dequeue(&flag)
goready(waiter)
return

 
So, the semaphore implementation may end up: 
• unnecessarily resuming a waiter goroutine 
results in a goroutine context switch again. 
• cause goroutine starvation 
can result in long wait times, high tail latencies.

 
So, the semaphore implementation may end up: 
• unnecessarily resuming a waiter goroutine 
results in a goroutine context switch again. 
• cause goroutine starvation 
can result in long wait times, high tail latencies.
the sync.Mutex implementation adds a layer that ﬁxes these.

go’s sync.Mutex
Is a hybrid lock that uses a semaphore to sleep / wake goroutines.

go’s sync.Mutex
Additionally, it tracks extra state to:
prevent unnecessarily waking up a goroutine 
“There’s a goroutine actively trying to CAS”: An unlock in this case does not wake a waiter. 
prevent severe goroutine starvation
“a waiter has been waiting”:
If a waiter is resumed but loses the CAS again, it’s queued at the head of the wait queue. 
If a waiter fails to lock for 1ms, switch the mutex to “starvation mode”.

go’s sync.Mutex
Additionally, it tracks extra state to:
other goroutines cannot CAS, they must queue
The unlock hands the mutex off to the ﬁrst waiter. 
i.e. the waiter does not have to compete.

how does it perform?
Lock & unlock a Go sync.Mutex 10M times in loop 
uncontended case (1 goroutine): ~13ns
contended case (12 goroutines): ~0.8us

Contended case performance of C vs. Go: 
Go initially performs better than C 
but they ~converge as concurrency gets high enough.
}

Contended case performance of C vs. Go: 
Go initially performs better than C 
but they ~converge as concurrency gets high enough.
}
}

&flag
G1 G3
G4
&other
the Go runtime semaphore’s
hash table for waiting goroutines:
each hash bucket needs a lock.
…and it’s a futex!

&flag
G1 G3
G4
&other
…it’s a futex!

&flag
G1 G3
G4
&other &flag
G1
the Linux kernel’s futex hash table
for waiting threads:
…it’s a spin lock!
…it’s a futex!

&flag
G1 G3
G4
&other &flag
G1
…it’s a spinlock!
…it’s a futex!
the Linux kernel’s futex hash table
for waiting threads:

uses futexes
uses spin-locks
It’s locks all the way down!
uses a semaphore
sync.Mutex

let’s analyze its performance!
performance models for contention.

uncontended case 
Cost of the atomic CAS.
contended case
In the worst-case, cost of failed atomic operations + spinning + goroutine context switch +  
thread context switch.
….But really, depends on degree of contention.

how many threads do we need to support a target throughput?  
while keeping response time the same.
how does response time change with the number of threads?
assuming a constant workload.
“How does application performance change with concurrency?”

Amdahl’s Law
Speed-up depends on the fraction of the workload that can be parallelized (p).
speed-up with N threads = 1
(1 — p) + p
N

a simple experiment
Measure time taken to complete a ﬁxed workload. 
serial fraction holds a lock (sync.Mutex).
scale parallel fraction (p) from 0.25 to 0.75
measure time taken for number of goroutines (N) = 1 —> 12.

p = 0.75
p = 0.25
Amdahl’s Law
Speed-up depends on the fraction of the workload that can be parallelized (p).

Universal Scalability Law (USL)
• contention penalty 
due to serialization for shared resources. 
examples: lock contention, database
contention. 
• crosstalk penalty 
due to coordination for coherence.
examples: servers coordinating to synchronize 
mutable state.
αN
Scalability depends on contention and cross-talk.

• contention penalty 
due to serialization for shared resources. 
examples: lock contention, database
contention. 
• crosstalk penalty 
due to coordination for coherence.
examples: servers coordinating to synchronize 
mutable state.
αN
Scalability depends on contention and cross-talk.
βN2

N
(αN + βN2 + C)
N
C
N
(αN + C)
contention and crosstalk
linear scaling
contention
throughput
concurrency
throughput of N threads = N
(αN + βN2 + C)

p = 0.75p = 0.25
USL curves
plotted using the R usl package
p = parallel fraction of workload

let’s use it, smartly!
a few closing strategies.

but first, profile!
Go mutex
• Go mutex contention profiler 
https://golang.org/doc/diagnostics.html
Linux
• perf-lock: 
perf examples by Brendan Gregg 
Brendan Gregg article on off-cpu analysis
• eBPF: 
example bcc tool to measure user lock contention
• Dtrace, systemtap
• mutrace, Valgrind-drd 
pprof mutex contention profile

strategy I: don’t use a lock
• remove the need for synchronization from hot-paths: 
typically involves rearchitecting.
• reduce the number of lock operations: 
doing more thread local work, buffering, batching, copy-on-write.
• use atomic operations.
• use lock-free data structures 
see: http://www.1024cores.net/

strategy II: granular locks
• shard data: 
but ensure no false sharing, by padding to cache line size. 
examples:  
go runtime semaphore’s hash table buckets; 
Linux scheduler’s per-CPU runqueues; 
Go scheduler’s per-CPU runqueues;
• use read-write locks
scheduler benchmark
(CreateGoroutineParallel)
modiﬁed scheduler: global lock; runqueue
go scheduler: per-CPU core, lock-free runqueues

strategy III: do less serial work
latency
time time
smaller critical section change
• move computation out of critical section: 
typically involves rearchitecting.

bonus strategy:
• contention-aware schedulers
example: Contention-aware scheduling in MySQL 8.0 Innodb

Special thanks to Eben Freeman, Justin Delegard, Austin Dufﬁeld for reading drafts of this.
@kavya719
speakerdeck.com/kavya719/lets-talk-locks
References 
Jeff Preshing’s excellent blog series 
Memory Barriers: A Hardware View for Software Hackers 
LWN.net on futexes 
The Go source code
The Universal Scalability Law Manifesto, Neil Gunther

Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
go-locks/

Let's Talk Locks!

More Related Content

What's hot (20)

Similar to Let's Talk Locks! (20)

More from C4Media (20)

Recently uploaded (20)

Let's Talk Locks!