Parallel Programming: Beyond the Critical Section

Introduction Tony Albrecht Senior Programmer for Pandemic Studios Brisbane Email: Tony.Albrecht0(at)gmail(dot)com

Overview Justify myself Start at the bottom Continue from the top Quick look in the middle

Parallel Programming: Why? Moore’s Law Limits to sequential CPUs – parallel processing is how we avoid those limits. Programs must be parallel to get Moore level speedups. Applies to programming in general.

“ Waaaah!” “ Parallel programming is hard.” “ My code already runs incredibly fast – it doesn’t need to go any faster.” “ It’s impossible to parallelise this algorithm.” “ Only the rendering pipeline needs to be parallel.” “ that’s only for super computers.”

So? ~2011 ~6TFlop machine Next console will have between 64 and 128 processors 4 to 8GB of memory 128 processors!!!!

How can we utilise 100+ CPUS? Start now Design Implement Iterate Learn

Race Condition Example x++ x++ x=0 x=? Thread A Thread B

Race Condition Example R1 = 0 x=0 Thread A Thread B

Race Condition Example R1 = 0+1 x=0 Thread A Thread B

Race Condition Example R1 = 1 R1 = 0 x=0 Thread A Thread B

Race Condition Example R1 = 1 R1 = 0+1 x=1 Thread A Thread B

Race Condition Example Solution requires atomics or locking. R1 = 1 R1 = 1 x=1 Thread A Thread B

Atomics Atomic operations are uninterruptable, singular operations Get/Set Inc/Dec (Add/Sub) Compare And Swap Plus other variations

Compare And Swap CAS(memory, oldValue, newValue) If(memory==oldValue) memory=newValue; Surprisingly useful. Simple locking primitive while(CAS(&lock,0,1)!=0) ;

Race Condition Solution A AtomicInc(x) AtomicInc(x) x=0 Thread A Thread B

Race Condition Solution A AtomicInc(x) AtomicInc(x) x=0 Thread A Thread B x=1

Race Condition Solution A AtomicInc(x) AtomicInc(x) x=0 Thread A Thread B x=1 x=2

Locking Used to serialise access to code. Like a key to a coffee shop toilet one key, one toilet, queue for access. Lock()/Unlock() … Code… Lock(); // protected region Unlock(); ...more code…

Race Condition Solution B x=0 Thread A Thread B Lock A x++ Unl0ck A Lock A x++ Unl0ck A

Race Condition Solution B x=0 Thread A Thread B x=1 Lock A x++ Unl0ck A Lock A x++ Unl0ck A

Race Condition Solution B x=0 Thread A Thread B x=1 x=2 Lock A x++ Unl0ck A Lock A x++ Unl0ck A

The Problems Race conditions Deadlocks

Deadlock “ When two trains approach each other at a crossing, both shall come to a full stop and neither shall start up again until the other has gone. ” — Kansas Legislature Deadlock can occur when 2 or more processes require resource(s) from another.

Deadlock Thread 1 Thread 2 Generally can be considered to be a logic error Can be painfully subtle and rare. Lock A Lock B Lock B Lock A Unl0ck A Unlock B

The Problems Race conditions Deadlocks Read/write tearing

Read/write tearing More that one thread writing to the same memory at the same time. The more data, the more likely Solve with synchronisation primitives. “ AAAAAAAA” “ BBBBBBBB” “ AAAABBBB”

The Problems Race conditions Deadlocks Read/write tearing Priority Inversion

Priority Inversion Consider threads with different priorities Low priority thread holds a shared resource High priority thread tries to acquire that resource High priority thread is blocked by the low Medium priority threads will execute at the expense of the low and the high threads.

The Problems Race conditions Deadlocks Read/write tearing Priority Inversion The ABA Problem

The ABA problem Thread 1 reads ‘A’ from memory. Thread 2 modifies memory value ‘A’ to ‘B’ and back to ‘A’ again. Thread 1 resumes and assumes nothing has changed (using CAS) Often associated with dynamically allocated memory

Consider a list and a thread pool… head a c b … ..

Thread A about to CAS head from a to b head a c b … .. CAS(&head->next,a,b);

Threads B: deq a & b head c … .. a b A & B are released into thread local pools

Thread B enq A - reused head a c … .. b A is added back

Thread A executes CAS head a c … .. b CAS(&head->next,a,b);

Thread A executes CAS successfully! head a c … .. b CAS(&head->next,a,b);

ABA Solution Tag each pointer with a count Each time you use the ptr, inc the tag Must do it atomically

The Problems Race conditions Deadlocks Read/write tearing Priority Inversion The ABA Problem Thread scheduling problems

Convoy/Stampede Convoy Multiple threads restricted by a bottleneck. Stampede Multiple threads being started at once.

Higher Level Locking Primitives SpinLock Mutex Barrier RWlock Semaphore

SpinLock Loop until a value is set. No OS overhead with thread management Doesn’t sleep thread Handy if you will never wait for long. Very bad if you need to wait for a long time Can embed sleep() or Yield() But these can be perilous

Mutex Mutual Exclusion A simple lock/unlock primitive Otherwise known as a CriticalSection Used to serialise access to code. Often overused. More than just a spinlock can release thread Be aware of overhead

Barrier Will block until ‘n’ threads signal it Useful for ensuring that all threads have finished a particular task.

Barrier example Thread 1 Thread 2 Thread 3 Thread 4 Barrier(3) Use results Do stuff

Barrier example Thread 1 Thread 2 Thread 3 Thread 4 Barrier(3) Use results Do stuff Calculating

Barrier example Thread 1 Thread 2 Thread 3 Thread 4 Barrier(3) Use results Do stuff Done

Barrier example Thread 1 Thread 2 Thread 3 Thread 4 Barrier(3) Use results Do stuff Signal

Barrier example Thread 1 Thread 2 Thread 3 Thread 4 Barrier(2) Use results Do stuff Do other stuff

Barrier example Thread 1 Thread 2 Thread 3 Thread 4 Barrier(2) Use results Do stuff Calculating

Barrier example Thread 1 Thread 2 Thread 3 Thread 4 Barrier(2) Use results Do stuff Done Calculating

Barrier example Thread 1 Thread 2 Thread 3 Thread 4 Barrier(2) Use results Do stuff Signal Done

Barrier example Thread 1 Thread 2 Thread 3 Thread 4 Barrier(1) Use results Do stuff More code Signal

Barrier example Thread 1 Thread 2 Thread 3 Thread 4 Barrier(0) Use results Do stuff Calc pi

Barrier example Thread 1 Thread 2 Thread 3 Thread 4 Barrier(0) Use results Do stuff

RWLock Allows many readers But exclusive writing Writing blocks writers and readers. Writing waits until all readers have finished.

Semaphore Generalisation of mutex Allows ‘c’ threads access to critical code at once. Basically an atomic integer Wait() will block if value == 0; then dec & cont. Signal() increments value (allows a waiting thread to unblock) Conceptually, Mutexes stop other threads from running code Semaphores tell other threads to run code

Parallel Patterns Why patterns? A set of templates to aid design A common language Aids education Provides a familiar base to start implementation

So, how do we start? Analyse your problem Identify tasks that can execute concurrently Identify data local to each task Identify task order/schedule Analyse dependencies between tasks. Consider the HW you are running on

Problem Decomposition Problem From “Patterns for Parallel Programming”

Problem Decomposition Problem Organise By Tasks Organise By Data Decomposition Organise By Data Flow From “Patterns for Parallel Programming”

Problem Decomposition Problem Organise By Tasks Organise By Data Decomposition Organise By Data Flow Linear Recursive Linear Recursive Linear Recursive From “Patterns for Parallel Programming”

Problem Decomposition Problem Organise By Tasks Organise By Data Decomposition Organise By Data Flow Linear Recursive Linear Recursive Linear Recursive Task Parallelism Divide and Conquer Geometric Decomposition Recursive Data Pipeline Event-Based Coordination From “Patterns for Parallel Programming”

Task Parallelism Task dominant, linear Functionally driven problem Many tasks that may depend on each other Try to minimise dependencies Key elements: Tasks Dependencies Schedule

Divide and Conquer Task Dominant, recursive Problem solved by splitting it into smaller sub-problems and solving them independently. Generally its easy to take a sequential Divide and Conquer implementation and parallelise it.

Geometric Decomposition Data dominant, linear Decompose the data into chunks Solve for chunks independently. Beware of edge dependencies.

Recursive Data Pattern Data dominant, recursive Operations on trees, lists, graphs Dependencies can often prohibit parallelism Often requires tricky recasting of problem ie operate on all tree elements in parallel More work, but distributed across more cores

Pipeline Pattern Data flow dominant, linear Sets of data flowing through a sequence of stages Each stage is independent Easy to understand - simple, dedicated code

Event-Based Coordination Data flow, recursive Groups of semi-independent tasks interacting in an irregular fashion. Tasks sending events to other tasks which send tasks… Can be highly complex Tricky to load balance

Supporting Structures SPMD Master/Worker Loop Parallelism Fork/Join Program Structures Data Structures Shared Data Distributed Array Shared Queue

Program Structures SPMD Master/Worker Loop Parallelism Fork/Join Program Structures Data Structures Shared Data Distributed Array Shared Queue

SPMD Single Program, Multiple Data Single source code image running on multiple threads Very common Easy to maintain Easy to understand

Master/Worker Dominant force is the need to dynamically load balance Tasks are highly variable ie duration/cost Program structure doesn’t map onto loops Cores vary in performance. “ Bag of Tasks” Master sets up tasks and waits for completion Workers grab task from queue, execute and then grab the next one.

Loop Parallelism Dominated by computationally expensive loops Split iterations of the loop out to threads Be careful of memory use and process granularity

Fork/Join The number of concurrent tasks varies over the life of the execution. Complex or recursive relations between tasks Either Direct task/core mapping Thread pool

Supporting Data Structures SPMD Master/Worker Loop Parallelism Fork/Join Program Structures Data Structures Shared Data Distributed Array Shared Queue

Shared Data Required when At least one data structure is accessed by multiple tasks At least one task modifies the shared data The tasks potentially need to use the modified value. Solutions Serialise execution – mutual exclusion Noninterfering sets of operations RWlocks

Distributed Array How can we distribute an array across many threads? Used in Geometric Decomposition Break array into thread specific parts Maximise locality per thread Be wary of cache line overlap Keep data distribution coarse

Shared Queue Extremely valuable construct Fundamental part of Master/Worker (“Bag of Tasks”) Must be consistent and work with many competing threads. Must be as efficient as possible Preferably lock free

Lock free programming Locks Simple, easy to use and implement But serialise code execution Lock Free Tricky to implement and debug

Lock Free linked list Lock free linked list (ordered) Easily generalised to other container classes Stacks Queues Relatively simple to understand

Adding a node to a list head a c tail b

Adding a node: Step 1 head a c tail b Find where to insert

Adding a node: Step 2 head a c tail b newNode->Next = prev->Next;

Adding a node: Step 3 head a c tail b prev->Next = newNode;

Extending to multiple threads What could go wrong?

Add ‘b’ and ‘c’ concurretly head a d tail b c Find where to insert

Add ‘b’ and ‘c’ concurretly head a d tail b c newNode->Next = prev->Next;

Add ‘b’ and ‘c’ concurrently head a d tail b c prev->Next = newNode;

Add ‘b’ and ‘c’ concurrently head a d tail b c

Extending to multiple threads What could go wrong? Add another node between a & c A or c could be deleted A concurrent read could reach a dangling pointer. Any number of multiples of the above If anything can go wrong, it will. So, how do we make it thread safe? Lets examine some solutions

Coarse Grained Locking Lock the list for each add or remove Also lock for reads (find, iterators) Will effectively serialise the list Only one thread at a time can access the list. Correctness at the expense of performance .

A concrete example 10 producers Add 500 random numbers in a tight loop 10 consumers Remove the 500 numbers in a tight loop Each in its own thread 21 threads Running on PS3 using SNTuner to profile

Step 1: Lock list b head a c tail

Step 2 & 3:Find then Insert b head a c tail

Coarse Grained locking Wide green bars are active locks Little blips are adds or removes Execution took 416ms (profiling significantly impacts performance)

Fine Grained Locking Add and Remove only affects neighbours Give each Node a lock (So, creating a node creates a mutex) Lock only neighbours when adding or removing. When iterating along the list you must lock/unlock as you go.

Fine Grained Locking head a c tail b

Fine Grained Locking a c tail b head

Fine Grained Locking c tail b head a

Fine Grained Locking head tail b a c

Fine Grained Locking Blocking is much longer – due to overhead in creating a mutex Very slow > 1200ms Better solution would have been to have a pool of mutexes that could be used

Optimistic Locking Search without locking Lock nodes once found, then validate them Valid if you can navigate to it from head. If invalid, search from head again.

Optimistic: Add(“g”) head a c d tail f k m g

Step 1: Search head a c d tail f k m g

Step 2: Lock head a c d tail m g f k

Step 3: Validate head a c d tail m g f k

Step 3: Validate - FAIL head a tail m g d f k

Step 3a: Validate (retry) head a e tail m g d f k

Step 3a: Validate (success) head a e tail m g d f k

Step 4: Add head a e tail m g d f k

Step 5: Unlock head a e tail f k m g d

Optimistic Caveat We can’t delete nodes immediately Another thread could be reading it Can’t rely on memory not being changed. Use deferred garbage collection Delete in a ‘safe’ part of a frame. Or use invasive lists (supply own nodes) Find() requires validation (Locks).

Delete Caveat: Validate head a e tail m g d f k

Delete Caveat: delete ‘d’ head a e tail m g f k d

Delete Caveat: Validate head a e tail m g f k d

Delete Caveat: Valid! head a e tail m g f k d

Optimistic Synchronisation ~540ms Most time was spent validating Plus there was overhead in creating a mutex per node for the lock. Again, a pool of mutexes would benefit.

Lazy Synchronization Attempt to speed up Optimistic Validation Store a deleted flag in each node Find() is then lock free Just check the deleted flag on success.

Lazy: Add(“g”) head a c d tail f k m g

Step 1a: Search (delete c) head a c d tail f k m g

Step 1b: Search (lock) head d tail f k m g a c

Step 1c: Search (mark) head d tail f k m g a c

Step 2d: lock (skip/unlock) head a c d tail m g f k

Step 3: Add/Validate head a d tail m g c f k

Step 4: Unlock head a d tail f k m g c

Lazy Synchronisation Still need to keep the deleted nodes. Faster than Optimistic Still serialises.

Lock free (Non-Blocking) Can’t we just modify Lazy Sync to use CAS?

Delete ‘a’ and add ‘b’ concurrently head a c tail b prev->next=curr->next; | prev->next=b;

Delete ‘a’ and add ‘b’ concurrently head a c tail b head->next=a->next; | prev->next=b;

Delete ‘a’ and add ‘b’ concurrently head a c tail b Effectively deletes ‘a’ and ‘b’.

Introducing the AtomicMarkedPtr<> Wrapper on uint32 Encapsulates an atomic pointer and a flag Allows testing of a flag and updating of a pointer atomically. Use LSB for the flag AtomicMarkedPtr<Node> next; next->CompareAndSet(eValue, nValue,eFlag, nFlag);

AtomicMarkedPtr<> We can now use CAS to set a pointer and check a flag in a single atomic action. ie. check deleted status and change pointer at same time. class Node { public: Node(); AtomicMarkedPtr<Node> m_Next; T m_Data; int32 m_Key; };

Lock Free: Remove ‘d’ head a c d tail f k m Start loop:

Step 1: Find ‘d’ head a c tail f k m pred curr succ d if(!InternalFind(‘d’)) continue;

Step 2: Mark ‘d’ head a c tail f k m pred curr succ d if(!curr->next->CAS(succ,succ,false,true)) continue;

Step 3: Skip ‘d’ head a c tail f k m pred curr succ d pred->next->CAS(curr,succ,false,false);

LockFree: InternalFind() Finds pred and curr Skips marked nodes. Consider the list at Step 2 in previous example and, lets introduce a second thread calling InternalFind();

Second InternalFind() head a c tail f k m pred curr succ pred curr succ d

If succ is marked… head a c tail f k m pred curr succ pred curr succ d

… Skip it head a c tail f k m pred curr succ pred curr succ d

Lock Free Synchronisation No blocking at all List is always in a consistent state. Faster threads help out slower ones.

Lock free Full thread usage ~60ms High thread coherency

Real world considerations Cost of locking Context switching Memory coherency/latency Size/granularity of tasks

Advice Build a set of lock free containers Design around data flow Minimise locking You can have more than ‘n’ threads on an ‘n’ core machine Profile, profile, profile.

References Patterns for Parallel Programming – T. Mattson et.al. The Art of Multiprocessor Programming – M Herlihy and Nir Shavit http://www.top500.org/ Flow Based Programming - http://www.jpaulmorrison.com/fbp/index.shtml http://www.valvesoftware.com/publications/2007/GDC2007_SourceMulticore.pdf http://www.netrino.com/node/202 http://blogs.intel.com/research/2007/08/what_makes_parallel_programmin.php The Little book of Semaphores - http://www.greenteapress.com/semaphores/ My Blog: 7DOF - http://seven-degrees-of-freedom.blogspot.com/

Parallel Programming: Beyond the Critical Section

More Related Content

What's hot (20)

Similar to Parallel Programming: Beyond the Critical Section (20)

Recently uploaded (20)

Parallel Programming: Beyond the Critical Section