SlideShare a Scribd company logo
Beyond The Critical Section
Introduction Tony Albrecht Senior Programmer for Pandemic Studios Brisbane Email: Tony.Albrecht0(at)gmail(dot)com
Overview Justify myself Start at the bottom Continue from the top Quick look in the middle
Parallel Programming: Why? Moore’s Law Limits to sequential CPUs – parallel processing is how we avoid those limits. Programs  must  be parallel to get Moore level speedups. Applies to programming in general.
Moore’s Law
“ Waaaah!” “ Parallel programming is hard.” “ My code already runs incredibly fast – it doesn’t need to go any faster.” “ It’s impossible to parallelise this algorithm.” “ Only the rendering pipeline needs to be parallel.” “ that’s only for super computers.”
Console trends
So? ~2011 ~6TFlop machine Next console will have between 64 and 128 processors  4 to 8GB of memory 128 processors!!!!
How can we utilise 100+ CPUS? Start now Design Implement Iterate Learn
The Problems Race conditions
Race Condition Example x++ x++ x=0 x=? Thread A Thread B
Race Condition Example R1 = 0 x=0 Thread A Thread B
Race Condition Example R1 = 0+1 x=0 Thread A Thread B
Race Condition Example R1 = 1 R1 = 0 x=0 Thread A Thread B
Race Condition Example R1 = 1 R1 = 0+1 x=1 Thread A Thread B
Race Condition Example Solution requires atomics or locking. R1 = 1 R1 = 1 x=1 Thread A Thread B
Atomics Atomic operations are uninterruptable, singular operations Get/Set Inc/Dec (Add/Sub) Compare And Swap Plus other variations
Compare And Swap CAS(memory, oldValue, newValue) If(memory==oldValue)   memory=newValue; Surprisingly useful. Simple locking primitive while(CAS(&lock,0,1)!=0)   ;
Race Condition Solution A AtomicInc(x) AtomicInc(x) x=0 Thread A Thread B
Race Condition Solution A AtomicInc(x) AtomicInc(x) x=0 Thread A Thread B x=1
Race Condition Solution A AtomicInc(x) AtomicInc(x) x=0 Thread A Thread B x=1 x=2
Locking Used to serialise access to code. Like a key to a coffee shop toilet  one key,  one toilet,  queue for access. Lock()/Unlock() … Code… Lock(); // protected region Unlock(); ...more code…
Race Condition Solution B x=0 Thread A Thread B Lock A x++ Unl0ck A Lock A x++ Unl0ck A
Race Condition Solution B x=0 Thread A Thread B Lock A x++ Unl0ck A Lock A x++ Unl0ck A
Race Condition Solution B x=0 Thread A Thread B x=1 Lock A x++ Unl0ck A Lock A x++ Unl0ck A
Race Condition Solution B x=0 Thread A Thread B x=1 Lock A x++ Unl0ck A Lock A x++ Unl0ck A
Race Condition Solution B x=0 Thread A Thread B x=1 Lock A x++ Unl0ck A Lock A x++ Unl0ck A
Race Condition Solution B x=0 Thread A Thread B x=1 x=2 Lock A x++ Unl0ck A Lock A x++ Unl0ck A
Race Condition Solution B x=0 Thread A Thread B x=1 x=2 Lock A x++ Unl0ck A Lock A x++ Unl0ck A
The Problems Race conditions Deadlocks
Deadlock “   When two trains approach each other at a crossing, both shall come to a full stop and neither shall start up again until the other has gone.  ”      — Kansas Legislature Deadlock can occur when 2 or more processes require resource(s) from another.
Deadlock Thread 1   Thread 2 Generally can be considered to be a logic error Can be painfully subtle and rare. Lock A Lock B Lock B Lock A Unl0ck A Unlock B
The Problems Race conditions Deadlocks Read/write tearing
Read/write tearing More that one thread writing to the same memory at the same time. The more data, the more likely Solve with synchronisation primitives. “ AAAAAAAA” “ BBBBBBBB” “ AAAABBBB”
The Problems Race conditions Deadlocks Read/write tearing Priority Inversion
Priority Inversion Consider  threads with different priorities Low priority thread holds a shared resource High priority thread tries to acquire that resource High priority thread is blocked by the low Medium priority threads will execute at the expense of the low  and  the high threads.
The Problems Race conditions Deadlocks Read/write tearing Priority Inversion The ABA Problem
The ABA problem Thread 1 reads ‘A’ from memory. Thread 2 modifies memory value ‘A’ to ‘B’ and back to ‘A’ again. Thread 1 resumes and assumes nothing has changed (using CAS) Often associated with dynamically allocated memory
Consider a list and a thread pool… head a c b … ..
Thread A about to CAS head from a to b head a c b … .. CAS(&head->next,a,b);
Threads B: deq a & b head c … .. a b A & B are released into thread local pools
Thread B enq A - reused head a c … .. b A is added back
Thread A executes CAS head a c … .. b CAS(&head->next,a,b);
Thread A executes CAS successfully! head a c … .. b CAS(&head->next,a,b);
ABA Solution Tag each pointer with a count Each time you use the ptr, inc the tag Must do it atomically
The Problems Race conditions Deadlocks Read/write tearing Priority Inversion The ABA Problem Thread scheduling problems
Convoy/Stampede Convoy Multiple threads restricted by a bottleneck. Stampede Multiple threads being started at once.
Higher Level Locking Primitives SpinLock Mutex Barrier RWlock Semaphore
SpinLock Loop until a value is set. No OS overhead with thread management Doesn’t sleep thread Handy if you will never wait for long. Very bad if you need to wait for a long time Can embed sleep() or Yield() But these can be perilous
Mutex Mutual Exclusion A simple lock/unlock primitive Otherwise known as a CriticalSection Used to serialise access to code. Often overused. More than just a spinlock  can release thread Be aware of overhead
Barrier Will block until ‘n’ threads signal it Useful for ensuring that all threads have finished a particular task.
Barrier example Thread 1 Thread 2 Thread 3 Thread 4 Barrier(3) Use results Do stuff
Barrier example Thread 1 Thread 2 Thread 3 Thread 4 Barrier(3) Use results Do stuff Calculating
Barrier example Thread 1 Thread 2 Thread 3 Thread 4 Barrier(3) Use results Do stuff Done
Barrier example Thread 1 Thread 2 Thread 3 Thread 4 Barrier(3) Use results Do stuff Signal
Barrier example Thread 1 Thread 2 Thread 3 Thread 4 Barrier(2) Use results Do stuff Do other stuff
Barrier example Thread 1 Thread 2 Thread 3 Thread 4 Barrier(2) Use results Do stuff Calculating
Barrier example Thread 1 Thread 2 Thread 3 Thread 4 Barrier(2) Use results Do stuff Done Calculating
Barrier example Thread 1 Thread 2 Thread 3 Thread 4 Barrier(2) Use results Do stuff Signal Done
Barrier example Thread 1 Thread 2 Thread 3 Thread 4 Barrier(1) Use results Do stuff More code Signal
Barrier example Thread 1 Thread 2 Thread 3 Thread 4 Barrier(0) Use results Do stuff Calc pi
Barrier example Thread 1 Thread 2 Thread 3 Thread 4 Barrier(0) Use results Do stuff
RWLock Allows many readers But exclusive writing Writing blocks writers and readers. Writing waits until all readers have finished.
Semaphore Generalisation of mutex Allows  ‘c’ threads access  to critical code at once. Basically an atomic integer Wait() will block if value == 0; then dec & cont. Signal() increments value (allows a waiting thread to unblock) Conceptually,  Mutexes stop other threads from running code Semaphores tell other threads to run code
Parallel Patterns Why patterns? A set of templates to aid design A common language Aids education Provides a familiar base to start implementation
So, how do we start? Analyse your problem Identify tasks that can execute concurrently Identify data local to each task Identify task order/schedule Analyse dependencies between tasks. Consider the HW you are running on
Problem Decomposition Problem From “Patterns for Parallel Programming”
Problem Decomposition Problem Organise By Tasks Organise By Data Decomposition Organise By Data Flow From “Patterns for Parallel Programming”
Problem Decomposition Problem Organise By Tasks Organise By Data Decomposition Organise By Data Flow Linear Recursive Linear Recursive Linear Recursive From “Patterns for Parallel Programming”
Problem Decomposition Problem Organise By Tasks Organise By Data Decomposition Organise By Data Flow Linear Recursive Linear Recursive Linear Recursive Task Parallelism Divide and Conquer Geometric Decomposition Recursive Data Pipeline Event-Based Coordination From “Patterns for Parallel Programming”
Task Parallelism Task dominant, linear Functionally driven problem Many tasks that may depend on each other Try to minimise dependencies Key elements: Tasks Dependencies Schedule
Divide and Conquer Task Dominant, recursive Problem solved by splitting it into smaller sub-problems and solving them independently. Generally its easy to take a sequential Divide and Conquer implementation and parallelise it.
Geometric Decomposition Data dominant, linear Decompose the data into chunks Solve for chunks independently. Beware of edge dependencies.
Recursive Data Pattern Data dominant, recursive Operations on trees, lists, graphs Dependencies can often prohibit parallelism Often requires tricky recasting of problem ie operate on all tree elements in parallel More work, but distributed across more cores
Pipeline Pattern Data flow dominant, linear Sets of data flowing through a sequence of stages Each stage is independent Easy to understand - simple, dedicated code
Event-Based Coordination Data flow, recursive Groups of semi-independent tasks interacting in an irregular fashion. Tasks sending events to other tasks which send tasks… Can be highly complex Tricky to load balance
Supporting Structures SPMD Master/Worker Loop Parallelism Fork/Join Program Structures Data Structures Shared Data Distributed Array Shared Queue
Program Structures SPMD Master/Worker Loop Parallelism Fork/Join Program Structures Data Structures Shared Data Distributed Array Shared Queue
SPMD Single Program, Multiple Data Single source code image running on multiple threads Very common Easy to maintain Easy to understand
Master/Worker Dominant force is the need to dynamically load balance  Tasks are highly variable ie duration/cost Program structure doesn’t map onto loops Cores vary in performance. “ Bag of Tasks” Master sets up tasks and waits for completion Workers grab task from queue, execute and then grab the next one.
Loop Parallelism Dominated by computationally expensive loops Split iterations of the loop out to threads Be careful of memory use and process granularity
Fork/Join The number of concurrent tasks varies over the life of the execution. Complex or recursive relations between tasks Either Direct task/core mapping Thread pool
Supporting Data Structures SPMD Master/Worker Loop Parallelism Fork/Join Program Structures Data Structures Shared Data Distributed Array Shared Queue
Shared Data Required when At least one data structure is accessed by multiple tasks At least one task modifies the shared data The tasks potentially need to use the modified value. Solutions Serialise execution – mutual exclusion Noninterfering sets of operations RWlocks
Distributed Array How can we distribute an array across many threads? Used in Geometric Decomposition Break array into thread specific parts Maximise locality per thread Be wary of cache line overlap Keep data distribution coarse
Shared Queue Extremely valuable construct  Fundamental part of Master/Worker (“Bag of Tasks”) Must be consistent and work with many competing threads. Must be as efficient as possible Preferably lock free
Lock free programming Locks  Simple, easy to use and implement But serialise code execution Lock Free Tricky to implement and debug
Lock Free linked list Lock free linked list (ordered) Easily generalised to other container classes Stacks Queues Relatively simple to understand
Adding a node to a list head a c tail b
Adding a node: Step 1 head a c tail b Find where to insert
Adding a node: Step 2 head a c tail b newNode->Next = prev->Next;
Adding a node: Step 3 head a c tail b prev->Next = newNode;
Extending to multiple threads What could go wrong?
Add ‘b’ and ‘c’ concurretly head a d tail b c Find where to insert
Add ‘b’ and ‘c’ concurretly head a d tail b c newNode->Next = prev->Next;
Add ‘b’ and ‘c’ concurrently head a d tail b c prev->Next = newNode;
Add ‘b’ and ‘c’ concurrently head a d tail b c
Extending to multiple threads What could go wrong? Add another node between a & c A or c could be deleted A concurrent read could reach a dangling pointer. Any number of multiples of the above If anything can go wrong, it will. So, how do we make it thread safe? Lets examine some solutions
Coarse Grained Locking Lock the list for each add or remove Also lock for reads (find, iterators) Will effectively serialise the list Only one thread at a time can access the list. Correctness at the expense of performance .
A concrete example 10 producers Add 500 random numbers in a tight loop 10 consumers Remove the 500 numbers in a tight loop Each in its own thread 21 threads Running on PS3 using SNTuner to profile
Coarse Grain head a c tail b
Step 1: Lock list b head a c tail
Step 2 & 3:Find then Insert  b head a c tail
Step 4:Unlock head a c tail b
Coarse Grained locking Wide green bars are active locks Little blips are adds or removes Execution took 416ms (profiling significantly impacts performance)
Fine Grained Locking Add and Remove only affects neighbours Give each Node a lock (So, creating a node creates a mutex) Lock only neighbours when adding or removing. When iterating along the list you must lock/unlock as you go.
Fine Grained Locking head a c tail b
Fine Grained Locking a c tail b head
Fine Grained Locking c tail b head a
Fine Grained Locking head tail b a c
Fine Grained Locking head tail b a c
Fine Grained Locking head a c tail b
Fine Grained Locking Blocking is much longer – due to overhead in creating a mutex Very slow > 1200ms Better solution would have been to have a pool of mutexes that could be used
Optimistic Locking Search without locking Lock nodes once found, then validate them Valid if you can navigate to it from head. If invalid, search from head again.
Optimistic: Add(“g”) head a c d tail f k m g
Step 1: Search head a c d tail f k m g
Step 1: Search head a c d tail f k m g
Step 1: Search head a c d tail f k m g
Step 1: Search head a c d tail f k m g
Step 1: Search head a c d tail f k m g
Step 1: Search head a c d tail f k m g
Step 2: Lock head a c d tail m g f k
Step 3: Validate head a c d tail m g f k
Step 3: Validate head a c d tail m g f k
Step 3: Validate - FAIL head a tail m g d f k
Step 3a: Validate (retry) head a e tail m g d f k
Step 3a: Validate (retry) head a e tail m g d f k
Step 3a: Validate (retry) head a e tail m g d f k
Step 3a: Validate (retry) head a e tail m g d f k
Step 3a: Validate (success) head a e tail m g d f k
Step 4: Add head a e tail m g d f k
Step 5: Unlock head a e tail f k m g d
Optimistic Caveat We can’t delete nodes immediately Another thread could be reading it Can’t rely on memory not being changed. Use deferred garbage collection Delete in a ‘safe’ part of a frame. Or use invasive lists (supply own nodes) Find() requires validation (Locks).
Delete Caveat: Validate head a e tail m g d f k
Delete Caveat: Validate head a e tail m g d f k
Delete Caveat: delete ‘d’ head a e tail m g f k d
Delete Caveat: Validate head a e tail m g f k d
Delete Caveat: Validate head a e tail m g f k d
Delete Caveat: Valid! head a e tail m g f k d
Optimistic Synchronisation ~540ms Most time was spent validating Plus there was overhead in creating a mutex per node for the lock. Again, a pool of mutexes would benefit.
Lazy Synchronization Attempt to speed up Optimistic Validation Store a deleted flag in each node Find() is then lock free Just check the deleted flag on success.
Lazy: Add(“g”) head a c d tail f k m g
Step 1: Search head a c d tail f k m g
Step 1: Search head a c d tail f k m g
Step 1: Search head a c d tail f k m g
Step 1a: Search (delete c) head a c d tail f k m g
Step 1a: Search (delete c) head a c d tail f k m g
Step 1a: Search (delete c) head a c d tail f k m g
Step 1a: Search (delete c) head a c d tail f k m g
Step 1b: Search (lock) head d tail f k m g a c
Step 1c: Search (mark) head d tail f k m g a c
Step 2d: lock (skip/unlock) head a c d tail m g f k
Step 3: Add/Validate head a d tail m g c f k
Step 4: Unlock head a d tail f k m g c
Lazy Synchronisation Still need to keep the deleted nodes. Faster than Optimistic Still serialises.
Lazy Synchronisation ~330ms
Lock free (Non-Blocking) Can’t we just modify Lazy Sync to use CAS?
Delete ‘a’ and add ‘b’ concurrently head a c tail b prev->next=curr->next;  |  prev->next=b;
Delete ‘a’ and add ‘b’ concurrently head a c tail b prev->next=curr->next;  |  prev->next=b;
Delete ‘a’ and add ‘b’ concurrently head a c tail b head->next=a->next;  |  prev->next=b;
Delete ‘a’ and add ‘b’ concurrently head a c tail b Effectively deletes ‘a’ and ‘b’.
Introducing the AtomicMarkedPtr<> Wrapper on uint32 Encapsulates an atomic pointer and a flag Allows testing of a flag and updating of a pointer atomically. Use LSB for the flag AtomicMarkedPtr<Node> next; next->CompareAndSet(eValue, nValue,eFlag, nFlag);
AtomicMarkedPtr<> We can now use CAS to set a pointer  and  check a flag in a single atomic action. ie. check deleted status and change pointer at same time. class Node { public: Node(); AtomicMarkedPtr<Node> m_Next; T m_Data; int32 m_Key; };
Lock Free: Remove ‘d’ head a c d tail f k m Start loop:
Step 1: Find ‘d’ head a c tail f k m pred curr succ d if(!InternalFind(‘d’)) continue;
Step 2: Mark ‘d’ head a c tail f k m pred curr succ d if(!curr->next->CAS(succ,succ,false,true)) continue;
Step 3: Skip ‘d’ head a c tail f k m pred curr succ d pred->next->CAS(curr,succ,false,false);
LockFree: InternalFind() Finds pred and curr Skips marked nodes. Consider the list at Step 2 in previous example and, lets introduce a second thread calling InternalFind();
Second InternalFind() head a c tail f k m pred curr succ pred curr succ d
If succ is marked… head a c tail f k m pred curr succ pred curr succ d
… Skip it head a c tail f k m pred curr succ pred curr succ d
Lock Free Synchronisation No blocking at all List is always in a consistent state. Faster threads help out slower ones.
Lock free Full thread usage ~60ms High thread coherency
Performance comparison
Real world considerations Cost of locking Context switching Memory coherency/latency Size/granularity of tasks
Advice Build a set of lock free containers Design around data flow Minimise locking You can have more than ‘n’ threads on an ‘n’ core machine Profile, profile, profile.
References Patterns for Parallel Programming – T. Mattson et.al. The Art of Multiprocessor Programming – M Herlihy and Nir Shavit http://www.top500.org/ Flow Based Programming - http://www.jpaulmorrison.com/fbp/index.shtml http://www.valvesoftware.com/publications/2007/GDC2007_SourceMulticore.pdf http://www.netrino.com/node/202 http://blogs.intel.com/research/2007/08/what_makes_parallel_programmin.php The Little book of Semaphores -  http://www.greenteapress.com/semaphores/ My Blog: 7DOF -  http://seven-degrees-of-freedom.blogspot.com/

More Related Content

PPT
Parallel Programming Primer
PDF
Parallel Computing - Lec 5
PDF
Introduction to OpenMP (Performance)
PDF
Introduction to OpenMP
PDF
interfacing matlab with embedded systems
PPSX
Task Parallel Library Data Flows
PDF
A Domain-Specific Embedded Language for Programming Parallel Architectures.
PDF
Serial comm matlab
Parallel Programming Primer
Parallel Computing - Lec 5
Introduction to OpenMP (Performance)
Introduction to OpenMP
interfacing matlab with embedded systems
Task Parallel Library Data Flows
A Domain-Specific Embedded Language for Programming Parallel Architectures.
Serial comm matlab

What's hot (20)

PDF
C++ Data-flow Parallelism sounds great! But how practical is it? Let’s see ho...
PDF
Concurrent Programming OpenMP @ Distributed System Discussion
PDF
Open mp directives
PDF
The Challenges facing Libraries and Imperative Languages from Massively Paral...
PPTX
Coding For Cores - C# Way
PPT
Introduction to data structures and Algorithm
PPTX
Intro to OpenMP
PDF
Matlab Serial Port
PDF
Open mp
PDF
Introduction to OpenMP
PDF
Parallel computation
PPT
OpenMP And C++
PDF
XML / JSON Data Exchange with PLC
DOCX
Class notes(week 5) on command line arguments
PPTX
Unit v memory &amp; programmable logic devices
PPT
Nbvtalkataitamimageprocessingconf
PPTX
KEY
OpenMP
PDF
Parallel Programming in .NET
PPTX
C++ Data-flow Parallelism sounds great! But how practical is it? Let’s see ho...
Concurrent Programming OpenMP @ Distributed System Discussion
Open mp directives
The Challenges facing Libraries and Imperative Languages from Massively Paral...
Coding For Cores - C# Way
Introduction to data structures and Algorithm
Intro to OpenMP
Matlab Serial Port
Open mp
Introduction to OpenMP
Parallel computation
OpenMP And C++
XML / JSON Data Exchange with PLC
Class notes(week 5) on command line arguments
Unit v memory &amp; programmable logic devices
Nbvtalkataitamimageprocessingconf
OpenMP
Parallel Programming in .NET
Ad

Similar to Parallel Programming: Beyond the Critical Section (20)

PPT
what every web and app developer should know about multithreading
PDF
Peyton jones-2011-parallel haskell-the_future
PDF
Simon Peyton Jones: Managing parallelism
PDF
A Survey of Concurrency Constructs
PDF
The Need for Async @ ScalaWorld
PPTX
Concurrency Constructs Overview
PDF
Need for Async: Hot pursuit for scalable applications
PPTX
Inferno Scalable Deep Learning on Spark
PPT
10 Multicore 07
PDF
Here comes the Loom - Ya!vaConf.pdf
PDF
Performance and Predictability - Richard Warburton
PDF
Performance and predictability (1)
PDF
Atmosphere Conference 2015: Need for Async: In pursuit of scalable internet-s...
PDF
Lock free algorithms
PPTX
Lessons learnt on a 2000-core cluster
PPT
Migration To Multi Core - Parallel Programming Models
PPTX
Data oriented design and c++
PPTX
Medical Image Processing Strategies for multi-core CPUs
PPT
Java Core | Modern Java Concurrency | Martijn Verburg & Ben Evans
PDF
Towards a Scalable Non-Blocking Coding Style
what every web and app developer should know about multithreading
Peyton jones-2011-parallel haskell-the_future
Simon Peyton Jones: Managing parallelism
A Survey of Concurrency Constructs
The Need for Async @ ScalaWorld
Concurrency Constructs Overview
Need for Async: Hot pursuit for scalable applications
Inferno Scalable Deep Learning on Spark
10 Multicore 07
Here comes the Loom - Ya!vaConf.pdf
Performance and Predictability - Richard Warburton
Performance and predictability (1)
Atmosphere Conference 2015: Need for Async: In pursuit of scalable internet-s...
Lock free algorithms
Lessons learnt on a 2000-core cluster
Migration To Multi Core - Parallel Programming Models
Data oriented design and c++
Medical Image Processing Strategies for multi-core CPUs
Java Core | Modern Java Concurrency | Martijn Verburg & Ben Evans
Towards a Scalable Non-Blocking Coding Style
Ad

Recently uploaded (20)

PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
Machine Learning_overview_presentation.pptx
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
Machine learning based COVID-19 study performance prediction
PDF
Encapsulation theory and applications.pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Empathic Computing: Creating Shared Understanding
PPTX
Big Data Technologies - Introduction.pptx
PPTX
A Presentation on Artificial Intelligence
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
Diabetes mellitus diagnosis method based random forest with bat algorithm
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Encapsulation_ Review paper, used for researhc scholars
Chapter 3 Spatial Domain Image Processing.pdf
Machine Learning_overview_presentation.pptx
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Machine learning based COVID-19 study performance prediction
Encapsulation theory and applications.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Per capita expenditure prediction using model stacking based on satellite ima...
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Empathic Computing: Creating Shared Understanding
Big Data Technologies - Introduction.pptx
A Presentation on Artificial Intelligence
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
sap open course for s4hana steps from ECC to s4
Review of recent advances in non-invasive hemoglobin estimation
Spectral efficient network and resource selection model in 5G networks
Mobile App Security Testing_ A Comprehensive Guide.pdf
20250228 LYD VKU AI Blended-Learning.pptx

Parallel Programming: Beyond the Critical Section

  • 2. Introduction Tony Albrecht Senior Programmer for Pandemic Studios Brisbane Email: Tony.Albrecht0(at)gmail(dot)com
  • 3. Overview Justify myself Start at the bottom Continue from the top Quick look in the middle
  • 4. Parallel Programming: Why? Moore’s Law Limits to sequential CPUs – parallel processing is how we avoid those limits. Programs must be parallel to get Moore level speedups. Applies to programming in general.
  • 6. “ Waaaah!” “ Parallel programming is hard.” “ My code already runs incredibly fast – it doesn’t need to go any faster.” “ It’s impossible to parallelise this algorithm.” “ Only the rendering pipeline needs to be parallel.” “ that’s only for super computers.”
  • 8. So? ~2011 ~6TFlop machine Next console will have between 64 and 128 processors 4 to 8GB of memory 128 processors!!!!
  • 9. How can we utilise 100+ CPUS? Start now Design Implement Iterate Learn
  • 10. The Problems Race conditions
  • 11. Race Condition Example x++ x++ x=0 x=? Thread A Thread B
  • 12. Race Condition Example R1 = 0 x=0 Thread A Thread B
  • 13. Race Condition Example R1 = 0+1 x=0 Thread A Thread B
  • 14. Race Condition Example R1 = 1 R1 = 0 x=0 Thread A Thread B
  • 15. Race Condition Example R1 = 1 R1 = 0+1 x=1 Thread A Thread B
  • 16. Race Condition Example Solution requires atomics or locking. R1 = 1 R1 = 1 x=1 Thread A Thread B
  • 17. Atomics Atomic operations are uninterruptable, singular operations Get/Set Inc/Dec (Add/Sub) Compare And Swap Plus other variations
  • 18. Compare And Swap CAS(memory, oldValue, newValue) If(memory==oldValue) memory=newValue; Surprisingly useful. Simple locking primitive while(CAS(&lock,0,1)!=0) ;
  • 19. Race Condition Solution A AtomicInc(x) AtomicInc(x) x=0 Thread A Thread B
  • 20. Race Condition Solution A AtomicInc(x) AtomicInc(x) x=0 Thread A Thread B x=1
  • 21. Race Condition Solution A AtomicInc(x) AtomicInc(x) x=0 Thread A Thread B x=1 x=2
  • 22. Locking Used to serialise access to code. Like a key to a coffee shop toilet one key, one toilet, queue for access. Lock()/Unlock() … Code… Lock(); // protected region Unlock(); ...more code…
  • 23. Race Condition Solution B x=0 Thread A Thread B Lock A x++ Unl0ck A Lock A x++ Unl0ck A
  • 24. Race Condition Solution B x=0 Thread A Thread B Lock A x++ Unl0ck A Lock A x++ Unl0ck A
  • 25. Race Condition Solution B x=0 Thread A Thread B x=1 Lock A x++ Unl0ck A Lock A x++ Unl0ck A
  • 26. Race Condition Solution B x=0 Thread A Thread B x=1 Lock A x++ Unl0ck A Lock A x++ Unl0ck A
  • 27. Race Condition Solution B x=0 Thread A Thread B x=1 Lock A x++ Unl0ck A Lock A x++ Unl0ck A
  • 28. Race Condition Solution B x=0 Thread A Thread B x=1 x=2 Lock A x++ Unl0ck A Lock A x++ Unl0ck A
  • 29. Race Condition Solution B x=0 Thread A Thread B x=1 x=2 Lock A x++ Unl0ck A Lock A x++ Unl0ck A
  • 30. The Problems Race conditions Deadlocks
  • 31. Deadlock “ When two trains approach each other at a crossing, both shall come to a full stop and neither shall start up again until the other has gone. ”   — Kansas Legislature Deadlock can occur when 2 or more processes require resource(s) from another.
  • 32. Deadlock Thread 1 Thread 2 Generally can be considered to be a logic error Can be painfully subtle and rare. Lock A Lock B Lock B Lock A Unl0ck A Unlock B
  • 33. The Problems Race conditions Deadlocks Read/write tearing
  • 34. Read/write tearing More that one thread writing to the same memory at the same time. The more data, the more likely Solve with synchronisation primitives. “ AAAAAAAA” “ BBBBBBBB” “ AAAABBBB”
  • 35. The Problems Race conditions Deadlocks Read/write tearing Priority Inversion
  • 36. Priority Inversion Consider threads with different priorities Low priority thread holds a shared resource High priority thread tries to acquire that resource High priority thread is blocked by the low Medium priority threads will execute at the expense of the low and the high threads.
  • 37. The Problems Race conditions Deadlocks Read/write tearing Priority Inversion The ABA Problem
  • 38. The ABA problem Thread 1 reads ‘A’ from memory. Thread 2 modifies memory value ‘A’ to ‘B’ and back to ‘A’ again. Thread 1 resumes and assumes nothing has changed (using CAS) Often associated with dynamically allocated memory
  • 39. Consider a list and a thread pool… head a c b … ..
  • 40. Thread A about to CAS head from a to b head a c b … .. CAS(&head->next,a,b);
  • 41. Threads B: deq a & b head c … .. a b A & B are released into thread local pools
  • 42. Thread B enq A - reused head a c … .. b A is added back
  • 43. Thread A executes CAS head a c … .. b CAS(&head->next,a,b);
  • 44. Thread A executes CAS successfully! head a c … .. b CAS(&head->next,a,b);
  • 45. ABA Solution Tag each pointer with a count Each time you use the ptr, inc the tag Must do it atomically
  • 46. The Problems Race conditions Deadlocks Read/write tearing Priority Inversion The ABA Problem Thread scheduling problems
  • 47. Convoy/Stampede Convoy Multiple threads restricted by a bottleneck. Stampede Multiple threads being started at once.
  • 48. Higher Level Locking Primitives SpinLock Mutex Barrier RWlock Semaphore
  • 49. SpinLock Loop until a value is set. No OS overhead with thread management Doesn’t sleep thread Handy if you will never wait for long. Very bad if you need to wait for a long time Can embed sleep() or Yield() But these can be perilous
  • 50. Mutex Mutual Exclusion A simple lock/unlock primitive Otherwise known as a CriticalSection Used to serialise access to code. Often overused. More than just a spinlock can release thread Be aware of overhead
  • 51. Barrier Will block until ‘n’ threads signal it Useful for ensuring that all threads have finished a particular task.
  • 52. Barrier example Thread 1 Thread 2 Thread 3 Thread 4 Barrier(3) Use results Do stuff
  • 53. Barrier example Thread 1 Thread 2 Thread 3 Thread 4 Barrier(3) Use results Do stuff Calculating
  • 54. Barrier example Thread 1 Thread 2 Thread 3 Thread 4 Barrier(3) Use results Do stuff Done
  • 55. Barrier example Thread 1 Thread 2 Thread 3 Thread 4 Barrier(3) Use results Do stuff Signal
  • 56. Barrier example Thread 1 Thread 2 Thread 3 Thread 4 Barrier(2) Use results Do stuff Do other stuff
  • 57. Barrier example Thread 1 Thread 2 Thread 3 Thread 4 Barrier(2) Use results Do stuff Calculating
  • 58. Barrier example Thread 1 Thread 2 Thread 3 Thread 4 Barrier(2) Use results Do stuff Done Calculating
  • 59. Barrier example Thread 1 Thread 2 Thread 3 Thread 4 Barrier(2) Use results Do stuff Signal Done
  • 60. Barrier example Thread 1 Thread 2 Thread 3 Thread 4 Barrier(1) Use results Do stuff More code Signal
  • 61. Barrier example Thread 1 Thread 2 Thread 3 Thread 4 Barrier(0) Use results Do stuff Calc pi
  • 62. Barrier example Thread 1 Thread 2 Thread 3 Thread 4 Barrier(0) Use results Do stuff
  • 63. RWLock Allows many readers But exclusive writing Writing blocks writers and readers. Writing waits until all readers have finished.
  • 64. Semaphore Generalisation of mutex Allows ‘c’ threads access to critical code at once. Basically an atomic integer Wait() will block if value == 0; then dec & cont. Signal() increments value (allows a waiting thread to unblock) Conceptually, Mutexes stop other threads from running code Semaphores tell other threads to run code
  • 65. Parallel Patterns Why patterns? A set of templates to aid design A common language Aids education Provides a familiar base to start implementation
  • 66. So, how do we start? Analyse your problem Identify tasks that can execute concurrently Identify data local to each task Identify task order/schedule Analyse dependencies between tasks. Consider the HW you are running on
  • 67. Problem Decomposition Problem From “Patterns for Parallel Programming”
  • 68. Problem Decomposition Problem Organise By Tasks Organise By Data Decomposition Organise By Data Flow From “Patterns for Parallel Programming”
  • 69. Problem Decomposition Problem Organise By Tasks Organise By Data Decomposition Organise By Data Flow Linear Recursive Linear Recursive Linear Recursive From “Patterns for Parallel Programming”
  • 70. Problem Decomposition Problem Organise By Tasks Organise By Data Decomposition Organise By Data Flow Linear Recursive Linear Recursive Linear Recursive Task Parallelism Divide and Conquer Geometric Decomposition Recursive Data Pipeline Event-Based Coordination From “Patterns for Parallel Programming”
  • 71. Task Parallelism Task dominant, linear Functionally driven problem Many tasks that may depend on each other Try to minimise dependencies Key elements: Tasks Dependencies Schedule
  • 72. Divide and Conquer Task Dominant, recursive Problem solved by splitting it into smaller sub-problems and solving them independently. Generally its easy to take a sequential Divide and Conquer implementation and parallelise it.
  • 73. Geometric Decomposition Data dominant, linear Decompose the data into chunks Solve for chunks independently. Beware of edge dependencies.
  • 74. Recursive Data Pattern Data dominant, recursive Operations on trees, lists, graphs Dependencies can often prohibit parallelism Often requires tricky recasting of problem ie operate on all tree elements in parallel More work, but distributed across more cores
  • 75. Pipeline Pattern Data flow dominant, linear Sets of data flowing through a sequence of stages Each stage is independent Easy to understand - simple, dedicated code
  • 76. Event-Based Coordination Data flow, recursive Groups of semi-independent tasks interacting in an irregular fashion. Tasks sending events to other tasks which send tasks… Can be highly complex Tricky to load balance
  • 77. Supporting Structures SPMD Master/Worker Loop Parallelism Fork/Join Program Structures Data Structures Shared Data Distributed Array Shared Queue
  • 78. Program Structures SPMD Master/Worker Loop Parallelism Fork/Join Program Structures Data Structures Shared Data Distributed Array Shared Queue
  • 79. SPMD Single Program, Multiple Data Single source code image running on multiple threads Very common Easy to maintain Easy to understand
  • 80. Master/Worker Dominant force is the need to dynamically load balance Tasks are highly variable ie duration/cost Program structure doesn’t map onto loops Cores vary in performance. “ Bag of Tasks” Master sets up tasks and waits for completion Workers grab task from queue, execute and then grab the next one.
  • 81. Loop Parallelism Dominated by computationally expensive loops Split iterations of the loop out to threads Be careful of memory use and process granularity
  • 82. Fork/Join The number of concurrent tasks varies over the life of the execution. Complex or recursive relations between tasks Either Direct task/core mapping Thread pool
  • 83. Supporting Data Structures SPMD Master/Worker Loop Parallelism Fork/Join Program Structures Data Structures Shared Data Distributed Array Shared Queue
  • 84. Shared Data Required when At least one data structure is accessed by multiple tasks At least one task modifies the shared data The tasks potentially need to use the modified value. Solutions Serialise execution – mutual exclusion Noninterfering sets of operations RWlocks
  • 85. Distributed Array How can we distribute an array across many threads? Used in Geometric Decomposition Break array into thread specific parts Maximise locality per thread Be wary of cache line overlap Keep data distribution coarse
  • 86. Shared Queue Extremely valuable construct Fundamental part of Master/Worker (“Bag of Tasks”) Must be consistent and work with many competing threads. Must be as efficient as possible Preferably lock free
  • 87. Lock free programming Locks Simple, easy to use and implement But serialise code execution Lock Free Tricky to implement and debug
  • 88. Lock Free linked list Lock free linked list (ordered) Easily generalised to other container classes Stacks Queues Relatively simple to understand
  • 89. Adding a node to a list head a c tail b
  • 90. Adding a node: Step 1 head a c tail b Find where to insert
  • 91. Adding a node: Step 2 head a c tail b newNode->Next = prev->Next;
  • 92. Adding a node: Step 3 head a c tail b prev->Next = newNode;
  • 93. Extending to multiple threads What could go wrong?
  • 94. Add ‘b’ and ‘c’ concurretly head a d tail b c Find where to insert
  • 95. Add ‘b’ and ‘c’ concurretly head a d tail b c newNode->Next = prev->Next;
  • 96. Add ‘b’ and ‘c’ concurrently head a d tail b c prev->Next = newNode;
  • 97. Add ‘b’ and ‘c’ concurrently head a d tail b c
  • 98. Extending to multiple threads What could go wrong? Add another node between a & c A or c could be deleted A concurrent read could reach a dangling pointer. Any number of multiples of the above If anything can go wrong, it will. So, how do we make it thread safe? Lets examine some solutions
  • 99. Coarse Grained Locking Lock the list for each add or remove Also lock for reads (find, iterators) Will effectively serialise the list Only one thread at a time can access the list. Correctness at the expense of performance .
  • 100. A concrete example 10 producers Add 500 random numbers in a tight loop 10 consumers Remove the 500 numbers in a tight loop Each in its own thread 21 threads Running on PS3 using SNTuner to profile
  • 101. Coarse Grain head a c tail b
  • 102. Step 1: Lock list b head a c tail
  • 103. Step 2 & 3:Find then Insert b head a c tail
  • 104. Step 4:Unlock head a c tail b
  • 105. Coarse Grained locking Wide green bars are active locks Little blips are adds or removes Execution took 416ms (profiling significantly impacts performance)
  • 106. Fine Grained Locking Add and Remove only affects neighbours Give each Node a lock (So, creating a node creates a mutex) Lock only neighbours when adding or removing. When iterating along the list you must lock/unlock as you go.
  • 107. Fine Grained Locking head a c tail b
  • 108. Fine Grained Locking a c tail b head
  • 109. Fine Grained Locking c tail b head a
  • 110. Fine Grained Locking head tail b a c
  • 111. Fine Grained Locking head tail b a c
  • 112. Fine Grained Locking head a c tail b
  • 113. Fine Grained Locking Blocking is much longer – due to overhead in creating a mutex Very slow > 1200ms Better solution would have been to have a pool of mutexes that could be used
  • 114. Optimistic Locking Search without locking Lock nodes once found, then validate them Valid if you can navigate to it from head. If invalid, search from head again.
  • 115. Optimistic: Add(“g”) head a c d tail f k m g
  • 116. Step 1: Search head a c d tail f k m g
  • 117. Step 1: Search head a c d tail f k m g
  • 118. Step 1: Search head a c d tail f k m g
  • 119. Step 1: Search head a c d tail f k m g
  • 120. Step 1: Search head a c d tail f k m g
  • 121. Step 1: Search head a c d tail f k m g
  • 122. Step 2: Lock head a c d tail m g f k
  • 123. Step 3: Validate head a c d tail m g f k
  • 124. Step 3: Validate head a c d tail m g f k
  • 125. Step 3: Validate - FAIL head a tail m g d f k
  • 126. Step 3a: Validate (retry) head a e tail m g d f k
  • 127. Step 3a: Validate (retry) head a e tail m g d f k
  • 128. Step 3a: Validate (retry) head a e tail m g d f k
  • 129. Step 3a: Validate (retry) head a e tail m g d f k
  • 130. Step 3a: Validate (success) head a e tail m g d f k
  • 131. Step 4: Add head a e tail m g d f k
  • 132. Step 5: Unlock head a e tail f k m g d
  • 133. Optimistic Caveat We can’t delete nodes immediately Another thread could be reading it Can’t rely on memory not being changed. Use deferred garbage collection Delete in a ‘safe’ part of a frame. Or use invasive lists (supply own nodes) Find() requires validation (Locks).
  • 134. Delete Caveat: Validate head a e tail m g d f k
  • 135. Delete Caveat: Validate head a e tail m g d f k
  • 136. Delete Caveat: delete ‘d’ head a e tail m g f k d
  • 137. Delete Caveat: Validate head a e tail m g f k d
  • 138. Delete Caveat: Validate head a e tail m g f k d
  • 139. Delete Caveat: Valid! head a e tail m g f k d
  • 140. Optimistic Synchronisation ~540ms Most time was spent validating Plus there was overhead in creating a mutex per node for the lock. Again, a pool of mutexes would benefit.
  • 141. Lazy Synchronization Attempt to speed up Optimistic Validation Store a deleted flag in each node Find() is then lock free Just check the deleted flag on success.
  • 142. Lazy: Add(“g”) head a c d tail f k m g
  • 143. Step 1: Search head a c d tail f k m g
  • 144. Step 1: Search head a c d tail f k m g
  • 145. Step 1: Search head a c d tail f k m g
  • 146. Step 1a: Search (delete c) head a c d tail f k m g
  • 147. Step 1a: Search (delete c) head a c d tail f k m g
  • 148. Step 1a: Search (delete c) head a c d tail f k m g
  • 149. Step 1a: Search (delete c) head a c d tail f k m g
  • 150. Step 1b: Search (lock) head d tail f k m g a c
  • 151. Step 1c: Search (mark) head d tail f k m g a c
  • 152. Step 2d: lock (skip/unlock) head a c d tail m g f k
  • 153. Step 3: Add/Validate head a d tail m g c f k
  • 154. Step 4: Unlock head a d tail f k m g c
  • 155. Lazy Synchronisation Still need to keep the deleted nodes. Faster than Optimistic Still serialises.
  • 157. Lock free (Non-Blocking) Can’t we just modify Lazy Sync to use CAS?
  • 158. Delete ‘a’ and add ‘b’ concurrently head a c tail b prev->next=curr->next; | prev->next=b;
  • 159. Delete ‘a’ and add ‘b’ concurrently head a c tail b prev->next=curr->next; | prev->next=b;
  • 160. Delete ‘a’ and add ‘b’ concurrently head a c tail b head->next=a->next; | prev->next=b;
  • 161. Delete ‘a’ and add ‘b’ concurrently head a c tail b Effectively deletes ‘a’ and ‘b’.
  • 162. Introducing the AtomicMarkedPtr<> Wrapper on uint32 Encapsulates an atomic pointer and a flag Allows testing of a flag and updating of a pointer atomically. Use LSB for the flag AtomicMarkedPtr<Node> next; next->CompareAndSet(eValue, nValue,eFlag, nFlag);
  • 163. AtomicMarkedPtr<> We can now use CAS to set a pointer and check a flag in a single atomic action. ie. check deleted status and change pointer at same time. class Node { public: Node(); AtomicMarkedPtr<Node> m_Next; T m_Data; int32 m_Key; };
  • 164. Lock Free: Remove ‘d’ head a c d tail f k m Start loop:
  • 165. Step 1: Find ‘d’ head a c tail f k m pred curr succ d if(!InternalFind(‘d’)) continue;
  • 166. Step 2: Mark ‘d’ head a c tail f k m pred curr succ d if(!curr->next->CAS(succ,succ,false,true)) continue;
  • 167. Step 3: Skip ‘d’ head a c tail f k m pred curr succ d pred->next->CAS(curr,succ,false,false);
  • 168. LockFree: InternalFind() Finds pred and curr Skips marked nodes. Consider the list at Step 2 in previous example and, lets introduce a second thread calling InternalFind();
  • 169. Second InternalFind() head a c tail f k m pred curr succ pred curr succ d
  • 170. If succ is marked… head a c tail f k m pred curr succ pred curr succ d
  • 171. … Skip it head a c tail f k m pred curr succ pred curr succ d
  • 172. Lock Free Synchronisation No blocking at all List is always in a consistent state. Faster threads help out slower ones.
  • 173. Lock free Full thread usage ~60ms High thread coherency
  • 175. Real world considerations Cost of locking Context switching Memory coherency/latency Size/granularity of tasks
  • 176. Advice Build a set of lock free containers Design around data flow Minimise locking You can have more than ‘n’ threads on an ‘n’ core machine Profile, profile, profile.
  • 177. References Patterns for Parallel Programming – T. Mattson et.al. The Art of Multiprocessor Programming – M Herlihy and Nir Shavit http://www.top500.org/ Flow Based Programming - http://www.jpaulmorrison.com/fbp/index.shtml http://www.valvesoftware.com/publications/2007/GDC2007_SourceMulticore.pdf http://www.netrino.com/node/202 http://blogs.intel.com/research/2007/08/what_makes_parallel_programmin.php The Little book of Semaphores - http://www.greenteapress.com/semaphores/ My Blog: 7DOF - http://seven-degrees-of-freedom.blogspot.com/