SlideShare a Scribd company logo
Solution Patterns for Parallel
Programming
CS4532 Concurrent Programming
Dilum Bandara
Dilum.Bandara@uom.lk
Some slides adapted from Dr. Srinath Perera
Outline
 Designing parallel algorithms
 Solution patterns for parallelism
 Loop Parallel
 Fork/Join
 Divide & Conquer
 Pipe Line
 Asynchronous Agents
 Producer/Consumer
 Load balancing
2
Building a Solution by Composition
 We often solve problems by reducing the problem
to a composition of known problems
 Finding the way to Habarana?
 Sorting 1 million integers
 Can we solve this with Mutex & Semaphores?
 Mutex for mutual exclusion
 Semaphores for signaling
 There is another level
3
Designing Parallel Algorithms
 Parallel algorithm design is not easily reduced to
simple recipes
 Parallel version of serial algorithm is not necessarily
optimum
 Good algorithms require creativity
 Goal
 Suggest a framework within which parallel algorithm
design can be explored
 Develop intuition as to what constitutes a good
parallel algorithm
4
Methodical Design
 Partitioning &
communication focus
on concurrency &
scalability
 Agglomeration &
mapping focus on
locality & other
performance issues
5
Source: www.drdobbs.com/parallel/designing-parallel-
algorithms-part-1/223100878
Methodical Design (Cont.)
1. Partitioning
 Decompose computation/data into small tasks/chunks
 Focus on recognizing opportunities for parallel
execution
 Practical issues such as no of CPUs are ignored
2. Communication
 Determine communication required to coordinate task
execution
 Define communication structures & algorithms
6
Methodical Design (Cont.)
3. Agglomeration
 Defined task & communication structures are
evaluated with respect to
 Performance requirements
 Implementation costs
 If necessary, tasks are combined into larger tasks to
improve
 Performance
 Reduce development costs
7
Source: www.drdobbs.com/architecture-and-design/designing-
parallel-algorithms-part-3/223500075
Methodical Design (Cont.)
4. Mapping
 Each task is assigned to a processor while attempting
to satisfy competing goals of
 Maximizing processor utilization
 Minimizing communication costs
 Static mapping
 At design/compile time
 Dynamic mapping
 At runtime by load-balancing algorithms
8
Parallel Algorithm Design Issues
 Efficiency
 Scalability
 Partitioning computations
 Domain decomposition – based on data
 Functional decomposition – based on computation
 Locality
 Spatial & temporal
 Synchronous & asynchronous communication
 Agglomeration to reduce communication
 Load-balancing
9
3 Ways to Parallelize
1. By Data
 Partition data & give it to different threads
2. By Task
 Partition task into smaller tasks & give it to different
threads
3. By Order
 Partition task into steps & give them to different threads
10
By Data
 Use SPMD model
 When data can be processed locally with lower
dependencies with other data
 Patterns
 Loop parallel, embarrassingly parallel
 Large data unit – under utilization
 Small data units – thrashing
 Chunk layout
 Based on dependencies & caching
 Example – Processing geographical data
11
By Task
 Task Parallel, Divide & Conquer
 Too many tasks – thrashing
 Too little tasks – under utilization
 Dependencies among tasks
 Removable
 Code transformations
 Separable
 Accumulation operations (average, sum, count)
 Extrema (max, min)
 Read only, Read/Write
12
By Order
 Pipeline & Asynchronous Agents
 Dependencies
 Temporal – before/after
 Same time
 None
13
Load Balancing
 Some threads will be busy while others are idle
 Counter by distributing load equally
 When cost of problem is well understood this is possible
 e.g., matrix multiplication, known tree walk
 Some other problems are not that simple
 Hard to predict how workload will be distributed  use
dynamic load balancing
 But require communication between threads/tasks
 2 methods for dynamic load balancing
 Task queues
 Work stealing
14
Task Queues
 Multiple instance of task queues (producer
consumer)
 Threads comes to the task queue after finishing a
task & grab next task
 Typically run with thread pool with fixed no of
threads
15
Source: http://blog.zenika.com
Work Stealing
 Every thread has a work/task queue
 When 1 thread runs out of work, it goes to other
task queue & “steal” the work
16
Source: http://karlsenchoi.blogspot.com
Efficiency = Maximizing Parallelism?
 Usually it is 2 things
 Run algorithm in MAX no of threads with minimal
communication/waiting
 When size of the problem grows, algorithm can handle
it by adding new resources
 It’s done by right architecture + tuning
 There are no clear way to do it
 Just like “design patterns” for OOP, people have
identified parallel programming patterns
17
Solution Patterns for Parallelism
 Loop Parallel
 Fork/Join
 Divide and Conquer
 Producer Consumer/ Pipe Line
 Asynchronous Agents
 Producer/Consumer
18
Loop Parallel
 If each iteration in a loop only depends on that
iteration results + read only data, each iteration
can run in a different thread
 As it’s based on data, also called data parallelism
int[] A = .. int[] B = .. int[] C = ..
for (int i; i<N; i++){
C[i] = F(A[i], B[i])
}
19
Which for These are Loop Parallel?
int[] A = .. int[] B = .. int[] C = ..
for (int i; i<N; i++){
C[i] = F(A[i], B[i-1])
}
int[] A = .. int[] B = .. int[] C = ..
for (int i; i<N; i++){
C[i] = F(A[i], C[i-1])
}
20
Implementing Loop Parallel
 OpenMP example
21
Fork/Join
 Fork job into smaller tasks (independent if
possible), perform them, & join them
 Examples
 Calculate the mean across an array
 Tree walk
 How to partition?
 By Data, e.g., SPMD
 By Task, e.g., MPSD
22
Source: http://en.wikipedia.org/wiki/Fork%E2%80%93join_model
Fork/Join (Cont.)
 Size of work Unit
 Small units – thrashing
 Big Unit – imbalance
 Balancing load among threads
 Static allocation
 If data/task is completely known
 E.g., matrix addition
 Dynamic allocation (tree walks)
 Task queues
 Work Stealing
23
Implementing Fork/Join
 Pthreads
 OpenMP
24
Divide & Conquer
 Break problem into recursive sub-problems &
assign them to different threads
 Examples
 Quick sort
 Search for a value in a tree
 Calculating Fibonacci Sequence
 Often fork again, leads to an execution tree
 Recursion
 May or may not have a join step
 Deep tree – thrashing
 Shallow tree – under utilization 25
Divide & Conquer – Fibonacci
Sequence
Source - Introduction to Algorithms (3rd Edition) by Cormen, Leiserson, Rivest and Stein
26
Producer Consumer
 This pattern is often used, as it helps
dynamically balance workload
 E.g., crawling the Web
 Place new links in a queue so others can pick it up
27
Source: http://vichargrave.com/multithreaded-work-queue-in-c/
Pipeline
 Break a task into small steps (which may have
dependencies) & assign execution of steps to
different threads
 Example
 Read file, sort file, & write to file
 Work hand off from step-to-step
 Each task doesn’t gain, but if there are many
instances of the task, we get a better throughput
 Gain come from tuning
 Example – read/write are slow but sort is fast, can
add more threads to read/write & less threads to sort 28
Pipeline (Cont.)
 Long pipeline – high throughput
 Short pipeline – low latency
 Passing data from one stage to another
 Message passing
 Shared queues
29
Asynchronous Agents
 Here task is done by a set of agents
 Working in P2P fashion
 No clear structure
 They talk to each other via asynchronous messages
 Example – Detecting storms using weather data
 Many agents, each know some aspects about storms
 Weather events are sent to them, which in turn fire
other events, leading to detection
30
Source: http://blogs.msdn.com/
31

More Related Content

PDF
Arc 300-3 ade miller-en
PDF
Parallel Programming With Microsoft Net Design Patterns For Decomposition And...
PPT
Parallel Programming Primer 1
PDF
Lecture 2 more about parallel computing
PPT
Parallel Programming Primer
PPTX
20090720 smith
PPT
Unit-3.ppt
PPTX
parallel computing paradigm lecture.pptx
Arc 300-3 ade miller-en
Parallel Programming With Microsoft Net Design Patterns For Decomposition And...
Parallel Programming Primer 1
Lecture 2 more about parallel computing
Parallel Programming Primer
20090720 smith
Unit-3.ppt
parallel computing paradigm lecture.pptx

Similar to Solution Patterns for Parallel Programming (20)

PPTX
12. Parallel Algorithms.pptx
PPT
SecondPresentationDesigning_Parallel_Programs.ppt
PPTX
TASK AND DATA PARALLELISM in Computer Science pptx
PPT
PMSCS 657_Parallel and Distributed processing
PPT
01-MessagePassingFundamentals.ppt
PPTX
Compiler design
PDF
Parallel Algorithms
PPT
Parallel Computing 2007: Overview
PDF
Peyton jones-2011-parallel haskell-the_future
PDF
Simon Peyton Jones: Managing parallelism
PPTX
Thinking in parallel ab tuladev
PDF
Integrative Parallel Programming in HPC
PPT
Parallel computing and programming of parallel environment
PDF
OpenHPI - Parallel Programming Concepts - Week 6
PPT
parallel computing.ppt
PDF
Our Concurrent Past; Our Distributed Future
PPT
Overview Of Parallel Development - Ericnel
PPT
Parallel Programming and F#
PPT
Migration To Multi Core - Parallel Programming Models
PDF
12. Parallel Algorithms.pptx
SecondPresentationDesigning_Parallel_Programs.ppt
TASK AND DATA PARALLELISM in Computer Science pptx
PMSCS 657_Parallel and Distributed processing
01-MessagePassingFundamentals.ppt
Compiler design
Parallel Algorithms
Parallel Computing 2007: Overview
Peyton jones-2011-parallel haskell-the_future
Simon Peyton Jones: Managing parallelism
Thinking in parallel ab tuladev
Integrative Parallel Programming in HPC
Parallel computing and programming of parallel environment
OpenHPI - Parallel Programming Concepts - Week 6
parallel computing.ppt
Our Concurrent Past; Our Distributed Future
Overview Of Parallel Development - Ericnel
Parallel Programming and F#
Migration To Multi Core - Parallel Programming Models
Ad

More from Dilum Bandara (20)

PPTX
Designing for Multiple Blockchains in Industry Ecosystems
PPTX
Introduction to Machine Learning
PPTX
Time Series Analysis and Forecasting in Practice
PPTX
Introduction to Dimension Reduction with PCA
PPTX
Introduction to Descriptive & Predictive Analytics
PPTX
Introduction to Concurrent Data Structures
PPTX
Hard to Paralelize Problems: Matrix-Vector and Matrix-Matrix
PPTX
Introduction to Map-Reduce Programming with Hadoop
PPTX
Embarrassingly/Delightfully Parallel Problems
PPTX
Introduction to Warehouse-Scale Computers
PPTX
Introduction to Thread Level Parallelism
PPTX
CPU Memory Hierarchy and Caching Techniques
PPTX
Data-Level Parallelism in Microprocessors
PDF
Instruction Level Parallelism – Hardware Techniques
PPTX
Instruction Level Parallelism – Compiler Techniques
PPTX
CPU Pipelining and Hazards - An Introduction
PPTX
Advanced Computer Architecture – An Introduction
PPTX
High Performance Networking with Advanced TCP
PPTX
Introduction to Content Delivery Networks
PPTX
Peer-to-Peer Networking Systems and Streaming
Designing for Multiple Blockchains in Industry Ecosystems
Introduction to Machine Learning
Time Series Analysis and Forecasting in Practice
Introduction to Dimension Reduction with PCA
Introduction to Descriptive & Predictive Analytics
Introduction to Concurrent Data Structures
Hard to Paralelize Problems: Matrix-Vector and Matrix-Matrix
Introduction to Map-Reduce Programming with Hadoop
Embarrassingly/Delightfully Parallel Problems
Introduction to Warehouse-Scale Computers
Introduction to Thread Level Parallelism
CPU Memory Hierarchy and Caching Techniques
Data-Level Parallelism in Microprocessors
Instruction Level Parallelism – Hardware Techniques
Instruction Level Parallelism – Compiler Techniques
CPU Pipelining and Hazards - An Introduction
Advanced Computer Architecture – An Introduction
High Performance Networking with Advanced TCP
Introduction to Content Delivery Networks
Peer-to-Peer Networking Systems and Streaming
Ad

Recently uploaded (20)

PDF
PTS Company Brochure 2025 (1).pdf.......
PDF
The Role of Automation and AI in EHS Management for Data Centers.pdf
PPTX
Materi-Enum-and-Record-Data-Type (1).pptx
PPTX
Online Work Permit System for Fast Permit Processing
PPTX
L1 - Introduction to python Backend.pptx
PPTX
Odoo POS Development Services by CandidRoot Solutions
PPTX
Transform Your Business with a Software ERP System
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PPTX
FLIGHT TICKET RESERVATION SYSTEM | FLIGHT BOOKING ENGINE API
DOCX
The Five Best AI Cover Tools in 2025.docx
PDF
IEEE-CS Tech Predictions, SWEBOK and Quantum Software: Towards Q-SWEBOK
PPTX
ISO 45001 Occupational Health and Safety Management System
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PDF
AI in Product Development-omnex systems
PDF
Best Practices for Rolling Out Competency Management Software.pdf
PDF
QAware_Mario-Leander_Reimer_Architecting and Building a K8s-based AI Platform...
PDF
System and Network Administraation Chapter 3
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PDF
System and Network Administration Chapter 2
PDF
Softaken Excel to vCard Converter Software.pdf
PTS Company Brochure 2025 (1).pdf.......
The Role of Automation and AI in EHS Management for Data Centers.pdf
Materi-Enum-and-Record-Data-Type (1).pptx
Online Work Permit System for Fast Permit Processing
L1 - Introduction to python Backend.pptx
Odoo POS Development Services by CandidRoot Solutions
Transform Your Business with a Software ERP System
How to Migrate SBCGlobal Email to Yahoo Easily
FLIGHT TICKET RESERVATION SYSTEM | FLIGHT BOOKING ENGINE API
The Five Best AI Cover Tools in 2025.docx
IEEE-CS Tech Predictions, SWEBOK and Quantum Software: Towards Q-SWEBOK
ISO 45001 Occupational Health and Safety Management System
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
AI in Product Development-omnex systems
Best Practices for Rolling Out Competency Management Software.pdf
QAware_Mario-Leander_Reimer_Architecting and Building a K8s-based AI Platform...
System and Network Administraation Chapter 3
VVF-Customer-Presentation2025-Ver1.9.pptx
System and Network Administration Chapter 2
Softaken Excel to vCard Converter Software.pdf

Solution Patterns for Parallel Programming

  • 1. Solution Patterns for Parallel Programming CS4532 Concurrent Programming Dilum Bandara [email protected] Some slides adapted from Dr. Srinath Perera
  • 2. Outline  Designing parallel algorithms  Solution patterns for parallelism  Loop Parallel  Fork/Join  Divide & Conquer  Pipe Line  Asynchronous Agents  Producer/Consumer  Load balancing 2
  • 3. Building a Solution by Composition  We often solve problems by reducing the problem to a composition of known problems  Finding the way to Habarana?  Sorting 1 million integers  Can we solve this with Mutex & Semaphores?  Mutex for mutual exclusion  Semaphores for signaling  There is another level 3
  • 4. Designing Parallel Algorithms  Parallel algorithm design is not easily reduced to simple recipes  Parallel version of serial algorithm is not necessarily optimum  Good algorithms require creativity  Goal  Suggest a framework within which parallel algorithm design can be explored  Develop intuition as to what constitutes a good parallel algorithm 4
  • 5. Methodical Design  Partitioning & communication focus on concurrency & scalability  Agglomeration & mapping focus on locality & other performance issues 5 Source: www.drdobbs.com/parallel/designing-parallel- algorithms-part-1/223100878
  • 6. Methodical Design (Cont.) 1. Partitioning  Decompose computation/data into small tasks/chunks  Focus on recognizing opportunities for parallel execution  Practical issues such as no of CPUs are ignored 2. Communication  Determine communication required to coordinate task execution  Define communication structures & algorithms 6
  • 7. Methodical Design (Cont.) 3. Agglomeration  Defined task & communication structures are evaluated with respect to  Performance requirements  Implementation costs  If necessary, tasks are combined into larger tasks to improve  Performance  Reduce development costs 7 Source: www.drdobbs.com/architecture-and-design/designing- parallel-algorithms-part-3/223500075
  • 8. Methodical Design (Cont.) 4. Mapping  Each task is assigned to a processor while attempting to satisfy competing goals of  Maximizing processor utilization  Minimizing communication costs  Static mapping  At design/compile time  Dynamic mapping  At runtime by load-balancing algorithms 8
  • 9. Parallel Algorithm Design Issues  Efficiency  Scalability  Partitioning computations  Domain decomposition – based on data  Functional decomposition – based on computation  Locality  Spatial & temporal  Synchronous & asynchronous communication  Agglomeration to reduce communication  Load-balancing 9
  • 10. 3 Ways to Parallelize 1. By Data  Partition data & give it to different threads 2. By Task  Partition task into smaller tasks & give it to different threads 3. By Order  Partition task into steps & give them to different threads 10
  • 11. By Data  Use SPMD model  When data can be processed locally with lower dependencies with other data  Patterns  Loop parallel, embarrassingly parallel  Large data unit – under utilization  Small data units – thrashing  Chunk layout  Based on dependencies & caching  Example – Processing geographical data 11
  • 12. By Task  Task Parallel, Divide & Conquer  Too many tasks – thrashing  Too little tasks – under utilization  Dependencies among tasks  Removable  Code transformations  Separable  Accumulation operations (average, sum, count)  Extrema (max, min)  Read only, Read/Write 12
  • 13. By Order  Pipeline & Asynchronous Agents  Dependencies  Temporal – before/after  Same time  None 13
  • 14. Load Balancing  Some threads will be busy while others are idle  Counter by distributing load equally  When cost of problem is well understood this is possible  e.g., matrix multiplication, known tree walk  Some other problems are not that simple  Hard to predict how workload will be distributed  use dynamic load balancing  But require communication between threads/tasks  2 methods for dynamic load balancing  Task queues  Work stealing 14
  • 15. Task Queues  Multiple instance of task queues (producer consumer)  Threads comes to the task queue after finishing a task & grab next task  Typically run with thread pool with fixed no of threads 15 Source: http://blog.zenika.com
  • 16. Work Stealing  Every thread has a work/task queue  When 1 thread runs out of work, it goes to other task queue & “steal” the work 16 Source: http://karlsenchoi.blogspot.com
  • 17. Efficiency = Maximizing Parallelism?  Usually it is 2 things  Run algorithm in MAX no of threads with minimal communication/waiting  When size of the problem grows, algorithm can handle it by adding new resources  It’s done by right architecture + tuning  There are no clear way to do it  Just like “design patterns” for OOP, people have identified parallel programming patterns 17
  • 18. Solution Patterns for Parallelism  Loop Parallel  Fork/Join  Divide and Conquer  Producer Consumer/ Pipe Line  Asynchronous Agents  Producer/Consumer 18
  • 19. Loop Parallel  If each iteration in a loop only depends on that iteration results + read only data, each iteration can run in a different thread  As it’s based on data, also called data parallelism int[] A = .. int[] B = .. int[] C = .. for (int i; i<N; i++){ C[i] = F(A[i], B[i]) } 19
  • 20. Which for These are Loop Parallel? int[] A = .. int[] B = .. int[] C = .. for (int i; i<N; i++){ C[i] = F(A[i], B[i-1]) } int[] A = .. int[] B = .. int[] C = .. for (int i; i<N; i++){ C[i] = F(A[i], C[i-1]) } 20
  • 21. Implementing Loop Parallel  OpenMP example 21
  • 22. Fork/Join  Fork job into smaller tasks (independent if possible), perform them, & join them  Examples  Calculate the mean across an array  Tree walk  How to partition?  By Data, e.g., SPMD  By Task, e.g., MPSD 22 Source: http://en.wikipedia.org/wiki/Fork%E2%80%93join_model
  • 23. Fork/Join (Cont.)  Size of work Unit  Small units – thrashing  Big Unit – imbalance  Balancing load among threads  Static allocation  If data/task is completely known  E.g., matrix addition  Dynamic allocation (tree walks)  Task queues  Work Stealing 23
  • 25. Divide & Conquer  Break problem into recursive sub-problems & assign them to different threads  Examples  Quick sort  Search for a value in a tree  Calculating Fibonacci Sequence  Often fork again, leads to an execution tree  Recursion  May or may not have a join step  Deep tree – thrashing  Shallow tree – under utilization 25
  • 26. Divide & Conquer – Fibonacci Sequence Source - Introduction to Algorithms (3rd Edition) by Cormen, Leiserson, Rivest and Stein 26
  • 27. Producer Consumer  This pattern is often used, as it helps dynamically balance workload  E.g., crawling the Web  Place new links in a queue so others can pick it up 27 Source: http://vichargrave.com/multithreaded-work-queue-in-c/
  • 28. Pipeline  Break a task into small steps (which may have dependencies) & assign execution of steps to different threads  Example  Read file, sort file, & write to file  Work hand off from step-to-step  Each task doesn’t gain, but if there are many instances of the task, we get a better throughput  Gain come from tuning  Example – read/write are slow but sort is fast, can add more threads to read/write & less threads to sort 28
  • 29. Pipeline (Cont.)  Long pipeline – high throughput  Short pipeline – low latency  Passing data from one stage to another  Message passing  Shared queues 29
  • 30. Asynchronous Agents  Here task is done by a set of agents  Working in P2P fashion  No clear structure  They talk to each other via asynchronous messages  Example – Detecting storms using weather data  Many agents, each know some aspects about storms  Weather events are sent to them, which in turn fire other events, leading to detection 30 Source: http://blogs.msdn.com/
  • 31. 31

Editor's Notes

  • #2: Shovel example
  • #4: Along A6 after Dambulla