SlideShare a Scribd company logo
Lecture 2
More about Parallel
Computing
Vajira Thambawita
Parallel Computer Memory Architectures
- Shared Memory
• Multiple processors can work independently but share the same
memory resources
• Shared memory machines can be divided into two groups based upon
memory access time:
UMA : Uniform Memory Access
NUMA : Non- Uniform Memory Access
Parallel Computer Memory Architectures
- Shared Memory
• Equal accesses and access times to memory
• Most commonly represented today by Symmetric
Multiprocessor (SMP) machines
Uniform Memory Access (UMA)
Parallel Computer Memory Architectures
- Shared Memory
Non - Uniform Memory Access (NUMA)
• Not all processors have equal memory
access time
Parallel Computer Memory Architectures
- Distributed Memory
• Processors have own memory
(There is no concept of global address space)
• It operates independently
• communications in message passing systems
are performed via send and receive operations
Parallel Computer Memory Architectures
– Hybrid Distributed-Shared Memory
• Use in largest and Fasted computers in the
world today
Parallel Programming Models
Shared Memory Model (without threads)
• In this programming model, processes/tasks share a common address
space, which they read and write to asynchronously.
Parallel Programming Models
Threads Model
• This programming model is a type of
shared memory programming.
• In the threads model of parallel
programming, a single "heavy weight"
process can have multiple "light
weight", concurrent execution paths.
• Ex: POSIX Threads, OpenMP,
Microsoft threads, Java Python
threads, CUDA threads for GPUs
Parallel Programming Models
Distributed Memory / Message
Passing Model
• A set of tasks that use their
own local memory during
computation. Multiple tasks
can reside on the same physical
machine and/or across an
arbitrary number of machines.
• Tasks exchange data through
communications by sending
and receiving messages.
• Ex:
• Message Passing Interface
(MPI)
Parallel Programming Models
Data Parallel Model
• May also be referred to as the Partitioned Global Address Space (PGAS)
model.
• Ex: Coarray Fortran, Unified Parallel C (UPC), X10
Parallel Programming Models
Hybrid Model
• A hybrid model combines more than one of the previously described
programming models.
Parallel Programming Models
SPMD and MPMD
Single Program Multiple Data (SPMD)
Multiple Program Multiple Data (MPMD)
Designing Parallel Programs
Automatic vs. Manual Parallelization
• Fully Automatic
• The compiler analyzes the source code and identifies opportunities for
parallelism.
• The analysis includes identifying inhibitors to parallelism and possibly a
cost weighting on whether or not the parallelism would actually improve
performance.
• Loops (do, for) are the most frequent target for automatic
parallelization.
• Programmer Directed
• Using "compiler directives" or possibly compiler flags, the programmer
explicitly tells the compiler how to parallelize the code.
• May be able to be used in conjunction with some degree of automatic
parallelization also.
Designing Parallel Programs
Understand the Problem and the Program
• Easy to parallelize problem
• Calculate the potential energy for each of several thousand independent
conformations of a molecule. When done, find the minimum energy
conformation.
• A problem with little-to-no parallelism
• Calculation of the Fibonacci series (0,1,1,2,3,5,8,13,21,...) by use of the
formula:
• F(n) = F(n-1) + F(n-2)
Designing Parallel Programs
Partitioning
• One of the first steps in designing a parallel program is to break the
problem into discrete "chunks" of work that can be distributed to
multiple tasks. This is known as decomposition or partitioning.
Two ways:
• Domain decomposition
• Functional decomposition
Designing Parallel Programs
Domain Decomposition
The data associated with a problem is decomposed
There are different ways to partition data:
Designing Parallel Programs
Functional Decomposition
The problem is decomposed according to the work that must be done
Designing Parallel Programs
You DON'T need communications
• Some types of problems can be
decomposed and executed in parallel with
virtually no need for tasks to share data.
• Ex: Every pixel in a black and white image
needs to have its color reversed
You DO need communications
• This require tasks to share data with each
other
• A 2-D heat diffusion problem requires a
task to know the temperatures calculated
by the tasks that have neighboring data
Designing Parallel Programs
Factors to Consider (designing your program's inter-task
communications)
• Communication overhead
• Latency vs. Bandwidth
• Visibility of communications
• Synchronous vs. asynchronous communications
• Scope of communications
• Efficiency of communications
Designing Parallel Programs
Granularity
• In parallel computing, granularity is a qualitative measure of the ratio of
computation to communication. (Computation / Communication)
• Periods of computation are typically separated from periods of
communication by synchronization events.
• Fine-grain Parallelism
• Coarse-grain Parallelism
Designing Parallel Programs
• Fine-grain Parallelism
• Relatively small amounts of computational work
are done between communication events
• Low computation to communication ratio
• Facilitates load balancing
• Implies high communication overhead and less
opportunity for performance enhancement
• If granularity is too fine it is possible that the
overhead required for communications and
synchronization between tasks takes longer than
the computation.
• Coarse-grain Parallelism
• Relatively large amounts of computational work
are done between communication/synchronization
events
• High computation to communication ratio
• Implies more opportunity for performance increase
• Harder to load balance efficiently
Designing Parallel Programs
I/O
• Rule #1: Reduce overall I/O as much as possible
• If you have access to a parallel file system, use it.
• Writing large chunks of data rather than small chunks is usually
significantly more efficient.
• Fewer, larger files performs better than many small files.
• Confine I/O to specific serial portions of the job, and then use parallel
communications to distribute data to parallel tasks. For example, Task 1
could read an input file and then communicate required data to other
tasks. Likewise, Task 1 could perform write operation after receiving
required data from all other tasks.
• Aggregate I/O operations across tasks - rather than having many tasks
perform I/O, have a subset of tasks perform it.
Designing Parallel Programs
Debugging
• TotalView from RogueWave Software
• DDT from Allinea
• Inspector from Intel
Performance Analysis and Tuning
• LC's web pages at https://hpc.llnl.gov/software/development-environment-
software
• TAU: http://www.cs.uoregon.edu/research/tau/docs.php
• HPCToolkit: http://hpctoolkit.org/documentation.html
• Open|Speedshop: http://www.openspeedshop.org/
• Vampir / Vampirtrace: http://vampir.eu/
• Valgrind: http://valgrind.org/
• PAPI: http://icl.cs.utk.edu/papi/
• mpitrace https://computing.llnl.gov/tutorials/bgq/index.html#mpitrace
• mpiP: http://mpip.sourceforge.net/
• memP: http://memp.sourceforge.net/
Summary

More Related Content

PDF
Lecture 1 introduction to parallel and distributed computing
PPTX
Dichotomy of parallel computing platforms
PDF
Course outline of parallel and distributed computing
PDF
Parallel programming model, language and compiler in ACA.
PDF
Parallel Algorithms
PDF
Lecture 3 parallel programming platforms
PPTX
MPI message passing interface
PPTX
Physical organization of parallel platforms
Lecture 1 introduction to parallel and distributed computing
Dichotomy of parallel computing platforms
Course outline of parallel and distributed computing
Parallel programming model, language and compiler in ACA.
Parallel Algorithms
Lecture 3 parallel programming platforms
MPI message passing interface
Physical organization of parallel platforms

What's hot (20)

PPTX
Parallel algorithms
PPTX
Introduction to Parallel and Distributed Computing
PPTX
Limitations of memory system performance
PPTX
CPU Scheduling in OS Presentation
DOCX
Parallel computing persentation
PPT
Distributed & parallel system
PDF
OpenMP Tutorial for Beginners
PPT
CPU Scheduling Algorithms
PPT
Parallel Computing
PPTX
Process scheduling
PPTX
Inter Process Communication
PDF
Unit 5 Advanced Computer Architecture
PPT
program partitioning and scheduling IN Advanced Computer Architecture
PPTX
Cpu scheduling in operating System.
PDF
Introduction to Parallel Computing
PPTX
MULTILEVEL QUEUE SCHEDULING
PPTX
Operating system 31 multiple processor scheduling
PPT
Introduction to parallel_computing
PPTX
Parallel computing
PPTX
Multivector and multiprocessor
Parallel algorithms
Introduction to Parallel and Distributed Computing
Limitations of memory system performance
CPU Scheduling in OS Presentation
Parallel computing persentation
Distributed & parallel system
OpenMP Tutorial for Beginners
CPU Scheduling Algorithms
Parallel Computing
Process scheduling
Inter Process Communication
Unit 5 Advanced Computer Architecture
program partitioning and scheduling IN Advanced Computer Architecture
Cpu scheduling in operating System.
Introduction to Parallel Computing
MULTILEVEL QUEUE SCHEDULING
Operating system 31 multiple processor scheduling
Introduction to parallel_computing
Parallel computing
Multivector and multiprocessor
Ad

Similar to Lecture 2 more about parallel computing (20)

PPT
01-MessagePassingFundamentals.ppt
PPT
Unit5
PPT
SecondPresentationDesigning_Parallel_Programs.ppt
PDF
1 1c291nx981n98nun1nnc120102cn190n u90cn19nc 1c9
PPTX
Parallel computing
PPTX
Parallel & Distributed processing
PDF
Week # 1.pdf
PPTX
Parallel Computing-Part-1.pptx
PPTX
message passing vs shared memory
PPTX
Hpc 6 7
PPT
parallel programming models
PPTX
Lec 2 (parallel design and programming)
PPTX
Parallel processing
PPTX
Chap 1(one) general introduction
PPTX
PGAS Programming Model
PPTX
Parallel Algorithms Advantages and Disadvantages
PPTX
High performance computing
PPTX
Chapter 5.pptx
PPTX
Parallelization using open mp
PPSX
Research Scope in Parallel Computing And Parallel Programming
01-MessagePassingFundamentals.ppt
Unit5
SecondPresentationDesigning_Parallel_Programs.ppt
1 1c291nx981n98nun1nnc120102cn190n u90cn19nc 1c9
Parallel computing
Parallel & Distributed processing
Week # 1.pdf
Parallel Computing-Part-1.pptx
message passing vs shared memory
Hpc 6 7
parallel programming models
Lec 2 (parallel design and programming)
Parallel processing
Chap 1(one) general introduction
PGAS Programming Model
Parallel Algorithms Advantages and Disadvantages
High performance computing
Chapter 5.pptx
Parallelization using open mp
Research Scope in Parallel Computing And Parallel Programming
Ad

More from Vajira Thambawita (20)

PDF
Lecture 4 principles of parallel algorithm design updated
PDF
Lecture 12 localization and navigation
PDF
Lecture 11 neural network principles
PDF
Lecture 10 mobile robot design
PDF
Lecture 09 control
PDF
Lecture 08 robots and controllers
PDF
Lecture 07 more about pic
PDF
Lecture 06 pic programming in c
PDF
Lecture 05 pic io port programming
PDF
Lecture 04 branch call and time delay
PDF
Lecture 03 basics of pic
PDF
Lecture 02 mechatronics systems
PDF
Lecture 1 - Introduction to embedded system and Robotics
PDF
Lec 09 - Registers and Counters
PDF
Lec 08 - DESIGN PROCEDURE
PDF
Lec 07 - ANALYSIS OF CLOCKED SEQUENTIAL CIRCUITS
PDF
Lec 06 - Synchronous Sequential Logic
PDF
Lec 05 - Combinational Logic
PDF
Lec 04 - Gate-level Minimization
PDF
Lec 03 - Combinational Logic Design
Lecture 4 principles of parallel algorithm design updated
Lecture 12 localization and navigation
Lecture 11 neural network principles
Lecture 10 mobile robot design
Lecture 09 control
Lecture 08 robots and controllers
Lecture 07 more about pic
Lecture 06 pic programming in c
Lecture 05 pic io port programming
Lecture 04 branch call and time delay
Lecture 03 basics of pic
Lecture 02 mechatronics systems
Lecture 1 - Introduction to embedded system and Robotics
Lec 09 - Registers and Counters
Lec 08 - DESIGN PROCEDURE
Lec 07 - ANALYSIS OF CLOCKED SEQUENTIAL CIRCUITS
Lec 06 - Synchronous Sequential Logic
Lec 05 - Combinational Logic
Lec 04 - Gate-level Minimization
Lec 03 - Combinational Logic Design

Recently uploaded (20)

PPTX
Cell Structure & Organelles in detailed.
PPTX
Revamp in MTO Odoo 18 Inventory - Odoo Slides
PDF
Origin of periodic table-Mendeleev’s Periodic-Modern Periodic table
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PDF
Piense y hagase Rico - Napoleon Hill Ccesa007.pdf
PPTX
Introduction to Child Health Nursing – Unit I | Child Health Nursing I | B.Sc...
PDF
Mark Klimek Lecture Notes_240423 revision books _173037.pdf
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PPTX
Pharma ospi slides which help in ospi learning
PPTX
Open Quiz Monsoon Mind Game Prelims.pptx
PDF
Electrolyte Disturbances and Fluid Management A clinical and physiological ap...
DOCX
UPPER GASTRO INTESTINAL DISORDER.docx
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PPTX
Onica Farming 24rsclub profitable farm business
PDF
Module 3: Health Systems Tutorial Slides S2 2025
PPTX
Nursing Management of Patients with Disorders of Ear, Nose, and Throat (ENT) ...
PPTX
Cardiovascular Pharmacology for pharmacy students.pptx
Cell Structure & Organelles in detailed.
Revamp in MTO Odoo 18 Inventory - Odoo Slides
Origin of periodic table-Mendeleev’s Periodic-Modern Periodic table
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
Piense y hagase Rico - Napoleon Hill Ccesa007.pdf
Introduction to Child Health Nursing – Unit I | Child Health Nursing I | B.Sc...
Mark Klimek Lecture Notes_240423 revision books _173037.pdf
Renaissance Architecture: A Journey from Faith to Humanism
Pharma ospi slides which help in ospi learning
Open Quiz Monsoon Mind Game Prelims.pptx
Electrolyte Disturbances and Fluid Management A clinical and physiological ap...
UPPER GASTRO INTESTINAL DISORDER.docx
102 student loan defaulters named and shamed – Is someone you know on the list?
O5-L3 Freight Transport Ops (International) V1.pdf
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
STATICS OF THE RIGID BODIES Hibbelers.pdf
Onica Farming 24rsclub profitable farm business
Module 3: Health Systems Tutorial Slides S2 2025
Nursing Management of Patients with Disorders of Ear, Nose, and Throat (ENT) ...
Cardiovascular Pharmacology for pharmacy students.pptx

Lecture 2 more about parallel computing

  • 1. Lecture 2 More about Parallel Computing Vajira Thambawita
  • 2. Parallel Computer Memory Architectures - Shared Memory • Multiple processors can work independently but share the same memory resources • Shared memory machines can be divided into two groups based upon memory access time: UMA : Uniform Memory Access NUMA : Non- Uniform Memory Access
  • 3. Parallel Computer Memory Architectures - Shared Memory • Equal accesses and access times to memory • Most commonly represented today by Symmetric Multiprocessor (SMP) machines Uniform Memory Access (UMA)
  • 4. Parallel Computer Memory Architectures - Shared Memory Non - Uniform Memory Access (NUMA) • Not all processors have equal memory access time
  • 5. Parallel Computer Memory Architectures - Distributed Memory • Processors have own memory (There is no concept of global address space) • It operates independently • communications in message passing systems are performed via send and receive operations
  • 6. Parallel Computer Memory Architectures – Hybrid Distributed-Shared Memory • Use in largest and Fasted computers in the world today
  • 7. Parallel Programming Models Shared Memory Model (without threads) • In this programming model, processes/tasks share a common address space, which they read and write to asynchronously.
  • 8. Parallel Programming Models Threads Model • This programming model is a type of shared memory programming. • In the threads model of parallel programming, a single "heavy weight" process can have multiple "light weight", concurrent execution paths. • Ex: POSIX Threads, OpenMP, Microsoft threads, Java Python threads, CUDA threads for GPUs
  • 9. Parallel Programming Models Distributed Memory / Message Passing Model • A set of tasks that use their own local memory during computation. Multiple tasks can reside on the same physical machine and/or across an arbitrary number of machines. • Tasks exchange data through communications by sending and receiving messages. • Ex: • Message Passing Interface (MPI)
  • 10. Parallel Programming Models Data Parallel Model • May also be referred to as the Partitioned Global Address Space (PGAS) model. • Ex: Coarray Fortran, Unified Parallel C (UPC), X10
  • 11. Parallel Programming Models Hybrid Model • A hybrid model combines more than one of the previously described programming models.
  • 12. Parallel Programming Models SPMD and MPMD Single Program Multiple Data (SPMD) Multiple Program Multiple Data (MPMD)
  • 13. Designing Parallel Programs Automatic vs. Manual Parallelization • Fully Automatic • The compiler analyzes the source code and identifies opportunities for parallelism. • The analysis includes identifying inhibitors to parallelism and possibly a cost weighting on whether or not the parallelism would actually improve performance. • Loops (do, for) are the most frequent target for automatic parallelization. • Programmer Directed • Using "compiler directives" or possibly compiler flags, the programmer explicitly tells the compiler how to parallelize the code. • May be able to be used in conjunction with some degree of automatic parallelization also.
  • 14. Designing Parallel Programs Understand the Problem and the Program • Easy to parallelize problem • Calculate the potential energy for each of several thousand independent conformations of a molecule. When done, find the minimum energy conformation. • A problem with little-to-no parallelism • Calculation of the Fibonacci series (0,1,1,2,3,5,8,13,21,...) by use of the formula: • F(n) = F(n-1) + F(n-2)
  • 15. Designing Parallel Programs Partitioning • One of the first steps in designing a parallel program is to break the problem into discrete "chunks" of work that can be distributed to multiple tasks. This is known as decomposition or partitioning. Two ways: • Domain decomposition • Functional decomposition
  • 16. Designing Parallel Programs Domain Decomposition The data associated with a problem is decomposed There are different ways to partition data:
  • 17. Designing Parallel Programs Functional Decomposition The problem is decomposed according to the work that must be done
  • 18. Designing Parallel Programs You DON'T need communications • Some types of problems can be decomposed and executed in parallel with virtually no need for tasks to share data. • Ex: Every pixel in a black and white image needs to have its color reversed You DO need communications • This require tasks to share data with each other • A 2-D heat diffusion problem requires a task to know the temperatures calculated by the tasks that have neighboring data
  • 19. Designing Parallel Programs Factors to Consider (designing your program's inter-task communications) • Communication overhead • Latency vs. Bandwidth • Visibility of communications • Synchronous vs. asynchronous communications • Scope of communications • Efficiency of communications
  • 20. Designing Parallel Programs Granularity • In parallel computing, granularity is a qualitative measure of the ratio of computation to communication. (Computation / Communication) • Periods of computation are typically separated from periods of communication by synchronization events. • Fine-grain Parallelism • Coarse-grain Parallelism
  • 21. Designing Parallel Programs • Fine-grain Parallelism • Relatively small amounts of computational work are done between communication events • Low computation to communication ratio • Facilitates load balancing • Implies high communication overhead and less opportunity for performance enhancement • If granularity is too fine it is possible that the overhead required for communications and synchronization between tasks takes longer than the computation. • Coarse-grain Parallelism • Relatively large amounts of computational work are done between communication/synchronization events • High computation to communication ratio • Implies more opportunity for performance increase • Harder to load balance efficiently
  • 22. Designing Parallel Programs I/O • Rule #1: Reduce overall I/O as much as possible • If you have access to a parallel file system, use it. • Writing large chunks of data rather than small chunks is usually significantly more efficient. • Fewer, larger files performs better than many small files. • Confine I/O to specific serial portions of the job, and then use parallel communications to distribute data to parallel tasks. For example, Task 1 could read an input file and then communicate required data to other tasks. Likewise, Task 1 could perform write operation after receiving required data from all other tasks. • Aggregate I/O operations across tasks - rather than having many tasks perform I/O, have a subset of tasks perform it.
  • 23. Designing Parallel Programs Debugging • TotalView from RogueWave Software • DDT from Allinea • Inspector from Intel Performance Analysis and Tuning • LC's web pages at https://hpc.llnl.gov/software/development-environment- software • TAU: http://www.cs.uoregon.edu/research/tau/docs.php • HPCToolkit: http://hpctoolkit.org/documentation.html • Open|Speedshop: http://www.openspeedshop.org/ • Vampir / Vampirtrace: http://vampir.eu/ • Valgrind: http://valgrind.org/ • PAPI: http://icl.cs.utk.edu/papi/ • mpitrace https://computing.llnl.gov/tutorials/bgq/index.html#mpitrace • mpiP: http://mpip.sourceforge.net/ • memP: http://memp.sourceforge.net/