Lecture 2 more about parallel computing

Lecture 2
More about Parallel
Computing
Vajira Thambawita

Parallel Computer Memory Architectures
- Shared Memory
• Multiple processors can work independently but share the same
memory resources
• Shared memory machines can be divided into two groups based upon
memory access time:
UMA : Uniform Memory Access
NUMA : Non- Uniform Memory Access

- Shared Memory
• Equal accesses and access times to memory
• Most commonly represented today by Symmetric
Multiprocessor (SMP) machines
Uniform Memory Access (UMA)

- Shared Memory
Non - Uniform Memory Access (NUMA)
• Not all processors have equal memory
access time

- Distributed Memory
• Processors have own memory
(There is no concept of global address space)
• It operates independently
• communications in message passing systems
are performed via send and receive operations

– Hybrid Distributed-Shared Memory
• Use in largest and Fasted computers in the
world today

Parallel Programming Models
Shared Memory Model (without threads)
• In this programming model, processes/tasks share a common address
space, which they read and write to asynchronously.

Threads Model
• This programming model is a type of
shared memory programming.
• In the threads model of parallel
programming, a single "heavy weight"
process can have multiple "light
weight", concurrent execution paths.
• Ex: POSIX Threads, OpenMP,
Microsoft threads, Java Python
threads, CUDA threads for GPUs

Distributed Memory / Message
Passing Model
• A set of tasks that use their
own local memory during
computation. Multiple tasks
can reside on the same physical
machine and/or across an
arbitrary number of machines.
• Tasks exchange data through
communications by sending
and receiving messages.
• Ex:
• Message Passing Interface
(MPI)

Data Parallel Model
• May also be referred to as the Partitioned Global Address Space (PGAS)
model.
• Ex: Coarray Fortran, Unified Parallel C (UPC), X10

Hybrid Model
• A hybrid model combines more than one of the previously described
programming models.

SPMD and MPMD
Single Program Multiple Data (SPMD)
Multiple Program Multiple Data (MPMD)

Designing Parallel Programs
Automatic vs. Manual Parallelization
• Fully Automatic
• The compiler analyzes the source code and identifies opportunities for
parallelism.
• The analysis includes identifying inhibitors to parallelism and possibly a
cost weighting on whether or not the parallelism would actually improve
performance.
• Loops (do, for) are the most frequent target for automatic
parallelization.
• Programmer Directed
• Using "compiler directives" or possibly compiler flags, the programmer
explicitly tells the compiler how to parallelize the code.
• May be able to be used in conjunction with some degree of automatic
parallelization also.

Understand the Problem and the Program
• Easy to parallelize problem
• Calculate the potential energy for each of several thousand independent
conformations of a molecule. When done, find the minimum energy
conformation.
• A problem with little-to-no parallelism
• Calculation of the Fibonacci series (0,1,1,2,3,5,8,13,21,...) by use of the
formula:
• F(n) = F(n-1) + F(n-2)

Partitioning
• One of the first steps in designing a parallel program is to break the
problem into discrete "chunks" of work that can be distributed to
multiple tasks. This is known as decomposition or partitioning.
Two ways:
• Domain decomposition
• Functional decomposition

Domain Decomposition
The data associated with a problem is decomposed
There are different ways to partition data:

Functional Decomposition
The problem is decomposed according to the work that must be done

You DON'T need communications
• Some types of problems can be
decomposed and executed in parallel with
virtually no need for tasks to share data.
• Ex: Every pixel in a black and white image
needs to have its color reversed
You DO need communications
• This require tasks to share data with each
other
• A 2-D heat diffusion problem requires a
task to know the temperatures calculated
by the tasks that have neighboring data

Factors to Consider (designing your program's inter-task
communications)
• Communication overhead
• Latency vs. Bandwidth
• Visibility of communications
• Synchronous vs. asynchronous communications
• Scope of communications
• Efficiency of communications

Granularity
• In parallel computing, granularity is a qualitative measure of the ratio of
computation to communication. (Computation / Communication)
• Periods of computation are typically separated from periods of
communication by synchronization events.
• Fine-grain Parallelism
• Coarse-grain Parallelism

• Fine-grain Parallelism
• Relatively small amounts of computational work
are done between communication events
• Low computation to communication ratio
• Facilitates load balancing
• Implies high communication overhead and less
opportunity for performance enhancement
• If granularity is too fine it is possible that the
overhead required for communications and
synchronization between tasks takes longer than
the computation.
• Coarse-grain Parallelism
• Relatively large amounts of computational work
are done between communication/synchronization
events
• High computation to communication ratio
• Implies more opportunity for performance increase
• Harder to load balance efficiently

I/O
• Rule #1: Reduce overall I/O as much as possible
• If you have access to a parallel file system, use it.
• Writing large chunks of data rather than small chunks is usually
significantly more efficient.
• Fewer, larger files performs better than many small files.
• Confine I/O to specific serial portions of the job, and then use parallel
communications to distribute data to parallel tasks. For example, Task 1
could read an input file and then communicate required data to other
tasks. Likewise, Task 1 could perform write operation after receiving
required data from all other tasks.
• Aggregate I/O operations across tasks - rather than having many tasks
perform I/O, have a subset of tasks perform it.

Debugging
• TotalView from RogueWave Software
• DDT from Allinea
• Inspector from Intel
Performance Analysis and Tuning
• LC's web pages at https://hpc.llnl.gov/software/development-environment-
software
• TAU: http://www.cs.uoregon.edu/research/tau/docs.php
• HPCToolkit: http://hpctoolkit.org/documentation.html
• Open|Speedshop: http://www.openspeedshop.org/
• Vampir / Vampirtrace: http://vampir.eu/
• Valgrind: http://valgrind.org/
• PAPI: http://icl.cs.utk.edu/papi/
• mpitrace https://computing.llnl.gov/tutorials/bgq/index.html#mpitrace
• mpiP: http://mpip.sourceforge.net/
• memP: http://memp.sourceforge.net/

Lecture 2 more about parallel computing

More Related Content

What's hot (20)

Similar to Lecture 2 more about parallel computing (20)

More from Vajira Thambawita (20)

Recently uploaded (20)

Lecture 2 more about parallel computing