Parallel and Distributed Computing Chapter 8

Parallel and
distributed computing

INTRODUCTION
A Microprocessor for processing of graphics
Capable of handling millions of instructions per
second
GPU is graphic card that relieve CPU of much of
graphic processing load

INTRODUCTION
GPU is made for performing parallel operations that’s
why GPU have many parallel execution Units.
Many super computer uses GPU for high processing
and performance.

CPU Vs GPU
A CPU consists of four to eight CPU cores,
GPU consists of hundreds of smaller cores.
This massively parallel architecture is what gives the GPU its
high compute performance.

SINGLE CORE VS MULTICORE PREOCESOR

INTRODUCTION
Today NVIDIA GPU upgraded to 128 cores on single chip.
Hence consuming less power with high performance.
Each core handle 8 threads (total=1024 threads)
Modern GPU not only use for graphics or video coding but
also used in HPC
GPU are designed to handle large number of floating point
operations in parallel.

COMPONENTS OF GPU
Graphic processor
Graphic co-processor
Graphic accelerator
Frame buffer
Memory
Graphic BIOS
Digital to analog converter
Display connector
Computer Connector
Components of CPU:
control unit (CU)
arithmetic logic unit
(ALU)
registers.
cache.
clock.

ALUs and Fetch/Decode logic run at
high speed, consume little power and
require few hardware to build
Contrary Execution Unit require huge
number of transistors to build cache, it
may occupy 50% of total area hence
expensive
It may also main energy absorbing
element

Introduction to GPU Evolution
GPU Comprises of ten to thousand cores
To build GPU we need slimmer design of CPU
For this all complex and large units of should be
removed from general CPU
Basic Idea of GPU: to have many (100 or 1000) simpler
or weaker processing units that process many same
instructions simultaneously on different data.

Vector Addition of two floating Point vectors each containing 128 elements

Introduction to GPU Evolution
Instead of using one CPU Core we can use two
such cores.
We will be able to execute two instructions in
parallel
Hence throughput is increased

Two Instructions executing parallel on two CPU Cores

Even we can achieve higher performance by further replicating ALU,
Instead of replicating complete CPU Core
Fetch/decode logic remained share among ALUs
All ALUs should execute same operations on different input data

One such core can add eight vector elements in parallel

Parallel and Distributed Computing Chapter 8

Modern GPU
Modern GPU containing hundreds of simple
Processing elements suited for computation that can
be run in parallel
This typically involves arithmetic on large data sets
i.e. vectors matrices, images, where same operations
can be performed across millions of data elements at
the same time.

Modern GPU
To exploit parallelism for GPU, Programmer should
partition their program into thousands of threads and
schedule them among Compute Units.

Memory Hierarchy on GPU
GPU has these five memory regions accessible from a
single work item
1. Registers
2. Local Memory
3. Texture Memory
4. Constant Memory
5. Global Memory

Registers:
Registers are at first and most preferable level
Each work-item has dedicated registers
There may be 16k, 32K or 64K register for work item.

Global Memory:
Also called graphic dynamic memory. Achieves high bandwidth
But this memory has a high latency compare to other memories.
It is global because it is accessed from GPU and CPU both.
GTX780 GPU has 3GB of global memory implemented in GDDR5
It is used for transfers between host and device
It is larger capacity memory and high latency.

Host and Device
Main CPU is host, while all other processors like,
GPU are names as device

There are also two additional memories that are accessible by all
work items
Constant memory and texture memory
Constant Memory: resides in device memory (Cached) where
constant and program arguments are stored
Constant memory has two special properties. Firstly, it is cached, and
second it supports broadcasting a single value to all work items
This broadcast takes place in just a single cycle.
Work item have only read access to this region. However host is
permitted both read/write access.

Texture Memory: When all reads in a work group are
physically adjacent, using texture memory can reduce
memory traffic and increase performance compared
to global memory
However Texture memory is much slower than
constant memory

Work-Item / Work-Group
Work-Item are actually threads (WI)
Work-Group, which is unit of work scheduled onto compute unit.
Work Items are organized into Work-Groups
Hence Work Group is also define as the set of Work Item
All work item in a work group are able to share local memory
Work group execute independently from each other.
Work Group executes on Compute Unit and Work Item are mapped
to CU PEs

OpenCL
A Framework use for writing programs that execute
across heterogeneous platforms consisting of CPUs,
GPUs or any accelerated hardware.
It define C Programming language that is used to
write and compute kernels just like C functions
One significant drawback: not easy to learn

OpenCL Kernel
Code that gets executed in a GPU Device is called
kernel in OpenCL.
The body of kernel function implements the
computation to be completed by all work-items.
When writing kernels in OpenCL, We must declare
memory with certain address space to mention that
data will resides in which memory. E.g.
(_ _global), (_ _constant), (_ _local), or by default
private memory within a kernel

Heterogeneous system
Also called platform model or heterogeneous system.
Consists of single host connected to one or more OpenCL
Devices e.g. FPGA Accelerators, DSP, GPU or even CPU

OpenCL Devices
OpenCL Devices comprises of several compute
unit. Each compute unit comprises of tens or
hundreds of processing element.

Execution Model
OpenCL execution model defines how kernels execute. i.e.
NDRange (N-Dimensional Range) execution model.
Host program invokes kernel over an index space called
NDRange.
NDRange defines total numbers of work items that execute
in parallel.

Programming in OpenCL
Sample C Code for
vector addition for
single core CPU

Kernel Function
To execute vector addition function on GPU Device, we must
write it as kernel function
Each thread on GPU Device will then execute the same
kernel function

HOST CODE
First step is to code host application, that run’s on a user’s
computer and dispatches to connected devices.
Host application can be coded in C/C++
OpenCL supports wide range of heterogeneous platforms.
Prior to execute kernel function, the host program for a
heterogeneous system must carry out the some steps.

Steps to execute Kernel Function
1. Discover OpenCL Devices, The platform consists of one or
more devices capable of executing OpenCL Kernels.
2. Probe the characteristics of these devices so that kernel
functions can adapt to specific features.
3. Read source program, compile the kernel that will run on
selected devices.
4. Setup memory objects that will hold the data for
computation.
5. Run the kernel on selected devices
6. Collect the final result from devices.

Steps to execute Kernel Function
These steps are accomplished through following series of calls.
1.Prepare and initialize data on host.
2. Discover and initialize the devices.
3. Create a context.
4. Create a command queue.
5. Create the program object for a context.
6. Build the OpenCL program.
7. Create device buffers.
8. Write host data to device buffers.
9. Create and compile the kernel.
10. Set the kernel arguments
11. Set the execution model and enqueue the kernel for execution.
12. Read the output buffer back to the host

Discover & Initialize the devices
cl_int clGetDeviceIDs()
functions is used to discover and initialize
the devise.
This function returns number of devices in
num_devices.

Discover & Initialize the devices
Get device info. We use
cl_int clGetDeviceInfo()
This function will return max. computer
units, max work-group size, Device type,
size of memory

Create a context
clCreatecontext() function is use for
creating a context i.e. managing objects
such as command queues, program, kernel
objects

Build a program Executeable
clBuildProgram() function is use to build
(compile and link) a program from source
code.

Write host data to device buffer
clEnqueueWriterBuffer() function is use to
write data from host memory to device
buffer
This function provide data for processing
over device.

Enqueue the kernel for execution
OpenCL always execute kernel in parallel i.e. same
kernel execute on different data set.
Each kernel execution in OpenCL is called work-item
Each work-item is responsible for executing kernel
once on its assigned portion of data set.
Thus it is programmer responsibility to tell OpenCl
how many work-items are needed to process all data.

Occupancy
Occupancy is ratio of active work groups per
compute unit to maximize its performance. We
should always keeping a occupancy high in order
to hide latency when executing instructions
A compute unit should have a ready work group
to execute in every cycle as this is the only way
to keep hardware busy.

Parallel and Distributed Computing Chapter 8

More Related Content

What's hot (20)

Similar to Parallel and Distributed Computing Chapter 8 (20)

More from AbdullahMunir32 (17)

Recently uploaded (20)

Parallel and Distributed Computing Chapter 8