A beginner’s guide to programming GPUs with CUDA

A beginner’s guide to programming GPUs with CUDA

Mike Peardon

School of Mathematics
Trinity College Dublin

April 24, 2009

Mike Peardon (TCD) A beginner’s guide to programming GPUs with CUDA April 24, 2009 1 / 20

What is a GPU?

Graphics Processing Unit
Processor dedicated to rapid rendering of polygons - texturing,
shading
They are mass-produced, so very cheap 1 Tflop peak ≈ EU 1k.
They have lots of compute cores, but a simpler architecture cf a
standard CPU
The “shader pipeline” can be used to do floating point calculations
−→ cheap scientific/technical computing


What is a GPU? (2)


What is CUDA?

Compute Uniﬁed Device Architecture
Extension to C programming language
Adds library functions to access to GPU
Adds directives to translate C into instructions that run on the host
CPU or the GPU when needed
Allows easy multi-threading - parallel execution on all thread
processors on the GPU


Will CUDA work on my PC/laptop?

CUDA works on modern nVidia cards (Quadro, GeForce, Tesla)
See
http://www.nvidia.com/object/cuda learn products.html


nVidia’s compiler - nvcc

CUDA code must be compiled using nvcc
nvcc generates both instructions for host and GPU (PTX instruction
set), as well as instructions to send data back and forwards between
them
Standard CUDA install; /usr/local/cuda/bin/nvcc
Shell executing compiled code needs dynamic linker path
LD LIBRARY PATH environment variable set to include
/usr/local/cuda/lib


Simple overview
Network,
Disk, etc GPU
CPU Multiprocessors

Memory Memory

PCI bus
PC Motherboard
GPU can’t directly access main memory
CPU can’t directly access GPU memory
Need to explicitly copy data
No printf!

Writing some code (1) - specifying where code runs

CUDA provides function type qualifiers (that are not in C/C++) to
enable programmer to define where a function should run
host : specifies the code should run on the host CPU (redundant
on its own - it is the default)
device : specifies the code should run on the GPU, and the
function can only be called by code running on the GPU
global : specifies the code should run on the GPU, but be called
from the host - this is the access point to start multi-threaded codes
running on the GPU
Device can’t execute code on the host!
CUDA imposes some restrictions, such as device code is C-only (host
code can be C++), device code can’t be called recursively...


Code execution


Writing some code (2) - launching a global function
All calls to a global function must specify how many threaded
copies are to launch and in what conﬁguration.
CUDA syntax: <<< >>>
threads are grouped into thread blocks then into a grid of blocks
This deﬁnes a memory heirarchy (important for performance)


The thread/block/grid model


Writing some code (3) - launching a global function

Inside the <<< >>>, need at least two arguments (can be two more,
that have default values)
Call looks eg. like my func<<<bg, tb>>>(arg1, arg2)
bg specifies the dimensions of the block grid and tb specifies the
dimensions of each thread block
bg and tb are both of type dim3 (a new datatype defined by CUDA;
three unsigned ints where any unspecified component defaults to 1).
dim3 has struct-like access - members are x, y and z
CUDA provides constructor: dim3 mygrid(2,2); sets mygrid.x=2,
mygrid.y=2 and mygrid.z=1
1d syntax allowed: myfunc<<<5, 6>>>() makes 5 blocks (in linear
array) with 6 threads each and runs myfunc on them all.


Writing some code (4) - built-in variables on the GPU

For code running on the GPU ( device and global ), some
variables are predeﬁned, which allow threads to be located inside their
blocks and grids
dim3 gridDim Dimensions of the grid.
uint3 blockIdx location of this block in the grid.
dim3 blockDim Dimensions of the blocks
uint3 threadIdx location of this thread in the block.
int warpSize number of threads in the warp?


Writing some code (5) - where variables are stored

For code running on the GPU ( device and global ), the
memory used to hold a variable can be specified.
device : the variable resides in the GPU’s global memory and is
defined while the code runs.
constant : the variable resides in the constant memory space of
the GPU and is defined while the code runs.
shared : the variable resides in the shared memory of the thread
block and has the same lifespan as the block. block.


Example - vector adder
Start:
#include <stdlib.h>
#include <stdio.h>

#define N 1000
#define NBLOCK 10
#define NTHREAD 10

Deﬁne the kernel to execute on the host
__global__
void adder(int n, float* a, float *b)
// a=a+b - thread code - add n numbers per thread
{
int i,off = (N * blockIdx.x ) / NBLOCK +
(threadIdx.x * N) / (NBLOCK * NTHREAD);

for (i=off;i<off+n;i++)
{
a[i] = a[i] + b[i];
}
}

Example - vector adder (2)

Call using

cudaMemcpy(gpu_a, host_a, sizeof(float) * n,
cudaMemcpyHostToDevice);
cudaMemcpy(gpu_b, host_b, sizeof(float) * n,
cudaMemcpyHostToDevice);

adder<<<NBLOCK, NTHREAD>>>(n / (NBLOCK * NTHREAD), gpu_a, gpu_b);

cudaMemcpy(host_c, gpu_a, sizeof(float) * n,
cudaMemcpyDeviceToHost);

Need the cudaMemcpy’s to push/pull the data on/oﬀ the GPU.


arXiv:0810.5365 Barros et. al.

“Blasting through lattice calculations using CUDA”
An implementation of an important compute kernel for lattice QCD -
the Wilson-Dirac operator - this is a sparse linear operator that
represents the kinetic energy operator in a discrete version of the
quantum field theory of relativistic quarks (interacting with gluons).
Usually, performance is limited by memory bandwidth (and
inter-processor communications).
Data is stored in the GPU’s memory
“Atom” of data is the spinor of a field on one site. This is 12 complex
numbers (3 colours for 4 spins).
They use the float4 CUDA data primitive, which packs four floating
point numbers efficiently. An array of 6 float4 types then holds one
lattice size of the quark field.


arXiv:0810.5365 Barros et. al. (2)

Performance issues:
1 16 threads can read 16 contiguous memory elements very eﬃciently -
their implementation of 6 arrays for the spinor allows this contiguous
access
2 GPUs do not have caches; rather they have a small but fast shared
memory. Access is managed by software instructions.
3 The GPU has a very eﬃcient thread manager which can schedule
multiple threads to run withing the cores of a multi-processor. Best
performance comes when the number of threads is (much) more than
the number of cores.
4 The local shared memory space is 16k - not enough! Barros et al also
use the registers on the multiprocessors (8,192 of them).
Unfortunately, this means they have to hand-unroll all their loops!



Performance: (even-odd) Wilson operator



Performance: Conjugate Gradient solver:


Conclusions

The GPU offers a very impressive architecture for scientific computing
on a single chip.
Peak performance now is close to 1 TFlop for less than EU 1,000
CUDA is an extension to C that allows multi-threaded software to
execute on modern nVidia GPUs. There are alternatives for other
manufacturer’s hardware and proposed architecture-independent
schemes (like OpenCL)
Efficient use of the hardware is challenging; threads must be
scheduled efficiently and synchronisation is slow. Memory access must
be defined very carefully.
The (near) future will be very interesting...


A beginner’s guide to programming GPUs with CUDA

More Related Content

What's hot (18)

Similar to A beginner’s guide to programming GPUs with CUDA (20)

More from Piyush Mittal (20)

Recently uploaded (20)

A beginner’s guide to programming GPUs with CUDA