SlideShare a Scribd company logo
A beginner’s guide to programming GPUs with CUDA

                               Mike Peardon

                            School of Mathematics
                            Trinity College Dublin


                               April 24, 2009




 Mike Peardon (TCD)   A beginner’s guide to programming GPUs with CUDA   April 24, 2009   1 / 20
What is a GPU?



                         Graphics Processing Unit
   Processor dedicated to rapid rendering of polygons - texturing,
   shading
   They are mass-produced, so very cheap 1 Tflop peak ≈ EU 1k.
   They have lots of compute cores, but a simpler architecture cf a
   standard CPU
   The “shader pipeline” can be used to do floating point calculations
   −→ cheap scientific/technical computing




   Mike Peardon (TCD)   A beginner’s guide to programming GPUs with CUDA   April 24, 2009   2 / 20
What is a GPU? (2)




   Mike Peardon (TCD)   A beginner’s guide to programming GPUs with CUDA   April 24, 2009   3 / 20
What is CUDA?



                        Compute Unified Device Architecture
   Extension to C programming language
   Adds library functions to access to GPU
   Adds directives to translate C into instructions that run on the host
   CPU or the GPU when needed
   Allows easy multi-threading - parallel execution on all thread
   processors on the GPU




   Mike Peardon (TCD)       A beginner’s guide to programming GPUs with CUDA   April 24, 2009   4 / 20
Will CUDA work on my PC/laptop?




   CUDA works on modern nVidia cards (Quadro, GeForce, Tesla)
   See
   http://www.nvidia.com/object/cuda learn products.html




   Mike Peardon (TCD)   A beginner’s guide to programming GPUs with CUDA   April 24, 2009   5 / 20
nVidia’s compiler - nvcc



    CUDA code must be compiled using nvcc
    nvcc generates both instructions for host and GPU (PTX instruction
    set), as well as instructions to send data back and forwards between
    them
    Standard CUDA install; /usr/local/cuda/bin/nvcc
    Shell executing compiled code needs dynamic linker path
    LD LIBRARY PATH environment variable set to include
    /usr/local/cuda/lib




   Mike Peardon (TCD)   A beginner’s guide to programming GPUs with CUDA   April 24, 2009   6 / 20
Simple overview
                    Network,
                    Disk, etc                               GPU
                                CPU            Multiprocessors

                   Memory                                          Memory




                              PCI bus
                  PC Motherboard
   GPU can’t directly access main memory
   CPU can’t directly access GPU memory
   Need to explicitly copy data
   No printf!
   Mike Peardon (TCD)       A beginner’s guide to programming GPUs with CUDA   April 24, 2009   7 / 20
Writing some code (1) - specifying where code runs


   CUDA provides function type qualifiers (that are not in C/C++) to
   enable programmer to define where a function should run
    host : specifies the code should run on the host CPU (redundant
   on its own - it is the default)
     device : specifies the code should run on the GPU, and the
   function can only be called by code running on the GPU
     global : specifies the code should run on the GPU, but be called
   from the host - this is the access point to start multi-threaded codes
   running on the GPU
   Device can’t execute code on the host!
   CUDA imposes some restrictions, such as device code is C-only (host
   code can be C++), device code can’t be called recursively...


   Mike Peardon (TCD)   A beginner’s guide to programming GPUs with CUDA   April 24, 2009   8 / 20
Code execution




   Mike Peardon (TCD)   A beginner’s guide to programming GPUs with CUDA   April 24, 2009   9 / 20
Writing some code (2) - launching a                           global           function
   All calls to a global function must specify how many threaded
   copies are to launch and in what configuration.
   CUDA syntax: <<< >>>
   threads are grouped into thread blocks then into a grid of blocks
   This defines a memory heirarchy (important for performance)




   Mike Peardon (TCD)   A beginner’s guide to programming GPUs with CUDA   April 24, 2009   10 / 20
The thread/block/grid model




   Mike Peardon (TCD)   A beginner’s guide to programming GPUs with CUDA   April 24, 2009   11 / 20
Writing some code (3) - launching a                           global           function


   Inside the <<< >>>, need at least two arguments (can be two more,
   that have default values)
   Call looks eg. like my func<<<bg, tb>>>(arg1, arg2)
   bg specifies the dimensions of the block grid and tb specifies the
   dimensions of each thread block
   bg and tb are both of type dim3 (a new datatype defined by CUDA;
   three unsigned ints where any unspecified component defaults to 1).
   dim3 has struct-like access - members are x, y and z
   CUDA provides constructor: dim3 mygrid(2,2); sets mygrid.x=2,
   mygrid.y=2 and mygrid.z=1
   1d syntax allowed: myfunc<<<5, 6>>>() makes 5 blocks (in linear
   array) with 6 threads each and runs myfunc on them all.


   Mike Peardon (TCD)   A beginner’s guide to programming GPUs with CUDA   April 24, 2009   12 / 20
Writing some code (4) - built-in variables on the GPU



    For code running on the GPU ( device and global ), some
    variables are predefined, which allow threads to be located inside their
    blocks and grids
    dim3 gridDim Dimensions of the grid.
    uint3 blockIdx location of this block in the grid.
    dim3 blockDim Dimensions of the blocks
    uint3 threadIdx location of this thread in the block.
    int warpSize number of threads in the warp?




   Mike Peardon (TCD)   A beginner’s guide to programming GPUs with CUDA   April 24, 2009   13 / 20
Writing some code (5) - where variables are stored



    For code running on the GPU ( device and global ), the
    memory used to hold a variable can be specified.
      device : the variable resides in the GPU’s global memory and is
    defined while the code runs.
      constant : the variable resides in the constant memory space of
    the GPU and is defined while the code runs.
      shared : the variable resides in the shared memory of the thread
    block and has the same lifespan as the block. block.




   Mike Peardon (TCD)   A beginner’s guide to programming GPUs with CUDA   April 24, 2009   14 / 20
Example - vector adder
Start:
#include <stdlib.h>
#include <stdio.h>

#define N 1000
#define NBLOCK 10
#define NTHREAD 10


      Define the kernel to execute on the host
__global__
void adder(int n, float* a, float *b)
// a=a+b - thread code - add n numbers per thread
{
  int i,off = (N * blockIdx.x ) / NBLOCK +
    (threadIdx.x * N) / (NBLOCK * NTHREAD);

    for (i=off;i<off+n;i++)
    {
      a[i] = a[i] + b[i];
    }
}
      Mike Peardon (TCD)   A beginner’s guide to programming GPUs with CUDA   April 24, 2009   15 / 20
Example - vector adder (2)


    Call using

  cudaMemcpy(gpu_a, host_a, sizeof(float) * n,
      cudaMemcpyHostToDevice);
  cudaMemcpy(gpu_b, host_b, sizeof(float) * n,
      cudaMemcpyHostToDevice);

  adder<<<NBLOCK, NTHREAD>>>(n / (NBLOCK * NTHREAD), gpu_a, gpu_b);

  cudaMemcpy(host_c, gpu_a, sizeof(float) * n,
      cudaMemcpyDeviceToHost);


    Need the cudaMemcpy’s to push/pull the data on/off the GPU.



   Mike Peardon (TCD)   A beginner’s guide to programming GPUs with CUDA   April 24, 2009   16 / 20
arXiv:0810.5365 Barros et. al.

             “Blasting through lattice calculations using CUDA”
    An implementation of an important compute kernel for lattice QCD -
    the Wilson-Dirac operator - this is a sparse linear operator that
    represents the kinetic energy operator in a discrete version of the
    quantum field theory of relativistic quarks (interacting with gluons).
    Usually, performance is limited by memory bandwidth (and
    inter-processor communications).
    Data is stored in the GPU’s memory
    “Atom” of data is the spinor of a field on one site. This is 12 complex
    numbers (3 colours for 4 spins).
    They use the float4 CUDA data primitive, which packs four floating
    point numbers efficiently. An array of 6 float4 types then holds one
    lattice size of the quark field.


   Mike Peardon (TCD)    A beginner’s guide to programming GPUs with CUDA   April 24, 2009   17 / 20
arXiv:0810.5365 Barros et. al. (2)

Performance issues:
  1   16 threads can read 16 contiguous memory elements very efficiently -
      their implementation of 6 arrays for the spinor allows this contiguous
      access
  2   GPUs do not have caches; rather they have a small but fast shared
      memory. Access is managed by software instructions.
  3   The GPU has a very efficient thread manager which can schedule
      multiple threads to run withing the cores of a multi-processor. Best
      performance comes when the number of threads is (much) more than
      the number of cores.
  4   The local shared memory space is 16k - not enough! Barros et al also
      use the registers on the multiprocessors (8,192 of them).
      Unfortunately, this means they have to hand-unroll all their loops!


      Mike Peardon (TCD)   A beginner’s guide to programming GPUs with CUDA   April 24, 2009   18 / 20
arXiv:0810.5365 Barros et. al. (3)

Performance: (even-odd) Wilson operator




    Mike Peardon (TCD)   A beginner’s guide to programming GPUs with CUDA   April 24, 2009   19 / 20
arXiv:0810.5365 Barros et. al. (4)

Performance: Conjugate Gradient solver:




    Mike Peardon (TCD)   A beginner’s guide to programming GPUs with CUDA   April 24, 2009   20 / 20
Conclusions


   The GPU offers a very impressive architecture for scientific computing
   on a single chip.
   Peak performance now is close to 1 TFlop for less than EU 1,000
   CUDA is an extension to C that allows multi-threaded software to
   execute on modern nVidia GPUs. There are alternatives for other
   manufacturer’s hardware and proposed architecture-independent
   schemes (like OpenCL)
   Efficient use of the hardware is challenging; threads must be
   scheduled efficiently and synchronisation is slow. Memory access must
   be defined very carefully.
   The (near) future will be very interesting...



   Mike Peardon (TCD)   A beginner’s guide to programming GPUs with CUDA   April 24, 2009   21 / 20

More Related Content

PDF
Introduction to CUDA
PPT
NVidia CUDA for Bruteforce Attacks - DefCamp 2012
PDF
Cuda introduction
PPT
Introduction to parallel computing using CUDA
PPTX
Intro to GPGPU with CUDA (DevLink)
PDF
Kato Mivule: An Overview of CUDA for High Performance Computing
PPTX
PPT
Introduction to CUDA
NVidia CUDA for Bruteforce Attacks - DefCamp 2012
Cuda introduction
Introduction to parallel computing using CUDA
Intro to GPGPU with CUDA (DevLink)
Kato Mivule: An Overview of CUDA for High Performance Computing

What's hot (18)

PDF
GPU: Understanding CUDA
PDF
Introduction to CUDA C: NVIDIA : Notes
PDF
Nvidia cuda tutorial_no_nda_apr08
PDF
PPTX
PDF
NVidia CUDA Tutorial - June 15, 2009
PPTX
Cuda Architecture
PDF
Cuda tutorial
PPTX
Intro to GPGPU Programming with Cuda
PDF
GPU Computing with Ruby
PPT
Vpu technology &gpgpu computing
PPT
CUDA Architecture
PDF
Introduction to cuda geek camp singapore 2011
PPTX
GPGPU programming with CUDA
PDF
Computing using GPUs
PPTX
Debugging CUDA applications
PDF
How to Burn Multi-GPUs using CUDA stress test memo
PPT
Cuda intro
GPU: Understanding CUDA
Introduction to CUDA C: NVIDIA : Notes
Nvidia cuda tutorial_no_nda_apr08
NVidia CUDA Tutorial - June 15, 2009
Cuda Architecture
Cuda tutorial
Intro to GPGPU Programming with Cuda
GPU Computing with Ruby
Vpu technology &gpgpu computing
CUDA Architecture
Introduction to cuda geek camp singapore 2011
GPGPU programming with CUDA
Computing using GPUs
Debugging CUDA applications
How to Burn Multi-GPUs using CUDA stress test memo
Cuda intro
Ad

Similar to A beginner’s guide to programming GPUs with CUDA (20)

PPT
Vpu technology &gpgpu computing
PPT
Vpu technology &gpgpu computing
PDF
Newbie’s guide to_the_gpgpu_universe
PPT
Vpu technology &gpgpu computing
PPTX
gpuprogram_lecture,architecture_designsn
PDF
Tech Talk NVIDIA CUDA
PPTX
Lrz kurs: gpu and mic programming with r
PDF
Challenges in GPU compilers
PPTX
Introduction to Accelerators
PPT
Parallel computing with Gpu
PPTX
An Introduction to CUDA-OpenCL - University.pptx
PDF
Monte Carlo G P U Jan2010
PPTX
lecture11_GPUArchCUDA01.pptx
PDF
lecture_GPUArchCUDA02-CUDAMem.pdf
PDF
Introduction to CUDA programming in C language
PPTX
Gpu workshop cluster universe: scripting cuda
PDF
NVIDIA cuda programming, open source and AI
PDF
Cuda materials
PDF
GPU programming and Its Case Study
Vpu technology &gpgpu computing
Vpu technology &gpgpu computing
Newbie’s guide to_the_gpgpu_universe
Vpu technology &gpgpu computing
gpuprogram_lecture,architecture_designsn
Tech Talk NVIDIA CUDA
Lrz kurs: gpu and mic programming with r
Challenges in GPU compilers
Introduction to Accelerators
Parallel computing with Gpu
An Introduction to CUDA-OpenCL - University.pptx
Monte Carlo G P U Jan2010
lecture11_GPUArchCUDA01.pptx
lecture_GPUArchCUDA02-CUDAMem.pdf
Introduction to CUDA programming in C language
Gpu workshop cluster universe: scripting cuda
NVIDIA cuda programming, open source and AI
Cuda materials
GPU programming and Its Case Study
Ad

More from Piyush Mittal (20)

PPTX
Power mock
PDF
Design pattern tutorial
PPT
Reflection
PPTX
Gpu archi
PDF
Intel open mp
PDF
Intro to parallel computing
PDF
Cuda toolkit reference manual
PDF
Matrix multiplication using CUDA
PPT
Channel coding
PPT
Basics of Coding Theory
PDF
Java cheat sheet
PDF
Google app engine cheat sheet
PDF
Git cheat sheet
PDF
Vi cheat sheet
PDF
Css cheat sheet
PDF
Cpp cheat sheet
PDF
Ubuntu cheat sheet
PDF
Php cheat sheet
PDF
oracle 9i cheat sheet
PDF
Open ssh cheet sheat
Power mock
Design pattern tutorial
Reflection
Gpu archi
Intel open mp
Intro to parallel computing
Cuda toolkit reference manual
Matrix multiplication using CUDA
Channel coding
Basics of Coding Theory
Java cheat sheet
Google app engine cheat sheet
Git cheat sheet
Vi cheat sheet
Css cheat sheet
Cpp cheat sheet
Ubuntu cheat sheet
Php cheat sheet
oracle 9i cheat sheet
Open ssh cheet sheat

Recently uploaded (20)

PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
cuic standard and advanced reporting.pdf
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
Machine Learning_overview_presentation.pptx
PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Empathic Computing: Creating Shared Understanding
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
Machine learning based COVID-19 study performance prediction
PDF
Getting Started with Data Integration: FME Form 101
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Spectral efficient network and resource selection model in 5G networks
Reach Out and Touch Someone: Haptics and Empathic Computing
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
cuic standard and advanced reporting.pdf
NewMind AI Weekly Chronicles - August'25-Week II
20250228 LYD VKU AI Blended-Learning.pptx
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Machine Learning_overview_presentation.pptx
Programs and apps: productivity, graphics, security and other tools
Digital-Transformation-Roadmap-for-Companies.pptx
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Encapsulation_ Review paper, used for researhc scholars
Building Integrated photovoltaic BIPV_UPV.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
Empathic Computing: Creating Shared Understanding
Group 1 Presentation -Planning and Decision Making .pptx
Machine learning based COVID-19 study performance prediction
Getting Started with Data Integration: FME Form 101
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Spectral efficient network and resource selection model in 5G networks

A beginner’s guide to programming GPUs with CUDA

  • 1. A beginner’s guide to programming GPUs with CUDA Mike Peardon School of Mathematics Trinity College Dublin April 24, 2009 Mike Peardon (TCD) A beginner’s guide to programming GPUs with CUDA April 24, 2009 1 / 20
  • 2. What is a GPU? Graphics Processing Unit Processor dedicated to rapid rendering of polygons - texturing, shading They are mass-produced, so very cheap 1 Tflop peak ≈ EU 1k. They have lots of compute cores, but a simpler architecture cf a standard CPU The “shader pipeline” can be used to do floating point calculations −→ cheap scientific/technical computing Mike Peardon (TCD) A beginner’s guide to programming GPUs with CUDA April 24, 2009 2 / 20
  • 3. What is a GPU? (2) Mike Peardon (TCD) A beginner’s guide to programming GPUs with CUDA April 24, 2009 3 / 20
  • 4. What is CUDA? Compute Unified Device Architecture Extension to C programming language Adds library functions to access to GPU Adds directives to translate C into instructions that run on the host CPU or the GPU when needed Allows easy multi-threading - parallel execution on all thread processors on the GPU Mike Peardon (TCD) A beginner’s guide to programming GPUs with CUDA April 24, 2009 4 / 20
  • 5. Will CUDA work on my PC/laptop? CUDA works on modern nVidia cards (Quadro, GeForce, Tesla) See http://www.nvidia.com/object/cuda learn products.html Mike Peardon (TCD) A beginner’s guide to programming GPUs with CUDA April 24, 2009 5 / 20
  • 6. nVidia’s compiler - nvcc CUDA code must be compiled using nvcc nvcc generates both instructions for host and GPU (PTX instruction set), as well as instructions to send data back and forwards between them Standard CUDA install; /usr/local/cuda/bin/nvcc Shell executing compiled code needs dynamic linker path LD LIBRARY PATH environment variable set to include /usr/local/cuda/lib Mike Peardon (TCD) A beginner’s guide to programming GPUs with CUDA April 24, 2009 6 / 20
  • 7. Simple overview Network, Disk, etc GPU CPU Multiprocessors Memory Memory PCI bus PC Motherboard GPU can’t directly access main memory CPU can’t directly access GPU memory Need to explicitly copy data No printf! Mike Peardon (TCD) A beginner’s guide to programming GPUs with CUDA April 24, 2009 7 / 20
  • 8. Writing some code (1) - specifying where code runs CUDA provides function type qualifiers (that are not in C/C++) to enable programmer to define where a function should run host : specifies the code should run on the host CPU (redundant on its own - it is the default) device : specifies the code should run on the GPU, and the function can only be called by code running on the GPU global : specifies the code should run on the GPU, but be called from the host - this is the access point to start multi-threaded codes running on the GPU Device can’t execute code on the host! CUDA imposes some restrictions, such as device code is C-only (host code can be C++), device code can’t be called recursively... Mike Peardon (TCD) A beginner’s guide to programming GPUs with CUDA April 24, 2009 8 / 20
  • 9. Code execution Mike Peardon (TCD) A beginner’s guide to programming GPUs with CUDA April 24, 2009 9 / 20
  • 10. Writing some code (2) - launching a global function All calls to a global function must specify how many threaded copies are to launch and in what configuration. CUDA syntax: <<< >>> threads are grouped into thread blocks then into a grid of blocks This defines a memory heirarchy (important for performance) Mike Peardon (TCD) A beginner’s guide to programming GPUs with CUDA April 24, 2009 10 / 20
  • 11. The thread/block/grid model Mike Peardon (TCD) A beginner’s guide to programming GPUs with CUDA April 24, 2009 11 / 20
  • 12. Writing some code (3) - launching a global function Inside the <<< >>>, need at least two arguments (can be two more, that have default values) Call looks eg. like my func<<<bg, tb>>>(arg1, arg2) bg specifies the dimensions of the block grid and tb specifies the dimensions of each thread block bg and tb are both of type dim3 (a new datatype defined by CUDA; three unsigned ints where any unspecified component defaults to 1). dim3 has struct-like access - members are x, y and z CUDA provides constructor: dim3 mygrid(2,2); sets mygrid.x=2, mygrid.y=2 and mygrid.z=1 1d syntax allowed: myfunc<<<5, 6>>>() makes 5 blocks (in linear array) with 6 threads each and runs myfunc on them all. Mike Peardon (TCD) A beginner’s guide to programming GPUs with CUDA April 24, 2009 12 / 20
  • 13. Writing some code (4) - built-in variables on the GPU For code running on the GPU ( device and global ), some variables are predefined, which allow threads to be located inside their blocks and grids dim3 gridDim Dimensions of the grid. uint3 blockIdx location of this block in the grid. dim3 blockDim Dimensions of the blocks uint3 threadIdx location of this thread in the block. int warpSize number of threads in the warp? Mike Peardon (TCD) A beginner’s guide to programming GPUs with CUDA April 24, 2009 13 / 20
  • 14. Writing some code (5) - where variables are stored For code running on the GPU ( device and global ), the memory used to hold a variable can be specified. device : the variable resides in the GPU’s global memory and is defined while the code runs. constant : the variable resides in the constant memory space of the GPU and is defined while the code runs. shared : the variable resides in the shared memory of the thread block and has the same lifespan as the block. block. Mike Peardon (TCD) A beginner’s guide to programming GPUs with CUDA April 24, 2009 14 / 20
  • 15. Example - vector adder Start: #include <stdlib.h> #include <stdio.h> #define N 1000 #define NBLOCK 10 #define NTHREAD 10 Define the kernel to execute on the host __global__ void adder(int n, float* a, float *b) // a=a+b - thread code - add n numbers per thread { int i,off = (N * blockIdx.x ) / NBLOCK + (threadIdx.x * N) / (NBLOCK * NTHREAD); for (i=off;i<off+n;i++) { a[i] = a[i] + b[i]; } } Mike Peardon (TCD) A beginner’s guide to programming GPUs with CUDA April 24, 2009 15 / 20
  • 16. Example - vector adder (2) Call using cudaMemcpy(gpu_a, host_a, sizeof(float) * n, cudaMemcpyHostToDevice); cudaMemcpy(gpu_b, host_b, sizeof(float) * n, cudaMemcpyHostToDevice); adder<<<NBLOCK, NTHREAD>>>(n / (NBLOCK * NTHREAD), gpu_a, gpu_b); cudaMemcpy(host_c, gpu_a, sizeof(float) * n, cudaMemcpyDeviceToHost); Need the cudaMemcpy’s to push/pull the data on/off the GPU. Mike Peardon (TCD) A beginner’s guide to programming GPUs with CUDA April 24, 2009 16 / 20
  • 17. arXiv:0810.5365 Barros et. al. “Blasting through lattice calculations using CUDA” An implementation of an important compute kernel for lattice QCD - the Wilson-Dirac operator - this is a sparse linear operator that represents the kinetic energy operator in a discrete version of the quantum field theory of relativistic quarks (interacting with gluons). Usually, performance is limited by memory bandwidth (and inter-processor communications). Data is stored in the GPU’s memory “Atom” of data is the spinor of a field on one site. This is 12 complex numbers (3 colours for 4 spins). They use the float4 CUDA data primitive, which packs four floating point numbers efficiently. An array of 6 float4 types then holds one lattice size of the quark field. Mike Peardon (TCD) A beginner’s guide to programming GPUs with CUDA April 24, 2009 17 / 20
  • 18. arXiv:0810.5365 Barros et. al. (2) Performance issues: 1 16 threads can read 16 contiguous memory elements very efficiently - their implementation of 6 arrays for the spinor allows this contiguous access 2 GPUs do not have caches; rather they have a small but fast shared memory. Access is managed by software instructions. 3 The GPU has a very efficient thread manager which can schedule multiple threads to run withing the cores of a multi-processor. Best performance comes when the number of threads is (much) more than the number of cores. 4 The local shared memory space is 16k - not enough! Barros et al also use the registers on the multiprocessors (8,192 of them). Unfortunately, this means they have to hand-unroll all their loops! Mike Peardon (TCD) A beginner’s guide to programming GPUs with CUDA April 24, 2009 18 / 20
  • 19. arXiv:0810.5365 Barros et. al. (3) Performance: (even-odd) Wilson operator Mike Peardon (TCD) A beginner’s guide to programming GPUs with CUDA April 24, 2009 19 / 20
  • 20. arXiv:0810.5365 Barros et. al. (4) Performance: Conjugate Gradient solver: Mike Peardon (TCD) A beginner’s guide to programming GPUs with CUDA April 24, 2009 20 / 20
  • 21. Conclusions The GPU offers a very impressive architecture for scientific computing on a single chip. Peak performance now is close to 1 TFlop for less than EU 1,000 CUDA is an extension to C that allows multi-threaded software to execute on modern nVidia GPUs. There are alternatives for other manufacturer’s hardware and proposed architecture-independent schemes (like OpenCL) Efficient use of the hardware is challenging; threads must be scheduled efficiently and synchronisation is slow. Memory access must be defined very carefully. The (near) future will be very interesting... Mike Peardon (TCD) A beginner’s guide to programming GPUs with CUDA April 24, 2009 21 / 20