HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel

Efficient Scheduling of OpenMP and OpenCL Workloads
Getting the most out of your APU

Objective
! software has a long life-span that exceeds the life-span of hardware
! software is very expensive to be written and maintained
! next generation hardware also needs to run legacy software
! Example: IWAVE
! procedural C-code
! no object orientation
! tight integration between data structures and functions
! What do I mean by efficient scheduling?
! find ways to utilize GPU cores for code blocks
! find ways to utilize all CPU cores and GPU units at the same time

!2

| OpenCL and OpenMP Workloads on Accelerated Processing Units |

Historical Context
GPU Compute Timeline

Aparapi
CUDA
2002
!3


2008

AMP C++
2010

2012

Accelerator Challenges
Technology Accessibility and Performance
Performance

OpenCL & CUDA

CPU Multithread

CPU Single Thread
Ease-of-Use
!4


APU Opportunities
One Die - Two Computational Devices

Metric

CPU

APU

Memory Size

large

small

Memory Bandwidth

small

large

Parallelism

small

large

yes

no

Performance

application dependent


Performance-per-Watt



Traditional

OpenCL

General Purpose

Programming

!5


APU Opportunities

Performance and Performance-per-Watt
! Example: Luxmark OpenCL Benchmark

APU

Performance[Pts]

170

197

316

50

37

58

3.4

5.3

5.4

Combined[Pts2/W]

! GPU has best performance-per-Watt

GPU

PPW[Pts/W]

! Best performance by using the APU

CPU

Power[W]

! Similar CPU and GPU performance

Metric

578

1049

1722

! APU provides outstanding value

Luxmark OpenCL Benchmark
Ubuntu 12.10 x86_64
4 Piledriver CPU cores @ 2.5GHz
6 GPU Compute Units @ 720MHz
16GB DDR3 1600MHz
!6


Example: Luxmark Renderer

Performance and Performance-per-Watt

+64%
+81%

!7


Luxmark OpenCL Benchmark
Render “Sala” Scene
Ubuntu 12.10 x86_64
4 Piledriver cores @ 2.5GHz
6 GPU CUs @ 720MHz
16GB DDR3 1600MHz

Programming Strategies

Example: Solving the Acoustic Wave Equation in 3D using IWAVE
! Know the problem you are trying to solve.
! staggered rectangular grid in 3D
! coupled first order PDE
! scalar pressure field p
! vector velocity field v = {vx, vy, vz}
! source term g

!8




while(…) {
sgn_ts3d_210_p012_OpenMP(dom, pars);
sgn_ts3d_210_v0_OpenMP(dom, pars);
…
}

OpenMP p

OpenMP vx

//
//
//
//
//

main simulation loop
calculate pressure field
calculate velocity x-axis
calculate velocity y-axis

OpenMP vy

OpenMP vz

OpenMP
Time

!9




! Measure the initial performance.
! pressure and velocity field simulated using OpenMP
! average time T[ms] per iteration
! OpenMP linear scaling with threads

!10




! find computational blocks
! understand dependencies between blocks

OpenMP vx
OpenMP p

OpenMP vy

! identify sequential and parallel parts

OpenMP

OpenMP vz
Causality

OpenMP p

OpenMP vx

OpenMP vy

OpenMP vz

OpenMP
Time

!11




while(…) {
sgn_ts3d_210_v0_OpenCL(dom, pars);
…
}

//
//
//
//
//

calculate pressure field p

OpenCL vx
OpenMP p

IDLE

OpenMP vy

OpenMP vz

OpenMP
Time

!12




! use the GPU to compute vx
! the CPU is idle while the GPU is running
! 42% improvement for 1 thread
! 25% improvement for 2 threads

!13



while(…) {

!
!

// main simulation loop
// calculate pressure field p

int num_threads = atoi(getenv("OMP_NUM_THREADS"));
omp_set_num_threads(2);
omp_set_nested(1);

#pragma omp parallel shared(…) private(…)
{
switch ( omp_get_thread_num() ) {
case 0:
sgn_ts3d_210_v0_OpenCL(dom, pars)
break;
case 1:
omp_set_num_threads(num_threads);
break;
default:
break;
}
}
x
}

OpenCL v

OpenMP p

OpenMP vy

OpenMP vz

// save the current number of OpenMP threads
// restrict the number of OpenMP threads to 2
// allow nested OpenMP threads
// start 2 OpenMP threads

// calculate velocity x-axis using OpenCL
// increase number of OpenMP threads back
// calculate velocity y-axis
// calculate velocity z-axis

// close OpenMP pragma
// close simulation while

OpenMP
Time

!14




! overlap vx and vy
! CPU not idle anymore
! 50% improvement for 1 thread

!15




while(…) {
sgn_ts3d_210_p012_OpenCL(dom, pars);
…
}

//
//
//
//
//

bool sgn_ts3d_210_p012_OpenCL(RDOM* dom, void* pars) {
…
clEnqueueWriteBuffer(queue, buffer, …);
clEnqueueNDRangeKernel(queue, kernel_P012, dims, …);
clEnqueueReadBuffer(queue, buffer, …);
…
}

OpenCL p

OpenCL vx

OpenCL vy

calculate pressure field

// copy data from host to device
// execute OpenCL kernel on device
// copy data from device to host

OpenCL vz

OpenCL
Time

!16




! understand where performance gets lost
! 98% of time spent on I/O
! 2% of time spent on compute
! reduce I/O

OpenCL Upload

Kernel Execution

OpenCL Download

188ms

4ms

54ms

OpenCL vx
OpenMP p

OpenMP vy

OpenMP vz

OpenMP
Time

!17



Example: High Throughput Computer Vision with OpenCV
! How does the speedup of an OpenCL application
(SOpenCL) depend on speedup of the OpenCL kernel
(SKernel) when the OpenCL I/O time is fixed?
! Fraction of OpenCL I/O time: FI/O
! 50% I/O time limit the maximal possible speedup to 2
! Minimize OpenCL I/O, only then increase OpenCL
kernel performance

!18

SKernel
SOpenCL =
HSKernel - 1L FIêO + 1




while(…) {
sgn_ts3d_210_ALL_OpenCL(dom, pars);
…
}

// main simulation loop
// combine all OpenCL calculations

bool sgn_ts3d_210_ALL_OpenCL(RDOM* dom, void* pars) {
…
clEnqueueWriteBuffer(queue, buffer, …);

!
!

while(…) {
clEnqueueNDRangeKernel(queue,

kernel_P012, dims, …);
kernel_V0, dims, …);

// copy data from host to device
//
//
//
//

execute
execute
execute
execute

OpenCL
OpenCL
OpenCL
OpenCL

kernel
kernel
kernel
kernel

for
for
for
for

pressure
velocity x
velocity y
velocity z

}
clEnqueueReadBuffer(queue, buffer, …);
…

// copy data from device to host

}

OpenCL p

OpenCL vx

OpenCL vy

OpenCL vz

OpenCL
Time

!19




! eliminate all but essential I/O
! significant speedup over simple OpenCL

!20




! measure real application performance
! 3000 iterations using a 97x405x389 simulation grid
! 8 GCN Compute Units achieve 70% more
performance than 8 traditional OpenMP threads

14
10.5
7
3.5
0
CPU (8T) "Piledriver"

!21


GPU (8CU)

AMD S9000


! initial OpenCL performance measurements
! 89 Algorithms tested for image size of 4MP
! compare OpenCL I/O and execution time
! 28% of all algorithms are compute bound
! 72% of all algorithms are I/O bound

OpenCV Computer Vision Library Performance Tests v2.4
Ubuntu 12.10 x86_64
1 Piledriver CPU core @ 2.5GHz
16GB DDR3 1600MHz
!22



! compare OpenCL and single-threaded performance
! 89 Algorithms tested for image size of 4MP
! realistic timing that includes I/O over PCIe
! 59% of all algorithms execute faster on the GPU
! 41% of all algorithms execute faster on the CPU(1)
! significant speedup for only 15% of all algorithms

OpenCV Computer Vision Library Performance Tests v2.4
Ubuntu 12.10 x86_64
1 Piledriver CPU core @ 2.5GHz
16GB DDR3 1600MHz
!23



! Task: Batch process a large amount of images using a single algorithm.
! OpenCL performance is algorithm and image size dependent
! Either the CPU will process data or the GPU, but not both
! How to choose which algorithm and device to use depending on image size?

!24




!25



! Better: create input image queue that CPU and GPU query for new image tasks till queue is empty.
! all CPU cores are fully utilized at all times even for single-threaded algorithms
! all GPU compute units are fully utilized at all times
! combined performance for single algorithm is sum of GPU and CPU performance for that algorithm
! combined performance for multiple algorithms is better than sum of device performance

P

i

APU

=P

P=
!26


i

CPU

+P

i

N
1
⁄i=1 Pi

1

GPU



!27



Summary

!
! next generation hardware and legacy code requires compromises
! OpenCL performance is tied to Amdahl’s Law regarding OpenCL I/O and OpenCL execution time
! application performance can be increased by overlapping OpenCL and OpenMP workloads
! removing all but necessary OpenCL I/O can have a dramatic influence on performance
! for loosely coupled high-throughput applications the OpenCL and OpenMP performance add for single algorithms
! for multiple algorithms the combined performance across all algorithms is better than the sum of devices performances
! APUs may provide greatest performance per Watt
! GPUs may provide greatest performance

!28


DISCLAIMER & ATTRIBUTION
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and
typographical errors. 
The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product
and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing
manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or
revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof
without obligation of AMD to notify any person of such revisions or changes. 
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY
INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. 
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD
BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION
CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

!
ATTRIBUTION
© 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro
Devices, Inc. in the United States and/or other jurisdictions. SPEC is a registered trademark of the Standard Performance Evaluation
Corporation (SPEC). Other names are for informational purposes only and may be trademarks of their respective owners.

!29


HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel (20)

More from AMD Developer Central (20)

Recently uploaded (20)

HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel