SlideShare a Scribd company logo
Efficient Scheduling of OpenMP and OpenCL Workloads
Getting the most out of your APU
Objective
! software has a long life-span that exceeds the life-span of hardware
! software is very expensive to be written and maintained
! next generation hardware also needs to run legacy software
! Example: IWAVE
! procedural C-code
! no object orientation
! tight integration between data structures and functions
! What do I mean by efficient scheduling?
! find ways to utilize GPU cores for code blocks
! find ways to utilize all CPU cores and GPU units at the same time

!2

| OpenCL and OpenMP Workloads on Accelerated Processing Units |
Historical Context
GPU Compute Timeline

Aparapi
CUDA
2002
!3

| OpenCL and OpenMP Workloads on Accelerated Processing Units |

2008

AMP C++
2010

2012
Accelerator Challenges
Technology Accessibility and Performance
Performance

OpenCL & CUDA

CPU Multithread

CPU Single Thread
Ease-of-Use
!4

| OpenCL and OpenMP Workloads on Accelerated Processing Units |
APU Opportunities
One Die - Two Computational Devices

Metric

CPU

APU

Memory Size

large

small

Memory Bandwidth

small

large

Parallelism

small

large

yes

no

Performance

application dependent

application dependent

Performance-per-Watt

application dependent

application dependent

Traditional

OpenCL

General Purpose

Programming

!5

| OpenCL and OpenMP Workloads on Accelerated Processing Units |
APU Opportunities

Performance and Performance-per-Watt
! Example: Luxmark OpenCL Benchmark

APU

Performance[Pts]

170

197

316

50

37

58

3.4

5.3

5.4

Combined[Pts2/W]

! GPU has best performance-per-Watt

GPU

PPW[Pts/W]

! Best performance by using the APU

CPU

Power[W]

! Similar CPU and GPU performance

Metric

578

1049

1722

! APU provides outstanding value

Luxmark OpenCL Benchmark
Ubuntu 12.10 x86_64
4 Piledriver CPU cores @ 2.5GHz
6 GPU Compute Units @ 720MHz
16GB DDR3 1600MHz
!6

| OpenCL and OpenMP Workloads on Accelerated Processing Units |
Example: Luxmark Renderer

Performance and Performance-per-Watt

+64%
+81%

!7

| OpenCL and OpenMP Workloads on Accelerated Processing Units |

Luxmark OpenCL Benchmark
Render “Sala” Scene
Ubuntu 12.10 x86_64
4 Piledriver cores @ 2.5GHz
6 GPU CUs @ 720MHz
16GB DDR3 1600MHz
Programming Strategies

Example: Solving the Acoustic Wave Equation in 3D using IWAVE
! Know the problem you are trying to solve.
! staggered rectangular grid in 3D
! coupled first order PDE
! scalar pressure field p
! vector velocity field v = {vx, vy, vz}
! source term g

!8

| OpenCL and OpenMP Workloads on Accelerated Processing Units |
Programming Strategies

Example: Solving the Acoustic Wave Equation in 3D using IWAVE

while(…) {
sgn_ts3d_210_p012_OpenMP(dom, pars);
sgn_ts3d_210_v0_OpenMP(dom, pars);
sgn_ts3d_210_v1_OpenMP(dom, pars);
sgn_ts3d_210_v2_OpenMP(dom, pars);
…
}

OpenMP p

OpenMP vx

//
//
//
//
//

main simulation loop
calculate pressure field
calculate velocity x-axis
calculate velocity y-axis
calculate velocity x-axis

OpenMP vy

OpenMP vz

OpenMP
Time

!9

| OpenCL and OpenMP Workloads on Accelerated Processing Units |
Programming Strategies

Example: Solving the Acoustic Wave Equation in 3D using IWAVE

! Measure the initial performance.
! pressure and velocity field simulated using OpenMP
! average time T[ms] per iteration
! OpenMP linear scaling with threads

!10

| OpenCL and OpenMP Workloads on Accelerated Processing Units |
Programming Strategies

Example: Solving the Acoustic Wave Equation in 3D using IWAVE

! find computational blocks
! understand dependencies between blocks

OpenMP vx
OpenMP p

OpenMP vy

! identify sequential and parallel parts

OpenMP

OpenMP vz
Causality

OpenMP p

OpenMP vx

OpenMP vy

OpenMP vz

OpenMP
Time

!11

| OpenCL and OpenMP Workloads on Accelerated Processing Units |
Programming Strategies

Example: Solving the Acoustic Wave Equation in 3D using IWAVE

while(…) {
sgn_ts3d_210_p012_OpenMP(dom, pars);
sgn_ts3d_210_v0_OpenCL(dom, pars);
sgn_ts3d_210_v1_OpenMP(dom, pars);
sgn_ts3d_210_v2_OpenMP(dom, pars);
…
}

//
//
//
//
//

main simulation loop
calculate pressure field p
calculate velocity x-axis
calculate velocity y-axis
calculate velocity x-axis

OpenCL vx
OpenMP p

IDLE

OpenMP vy

OpenMP vz

OpenMP
Time

!12

| OpenCL and OpenMP Workloads on Accelerated Processing Units |
Programming Strategies

Example: Solving the Acoustic Wave Equation in 3D using IWAVE

! use the GPU to compute vx
! the CPU is idle while the GPU is running
! 42% improvement for 1 thread
! 25% improvement for 2 threads
! 9% improvement for 4 threads

!13

| OpenCL and OpenMP Workloads on Accelerated Processing Units |
Programming Strategies

Example: Solving the Acoustic Wave Equation in 3D using IWAVE
while(…) {
sgn_ts3d_210_p012_OpenMP(dom, pars);

!
!

// main simulation loop
// calculate pressure field p

int num_threads = atoi(getenv("OMP_NUM_THREADS"));
omp_set_num_threads(2);
omp_set_nested(1);

#pragma omp parallel shared(…) private(…)
{
switch ( omp_get_thread_num() ) {
case 0:
sgn_ts3d_210_v0_OpenCL(dom, pars)
break;
case 1:
omp_set_num_threads(num_threads);
sgn_ts3d_210_v1_OpenMP(dom, pars);
sgn_ts3d_210_v2_OpenMP(dom, pars);
break;
default:
break;
}
}
x
}

OpenCL v

OpenMP p

OpenMP vy

OpenMP vz

// save the current number of OpenMP threads
// restrict the number of OpenMP threads to 2
// allow nested OpenMP threads
// start 2 OpenMP threads

// calculate velocity x-axis using OpenCL
// increase number of OpenMP threads back
// calculate velocity y-axis
// calculate velocity z-axis

// close OpenMP pragma
// close simulation while

OpenMP
Time

!14

| OpenCL and OpenMP Workloads on Accelerated Processing Units |
Programming Strategies

Example: Solving the Acoustic Wave Equation in 3D using IWAVE

! overlap vx and vy
! CPU not idle anymore
! 50% improvement for 1 thread
! 40% improvement for 2 threads
! 38% improvement for 4 threads

!15

| OpenCL and OpenMP Workloads on Accelerated Processing Units |
Programming Strategies

Example: Solving the Acoustic Wave Equation in 3D using IWAVE

while(…) {
sgn_ts3d_210_p012_OpenCL(dom, pars);
sgn_ts3d_210_v0_OpenCL(dom, pars);
sgn_ts3d_210_v1_OpenCL(dom, pars);
sgn_ts3d_210_v2_OpenCL(dom, pars);
…
}

//
//
//
//
//

bool sgn_ts3d_210_p012_OpenCL(RDOM* dom, void* pars) {
…
clEnqueueWriteBuffer(queue, buffer, …);
clEnqueueNDRangeKernel(queue, kernel_P012, dims, …);
clEnqueueReadBuffer(queue, buffer, …);
…
}

OpenCL p

OpenCL vx

OpenCL vy

main simulation loop
calculate pressure field
calculate velocity x-axis
calculate velocity y-axis
calculate velocity x-axis

// copy data from host to device
// execute OpenCL kernel on device
// copy data from device to host

OpenCL vz

OpenCL
Time

!16

| OpenCL and OpenMP Workloads on Accelerated Processing Units |
Programming Strategies

Example: Solving the Acoustic Wave Equation in 3D using IWAVE

! understand where performance gets lost
! 98% of time spent on I/O
! 2% of time spent on compute
! reduce I/O

OpenCL Upload

Kernel Execution

OpenCL Download

188ms

4ms

54ms

OpenCL vx
OpenMP p

OpenMP vy

OpenMP vz

OpenMP
Time

!17

| OpenCL and OpenMP Workloads on Accelerated Processing Units |
Programming Strategies

Example: High Throughput Computer Vision with OpenCV
! How does the speedup of an OpenCL application
(SOpenCL) depend on speedup of the OpenCL kernel
(SKernel) when the OpenCL I/O time is fixed?
! Fraction of OpenCL I/O time: FI/O
! 50% I/O time limit the maximal possible speedup to 2
! Minimize OpenCL I/O, only then increase OpenCL
kernel performance

!18

SKernel
SOpenCL =
HSKernel - 1L FIêO + 1

| OpenCL and OpenMP Workloads on Accelerated Processing Units |
Programming Strategies

Example: Solving the Acoustic Wave Equation in 3D using IWAVE

while(…) {
sgn_ts3d_210_ALL_OpenCL(dom, pars);
…
}

// main simulation loop
// combine all OpenCL calculations

bool sgn_ts3d_210_ALL_OpenCL(RDOM* dom, void* pars) {
…
clEnqueueWriteBuffer(queue, buffer, …);

!
!

while(…) {
clEnqueueNDRangeKernel(queue,
clEnqueueNDRangeKernel(queue,
clEnqueueNDRangeKernel(queue,
clEnqueueNDRangeKernel(queue,

kernel_P012, dims, …);
kernel_V0, dims, …);
kernel_V1, dims, …);
kernel_V1, dims, …);

// copy data from host to device
//
//
//
//

execute
execute
execute
execute

OpenCL
OpenCL
OpenCL
OpenCL

kernel
kernel
kernel
kernel

for
for
for
for

pressure
velocity x
velocity y
velocity z

}
clEnqueueReadBuffer(queue, buffer, …);
…

// copy data from device to host

}

OpenCL p

OpenCL vx

OpenCL vy

OpenCL vz

OpenCL
Time

!19

| OpenCL and OpenMP Workloads on Accelerated Processing Units |
Programming Strategies

Example: Solving the Acoustic Wave Equation in 3D using IWAVE

! eliminate all but essential I/O
! significant speedup over simple OpenCL

!20

| OpenCL and OpenMP Workloads on Accelerated Processing Units |
Programming Strategies

Example: Solving the Acoustic Wave Equation in 3D using IWAVE

! measure real application performance
! 3000 iterations using a 97x405x389 simulation grid
! 8 GCN Compute Units achieve 70% more
performance than 8 traditional OpenMP threads

14
10.5
7
3.5
0
CPU (8T) "Piledriver"

!21

| OpenCL and OpenMP Workloads on Accelerated Processing Units |

GPU (8CU)

AMD S9000
Programming Strategies

Example: High Throughput Computer Vision with OpenCV
! initial OpenCL performance measurements
! 89 Algorithms tested for image size of 4MP
! compare OpenCL I/O and execution time
! 28% of all algorithms are compute bound
! 72% of all algorithms are I/O bound

OpenCV Computer Vision Library Performance Tests v2.4
Ubuntu 12.10 x86_64
1 Piledriver CPU core @ 2.5GHz
6 GPU Compute Units @ 720MHz
16GB DDR3 1600MHz
!22

| OpenCL and OpenMP Workloads on Accelerated Processing Units |
Programming Strategies

Example: High Throughput Computer Vision with OpenCV
! compare OpenCL and single-threaded performance
! 89 Algorithms tested for image size of 4MP
! realistic timing that includes I/O over PCIe
! 59% of all algorithms execute faster on the GPU
! 41% of all algorithms execute faster on the CPU(1)
! significant speedup for only 15% of all algorithms

OpenCV Computer Vision Library Performance Tests v2.4
Ubuntu 12.10 x86_64
1 Piledriver CPU core @ 2.5GHz
6 GPU Compute Units @ 720MHz
16GB DDR3 1600MHz
!23

| OpenCL and OpenMP Workloads on Accelerated Processing Units |
Programming Strategies

Example: High Throughput Computer Vision with OpenCV
! Task: Batch process a large amount of images using a single algorithm.
! OpenCL performance is algorithm and image size dependent
! Either the CPU will process data or the GPU, but not both
! How to choose which algorithm and device to use depending on image size?

!24

| OpenCL and OpenMP Workloads on Accelerated Processing Units |
Programming Strategies

Example: High Throughput Computer Vision with OpenCV

!25

| OpenCL and OpenMP Workloads on Accelerated Processing Units |
Programming Strategies

Example: High Throughput Computer Vision with OpenCV
! Better: create input image queue that CPU and GPU query for new image tasks till queue is empty.
! all CPU cores are fully utilized at all times even for single-threaded algorithms
! all GPU compute units are fully utilized at all times
! combined performance for single algorithm is sum of GPU and CPU performance for that algorithm
! combined performance for multiple algorithms is better than sum of device performance

P

i

APU

=P

P=
!26

| OpenCL and OpenMP Workloads on Accelerated Processing Units |

i

CPU

+P

i

N
1
⁄i=1 Pi

1

GPU
Programming Strategies

Example: High Throughput Computer Vision with OpenCV

!27

| OpenCL and OpenMP Workloads on Accelerated Processing Units |
Programming Strategies

Summary

!
! next generation hardware and legacy code requires compromises
! OpenCL performance is tied to Amdahl’s Law regarding OpenCL I/O and OpenCL execution time
! application performance can be increased by overlapping OpenCL and OpenMP workloads
! removing all but necessary OpenCL I/O can have a dramatic influence on performance
! for loosely coupled high-throughput applications the OpenCL and OpenMP performance add for single algorithms
! for multiple algorithms the combined performance across all algorithms is better than the sum of devices performances
! APUs may provide greatest performance per Watt
! GPUs may provide greatest performance

!28

| OpenCL and OpenMP Workloads on Accelerated Processing Units |
DISCLAIMER & ATTRIBUTION
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and
typographical errors.

The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product
and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing
manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or
revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof
without obligation of AMD to notify any person of such revisions or changes.

AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY
INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.

AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD
BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION
CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

!
ATTRIBUTION
© 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro
Devices, Inc. in the United States and/or other jurisdictions. SPEC is a registered trademark of the Standard Performance Evaluation
Corporation (SPEC). Other names are for informational purposes only and may be trademarks of their respective owners.

!29

| OpenCL and OpenMP Workloads on Accelerated Processing Units |

More Related Content

PDF
PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern ...
PDF
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...
PDF
GS-4152, AMD’s Radeon R9-290X, One Big dGPU, by Michael Mantor
PDF
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...
PDF
PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning
PPTX
Leverage the Speed of OpenCL™ with AMD Math Libraries
PDF
PT-4053, Advanced OpenCL - Debugging and Profiling Using AMD CodeXL, by Uri S...
PDF
MM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary Demos
PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern ...
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...
GS-4152, AMD’s Radeon R9-290X, One Big dGPU, by Michael Mantor
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...
PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning
Leverage the Speed of OpenCL™ with AMD Math Libraries
PT-4053, Advanced OpenCL - Debugging and Profiling Using AMD CodeXL, by Uri S...
MM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary Demos

What's hot (20)

PDF
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
PDF
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...
PDF
MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...
PDF
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PPSX
Direct3D12 and the Future of Graphics APIs by Dave Oldcorn
PDF
MM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey Pavlenko
PDF
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...
PDF
GS-4150, Bullet 3 OpenCL Rigid Body Simulation, by Erwin Coumans
PDF
HSA-4123, HSA Memory Model, by Ben Gaster
PDF
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
PDF
GS-4147, TressFX 2.0, by Bill-Bilodeau
PPTX
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
PDF
GS-4108, Direct Compute in Gaming, by Bill Bilodeau
PDF
PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...
PDF
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...
PDF
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PDF
PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...
PDF
CC-4010, Bringing Spatial Love to your Java Application, by Steven Citron-Pousty
PDF
Bolt C++ Standard Template Libary for HSA by Ben Sanders, AMD
PDF
DX12 & Vulkan: Dawn of a New Generation of Graphics APIs
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...
MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
Direct3D12 and the Future of Graphics APIs by Dave Oldcorn
MM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey Pavlenko
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...
GS-4150, Bullet 3 OpenCL Rigid Body Simulation, by Erwin Coumans
HSA-4123, HSA Memory Model, by Ben Gaster
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
GS-4147, TressFX 2.0, by Bill-Bilodeau
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
GS-4108, Direct Compute in Gaming, by Bill Bilodeau
PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...
CC-4010, Bringing Spatial Love to your Java Application, by Steven Citron-Pousty
Bolt C++ Standard Template Libary for HSA by Ben Sanders, AMD
DX12 & Vulkan: Dawn of a New Generation of Graphics APIs
Ad

Viewers also liked (20)

PDF
Curriculum de professor_atual
PDF
CURRICULUM VITAE Alexandra Damaso
PDF
Modelos de curriculo
DOC
Modelo de currículo 1º emprego
PDF
Curriculum vitae 2013
DOCX
Professor de musica curriculo - arnaldo alves
DOC
Modelo de curriculo menor aprendiz
DOC
Modelo de-curriculum-1-preenchido
DOC
Curriculo pronto-3
PDF
CurríCulo Luiz 2010
DOC
Curriculo:Enfermeiro
PDF
Curriculum Profª Elizete Arantes
DOCX
Trabalho LPL
DOC
Curriculo 850 Alternativo
PDF
PPP - E.B.M. Henrique Alfarth 2014
PDF
Manual blogger
PPT
Criar Um Blog -Blogger
PDF
Blog na-educacao
DOC
Modelo de-curriculum-4-1
DOC
Curriculum psicóloga educacional
Curriculum de professor_atual
CURRICULUM VITAE Alexandra Damaso
Modelos de curriculo
Modelo de currículo 1º emprego
Curriculum vitae 2013
Professor de musica curriculo - arnaldo alves
Modelo de curriculo menor aprendiz
Modelo de-curriculum-1-preenchido
Curriculo pronto-3
CurríCulo Luiz 2010
Curriculo:Enfermeiro
Curriculum Profª Elizete Arantes
Trabalho LPL
Curriculo 850 Alternativo
PPP - E.B.M. Henrique Alfarth 2014
Manual blogger
Criar Um Blog -Blogger
Blog na-educacao
Modelo de-curriculum-4-1
Curriculum psicóloga educacional
Ad

Similar to HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel (20)

PDF
lecture_GPUArchCUDA04-OpenMPHOMP.pdf
PPTX
Profiling & Testing with Spark
PPT
Threaded Programming
PDF
MOVED: The challenge of SVE in QEMU - SFO17-103
PPTX
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
PDF
NVIDIA HPC ソフトウエア斜め読み
PPT
20081114 Friday Food iLabt Bart Joris
PDF
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
PDF
Challenges in GPU compilers
PPT
Intermachine Parallelism
PDF
開放運算&GPU技術研究班
PDF
Introduction to CUDA programming in C language
PDF
The Green Lab - [04 B] [PWA] Experiment setup
PDF
Getting started with AMD GPUs
PDF
H2O Design and Infrastructure with Matt Dowle
PDF
Porting and optimizing UniFrac for GPUs
PDF
Using GPUs to handle Big Data with Java by Adam Roberts.
PDF
Hardware & Software Platforms for HPC, AI and ML
PPT
Vpu technology &gpgpu computing
PPT
Vpu technology &gpgpu computing
lecture_GPUArchCUDA04-OpenMPHOMP.pdf
Profiling & Testing with Spark
Threaded Programming
MOVED: The challenge of SVE in QEMU - SFO17-103
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
NVIDIA HPC ソフトウエア斜め読み
20081114 Friday Food iLabt Bart Joris
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Challenges in GPU compilers
Intermachine Parallelism
開放運算&GPU技術研究班
Introduction to CUDA programming in C language
The Green Lab - [04 B] [PWA] Experiment setup
Getting started with AMD GPUs
H2O Design and Infrastructure with Matt Dowle
Porting and optimizing UniFrac for GPUs
Using GPUs to handle Big Data with Java by Adam Roberts.
Hardware & Software Platforms for HPC, AI and ML
Vpu technology &gpgpu computing
Vpu technology &gpgpu computing

More from AMD Developer Central (20)

PPTX
Introduction to Node.js
PPTX
Media SDK Webinar 2014
PDF
DirectGMA on AMD’S FirePro™ GPUS
PPT
Webinar: Whats New in Java 8 with Develop Intelligence
PPSX
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
PPSX
Inside XBox- One, by Martin Fuller
PPSX
TressFX The Fast and The Furry by Nicolas Thibieroz
PPSX
Rendering Battlefield 4 with Mantle by Yuriy ODonnell
PPTX
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
PPSX
Gcn performance ftw by stephan hodes
PPSX
Inside XBOX ONE by Martin Fuller
PPSX
Introduction to Direct 3D 12 by Ivan Nevraev
PPSX
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
PDF
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...
PDF
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
PPSX
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
PDF
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
PPSX
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...
PPSX
Mantle - Introducing a new API for Graphics - AMD at GDC14
PPSX
Direct3D and the Future of Graphics APIs - AMD at GDC14
Introduction to Node.js
Media SDK Webinar 2014
DirectGMA on AMD’S FirePro™ GPUS
Webinar: Whats New in Java 8 with Develop Intelligence
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
Inside XBox- One, by Martin Fuller
TressFX The Fast and The Furry by Nicolas Thibieroz
Rendering Battlefield 4 with Mantle by Yuriy ODonnell
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Gcn performance ftw by stephan hodes
Inside XBOX ONE by Martin Fuller
Introduction to Direct 3D 12 by Ivan Nevraev
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...
Mantle - Introducing a new API for Graphics - AMD at GDC14
Direct3D and the Future of Graphics APIs - AMD at GDC14

Recently uploaded (20)

PDF
Review of recent advances in non-invasive hemoglobin estimation
PPT
Teaching material agriculture food technology
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
Cloud computing and distributed systems.
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Encapsulation theory and applications.pdf
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
A Presentation on Artificial Intelligence
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
cuic standard and advanced reporting.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
Review of recent advances in non-invasive hemoglobin estimation
Teaching material agriculture food technology
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Cloud computing and distributed systems.
NewMind AI Weekly Chronicles - August'25-Week II
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
Encapsulation theory and applications.pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
A Presentation on Artificial Intelligence
A comparative analysis of optical character recognition models for extracting...
cuic standard and advanced reporting.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
“AI and Expert System Decision Support & Business Intelligence Systems”
Encapsulation_ Review paper, used for researhc scholars
Building Integrated photovoltaic BIPV_UPV.pdf

HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel

  • 1. Efficient Scheduling of OpenMP and OpenCL Workloads Getting the most out of your APU
  • 2. Objective ! software has a long life-span that exceeds the life-span of hardware ! software is very expensive to be written and maintained ! next generation hardware also needs to run legacy software ! Example: IWAVE ! procedural C-code ! no object orientation ! tight integration between data structures and functions ! What do I mean by efficient scheduling? ! find ways to utilize GPU cores for code blocks ! find ways to utilize all CPU cores and GPU units at the same time !2 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 3. Historical Context GPU Compute Timeline Aparapi CUDA 2002 !3 | OpenCL and OpenMP Workloads on Accelerated Processing Units | 2008 AMP C++ 2010 2012
  • 4. Accelerator Challenges Technology Accessibility and Performance Performance OpenCL & CUDA CPU Multithread CPU Single Thread Ease-of-Use !4 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 5. APU Opportunities One Die - Two Computational Devices Metric CPU APU Memory Size large small Memory Bandwidth small large Parallelism small large yes no Performance application dependent application dependent Performance-per-Watt application dependent application dependent Traditional OpenCL General Purpose Programming !5 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 6. APU Opportunities Performance and Performance-per-Watt ! Example: Luxmark OpenCL Benchmark APU Performance[Pts] 170 197 316 50 37 58 3.4 5.3 5.4 Combined[Pts2/W] ! GPU has best performance-per-Watt GPU PPW[Pts/W] ! Best performance by using the APU CPU Power[W] ! Similar CPU and GPU performance Metric 578 1049 1722 ! APU provides outstanding value Luxmark OpenCL Benchmark Ubuntu 12.10 x86_64 4 Piledriver CPU cores @ 2.5GHz 6 GPU Compute Units @ 720MHz 16GB DDR3 1600MHz !6 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 7. Example: Luxmark Renderer Performance and Performance-per-Watt +64% +81% !7 | OpenCL and OpenMP Workloads on Accelerated Processing Units | Luxmark OpenCL Benchmark Render “Sala” Scene Ubuntu 12.10 x86_64 4 Piledriver cores @ 2.5GHz 6 GPU CUs @ 720MHz 16GB DDR3 1600MHz
  • 8. Programming Strategies Example: Solving the Acoustic Wave Equation in 3D using IWAVE ! Know the problem you are trying to solve. ! staggered rectangular grid in 3D ! coupled first order PDE ! scalar pressure field p ! vector velocity field v = {vx, vy, vz} ! source term g !8 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 9. Programming Strategies Example: Solving the Acoustic Wave Equation in 3D using IWAVE while(…) { sgn_ts3d_210_p012_OpenMP(dom, pars); sgn_ts3d_210_v0_OpenMP(dom, pars); sgn_ts3d_210_v1_OpenMP(dom, pars); sgn_ts3d_210_v2_OpenMP(dom, pars); … } OpenMP p OpenMP vx // // // // // main simulation loop calculate pressure field calculate velocity x-axis calculate velocity y-axis calculate velocity x-axis OpenMP vy OpenMP vz OpenMP Time !9 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 10. Programming Strategies Example: Solving the Acoustic Wave Equation in 3D using IWAVE ! Measure the initial performance. ! pressure and velocity field simulated using OpenMP ! average time T[ms] per iteration ! OpenMP linear scaling with threads !10 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 11. Programming Strategies Example: Solving the Acoustic Wave Equation in 3D using IWAVE ! find computational blocks ! understand dependencies between blocks OpenMP vx OpenMP p OpenMP vy ! identify sequential and parallel parts OpenMP OpenMP vz Causality OpenMP p OpenMP vx OpenMP vy OpenMP vz OpenMP Time !11 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 12. Programming Strategies Example: Solving the Acoustic Wave Equation in 3D using IWAVE while(…) { sgn_ts3d_210_p012_OpenMP(dom, pars); sgn_ts3d_210_v0_OpenCL(dom, pars); sgn_ts3d_210_v1_OpenMP(dom, pars); sgn_ts3d_210_v2_OpenMP(dom, pars); … } // // // // // main simulation loop calculate pressure field p calculate velocity x-axis calculate velocity y-axis calculate velocity x-axis OpenCL vx OpenMP p IDLE OpenMP vy OpenMP vz OpenMP Time !12 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 13. Programming Strategies Example: Solving the Acoustic Wave Equation in 3D using IWAVE ! use the GPU to compute vx ! the CPU is idle while the GPU is running ! 42% improvement for 1 thread ! 25% improvement for 2 threads ! 9% improvement for 4 threads !13 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 14. Programming Strategies Example: Solving the Acoustic Wave Equation in 3D using IWAVE while(…) { sgn_ts3d_210_p012_OpenMP(dom, pars); ! ! // main simulation loop // calculate pressure field p int num_threads = atoi(getenv("OMP_NUM_THREADS")); omp_set_num_threads(2); omp_set_nested(1); #pragma omp parallel shared(…) private(…) { switch ( omp_get_thread_num() ) { case 0: sgn_ts3d_210_v0_OpenCL(dom, pars) break; case 1: omp_set_num_threads(num_threads); sgn_ts3d_210_v1_OpenMP(dom, pars); sgn_ts3d_210_v2_OpenMP(dom, pars); break; default: break; } } x } OpenCL v OpenMP p OpenMP vy OpenMP vz // save the current number of OpenMP threads // restrict the number of OpenMP threads to 2 // allow nested OpenMP threads // start 2 OpenMP threads // calculate velocity x-axis using OpenCL // increase number of OpenMP threads back // calculate velocity y-axis // calculate velocity z-axis // close OpenMP pragma // close simulation while OpenMP Time !14 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 15. Programming Strategies Example: Solving the Acoustic Wave Equation in 3D using IWAVE ! overlap vx and vy ! CPU not idle anymore ! 50% improvement for 1 thread ! 40% improvement for 2 threads ! 38% improvement for 4 threads !15 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 16. Programming Strategies Example: Solving the Acoustic Wave Equation in 3D using IWAVE while(…) { sgn_ts3d_210_p012_OpenCL(dom, pars); sgn_ts3d_210_v0_OpenCL(dom, pars); sgn_ts3d_210_v1_OpenCL(dom, pars); sgn_ts3d_210_v2_OpenCL(dom, pars); … } // // // // // bool sgn_ts3d_210_p012_OpenCL(RDOM* dom, void* pars) { … clEnqueueWriteBuffer(queue, buffer, …); clEnqueueNDRangeKernel(queue, kernel_P012, dims, …); clEnqueueReadBuffer(queue, buffer, …); … } OpenCL p OpenCL vx OpenCL vy main simulation loop calculate pressure field calculate velocity x-axis calculate velocity y-axis calculate velocity x-axis // copy data from host to device // execute OpenCL kernel on device // copy data from device to host OpenCL vz OpenCL Time !16 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 17. Programming Strategies Example: Solving the Acoustic Wave Equation in 3D using IWAVE ! understand where performance gets lost ! 98% of time spent on I/O ! 2% of time spent on compute ! reduce I/O OpenCL Upload Kernel Execution OpenCL Download 188ms 4ms 54ms OpenCL vx OpenMP p OpenMP vy OpenMP vz OpenMP Time !17 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 18. Programming Strategies Example: High Throughput Computer Vision with OpenCV ! How does the speedup of an OpenCL application (SOpenCL) depend on speedup of the OpenCL kernel (SKernel) when the OpenCL I/O time is fixed? ! Fraction of OpenCL I/O time: FI/O ! 50% I/O time limit the maximal possible speedup to 2 ! Minimize OpenCL I/O, only then increase OpenCL kernel performance !18 SKernel SOpenCL = HSKernel - 1L FIêO + 1 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 19. Programming Strategies Example: Solving the Acoustic Wave Equation in 3D using IWAVE while(…) { sgn_ts3d_210_ALL_OpenCL(dom, pars); … } // main simulation loop // combine all OpenCL calculations bool sgn_ts3d_210_ALL_OpenCL(RDOM* dom, void* pars) { … clEnqueueWriteBuffer(queue, buffer, …); ! ! while(…) { clEnqueueNDRangeKernel(queue, clEnqueueNDRangeKernel(queue, clEnqueueNDRangeKernel(queue, clEnqueueNDRangeKernel(queue, kernel_P012, dims, …); kernel_V0, dims, …); kernel_V1, dims, …); kernel_V1, dims, …); // copy data from host to device // // // // execute execute execute execute OpenCL OpenCL OpenCL OpenCL kernel kernel kernel kernel for for for for pressure velocity x velocity y velocity z } clEnqueueReadBuffer(queue, buffer, …); … // copy data from device to host } OpenCL p OpenCL vx OpenCL vy OpenCL vz OpenCL Time !19 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 20. Programming Strategies Example: Solving the Acoustic Wave Equation in 3D using IWAVE ! eliminate all but essential I/O ! significant speedup over simple OpenCL !20 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 21. Programming Strategies Example: Solving the Acoustic Wave Equation in 3D using IWAVE ! measure real application performance ! 3000 iterations using a 97x405x389 simulation grid ! 8 GCN Compute Units achieve 70% more performance than 8 traditional OpenMP threads 14 10.5 7 3.5 0 CPU (8T) "Piledriver" !21 | OpenCL and OpenMP Workloads on Accelerated Processing Units | GPU (8CU) AMD S9000
  • 22. Programming Strategies Example: High Throughput Computer Vision with OpenCV ! initial OpenCL performance measurements ! 89 Algorithms tested for image size of 4MP ! compare OpenCL I/O and execution time ! 28% of all algorithms are compute bound ! 72% of all algorithms are I/O bound OpenCV Computer Vision Library Performance Tests v2.4 Ubuntu 12.10 x86_64 1 Piledriver CPU core @ 2.5GHz 6 GPU Compute Units @ 720MHz 16GB DDR3 1600MHz !22 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 23. Programming Strategies Example: High Throughput Computer Vision with OpenCV ! compare OpenCL and single-threaded performance ! 89 Algorithms tested for image size of 4MP ! realistic timing that includes I/O over PCIe ! 59% of all algorithms execute faster on the GPU ! 41% of all algorithms execute faster on the CPU(1) ! significant speedup for only 15% of all algorithms OpenCV Computer Vision Library Performance Tests v2.4 Ubuntu 12.10 x86_64 1 Piledriver CPU core @ 2.5GHz 6 GPU Compute Units @ 720MHz 16GB DDR3 1600MHz !23 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 24. Programming Strategies Example: High Throughput Computer Vision with OpenCV ! Task: Batch process a large amount of images using a single algorithm. ! OpenCL performance is algorithm and image size dependent ! Either the CPU will process data or the GPU, but not both ! How to choose which algorithm and device to use depending on image size? !24 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 25. Programming Strategies Example: High Throughput Computer Vision with OpenCV !25 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 26. Programming Strategies Example: High Throughput Computer Vision with OpenCV ! Better: create input image queue that CPU and GPU query for new image tasks till queue is empty. ! all CPU cores are fully utilized at all times even for single-threaded algorithms ! all GPU compute units are fully utilized at all times ! combined performance for single algorithm is sum of GPU and CPU performance for that algorithm ! combined performance for multiple algorithms is better than sum of device performance P i APU =P P= !26 | OpenCL and OpenMP Workloads on Accelerated Processing Units | i CPU +P i N 1 ⁄i=1 Pi 1 GPU
  • 27. Programming Strategies Example: High Throughput Computer Vision with OpenCV !27 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 28. Programming Strategies Summary ! ! next generation hardware and legacy code requires compromises ! OpenCL performance is tied to Amdahl’s Law regarding OpenCL I/O and OpenCL execution time ! application performance can be increased by overlapping OpenCL and OpenMP workloads ! removing all but necessary OpenCL I/O can have a dramatic influence on performance ! for loosely coupled high-throughput applications the OpenCL and OpenMP performance add for single algorithms ! for multiple algorithms the combined performance across all algorithms is better than the sum of devices performances ! APUs may provide greatest performance per Watt ! GPUs may provide greatest performance !28 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 29. DISCLAIMER & ATTRIBUTION The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.
 The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes.
 AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.
 AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. ! ATTRIBUTION © 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. SPEC is a registered trademark of the Standard Performance Evaluation Corporation (SPEC). Other names are for informational purposes only and may be trademarks of their respective owners. !29 | OpenCL and OpenMP Workloads on Accelerated Processing Units |