SlideShare a Scribd company logo
Parallel and
distributed computing
OUTLINE
GPU Architecture
INTRODUCTION
A Microprocessor for processing of graphics
Capable of handling millions of instructions per
second
GPU is graphic card that relieve CPU of much of
graphic processing load
INTRODUCTION
GPU is made for performing parallel operations that’s
why GPU have many parallel execution Units.
Many super computer uses GPU for high processing
and performance.
CPU Vs GPU
A CPU consists of four to eight CPU cores,
GPU consists of hundreds of smaller cores.
This massively parallel architecture is what gives the GPU its
high compute performance.
SINGLE CORE VS MULTICORE PREOCESOR
INTRODUCTION
Today NVIDIA GPU upgraded to 128 cores on single chip.
Hence consuming less power with high performance.
Each core handle 8 threads (total=1024 threads)
Modern GPU not only use for graphics or video coding but
also used in HPC
GPU are designed to handle large number of floating point
operations in parallel.
COMPONENTS OF GPU
Graphic processor
Graphic co-processor
Graphic accelerator
Frame buffer
Memory
Graphic BIOS
Digital to analog converter
Display connector
Computer Connector
Components of CPU:
control unit (CU)
arithmetic logic unit
(ALU)
registers.
cache.
clock.
ALUs and Fetch/Decode logic run at
high speed, consume little power and
require few hardware to build
Contrary Execution Unit require huge
number of transistors to build cache, it
may occupy 50% of total area hence
expensive
It may also main energy absorbing
element
Introduction to GPU Evolution
GPU Comprises of ten to thousand cores
To build GPU we need slimmer design of CPU
For this all complex and large units of should be
removed from general CPU
Basic Idea of GPU: to have many (100 or 1000) simpler
or weaker processing units that process many same
instructions simultaneously on different data.
Vector Addition of two floating Point vectors each containing 128 elements
Introduction to GPU Evolution
Instead of using one CPU Core we can use two
such cores.
We will be able to execute two instructions in
parallel
Hence throughput is increased
Two Instructions executing parallel on two CPU Cores
Even we can achieve higher performance by further replicating ALU,
Instead of replicating complete CPU Core
Fetch/decode logic remained share among ALUs
All ALUs should execute same operations on different input data
One such core can add eight vector elements in parallel
Parallel and Distributed Computing Chapter 8
Modern GPU
Modern GPU containing hundreds of simple
Processing elements suited for computation that can
be run in parallel
This typically involves arithmetic on large data sets
i.e. vectors matrices, images, where same operations
can be performed across millions of data elements at
the same time.
Modern GPU
Modern GPU
To exploit parallelism for GPU, Programmer should
partition their program into thousands of threads and
schedule them among Compute Units.
Memory Hierarchy on GPU
GPU has these five memory regions accessible from a
single work item
1. Registers
2. Local Memory
3. Texture Memory
4. Constant Memory
5. Global Memory
Memory Hierarchy on GPU
Memory Hierarchy on GPU
Registers:
Registers are at first and most preferable level
Each work-item has dedicated registers
There may be 16k, 32K or 64K register for work item.
Memory Hierarchy on GPU
Global Memory:
Also called graphic dynamic memory. Achieves high bandwidth
But this memory has a high latency compare to other memories.
It is global because it is accessed from GPU and CPU both.
GTX780 GPU has 3GB of global memory implemented in GDDR5
It is used for transfers between host and device
It is larger capacity memory and high latency.
Host and Device
Main CPU is host, while all other processors like,
GPU are names as device
Memory Hierarchy on GPU
There are also two additional memories that are accessible by all
work items
Constant memory and texture memory
Constant Memory: resides in device memory (Cached) where
constant and program arguments are stored
Constant memory has two special properties. Firstly, it is cached, and
second it supports broadcasting a single value to all work items
This broadcast takes place in just a single cycle.
Work item have only read access to this region. However host is
permitted both read/write access.
Memory Hierarchy on GPU
Texture Memory: When all reads in a work group are
physically adjacent, using texture memory can reduce
memory traffic and increase performance compared
to global memory
However Texture memory is much slower than
constant memory
Work-Item / Work-Group
Work-Item are actually threads (WI)
Work-Group, which is unit of work scheduled onto compute unit.
Work Items are organized into Work-Groups
Hence Work Group is also define as the set of Work Item
All work item in a work group are able to share local memory
Work group execute independently from each other.
Work Group executes on Compute Unit and Work Item are mapped
to CU PEs
OpenCL
A Framework use for writing programs that execute
across heterogeneous platforms consisting of CPUs,
GPUs or any accelerated hardware.
It define C Programming language that is used to
write and compute kernels just like C functions
One significant drawback: not easy to learn
OpenCL Kernel
Code that gets executed in a GPU Device is called
kernel in OpenCL.
The body of kernel function implements the
computation to be completed by all work-items.
When writing kernels in OpenCL, We must declare
memory with certain address space to mention that
data will resides in which memory. E.g.
(_ _global), (_ _constant), (_ _local), or by default
private memory within a kernel
Heterogeneous system
Also called platform model or heterogeneous system.
Consists of single host connected to one or more OpenCL
Devices e.g. FPGA Accelerators, DSP, GPU or even CPU
OpenCL Devices
OpenCL Devices comprises of several compute
unit. Each compute unit comprises of tens or
hundreds of processing element.
Execution Model
OpenCL execution model defines how kernels execute. i.e.
NDRange (N-Dimensional Range) execution model.
Host program invokes kernel over an index space called
NDRange.
NDRange defines total numbers of work items that execute
in parallel.
Programming in OpenCL
Sample C Code for
vector addition for
single core CPU
Kernel Function
To execute vector addition function on GPU Device, we must
write it as kernel function
Each thread on GPU Device will then execute the same
kernel function
Parallel and Distributed Computing Chapter 8
HOST CODE
First step is to code host application, that run’s on a user’s
computer and dispatches to connected devices.
Host application can be coded in C/C++
OpenCL supports wide range of heterogeneous platforms.
Prior to execute kernel function, the host program for a
heterogeneous system must carry out the some steps.
Steps to execute Kernel Function
1. Discover OpenCL Devices, The platform consists of one or
more devices capable of executing OpenCL Kernels.
2. Probe the characteristics of these devices so that kernel
functions can adapt to specific features.
3. Read source program, compile the kernel that will run on
selected devices.
4. Setup memory objects that will hold the data for
computation.
5. Run the kernel on selected devices
6. Collect the final result from devices.
Steps to execute Kernel Function
These steps are accomplished through following series of calls.
1.Prepare and initialize data on host.
2. Discover and initialize the devices.
3. Create a context.
4. Create a command queue.
5. Create the program object for a context.
6. Build the OpenCL program.
7. Create device buffers.
8. Write host data to device buffers.
9. Create and compile the kernel.
10. Set the kernel arguments
11. Set the execution model and enqueue the kernel for execution.
12. Read the output buffer back to the host
Discover & Initialize the devices
cl_int clGetDeviceIDs()
functions is used to discover and initialize
the devise.
This function returns number of devices in
num_devices.
Discover & Initialize the devices
Get device info. We use
cl_int clGetDeviceInfo()
This function will return max. computer
units, max work-group size, Device type,
size of memory
Create a context
clCreatecontext() function is use for
creating a context i.e. managing objects
such as command queues, program, kernel
objects
Build a program Executeable
clBuildProgram() function is use to build
(compile and link) a program from source
code.
Write host data to device buffer
clEnqueueWriterBuffer() function is use to
write data from host memory to device
buffer
This function provide data for processing
over device.
Enqueue the kernel for execution
OpenCL always execute kernel in parallel i.e. same
kernel execute on different data set.
Each kernel execution in OpenCL is called work-item
Each work-item is responsible for executing kernel
once on its assigned portion of data set.
Thus it is programmer responsibility to tell OpenCl
how many work-items are needed to process all data.
Occupancy
Occupancy is ratio of active work groups per
compute unit to maximize its performance. We
should always keeping a occupancy high in order
to hide latency when executing instructions
A compute unit should have a ready work group
to execute in every cycle as this is the only way
to keep hardware busy.

More Related Content

PDF
Parallel and Distributed Computing chapter 3
PPTX
Concurrency Control in Distributed Systems.pptx
PDF
Parallel and Distributed Computing Chapter 11
PPTX
Load Balancing In Distributed Computing
DOCX
Parallel computing persentation
PPTX
Cloud computing using Eucalyptus
PPT
system interconnect architectures in ACA
PPTX
Distributed file system
Parallel and Distributed Computing chapter 3
Concurrency Control in Distributed Systems.pptx
Parallel and Distributed Computing Chapter 11
Load Balancing In Distributed Computing
Parallel computing persentation
Cloud computing using Eucalyptus
system interconnect architectures in ACA
Distributed file system

What's hot (20)

PPTX
Parallel Distributed Systems and Heterogeneity.pptx
PPT
Unt 3 attributes, methods, relationships-1
PDF
Parallel and Distributed Computing Chapter 10
PPTX
ADVANCED COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
PPTX
Parallel programming model
PPT
File models and file accessing models
PPTX
Implementation levels of virtualization
PPT
C08 wireless atm[1]
PPTX
Concurrency Control in Distributed Database.
PPTX
Flynn's Classification .pptx
PPTX
Parallel processing (simd and mimd)
PPTX
Distributed concurrency control
PPT
Map reduce in BIG DATA
PPT
Distributed data processing
PPTX
Introduction to parallel processing
PPTX
Multithreading
PPT
Distributed system
PPTX
Dichotomy of parallel computing platforms
PPTX
Inter Process Communication
Parallel Distributed Systems and Heterogeneity.pptx
Unt 3 attributes, methods, relationships-1
Parallel and Distributed Computing Chapter 10
ADVANCED COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Parallel programming model
File models and file accessing models
Implementation levels of virtualization
C08 wireless atm[1]
Concurrency Control in Distributed Database.
Flynn's Classification .pptx
Parallel processing (simd and mimd)
Distributed concurrency control
Map reduce in BIG DATA
Distributed data processing
Introduction to parallel processing
Multithreading
Distributed system
Dichotomy of parallel computing platforms
Inter Process Communication
Ad

Similar to Parallel and Distributed Computing Chapter 8 (20)

PDF
General Purpose GPU Computing
PDF
Open CL For Speedup Workshop
PDF
Introduction to OpenCL By Hammad Ghulam Mustafa
PDF
Introduction to OpenCL
PPTX
OpenCL Heterogeneous Parallel Computing
PDF
Trip down the GPU lane with Machine Learning
PPTX
Indic threads pune12-accelerating computation in html 5
PPTX
GPU in Computer Science advance topic .pptx
PDF
SDAccel Design Contest: Xilinx SDAccel
PDF
Challenges in GPU compilers
PPTX
MattsonTutorialSC14.pptx
PDF
Performance analysis of sobel edge filter on heterogeneous system using opencl
PDF
Accelerating Real Time Applications on Heterogeneous Platforms
PDF
MattsonTutorialSC14.pdf
PDF
Development of Signal Processing Algorithms using OpenCL for FPGA based Archi...
PPTX
C for Cuda - Small Introduction to GPU computing
PPTX
An Introduction to CUDA-OpenCL - University.pptx
PDF
Gpu perf-presentation
PDF
GPGPU Computation
PPTX
Hands on OpenCL
General Purpose GPU Computing
Open CL For Speedup Workshop
Introduction to OpenCL By Hammad Ghulam Mustafa
Introduction to OpenCL
OpenCL Heterogeneous Parallel Computing
Trip down the GPU lane with Machine Learning
Indic threads pune12-accelerating computation in html 5
GPU in Computer Science advance topic .pptx
SDAccel Design Contest: Xilinx SDAccel
Challenges in GPU compilers
MattsonTutorialSC14.pptx
Performance analysis of sobel edge filter on heterogeneous system using opencl
Accelerating Real Time Applications on Heterogeneous Platforms
MattsonTutorialSC14.pdf
Development of Signal Processing Algorithms using OpenCL for FPGA based Archi...
C for Cuda - Small Introduction to GPU computing
An Introduction to CUDA-OpenCL - University.pptx
Gpu perf-presentation
GPGPU Computation
Hands on OpenCL
Ad

More from AbdullahMunir32 (17)

PDF
Mobile Application Development-Lecture 15 & 16.pdf
PDF
Mobile Application Development-Lecture 13 & 14.pdf
PDF
Mobile Application Development -Lecture 11 & 12.pdf
PDF
Mobile Application Development -Lecture 09 & 10.pdf
PDF
Mobile Application Development -Lecture 07 & 08.pdf
PDF
Mobile Application Development Lecture 05 & 06.pdf
PDF
Mobile Application Development-Lecture 03 & 04.pdf
PDF
Mobile Application Development-Lecture 01 & 02.pdf
PDF
Parallel and Distributed Computing Chapter 13
PDF
Parallel and Distributed Computing Chapter 12
PDF
Parallel and Distributed Computing Chapter 9
PDF
Parallel and Distributed Computing Chapter 7
PDF
Parallel and Distributed Computing Chapter 6
PDF
Parallel and Distributed Computing Chapter 5
PDF
Parallel and Distributed Computing Chapter 4
PDF
Parallel and Distributed Computing Chapter 2
PDF
Parallel and Distributed Computing chapter 1
Mobile Application Development-Lecture 15 & 16.pdf
Mobile Application Development-Lecture 13 & 14.pdf
Mobile Application Development -Lecture 11 & 12.pdf
Mobile Application Development -Lecture 09 & 10.pdf
Mobile Application Development -Lecture 07 & 08.pdf
Mobile Application Development Lecture 05 & 06.pdf
Mobile Application Development-Lecture 03 & 04.pdf
Mobile Application Development-Lecture 01 & 02.pdf
Parallel and Distributed Computing Chapter 13
Parallel and Distributed Computing Chapter 12
Parallel and Distributed Computing Chapter 9
Parallel and Distributed Computing Chapter 7
Parallel and Distributed Computing Chapter 6
Parallel and Distributed Computing Chapter 5
Parallel and Distributed Computing Chapter 4
Parallel and Distributed Computing Chapter 2
Parallel and Distributed Computing chapter 1

Recently uploaded (20)

PDF
Anesthesia in Laparoscopic Surgery in India
PDF
Piense y hagase Rico - Napoleon Hill Ccesa007.pdf
PPTX
Onica Farming 24rsclub profitable farm business
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PDF
From loneliness to social connection charting
PDF
Module 3: Health Systems Tutorial Slides S2 2025
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PDF
English Language Teaching from Post-.pdf
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PPTX
The Healthy Child – Unit II | Child Health Nursing I | B.Sc Nursing 5th Semester
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
PDF
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PDF
O7-L3 Supply Chain Operations - ICLT Program
PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PPTX
Week 4 Term 3 Study Techniques revisited.pptx
PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
Anesthesia in Laparoscopic Surgery in India
Piense y hagase Rico - Napoleon Hill Ccesa007.pdf
Onica Farming 24rsclub profitable farm business
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
From loneliness to social connection charting
Module 3: Health Systems Tutorial Slides S2 2025
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
English Language Teaching from Post-.pdf
STATICS OF THE RIGID BODIES Hibbelers.pdf
The Healthy Child – Unit II | Child Health Nursing I | B.Sc Nursing 5th Semester
102 student loan defaulters named and shamed – Is someone you know on the list?
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
O5-L3 Freight Transport Ops (International) V1.pdf
Renaissance Architecture: A Journey from Faith to Humanism
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
O7-L3 Supply Chain Operations - ICLT Program
human mycosis Human fungal infections are called human mycosis..pptx
Week 4 Term 3 Study Techniques revisited.pptx
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...

Parallel and Distributed Computing Chapter 8

  • 3. INTRODUCTION A Microprocessor for processing of graphics Capable of handling millions of instructions per second GPU is graphic card that relieve CPU of much of graphic processing load
  • 4. INTRODUCTION GPU is made for performing parallel operations that’s why GPU have many parallel execution Units. Many super computer uses GPU for high processing and performance.
  • 5. CPU Vs GPU A CPU consists of four to eight CPU cores, GPU consists of hundreds of smaller cores. This massively parallel architecture is what gives the GPU its high compute performance.
  • 6. SINGLE CORE VS MULTICORE PREOCESOR
  • 7. INTRODUCTION Today NVIDIA GPU upgraded to 128 cores on single chip. Hence consuming less power with high performance. Each core handle 8 threads (total=1024 threads) Modern GPU not only use for graphics or video coding but also used in HPC GPU are designed to handle large number of floating point operations in parallel.
  • 8. COMPONENTS OF GPU Graphic processor Graphic co-processor Graphic accelerator Frame buffer Memory Graphic BIOS Digital to analog converter Display connector Computer Connector Components of CPU: control unit (CU) arithmetic logic unit (ALU) registers. cache. clock.
  • 9. ALUs and Fetch/Decode logic run at high speed, consume little power and require few hardware to build Contrary Execution Unit require huge number of transistors to build cache, it may occupy 50% of total area hence expensive It may also main energy absorbing element
  • 10. Introduction to GPU Evolution GPU Comprises of ten to thousand cores To build GPU we need slimmer design of CPU For this all complex and large units of should be removed from general CPU Basic Idea of GPU: to have many (100 or 1000) simpler or weaker processing units that process many same instructions simultaneously on different data.
  • 11. Vector Addition of two floating Point vectors each containing 128 elements
  • 12. Introduction to GPU Evolution Instead of using one CPU Core we can use two such cores. We will be able to execute two instructions in parallel Hence throughput is increased
  • 13. Two Instructions executing parallel on two CPU Cores
  • 14. Even we can achieve higher performance by further replicating ALU, Instead of replicating complete CPU Core Fetch/decode logic remained share among ALUs All ALUs should execute same operations on different input data
  • 15. One such core can add eight vector elements in parallel
  • 17. Modern GPU Modern GPU containing hundreds of simple Processing elements suited for computation that can be run in parallel This typically involves arithmetic on large data sets i.e. vectors matrices, images, where same operations can be performed across millions of data elements at the same time.
  • 19. Modern GPU To exploit parallelism for GPU, Programmer should partition their program into thousands of threads and schedule them among Compute Units.
  • 20. Memory Hierarchy on GPU GPU has these five memory regions accessible from a single work item 1. Registers 2. Local Memory 3. Texture Memory 4. Constant Memory 5. Global Memory
  • 22. Memory Hierarchy on GPU Registers: Registers are at first and most preferable level Each work-item has dedicated registers There may be 16k, 32K or 64K register for work item.
  • 23. Memory Hierarchy on GPU Global Memory: Also called graphic dynamic memory. Achieves high bandwidth But this memory has a high latency compare to other memories. It is global because it is accessed from GPU and CPU both. GTX780 GPU has 3GB of global memory implemented in GDDR5 It is used for transfers between host and device It is larger capacity memory and high latency.
  • 24. Host and Device Main CPU is host, while all other processors like, GPU are names as device
  • 25. Memory Hierarchy on GPU There are also two additional memories that are accessible by all work items Constant memory and texture memory Constant Memory: resides in device memory (Cached) where constant and program arguments are stored Constant memory has two special properties. Firstly, it is cached, and second it supports broadcasting a single value to all work items This broadcast takes place in just a single cycle. Work item have only read access to this region. However host is permitted both read/write access.
  • 26. Memory Hierarchy on GPU Texture Memory: When all reads in a work group are physically adjacent, using texture memory can reduce memory traffic and increase performance compared to global memory However Texture memory is much slower than constant memory
  • 27. Work-Item / Work-Group Work-Item are actually threads (WI) Work-Group, which is unit of work scheduled onto compute unit. Work Items are organized into Work-Groups Hence Work Group is also define as the set of Work Item All work item in a work group are able to share local memory Work group execute independently from each other. Work Group executes on Compute Unit and Work Item are mapped to CU PEs
  • 28. OpenCL A Framework use for writing programs that execute across heterogeneous platforms consisting of CPUs, GPUs or any accelerated hardware. It define C Programming language that is used to write and compute kernels just like C functions One significant drawback: not easy to learn
  • 29. OpenCL Kernel Code that gets executed in a GPU Device is called kernel in OpenCL. The body of kernel function implements the computation to be completed by all work-items. When writing kernels in OpenCL, We must declare memory with certain address space to mention that data will resides in which memory. E.g. (_ _global), (_ _constant), (_ _local), or by default private memory within a kernel
  • 30. Heterogeneous system Also called platform model or heterogeneous system. Consists of single host connected to one or more OpenCL Devices e.g. FPGA Accelerators, DSP, GPU or even CPU
  • 31. OpenCL Devices OpenCL Devices comprises of several compute unit. Each compute unit comprises of tens or hundreds of processing element.
  • 32. Execution Model OpenCL execution model defines how kernels execute. i.e. NDRange (N-Dimensional Range) execution model. Host program invokes kernel over an index space called NDRange. NDRange defines total numbers of work items that execute in parallel.
  • 33. Programming in OpenCL Sample C Code for vector addition for single core CPU
  • 34. Kernel Function To execute vector addition function on GPU Device, we must write it as kernel function Each thread on GPU Device will then execute the same kernel function
  • 36. HOST CODE First step is to code host application, that run’s on a user’s computer and dispatches to connected devices. Host application can be coded in C/C++ OpenCL supports wide range of heterogeneous platforms. Prior to execute kernel function, the host program for a heterogeneous system must carry out the some steps.
  • 37. Steps to execute Kernel Function 1. Discover OpenCL Devices, The platform consists of one or more devices capable of executing OpenCL Kernels. 2. Probe the characteristics of these devices so that kernel functions can adapt to specific features. 3. Read source program, compile the kernel that will run on selected devices. 4. Setup memory objects that will hold the data for computation. 5. Run the kernel on selected devices 6. Collect the final result from devices.
  • 38. Steps to execute Kernel Function These steps are accomplished through following series of calls. 1.Prepare and initialize data on host. 2. Discover and initialize the devices. 3. Create a context. 4. Create a command queue. 5. Create the program object for a context. 6. Build the OpenCL program. 7. Create device buffers. 8. Write host data to device buffers. 9. Create and compile the kernel. 10. Set the kernel arguments 11. Set the execution model and enqueue the kernel for execution. 12. Read the output buffer back to the host
  • 39. Discover & Initialize the devices cl_int clGetDeviceIDs() functions is used to discover and initialize the devise. This function returns number of devices in num_devices.
  • 40. Discover & Initialize the devices Get device info. We use cl_int clGetDeviceInfo() This function will return max. computer units, max work-group size, Device type, size of memory
  • 41. Create a context clCreatecontext() function is use for creating a context i.e. managing objects such as command queues, program, kernel objects
  • 42. Build a program Executeable clBuildProgram() function is use to build (compile and link) a program from source code.
  • 43. Write host data to device buffer clEnqueueWriterBuffer() function is use to write data from host memory to device buffer This function provide data for processing over device.
  • 44. Enqueue the kernel for execution OpenCL always execute kernel in parallel i.e. same kernel execute on different data set. Each kernel execution in OpenCL is called work-item Each work-item is responsible for executing kernel once on its assigned portion of data set. Thus it is programmer responsibility to tell OpenCl how many work-items are needed to process all data.
  • 45. Occupancy Occupancy is ratio of active work groups per compute unit to maximize its performance. We should always keeping a occupancy high in order to hide latency when executing instructions A compute unit should have a ready work group to execute in every cycle as this is the only way to keep hardware busy.