SlideShare a Scribd company logo
Abstract— This paper describes the implementation and
results of CT reconstruction using Filtered Back Projection
on various stream programming paradigms.
Key words— Filtered Back Projection, CUDA, Cell.
I. INTRODUCTION
The complexity of Medical image reconstruction
requires tens to hundreds of billions of computations per
second. Until few years ago, special purpose processors
designed especially for such applications were used.
Such processors require significant design effort and are
thus difficult to change as new algorithms in
reconstructions evolve and have limited parallelism.
Hence the demand for flexibility in medical
applications motivated the use of stream processors with
massively parallel architecture.
Stream processing architectures offers data parallel kind
of parallelism. In data parallelism, a set of operations are
performed on large amount of data in a parallel fashion.
Many of the reconstruction algorithms are high suitable
for this data parallel paradigm. The central idea behind
stream processing is to organize the application into
streams - set of data - and kernels - a series of operations
applied to each element in the stream.
There are several stream processors commercially
available. Here in this paper, we consider two emerging
processors – Graphical Processing Unit (GPU) [2] and
Cell Processors [3] for benchmarking our reconstruction
application provide specific instructions to allow
application developers to tag kernels and/or streams.
GPUs are designed for high end game and 3D simulation
application. They are massively parallel architectures
with large amount of arithmetic processing units.
Programming a GPU requires the knowledge of graphics
pipeline architecture. But off late the advent of General
Purpose on GPU (GPGPU)[4] lead to design of tools
based on common programming languages. Cell
Processors are designed to bridge the gap between
conventional desktop processors and more specialized
high-performance processors, such as GPUs. Cell
processor can be split into four components: external
input and output structures, the main processor called
the Power Processing Element (PPE) eight
fully-functional co-processors called the Synergistic
Processing Elements.
GPU can be programmed using Shader languages of
Graphical APIs such as OpenGL [2] and DirectX [5],
which requires knowledge in Graphics architecture.
Then there are C style programming languages such as
CUDA [5], which requires minimum or no background
in Graphics. As the Cell processor is general purpose
processor, languages currently available are similar to C
programming.
In this paper we consider the X-ray beam computed
tomography (CT) reconstruction of 2Dimage/3D image
from 2D projections as or image reconstruction example.
The 2D/3D image reconstruction is a very demanding
computational task. The reconstruction from projections
is done using Filtered Back Projection [1] with time
complexity—O (N4) where N is the number of detector
pixels in one detector row.
In the rest of the paper we give an overview of the
programming tools used for GPU and Cell Processor.
Also we describe the implementation of Filtered Back
Projection using some of these programming paradigms.
We conclude with the paper by benchmarking results
with conventional processors.
II. FILTERED BACK PROJECTION
Filtered Back Projection is a conventional technique used
for CT reconstruction for 2D Fan Beam Data and 3D
Cone Beam Data. Filter Back Projection can be divided
into (1) Filtering and (2) Back Projection.
Algorithm of Discrete FBP
High Performance Medical Reconstruction
using Stream Programming Paradigms
June 2008
High Performance Computing Group, NeST Software [www.nestsoftware.com]
The following is the pseudo algorithm of the FBP
//Filter the projection images along horizontal lines
For Each Projection
For Each row of the Projection
For Each col of the Projection
Perform the convolution. (Filter) row wise
// Perform the back projection along the projection rays
For each Filtered Image
Get the angle of Projection for this filtered image
For each output Image of the reconstructed
volume.
Get the angle of Projection of this image.
Calculate increment factor U using angle for
indexing (look up) into the Filtered Image and
also weight calculation.
For each row of the output image
For each column of the output image
Lookup the image at calculated index.
Accumulate the output with the 1/U2
The computational complexity of the Filter Back
Projection (FBP) is biquadratic. To reconstruct a volume
of size 400 x 400 x 240 images from 360 projections
requires floating point operations of the order of 109 than
available with a normal processor. The most
computation-intensive step of the Filtered Back
Projection is back projection. Roughly 97% of the
operations involve the back projection when
reconstructing the images from the projections.
III. CUDA
Traditionally GPGPU applications were developed
using Graphical APIs and their shading language which
requires knowledge of Graphics Pipeline. Moreover
these APIs exposed very little of the underlying
hardware. CUDA, designed from the ground-up is an
efficient general purpose computation on GPUs. It gives
computationally intensive applications access to the
tremendous processing power of the latest NVIDIA
GeForce 8 GPUs through a C-like programming
interface. The GeForce 8 series GPUs have up to 128
processors running at 1.5 GHz and up to 1.5GB of
on-board memory. It uses a C-like programming
language and does not require remapping algorithms to
graphics concepts. CUDA exposes several hardware
features that are not available via the graphics API. The
most significant of these is shared memory, which is a
small (currently 16KB per multiprocessor) area of
on-chip memory which can be accessed in parallel by
blocks of threads. This allows caching of frequently used
data and can provide large speedups over using textures
to access data. Combined with thread synchronization
primitive, this allows cooperative parallel processing of
on-chip data, greatly reducing the expensive off-chip
bandwidth requirements of many parallel algorithms.
This benefits a number of common applications such as
linear algebra, Fast Fourier Transforms, and image
processing filters. The current generation of GPU from
NVIDIA has support for IEEE single precision. Double
precision support will be available in the next generation
towards the end of the year. There are however several
fields in which significant results can be obtained with
single precision: image and signal processing and some
numerical methods like pseudo spectral approximation
are just few examples
CUDA Programming Model
The programmer writes a serial program that calls
parallel kernels, which may be simple functions or full
programs. A kernel executes in parallel across a set of
parallel threads. The programmer organizes these
threads into a hierarchy of grids of thread blocks. A
thread block is a set of concurrent threads that can
cooperate among themselves through barrier
synchronization and shared access to a memory space
private to the block. A grid is a set of thread blocks that
may each be executed independently and thus may
execute in parallel.
When invoking a kernel, the programmer specifies the
number of threads per block and the number of blocks
making up the grid. Each thread is given a unique thread
ID number threadIdx within its thread block, numbered
0, 1, 2, ..., blockDim - 1, and each thread block is given a
unique block ID number blockIdx within its grid. CUDA
supports thread blocks containing up to 512 threads. For
convenience, thread blocks and grids may have one, two,
or three dimensions, accessed via .x, .y, and .z index
fields.
FBP Implementation
The FBP implementation of CPU consists of reading the
image files and its configuration files containing the
information such as angle of Projection, geometric
parameters for Region of Interest (ROI), computing the
Filter mask and the geometric shape of the output image.
The compute intensive portions like filtering and back
projection, implemented in GPU, are packed in two
different kernels. Prior to the kernel execution, the input
data is copied to the GPU’s device memory. As this data
is required for both filter kernel as well as back
projection kernel, it is allocated in the global memory of
the device, which helps to avoid copying multiple times.
In order to achieve high level of parallelism, size of
blocks and threads in the block are selected
appropriately. We divide the data into several 1D block
and threads.
The input image data is stored and accessed via global
memory. 1D convolution mask as well the a row input
data, which is used repeatedly used for convolving each
of row of image is stored and accessed via shared
memory than from the global memory. This will
significantly reduce memory bandwidth which affects
the performance. Data reuse is high with in a thread
block and hence a block of threads are loaded into the
shared memory. Data stored in the shared memory are
declared as volatile variable in order to disable to
compiler optimization for reading and writing to shared
memory as long as previous statement is met [6]. This is
yet another optimization strategy to reduce the memory
bandwidth.
The back projection algorithm contains computations
involving several array look ups. Unlike in CPU, array
look ups in GPUs are costlier. Hence they are replaced
with index based computations. This is done in order to
exploit arithmetic efficiency of GPU. Compared to
filtering operation, the data reuse is not that high in back
projection, which makes the shared memory usage
inefficient in this situation. Here texture look up of the
filtered image is the found to be better choice than
accessing the data stored in the shared memory. This is
because texture memory is spatially cached. More over
back projection computation performed on large amount
of data cannot be stored in the shared memory whose
size
 
Figure 1: FBP implementation in CUDA
Results
The Filtered Back Projection algorithm was tested on
NVIDIA 8800 GTS and Intel Pentium D processor with
3.2 GHz hardware. The table 1 below shows the results
for the filtered back projection for an Input Image of size
960 x 768 x 360.
Filtering & Back
Projection Operation
(in seconds) of size
960 x 768 x 360
Speed up
Intel Pentium
D ( 3.2 GHZ)
1800
NVIDIA
GeForce 8800
GTS
16 112.5x
Table 1 : Comparison of the time taken for execution of FBP on
Intel Pentium D (3.2 GHZ, 2 GB RAM) and the NVIDIA
8800 GTS.
IV. CELL PROGRAMMING
System Overview
The Cell Broadband Engine is a heterogeneous
multiprocessor containing 1 PowerPC Processing Unit
(PPU), 6 Synergistic Processing Units (SPUs) and a high
bandwidth element interconnect bus. There is support
for performing data transfers concurrent to instruction
execution. The Cell processor achieves efficiency in
power consumption because of the distinction between
the PPU (that is specialized for control operations and
runs the OS) and the SPU (that is specialized for data
intensive operations). There is support for vectorization
on the PPU and the SPU that can result in up to 4 single
precision floating point operations in one cycle. The Cell
based machine chosen for implementation of our
application is the Play Station 3 (PS3), running the
Yellow Dog Linux distribution. Programming is done in
C, using C extensions for the Cell platform. GNU C
Compiler for the SPU and PPU platforms are used.
Cell Programming Model
The Cell processor supports a stream programming
model, which involves the identification of an
appropriate kernel function and the execution of the
chosen kernel function on all the data elements. The
latency of execution is reduced by exploiting the single
program multiple data (SPMD) paradigm for executing
the same kernel function concurrently on multiple data
elements. In this case, the PPU spawns multiple threads
and schedules these threads for execution in the 6 SPEs.
The synchronization of threads and transfer of data are
controlled from the PPU program. Double buffering is
used to allow data transfers from the PPU to SPU(s)
concurrent to program execution. In order to minimize
the latency of memory access, the software cache in SPUs
is utilized. Data transfers from the PPU to SPUs and vice
versa are accomplished by DMA and message passing.
Vectorization of PPU and SPU code is performed for
further increase in data parallel execution. Wherever
possible, optimized library functions for the cell
processor are invoked for increased execution efficiency.
The code for execution on the PPU and SPUs is
developed separately. The executables for the SPUs are
embedded within the main PPU executable and can be
invoked from within PPU functions as a separate thread,
specifying the name of the executable, the SPU on which
it has to be executed and other relevant parameters.
Implementation of FBP
The filter and backprojection functions that have to be
evaluated corresponding to every input and output
voxel respectively, are considered as the kernel
functions. Filtering involves the convolution of a
one-dimensional filter with each row of input slices.
Back projection involves evaluation of weights for each
pixel, evaluation of the offset in the filtered image and
determining the contribution of the corresponding value
towards the back projection of the corresponding pixel.
Due to limitations in RAM for the Cell, the input images
are processed one at a time, accumulating the weighted
contribution of each input image to output voxels.
The issues encountered on the Cell processor include
limitations in RAM and SPU local memory as well
as issues in restructuring the code for optimal
performance on the Cell BE processor. There is an
increased programming overhead because the
programmer is responsible for management of data
transfer and buffering between the PPU and SPUs,
synchronization of execution of the PPU and SPUs as
well as because of constraints on data alignment,
limitation on the size of data that can be transferred by
DMA and endian issues.
The block diagram in Figure 3 depicting the
implementation of FBP on Cell is shown in Figure 3. The
Init block corresponds to a one-time initialization
function for the system. Following initialization, the
image is loaded onto the main memory of the PPU. The
filter kernel is executed on the 6 SPUs. Each SPU
operates on a different region of the image. The filter
kernel function is executed once for each pixel and is
invoked multiple times to complete the processing of a
block corresponding to an SPU. Data transfers
accomplished using DMA are initiated as required by
the application. Constraints in SPU local memory
impose that the filtered image be transferred back to the
PPU memory. Since double buffering is used, there is no
overhead for data transfers.
After completion of filtering for the input image,
backprojection is performed. All necessary output
voxels are updated with weighted values of
corresponding input pixels from the current image.
After completion of processing of the current image, the
filtering and backprojection operations are performed
for all remaining input images, accumulating the output
voxels for every input image. Following completion of
FBP for all input images, the output is written onto raw
image files.
Figure 2: FBP as implemented on the Cell BE
Results
The FBP code is implemented on the Play Station 3 (PS3),
which consists of a single Cell processor, 256 MB RAM
and 3.2 GHz clock speed. The data used for testing the
system is artificial CT data corresponding to the 3
Cylinders dataset. The performance on the Cell
processor is compared to that on an Intel Pentium D
processor running at the same clock speed and having a
RAM of 2 GB. The number of input images is 360 and the
dimension of each input image is 768 x 960. Following
backprojection 240 output images of dimension 400 x 400
voxels are obtained. Table 2 summarizes the results
obtained.
Time (sec)
Processor
Filt
er
Backproject Total Speedup
Intel Pentium D 600 1200 1800
Cell BE 47.5 212.9 260.4
6.9x
Table 2: Comparison of the time taken for execution of FBP on
Intel Pentium D (3.2 GHZ, 2 GB RAM) and the Cell BE
(3.2GHz, 250MB RAM) processors.
This is the first implementation of FBP on the PS3
machine. Other Cell based implementations have used
the Cell Blade, which contains 2 Cell BE processors
having 2 PPEs, 16 SPEs and 2GB RAM. The performance
obtained with such a system is reported to be 14.16x [7].
REFERENCES
[1]. F. Nattered, The Mathematics of Computerized
Tomography, John Wiley & Sons, 2 editions, 1986.
[2]. http:// www.opengl.org
[3]. http://www.microsoft.com/directx
[4]. http://www.gpgpu.org
[5]. http://www.nvidia.com/object/cuda_home.html
[6]. NVIDIA CUDA Programming Guide.
[7]. Programmer's Guide - Cell BE Programming
Tutorial,
http://www-01.ibm.com/chips/techlib/techlib.nsf
/techdocs/FC857AE550F7EB83872571A80061F788/
$file/CBE_Programming_Tutorial_v3.0.pdf
[8]. M. Knaup, S. Steckmann, O. Bockenbach and M.
Kachelrieß. Tomographic image reconstruction using the
Cell broadband engine (CBE) general purpose hardware.
Proceedings Electronic Imaging, Computational
Imaging V, SPIE Vol 6498, 64980P 1-10, January 2007.

More Related Content

PDF
Accelerating Real Time Applications on Heterogeneous Platforms
PDF
Volume 2-issue-6-2040-2045
PDF
Effective Sparse Matrix Representation for the GPU Architectures
PPTX
Lec06 memory
PDF
Reconfigurable and versatile bil rc architecture
PDF
Synergistic processing in cell's multicore architecture
PDF
Reconfigurable and versatile bil rc architecture design with an area and powe...
PDF
Hybrid Multicore Computing : NOTES
Accelerating Real Time Applications on Heterogeneous Platforms
Volume 2-issue-6-2040-2045
Effective Sparse Matrix Representation for the GPU Architectures
Lec06 memory
Reconfigurable and versatile bil rc architecture
Synergistic processing in cell's multicore architecture
Reconfigurable and versatile bil rc architecture design with an area and powe...
Hybrid Multicore Computing : NOTES

What's hot (13)

PDF
Coarse Grained Hybrid Reconfigurable Architecture with NoC Router for Variabl...
PDF
Transformation and dynamic visualization of images from computer through an F...
PDF
FPGA Implementation of Multiplier-less CDF-5/3 Wavelet Transform for Image Pr...
PDF
Architecture exploration of recent GPUs to analyze the efficiency of hardware...
PPTX
Computer architecture pptx
PDF
Survey_Report_Deep Learning Algorithm
PPTX
GPU Computing: A brief overview
PPTX
DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0
PDF
Hardware Architecture for Calculating LBP-Based Image Region Descriptors
PDF
International Journal of Engineering Research and Development
PDF
(Im2col)accelerating deep neural networks on low power heterogeneous architec...
PDF
hetero_pim
PDF
EFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGA
Coarse Grained Hybrid Reconfigurable Architecture with NoC Router for Variabl...
Transformation and dynamic visualization of images from computer through an F...
FPGA Implementation of Multiplier-less CDF-5/3 Wavelet Transform for Image Pr...
Architecture exploration of recent GPUs to analyze the efficiency of hardware...
Computer architecture pptx
Survey_Report_Deep Learning Algorithm
GPU Computing: A brief overview
DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0
Hardware Architecture for Calculating LBP-Based Image Region Descriptors
International Journal of Engineering Research and Development
(Im2col)accelerating deep neural networks on low power heterogeneous architec...
hetero_pim
EFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGA
Ad

Viewers also liked (17)

PDF
Product Engineering
PDF
Daily sheet
PDF
Ws wp-365 days of hr
PDF
OpenStack 101 Presentation
PPTX
Get Your Head in the Cloud
PPTX
Episode 5 Justin Somaini of Box.com
PPTX
KeyBank Accruent Expesite Case Study
KEY
VideoPress
PPTX
Call Management Services Should be Part of Every Business Telephone System
PDF
Embracing Mobile First
PDF
Msp deck v1.0
PPTX
Christmas 2015
PPTX
N-Able Summit AUS Finance
PDF
2 7-2013-big data and e-discovery
PDF
Connectin - An Enterprise Knowledge Management Solution
PPTX
Agile - Scrum
PPTX
KPN Getronics Portfolio informatie
Product Engineering
Daily sheet
Ws wp-365 days of hr
OpenStack 101 Presentation
Get Your Head in the Cloud
Episode 5 Justin Somaini of Box.com
KeyBank Accruent Expesite Case Study
VideoPress
Call Management Services Should be Part of Every Business Telephone System
Embracing Mobile First
Msp deck v1.0
Christmas 2015
N-Able Summit AUS Finance
2 7-2013-big data and e-discovery
Connectin - An Enterprise Knowledge Management Solution
Agile - Scrum
KPN Getronics Portfolio informatie
Ad

Similar to High Performance Medical Reconstruction Using Stream Programming Paradigms (20)

PDF
Comparison of Parallel Algorithms For An Image Processing Problem on Cuda
PDF
The Explanation the Pipeline design strategy.pdf
PDF
KEY
Why Graphics Is Fast, and What It Can Teach Us About Parallel Programming
PDF
The Rise of Parallel Computing
PDF
Nvidia cuda programming_guide_0.8.2
PDF
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...
PPT
Introduction to parallel computing using CUDA
PPT
NVIDIA CUDA
PDF
Computing using GPUs
PPTX
Shader model 5 0 and compute shader
PPT
GDC 2012: Advanced Procedural Rendering in DX11
PDF
Accelerating microbiome research with OpenACC
PDF
GPU Programming
PDF
Gpu perf-presentation
PDF
GPGPU Computation
PDF
Modern Graphics Pipeline Overview
PPT
Current Trends in HPC
PDF
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PPT
Parallel computing with Gpu
Comparison of Parallel Algorithms For An Image Processing Problem on Cuda
The Explanation the Pipeline design strategy.pdf
Why Graphics Is Fast, and What It Can Teach Us About Parallel Programming
The Rise of Parallel Computing
Nvidia cuda programming_guide_0.8.2
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...
Introduction to parallel computing using CUDA
NVIDIA CUDA
Computing using GPUs
Shader model 5 0 and compute shader
GDC 2012: Advanced Procedural Rendering in DX11
Accelerating microbiome research with OpenACC
GPU Programming
Gpu perf-presentation
GPGPU Computation
Modern Graphics Pipeline Overview
Current Trends in HPC
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
Parallel computing with Gpu

More from QuEST Global (erstwhile NeST Software) (20)

PDF
HPC Platform options: Cell BE and GPU
PDF
A Whitepaper on Hybrid Set-Top-Box
PDF
Real-Time Face Tracking with GPU Acceleration
PDF
CUDA Accelerated Face Recognition
PDF
Test Optimization Using Adaptive Random Testing Techniques
PDF
Ultra Fast SOM using CUDA
PDF
UpnP in Digital Home Networking
PDF
MR Brain Volume Analysis Using BrainAssist
PDF
Image Denoising Using WEAD
PDF
Focal Cortical Dysplasia Lesion Analysis with Complex Diffusion Approach
PDF
Speckle Reduction in Images with WEAD and WECD
PDF
Software Defined Networking – Virtualization of Traffic Engineering
PDF
A New Generation Software Test Automation Framework – CIVIM
PDF
BX-D – A Business Component & XML Driven Test Automation Framework
PDF
An Effective Design and Verification Methodology for Digital PLL
PDF
FaSaT An Interoperable Test Automation Solution
PDF
An Improved Hybrid Model for Molecular Image Denoising
PDF
A 1.2V 10-bit 165MSPS Video ADC
PDF
Advanced Driver Assistance System using FPGA
PDF
Real Time Video Processing in FPGA
HPC Platform options: Cell BE and GPU
A Whitepaper on Hybrid Set-Top-Box
Real-Time Face Tracking with GPU Acceleration
CUDA Accelerated Face Recognition
Test Optimization Using Adaptive Random Testing Techniques
Ultra Fast SOM using CUDA
UpnP in Digital Home Networking
MR Brain Volume Analysis Using BrainAssist
Image Denoising Using WEAD
Focal Cortical Dysplasia Lesion Analysis with Complex Diffusion Approach
Speckle Reduction in Images with WEAD and WECD
Software Defined Networking – Virtualization of Traffic Engineering
A New Generation Software Test Automation Framework – CIVIM
BX-D – A Business Component & XML Driven Test Automation Framework
An Effective Design and Verification Methodology for Digital PLL
FaSaT An Interoperable Test Automation Solution
An Improved Hybrid Model for Molecular Image Denoising
A 1.2V 10-bit 165MSPS Video ADC
Advanced Driver Assistance System using FPGA
Real Time Video Processing in FPGA

Recently uploaded (20)

PPT
Teaching material agriculture food technology
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Approach and Philosophy of On baking technology
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PPTX
Spectroscopy.pptx food analysis technology
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
A Presentation on Artificial Intelligence
PPTX
Machine Learning_overview_presentation.pptx
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
Teaching material agriculture food technology
The AUB Centre for AI in Media Proposal.docx
Approach and Philosophy of On baking technology
Reach Out and Touch Someone: Haptics and Empathic Computing
Spectral efficient network and resource selection model in 5G networks
Advanced methodologies resolving dimensionality complications for autism neur...
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
A comparative analysis of optical character recognition models for extracting...
NewMind AI Weekly Chronicles - August'25-Week II
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Spectroscopy.pptx food analysis technology
The Rise and Fall of 3GPP – Time for a Sabbatical?
Per capita expenditure prediction using model stacking based on satellite ima...
Dropbox Q2 2025 Financial Results & Investor Presentation
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Assigned Numbers - 2025 - Bluetooth® Document
Encapsulation_ Review paper, used for researhc scholars
A Presentation on Artificial Intelligence
Machine Learning_overview_presentation.pptx
20250228 LYD VKU AI Blended-Learning.pptx

High Performance Medical Reconstruction Using Stream Programming Paradigms

  • 1. Abstract— This paper describes the implementation and results of CT reconstruction using Filtered Back Projection on various stream programming paradigms. Key words— Filtered Back Projection, CUDA, Cell. I. INTRODUCTION The complexity of Medical image reconstruction requires tens to hundreds of billions of computations per second. Until few years ago, special purpose processors designed especially for such applications were used. Such processors require significant design effort and are thus difficult to change as new algorithms in reconstructions evolve and have limited parallelism. Hence the demand for flexibility in medical applications motivated the use of stream processors with massively parallel architecture. Stream processing architectures offers data parallel kind of parallelism. In data parallelism, a set of operations are performed on large amount of data in a parallel fashion. Many of the reconstruction algorithms are high suitable for this data parallel paradigm. The central idea behind stream processing is to organize the application into streams - set of data - and kernels - a series of operations applied to each element in the stream. There are several stream processors commercially available. Here in this paper, we consider two emerging processors – Graphical Processing Unit (GPU) [2] and Cell Processors [3] for benchmarking our reconstruction application provide specific instructions to allow application developers to tag kernels and/or streams. GPUs are designed for high end game and 3D simulation application. They are massively parallel architectures with large amount of arithmetic processing units. Programming a GPU requires the knowledge of graphics pipeline architecture. But off late the advent of General Purpose on GPU (GPGPU)[4] lead to design of tools based on common programming languages. Cell Processors are designed to bridge the gap between conventional desktop processors and more specialized high-performance processors, such as GPUs. Cell processor can be split into four components: external input and output structures, the main processor called the Power Processing Element (PPE) eight fully-functional co-processors called the Synergistic Processing Elements. GPU can be programmed using Shader languages of Graphical APIs such as OpenGL [2] and DirectX [5], which requires knowledge in Graphics architecture. Then there are C style programming languages such as CUDA [5], which requires minimum or no background in Graphics. As the Cell processor is general purpose processor, languages currently available are similar to C programming. In this paper we consider the X-ray beam computed tomography (CT) reconstruction of 2Dimage/3D image from 2D projections as or image reconstruction example. The 2D/3D image reconstruction is a very demanding computational task. The reconstruction from projections is done using Filtered Back Projection [1] with time complexity—O (N4) where N is the number of detector pixels in one detector row. In the rest of the paper we give an overview of the programming tools used for GPU and Cell Processor. Also we describe the implementation of Filtered Back Projection using some of these programming paradigms. We conclude with the paper by benchmarking results with conventional processors. II. FILTERED BACK PROJECTION Filtered Back Projection is a conventional technique used for CT reconstruction for 2D Fan Beam Data and 3D Cone Beam Data. Filter Back Projection can be divided into (1) Filtering and (2) Back Projection. Algorithm of Discrete FBP High Performance Medical Reconstruction using Stream Programming Paradigms June 2008 High Performance Computing Group, NeST Software [www.nestsoftware.com]
  • 2. The following is the pseudo algorithm of the FBP //Filter the projection images along horizontal lines For Each Projection For Each row of the Projection For Each col of the Projection Perform the convolution. (Filter) row wise // Perform the back projection along the projection rays For each Filtered Image Get the angle of Projection for this filtered image For each output Image of the reconstructed volume. Get the angle of Projection of this image. Calculate increment factor U using angle for indexing (look up) into the Filtered Image and also weight calculation. For each row of the output image For each column of the output image Lookup the image at calculated index. Accumulate the output with the 1/U2 The computational complexity of the Filter Back Projection (FBP) is biquadratic. To reconstruct a volume of size 400 x 400 x 240 images from 360 projections requires floating point operations of the order of 109 than available with a normal processor. The most computation-intensive step of the Filtered Back Projection is back projection. Roughly 97% of the operations involve the back projection when reconstructing the images from the projections. III. CUDA Traditionally GPGPU applications were developed using Graphical APIs and their shading language which requires knowledge of Graphics Pipeline. Moreover these APIs exposed very little of the underlying hardware. CUDA, designed from the ground-up is an efficient general purpose computation on GPUs. It gives computationally intensive applications access to the tremendous processing power of the latest NVIDIA GeForce 8 GPUs through a C-like programming interface. The GeForce 8 series GPUs have up to 128 processors running at 1.5 GHz and up to 1.5GB of on-board memory. It uses a C-like programming language and does not require remapping algorithms to graphics concepts. CUDA exposes several hardware features that are not available via the graphics API. The most significant of these is shared memory, which is a small (currently 16KB per multiprocessor) area of on-chip memory which can be accessed in parallel by blocks of threads. This allows caching of frequently used data and can provide large speedups over using textures to access data. Combined with thread synchronization primitive, this allows cooperative parallel processing of on-chip data, greatly reducing the expensive off-chip bandwidth requirements of many parallel algorithms. This benefits a number of common applications such as linear algebra, Fast Fourier Transforms, and image processing filters. The current generation of GPU from NVIDIA has support for IEEE single precision. Double precision support will be available in the next generation towards the end of the year. There are however several fields in which significant results can be obtained with single precision: image and signal processing and some numerical methods like pseudo spectral approximation are just few examples CUDA Programming Model The programmer writes a serial program that calls parallel kernels, which may be simple functions or full programs. A kernel executes in parallel across a set of parallel threads. The programmer organizes these threads into a hierarchy of grids of thread blocks. A thread block is a set of concurrent threads that can cooperate among themselves through barrier synchronization and shared access to a memory space private to the block. A grid is a set of thread blocks that may each be executed independently and thus may execute in parallel. When invoking a kernel, the programmer specifies the number of threads per block and the number of blocks making up the grid. Each thread is given a unique thread ID number threadIdx within its thread block, numbered 0, 1, 2, ..., blockDim - 1, and each thread block is given a unique block ID number blockIdx within its grid. CUDA supports thread blocks containing up to 512 threads. For convenience, thread blocks and grids may have one, two, or three dimensions, accessed via .x, .y, and .z index fields. FBP Implementation The FBP implementation of CPU consists of reading the image files and its configuration files containing the information such as angle of Projection, geometric parameters for Region of Interest (ROI), computing the Filter mask and the geometric shape of the output image. The compute intensive portions like filtering and back projection, implemented in GPU, are packed in two different kernels. Prior to the kernel execution, the input data is copied to the GPU’s device memory. As this data
  • 3. is required for both filter kernel as well as back projection kernel, it is allocated in the global memory of the device, which helps to avoid copying multiple times. In order to achieve high level of parallelism, size of blocks and threads in the block are selected appropriately. We divide the data into several 1D block and threads. The input image data is stored and accessed via global memory. 1D convolution mask as well the a row input data, which is used repeatedly used for convolving each of row of image is stored and accessed via shared memory than from the global memory. This will significantly reduce memory bandwidth which affects the performance. Data reuse is high with in a thread block and hence a block of threads are loaded into the shared memory. Data stored in the shared memory are declared as volatile variable in order to disable to compiler optimization for reading and writing to shared memory as long as previous statement is met [6]. This is yet another optimization strategy to reduce the memory bandwidth. The back projection algorithm contains computations involving several array look ups. Unlike in CPU, array look ups in GPUs are costlier. Hence they are replaced with index based computations. This is done in order to exploit arithmetic efficiency of GPU. Compared to filtering operation, the data reuse is not that high in back projection, which makes the shared memory usage inefficient in this situation. Here texture look up of the filtered image is the found to be better choice than accessing the data stored in the shared memory. This is because texture memory is spatially cached. More over back projection computation performed on large amount of data cannot be stored in the shared memory whose size   Figure 1: FBP implementation in CUDA Results The Filtered Back Projection algorithm was tested on NVIDIA 8800 GTS and Intel Pentium D processor with 3.2 GHz hardware. The table 1 below shows the results for the filtered back projection for an Input Image of size 960 x 768 x 360. Filtering & Back Projection Operation (in seconds) of size 960 x 768 x 360 Speed up Intel Pentium D ( 3.2 GHZ) 1800 NVIDIA GeForce 8800 GTS 16 112.5x Table 1 : Comparison of the time taken for execution of FBP on Intel Pentium D (3.2 GHZ, 2 GB RAM) and the NVIDIA 8800 GTS. IV. CELL PROGRAMMING System Overview The Cell Broadband Engine is a heterogeneous multiprocessor containing 1 PowerPC Processing Unit (PPU), 6 Synergistic Processing Units (SPUs) and a high bandwidth element interconnect bus. There is support for performing data transfers concurrent to instruction execution. The Cell processor achieves efficiency in power consumption because of the distinction between the PPU (that is specialized for control operations and runs the OS) and the SPU (that is specialized for data intensive operations). There is support for vectorization on the PPU and the SPU that can result in up to 4 single precision floating point operations in one cycle. The Cell based machine chosen for implementation of our application is the Play Station 3 (PS3), running the Yellow Dog Linux distribution. Programming is done in C, using C extensions for the Cell platform. GNU C Compiler for the SPU and PPU platforms are used. Cell Programming Model The Cell processor supports a stream programming model, which involves the identification of an appropriate kernel function and the execution of the chosen kernel function on all the data elements. The latency of execution is reduced by exploiting the single program multiple data (SPMD) paradigm for executing the same kernel function concurrently on multiple data elements. In this case, the PPU spawns multiple threads and schedules these threads for execution in the 6 SPEs.
  • 4. The synchronization of threads and transfer of data are controlled from the PPU program. Double buffering is used to allow data transfers from the PPU to SPU(s) concurrent to program execution. In order to minimize the latency of memory access, the software cache in SPUs is utilized. Data transfers from the PPU to SPUs and vice versa are accomplished by DMA and message passing. Vectorization of PPU and SPU code is performed for further increase in data parallel execution. Wherever possible, optimized library functions for the cell processor are invoked for increased execution efficiency. The code for execution on the PPU and SPUs is developed separately. The executables for the SPUs are embedded within the main PPU executable and can be invoked from within PPU functions as a separate thread, specifying the name of the executable, the SPU on which it has to be executed and other relevant parameters. Implementation of FBP The filter and backprojection functions that have to be evaluated corresponding to every input and output voxel respectively, are considered as the kernel functions. Filtering involves the convolution of a one-dimensional filter with each row of input slices. Back projection involves evaluation of weights for each pixel, evaluation of the offset in the filtered image and determining the contribution of the corresponding value towards the back projection of the corresponding pixel. Due to limitations in RAM for the Cell, the input images are processed one at a time, accumulating the weighted contribution of each input image to output voxels. The issues encountered on the Cell processor include limitations in RAM and SPU local memory as well as issues in restructuring the code for optimal performance on the Cell BE processor. There is an increased programming overhead because the programmer is responsible for management of data transfer and buffering between the PPU and SPUs, synchronization of execution of the PPU and SPUs as well as because of constraints on data alignment, limitation on the size of data that can be transferred by DMA and endian issues. The block diagram in Figure 3 depicting the implementation of FBP on Cell is shown in Figure 3. The Init block corresponds to a one-time initialization function for the system. Following initialization, the image is loaded onto the main memory of the PPU. The filter kernel is executed on the 6 SPUs. Each SPU operates on a different region of the image. The filter kernel function is executed once for each pixel and is invoked multiple times to complete the processing of a block corresponding to an SPU. Data transfers accomplished using DMA are initiated as required by the application. Constraints in SPU local memory impose that the filtered image be transferred back to the PPU memory. Since double buffering is used, there is no overhead for data transfers. After completion of filtering for the input image, backprojection is performed. All necessary output voxels are updated with weighted values of corresponding input pixels from the current image. After completion of processing of the current image, the filtering and backprojection operations are performed for all remaining input images, accumulating the output voxels for every input image. Following completion of FBP for all input images, the output is written onto raw image files. Figure 2: FBP as implemented on the Cell BE Results The FBP code is implemented on the Play Station 3 (PS3), which consists of a single Cell processor, 256 MB RAM and 3.2 GHz clock speed. The data used for testing the system is artificial CT data corresponding to the 3 Cylinders dataset. The performance on the Cell processor is compared to that on an Intel Pentium D processor running at the same clock speed and having a RAM of 2 GB. The number of input images is 360 and the dimension of each input image is 768 x 960. Following backprojection 240 output images of dimension 400 x 400 voxels are obtained. Table 2 summarizes the results obtained. Time (sec) Processor Filt er Backproject Total Speedup Intel Pentium D 600 1200 1800 Cell BE 47.5 212.9 260.4 6.9x Table 2: Comparison of the time taken for execution of FBP on Intel Pentium D (3.2 GHZ, 2 GB RAM) and the Cell BE (3.2GHz, 250MB RAM) processors.
  • 5. This is the first implementation of FBP on the PS3 machine. Other Cell based implementations have used the Cell Blade, which contains 2 Cell BE processors having 2 PPEs, 16 SPEs and 2GB RAM. The performance obtained with such a system is reported to be 14.16x [7]. REFERENCES [1]. F. Nattered, The Mathematics of Computerized Tomography, John Wiley & Sons, 2 editions, 1986. [2]. http:// www.opengl.org [3]. http://www.microsoft.com/directx [4]. http://www.gpgpu.org [5]. http://www.nvidia.com/object/cuda_home.html [6]. NVIDIA CUDA Programming Guide. [7]. Programmer's Guide - Cell BE Programming Tutorial, http://www-01.ibm.com/chips/techlib/techlib.nsf /techdocs/FC857AE550F7EB83872571A80061F788/ $file/CBE_Programming_Tutorial_v3.0.pdf [8]. M. Knaup, S. Steckmann, O. Bockenbach and M. Kachelrieß. Tomographic image reconstruction using the Cell broadband engine (CBE) general purpose hardware. Proceedings Electronic Imaging, Computational Imaging V, SPIE Vol 6498, 64980P 1-10, January 2007.