Performance boosting of discrete cosine transform using parallel programming methodology

Proceedings of the International Conference on Emerging Trends in Engineering and Management (ICETEM14)
30 – 31, December 2014, Ernakulam, India
105
PERFORMANCE BOOSTING OF DISCRETE COSINE
TRANSFORM USING PARALLEL PROGRAMMING
METHODOLOGY
Aparna M.P
Final Year MTech, Dept of Computer Science & Engineering, Sree Narayana Gurukulam College of Engineering,
Kerala, India
Smitha Suresh
Associate Professor, Dept of Computer Science & Engineering, Sree Narayana Gurukulam College of Engineering,
Kerala, India
Anoop M.P
Software Engineer, Intel Corporation, Hillsboro, United States
ABSTRACT
Discrete Cosine Transform (DCT) is a most widely used transform in JPEG compression. DCT transforms an
image (2D-signal) from time domain to frequency domain. DCT and Quantization are the first two steps in JPEG
compression standard where inter-pixel redundancy and psycho-visual redundancy of the image are removed. However
such operations involve complex and time consuming mathematical calculations such as the matrix multiplications. In
this paper we demonstrate how DCT algorithm can execute faster on a given processor architecture by utilizing multiple
processing cores and efficiently utilizing each processing core by generating SIMD instructions. DCT is a classic
example of data parallel algorithm and the performance of this algorithm can be improved on a multi-core machine using
the thread level parallelism and vector level parallelism within each processing core. Each processing core has vector
registers which enables vector operations. The programming methodology used in this paper to enable thread level and
vector level parallelism is Cilk Plus. This paper focuses on demonstrating the speedup in DCT/Inverse DCT (IDCT) and
quantization/de-quantization algorithms.
Keywords: DCT, Parallel Programming.
1. INTRODUCTION
Images are the real world 3D scene captured on a 2 dimensional plane of pixels. In this paper we consider .bmp
image files where each pixel in the image is represented in 24bit RGB bitmap format. The number of bits used to
represent each pixel determines the quality of the image. The greater the number of bits used for the representing each
color the higher is the quality of the image.
INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING &
TECHNOLOGY (IJCET)
ISSN 0976 – 6367(Print)
ISSN 0976 – 6375(Online)
Volume 5, Issue 12, December (2014), pp. 105-108
© IAEME: www.iaeme.com/IJCET.asp
Journal Impact Factor (2014): 8.5328 (Calculated by GISI)
www.jifactor.com
IJCET
© I A E M E

106
Neighboring pixels in the image exhibit certain level of correlation. A transformation maps the correlated data
to uncorrelated coefficients thereby reducing the interpixel redundancy. Discrete cosine transform[1] is a lossy
compression algorithm used in the JPEG images where finite sequence of data points in the image, discarding the small
high-frequency components of an image are represented in terms of a sum of cosine functions. The image is split into
blocks of size 8x8. The process of eliminating higher frequency components not sensitive to human eyes is done through
quantization.
There is reduction in the quality of an image when DCT is applied followed by quantization in image
compression. An inverse operation namely dequantization followed by the inverse Discrete Cosine Transform is done on
the image in order to maintain the quality of the image and to increase the load on the processors by increasing the
number of calculations.
Parallelisms of both levels namely thread level and vector levels are implemented on the multicore machines to
improve the performance. Thread level parallelism is applied on each core that processes independent 8x8 block of pixel
obtained by splitting the .bmp image. The speed of operation on the cores is increased by using vector registers. In this
paper we use the Advanced Vector Extension (AVX) architecture where the each vector register is 32 bytes long to store
the multiple elements of array for processing. Each core has the vector level parallelism [2] implemented on it where data
of the same data types are stored in the form of an array and the operation is applied simultaneously on all the elements
stored in that array. Data level parallelism is achieved using Single Instruction Multiple Data (SIMD) [6]. SIMD is used
to increase the computing speed by applying same operations on multiple data which are stored in the vector registers.
Cilk Plus [8] is the programming methodology used in this paper to enable thread level and vector level parallelism.
2. DISCRETE COSINE TRANSFORM ALGORITHM
DCT algorithm has the following steps
1. The image that needs to be compressed is broken down into 8x8 blocks of pixels.
2. DCT algorithm is applied to each of the image block.
3. Quantization algorithm is applied to each block for eliminating the higher frequency components.
4. The quantized image block is then de-quantized.
5. Finally Inverse DCT is applied on each de-quantized.
The quality of the output image is dependent on the degree of compression which varies with the quantization
matrix chosen. The quality level varies from 1-100 in scale. Value 100 denotes best quality image with lower
compression and value 1 represents highest compression with poor quality image. In this paper we perform quant90
matrix for the compression.
3. CILK PLUS PROGRAMMING METHODOLOGY
Traditional C/C++ programming language is not designed to express potential parallelism in an application.
This demanded some extensions to the language which enables the programmer to express the potential parallelism. Cilk
Plus [9] is a parallel programming model which provides tools for both enabling multi-threading as well as enabling
SIMD in an application. Threading solution is offered using three keywords which cilk_for, cilk_spawn and cilk_sync
[8]. SIMD solution is offered using three explicit vectorization tools which Array notations, pragma simd[4] and simd-
enabled functions. Cilk Plus specification is supported C++ Compiler 13.0.
4. PROPOSED METHODOLOGY
In this paper we assume the following:
1. An image of resolution 3264 x 2448 (24 bit RGB bitmap format).
2. A machine with 4 processing cores and each core supporting AVX architecture.
3. If serial and scalar processing of the image takes “n” units of time, then on a 4 core machine by enabling multi-
threading, the theoretical time taken to process the image is reduced by 4.
4. Each operation involves single precision floating point data. Consider vector operations targeting AVX architecture
for a theoretical potential speedup of 8x8 in comparison to serial implementation. Theoretical time taken is “n/8”
units of time.
5. Combining the implementations of both threading and vectorization targeting AVX architecture, the theoretical
potential speedup is 32x (Theoretical time taken is “n/32” units of time).

107
4.1 Serial Implementation with scalar operation
A DCT matrix of size 8x8 is generated using the equation 1 and the transpose of the matrix (IDCT) is generated.
The transform is applied by multiplying DCT matrix with 8x8 image block and the IDCT. The quant90 matrix is the
quantization matrix and the quantized matrix is obtained by dividing the transformed image block by quant90. The
quantized matrix is de-quantized and the Inverse DCT is applied to the block to get the final block. In the serial
implementation only 1 core executes the processing of the image block in a serial fashion. The full payload of the for
loop is executed serially (single thread) in scalar mode.
Algorithm
1. Create a DCT matrix of size 8x8
2. Create a Inverse DCT matrix (IDCT) of size 8x8 => IDCT = transpose (DCT)
3. Create quantization matrix (quant).
4. Divide the image into 8x8 blocks
5. Serial loop with scalar operations:
6. for i = 1 to n (number of image blocks) do
7. Compute DCT of block[i] => Transform = (DCT * block[i] * IDCT)
8. Quantize the transformed image block => Quantized matrix = (Transform/quant90)
9. De-quantize the quantized image block => De-quantized matrix = (Quantized matrix * quant90)
10. Compute Inverse DCT of block[i] => Final block = (IDCT * (de-quantized image block) * DCT)
4.2 .Thread level parallelism implementation with scalar operation
In the thread level parallelism, implementation the 124848 blocks ((3264*2448)/(8x8)) which are divided
between all the 4 cores. The 4 cores of the machine execute the same code on 4 different image blocks simultaneously.
The theoretical speedup possible from this threading solution is 4x.
Algorithm
3. Create quantization matrix (quant90).
5. Thread level parallelism with scalar operations:
7. Divide n /4 (number of cores available in the machine)
8. Assign n/4 blocks to each core for processing in a scalar mode
9. Assign each image block to each thread available in the core
10. Compute DCT of block[i] => Transform = (DCT * block[i] * IDCT)
11. Quantize the transformed image block => Quantized matrix = (Transform/quant)
12. De-quantize the quantized image block => De-quantized matrix = (Quantized matrix * quant)
4.3 .Vector level parallelism implementation with array operation
In the vector level parallelism, the vector registers (targeting AVX architecture) are used to execute the
operations in vector mode. In the earlier system each register size was just 4 bytes but with latest AVX architecture, the
register size is increased by 8 times (32 bytes). That means each instruction can execute on 8 times more data in
comparison to the scalar operation mode. The theoretical potential speedup here is 8x.
Algorithm:
2. Create an Inverse DCT matrix (IDCT) of size 8x8 => IDCT = transpose (DCT)
3. Create quantization matrix (quant).
5. Vector level parallelism with single thread (serial mode)
7. Divide n/4 (number of arrays available in the machine)
8. Assign n/4 blocks to each array of core for processing in a vectorized way using SIMD
9. Compute DCT for array of block[i] => Transform = (DCT * block[i] * IDCT)

108
11. De-quantize the quantized image block => De-quantized matrix = (Quantized matrix * quant)
4.4. Thread level parallelism implementation with vector operation
Algorithm
3. Create quantization matrix (quant)
6. Divide n/4 (number of cores available in the machine)
7. for j =1 to n/4 do
8. //This loop body executes in multi-threaded SIMD mode
9. Assign n/z blocks to each array of core for processing in a vectorized way using SIMD
10. Compute DCT for array of block[i] => Transform = (DCT * block[i] * IDCT)
12. De-quantize the quantized image block => De-quantized matrix = (Quantized matrix * quant).
This step combines the step 2 and 3 discussed previously (threading + SIMD solution). The theoretical speedup
possible is 32x (theoretical speedup possible using multi-threading on 4 core machine * theoretical speedup possible
using SIMD targeting Intel® AVX).
5. CONCLUSION
Irrespective of the engineering or science domain, we deal with a lot of algorithms to simulate and solve the
practical problems. Most of the practical applications will fall under either Task parallelism or Data parallelism.
Irrespective of which parallelism the problem falls under, there are ways to convey the potential parallelism in the
algorithm using certain parallel programming models like Intel® Cilk™ Plus. Making use of these parallel programming
models helps utilize the hardware resources better thereby increasing the speed of execution of the algorithm.
REFERENCES
[1] Ken Cabeen and Peter Gent Math 45 College of the Redwoods “Image Compression and the Discrete Cosine
Transform".
[2] Autovectorization Using the Intel® C++ Compiler-
https://software.intel.com/sites/default/files/8c/a9/CompilerAutovectorizationGuide.pdf.
[3] LoopVectorizationhttps://software.intel.com/en- us/articles/ requirements-for-vectorizable-loops.
[4] Pragma SIMD for loop vectorization - https://software.intel.com/en-us/articles/requirements-for-vectorizing-
loops-with-pragma-simd.
[5] Intel®Cilk™Plus:-https://software.intel.com/sites/default/files/article/185163/introduction-to-array-notation.pdf.
[6] SIMD parallelism -https://software.intel.com/en-us/blogs/2010/ 09/03/simd-parallelism-using-array-notation/?
wapkw=array+notation.
[7] Dataparallelism:https://software.intel.com/sites/default/files/article/181418/whitepaperonelementalfunctions.Pdf.
[8] Intel® Cilk™ Plus to Achieve Data and Thread Parallelism - https://software.intel.com/en-us/articles/data-and-
thread-parallelism.
[9] P. Prasanth Babu, L.Rangaiah and D.Maruthi Kumar, “Comparison and Improvement of Image Compression
using DCT, DWT & Huffman Encoding Techniques”, International Journal of Computer Engineering &
Technology (IJCET), Volume 4, Issue 1, 2013, pp. 54 - 60, ISSN Print: 0976 – 6367, ISSN Online: 0976 – 6375.
[10] Neetu Rathi and Dr. Anil Kumar Sharma, “Secure Hybrid Watermarking using Discrete Wavelet Transform
(DWT) & Discrete Cosine Transform (DCT)”, International Journal of Computer Engineering & Technology
(IJCET), Volume 5, Issue 4, 2014, pp. 186 - 193, ISSN Print: 0976 – 6367, ISSN Online: 0976 – 6375.

Performance boosting of discrete cosine transform using parallel programming methodology

Recommended

More Related Content

What's hot (17)

Viewers also liked (10)

Similar to Performance boosting of discrete cosine transform using parallel programming methodology (20)

More from IAEME Publication (20)

Recently uploaded (20)

Performance boosting of discrete cosine transform using parallel programming methodology