SlideShare a Scribd company logo
Proceedings of the International Conference on Emerging Trends in Engineering and Management (ICETEM14)
30 – 31, December 2014, Ernakulam, India
105
PERFORMANCE BOOSTING OF DISCRETE COSINE
TRANSFORM USING PARALLEL PROGRAMMING
METHODOLOGY
Aparna M.P
Final Year MTech, Dept of Computer Science & Engineering, Sree Narayana Gurukulam College of Engineering,
Kerala, India
Smitha Suresh
Associate Professor, Dept of Computer Science & Engineering, Sree Narayana Gurukulam College of Engineering,
Kerala, India
Anoop M.P
Software Engineer, Intel Corporation, Hillsboro, United States
ABSTRACT
Discrete Cosine Transform (DCT) is a most widely used transform in JPEG compression. DCT transforms an
image (2D-signal) from time domain to frequency domain. DCT and Quantization are the first two steps in JPEG
compression standard where inter-pixel redundancy and psycho-visual redundancy of the image are removed. However
such operations involve complex and time consuming mathematical calculations such as the matrix multiplications. In
this paper we demonstrate how DCT algorithm can execute faster on a given processor architecture by utilizing multiple
processing cores and efficiently utilizing each processing core by generating SIMD instructions. DCT is a classic
example of data parallel algorithm and the performance of this algorithm can be improved on a multi-core machine using
the thread level parallelism and vector level parallelism within each processing core. Each processing core has vector
registers which enables vector operations. The programming methodology used in this paper to enable thread level and
vector level parallelism is Cilk Plus. This paper focuses on demonstrating the speedup in DCT/Inverse DCT (IDCT) and
quantization/de-quantization algorithms.
Keywords: DCT, Parallel Programming.
1. INTRODUCTION
Images are the real world 3D scene captured on a 2 dimensional plane of pixels. In this paper we consider .bmp
image files where each pixel in the image is represented in 24bit RGB bitmap format. The number of bits used to
represent each pixel determines the quality of the image. The greater the number of bits used for the representing each
color the higher is the quality of the image.
INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING &
TECHNOLOGY (IJCET)
ISSN 0976 – 6367(Print)
ISSN 0976 – 6375(Online)
Volume 5, Issue 12, December (2014), pp. 105-108
© IAEME: www.iaeme.com/IJCET.asp
Journal Impact Factor (2014): 8.5328 (Calculated by GISI)
www.jifactor.com
IJCET
© I A E M E
Proceedings of the International Conference on Emerging Trends in Engineering and Management (ICETEM14)
30 – 31, December 2014, Ernakulam, India
106
Neighboring pixels in the image exhibit certain level of correlation. A transformation maps the correlated data
to uncorrelated coefficients thereby reducing the interpixel redundancy. Discrete cosine transform[1] is a lossy
compression algorithm used in the JPEG images where finite sequence of data points in the image, discarding the small
high-frequency components of an image are represented in terms of a sum of cosine functions. The image is split into
blocks of size 8x8. The process of eliminating higher frequency components not sensitive to human eyes is done through
quantization.
There is reduction in the quality of an image when DCT is applied followed by quantization in image
compression. An inverse operation namely dequantization followed by the inverse Discrete Cosine Transform is done on
the image in order to maintain the quality of the image and to increase the load on the processors by increasing the
number of calculations.
Parallelisms of both levels namely thread level and vector levels are implemented on the multicore machines to
improve the performance. Thread level parallelism is applied on each core that processes independent 8x8 block of pixel
obtained by splitting the .bmp image. The speed of operation on the cores is increased by using vector registers. In this
paper we use the Advanced Vector Extension (AVX) architecture where the each vector register is 32 bytes long to store
the multiple elements of array for processing. Each core has the vector level parallelism [2] implemented on it where data
of the same data types are stored in the form of an array and the operation is applied simultaneously on all the elements
stored in that array. Data level parallelism is achieved using Single Instruction Multiple Data (SIMD) [6]. SIMD is used
to increase the computing speed by applying same operations on multiple data which are stored in the vector registers.
Cilk Plus [8] is the programming methodology used in this paper to enable thread level and vector level parallelism.
2. DISCRETE COSINE TRANSFORM ALGORITHM
DCT algorithm has the following steps
1. The image that needs to be compressed is broken down into 8x8 blocks of pixels.
2. DCT algorithm is applied to each of the image block.
3. Quantization algorithm is applied to each block for eliminating the higher frequency components.
4. The quantized image block is then de-quantized.
5. Finally Inverse DCT is applied on each de-quantized.
The quality of the output image is dependent on the degree of compression which varies with the quantization
matrix chosen. The quality level varies from 1-100 in scale. Value 100 denotes best quality image with lower
compression and value 1 represents highest compression with poor quality image. In this paper we perform quant90
matrix for the compression.
3. CILK PLUS PROGRAMMING METHODOLOGY
Traditional C/C++ programming language is not designed to express potential parallelism in an application.
This demanded some extensions to the language which enables the programmer to express the potential parallelism. Cilk
Plus [9] is a parallel programming model which provides tools for both enabling multi-threading as well as enabling
SIMD in an application. Threading solution is offered using three keywords which cilk_for, cilk_spawn and cilk_sync
[8]. SIMD solution is offered using three explicit vectorization tools which Array notations, pragma simd[4] and simd-
enabled functions. Cilk Plus specification is supported C++ Compiler 13.0.
4. PROPOSED METHODOLOGY
In this paper we assume the following:
1. An image of resolution 3264 x 2448 (24 bit RGB bitmap format).
2. A machine with 4 processing cores and each core supporting AVX architecture.
3. If serial and scalar processing of the image takes “n” units of time, then on a 4 core machine by enabling multi-
threading, the theoretical time taken to process the image is reduced by 4.
4. Each operation involves single precision floating point data. Consider vector operations targeting AVX architecture
for a theoretical potential speedup of 8x8 in comparison to serial implementation. Theoretical time taken is “n/8”
units of time.
5. Combining the implementations of both threading and vectorization targeting AVX architecture, the theoretical
potential speedup is 32x (Theoretical time taken is “n/32” units of time).
Proceedings of the International Conference on Emerging Trends in Engineering and Management (ICETEM14)
30 – 31, December 2014, Ernakulam, India
107
4.1 Serial Implementation with scalar operation
A DCT matrix of size 8x8 is generated using the equation 1 and the transpose of the matrix (IDCT) is generated.
The transform is applied by multiplying DCT matrix with 8x8 image block and the IDCT. The quant90 matrix is the
quantization matrix and the quantized matrix is obtained by dividing the transformed image block by quant90. The
quantized matrix is de-quantized and the Inverse DCT is applied to the block to get the final block. In the serial
implementation only 1 core executes the processing of the image block in a serial fashion. The full payload of the for
loop is executed serially (single thread) in scalar mode.
Algorithm
1. Create a DCT matrix of size 8x8
2. Create a Inverse DCT matrix (IDCT) of size 8x8 => IDCT = transpose (DCT)
3. Create quantization matrix (quant).
4. Divide the image into 8x8 blocks
5. Serial loop with scalar operations:
6. for i = 1 to n (number of image blocks) do
7. Compute DCT of block[i] => Transform = (DCT * block[i] * IDCT)
8. Quantize the transformed image block => Quantized matrix = (Transform/quant90)
9. De-quantize the quantized image block => De-quantized matrix = (Quantized matrix * quant90)
10. Compute Inverse DCT of block[i] => Final block = (IDCT * (de-quantized image block) * DCT)
4.2 .Thread level parallelism implementation with scalar operation
In the thread level parallelism, implementation the 124848 blocks ((3264*2448)/(8x8)) which are divided
between all the 4 cores. The 4 cores of the machine execute the same code on 4 different image blocks simultaneously.
The theoretical speedup possible from this threading solution is 4x.
Algorithm
1. Create a DCT matrix of size 8x8
2. Create a Inverse DCT matrix (IDCT) of size 8x8 => IDCT = transpose (DCT)
3. Create quantization matrix (quant90).
4. Divide the image into 8x8 blocks
5. Thread level parallelism with scalar operations:
6. for i = 1 to n (number of image blocks) do
7. Divide n /4 (number of cores available in the machine)
8. Assign n/4 blocks to each core for processing in a scalar mode
9. Assign each image block to each thread available in the core
10. Compute DCT of block[i] => Transform = (DCT * block[i] * IDCT)
11. Quantize the transformed image block => Quantized matrix = (Transform/quant)
12. De-quantize the quantized image block => De-quantized matrix = (Quantized matrix * quant)
13. Compute Inverse DCT of block[i] => Final block = (IDCT * (de-quantized image block) * DCT)
4.3 .Vector level parallelism implementation with array operation
In the vector level parallelism, the vector registers (targeting AVX architecture) are used to execute the
operations in vector mode. In the earlier system each register size was just 4 bytes but with latest AVX architecture, the
register size is increased by 8 times (32 bytes). That means each instruction can execute on 8 times more data in
comparison to the scalar operation mode. The theoretical potential speedup here is 8x.
Algorithm:
1. Create a DCT matrix of size 8x8
2. Create an Inverse DCT matrix (IDCT) of size 8x8 => IDCT = transpose (DCT)
3. Create quantization matrix (quant).
4. Divide the image into 8x8 blocks
5. Vector level parallelism with single thread (serial mode)
6. for i = 1 to n (number of image blocks) do
7. Divide n/4 (number of arrays available in the machine)
8. Assign n/4 blocks to each array of core for processing in a vectorized way using SIMD
9. Compute DCT for array of block[i] => Transform = (DCT * block[i] * IDCT)
10. Quantize the transformed image block => Quantized matrix = (Transform/quant)
Proceedings of the International Conference on Emerging Trends in Engineering and Management (ICETEM14)
30 – 31, December 2014, Ernakulam, India
108
11. De-quantize the quantized image block => De-quantized matrix = (Quantized matrix * quant)
12. Compute Inverse DCT of block[i] => Final block = (IDCT * (de-quantized image block) * DCT)
4.4. Thread level parallelism implementation with vector operation
Algorithm
1. Create a DCT matrix of size 8x8
2. Create a Inverse DCT matrix (IDCT) of size 8x8 => IDCT = transpose (DCT)
3. Create quantization matrix (quant)
4. Divide the image into 8x8 blocks
5. for i = 1 to n (number of image blocks) do
6. Divide n/4 (number of cores available in the machine)
7. for j =1 to n/4 do
8. //This loop body executes in multi-threaded SIMD mode
9. Assign n/z blocks to each array of core for processing in a vectorized way using SIMD
10. Compute DCT for array of block[i] => Transform = (DCT * block[i] * IDCT)
11. Quantize the transformed image block => Quantized matrix = (Transform/quant)
12. De-quantize the quantized image block => De-quantized matrix = (Quantized matrix * quant).
13. Compute Inverse DCT of block[i] => Final block = (IDCT * (de-quantized image block) * DCT)
This step combines the step 2 and 3 discussed previously (threading + SIMD solution). The theoretical speedup
possible is 32x (theoretical speedup possible using multi-threading on 4 core machine * theoretical speedup possible
using SIMD targeting Intel® AVX).
5. CONCLUSION
Irrespective of the engineering or science domain, we deal with a lot of algorithms to simulate and solve the
practical problems. Most of the practical applications will fall under either Task parallelism or Data parallelism.
Irrespective of which parallelism the problem falls under, there are ways to convey the potential parallelism in the
algorithm using certain parallel programming models like Intel® Cilk™ Plus. Making use of these parallel programming
models helps utilize the hardware resources better thereby increasing the speed of execution of the algorithm.
REFERENCES
[1] Ken Cabeen and Peter Gent Math 45 College of the Redwoods “Image Compression and the Discrete Cosine
Transform".
[2] Autovectorization Using the Intel® C++ Compiler-
https://software.intel.com/sites/default/files/8c/a9/CompilerAutovectorizationGuide.pdf.
[3] LoopVectorizationhttps://software.intel.com/en- us/articles/ requirements-for-vectorizable-loops.
[4] Pragma SIMD for loop vectorization - https://software.intel.com/en-us/articles/requirements-for-vectorizing-
loops-with-pragma-simd.
[5] Intel®Cilk™Plus:-https://software.intel.com/sites/default/files/article/185163/introduction-to-array-notation.pdf.
[6] SIMD parallelism -https://software.intel.com/en-us/blogs/2010/ 09/03/simd-parallelism-using-array-notation/?
wapkw=array+notation.
[7] Dataparallelism:https://software.intel.com/sites/default/files/article/181418/whitepaperonelementalfunctions.Pdf.
[8] Intel® Cilk™ Plus to Achieve Data and Thread Parallelism - https://software.intel.com/en-us/articles/data-and-
thread-parallelism.
[9] P. Prasanth Babu, L.Rangaiah and D.Maruthi Kumar, “Comparison and Improvement of Image Compression
using DCT, DWT & Huffman Encoding Techniques”, International Journal of Computer Engineering &
Technology (IJCET), Volume 4, Issue 1, 2013, pp. 54 - 60, ISSN Print: 0976 – 6367, ISSN Online: 0976 – 6375.
[10] Neetu Rathi and Dr. Anil Kumar Sharma, “Secure Hybrid Watermarking using Discrete Wavelet Transform
(DWT) & Discrete Cosine Transform (DCT)”, International Journal of Computer Engineering & Technology
(IJCET), Volume 5, Issue 4, 2014, pp. 186 - 193, ISSN Print: 0976 – 6367, ISSN Online: 0976 – 6375.
Ad

Recommended

PDF
Hybrid compression based stationary wavelet transforms
Omar Ghazi
 
PDF
Ceis 4
Alexander Decker
 
PDF
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
ijceronline
 
PDF
Implementation of Vedic Multiplier in Image Compression Using Discrete Wavele...
IJSRD
 
PDF
Performance Comparison of K-means Codebook Optimization using different Clust...
IOSR Journals
 
PDF
Dynamic Texture Coding using Modified Haar Wavelet with CUDA
IJERA Editor
 
PDF
I017425763
IOSR Journals
 
PDF
FPGA Implementation of 2-D DCT & DWT Engines for Vision Based Tracking of Dyn...
IJERA Editor
 
PDF
B046050711
IJERA Editor
 
PDF
Welcome to International Journal of Engineering Research and Development (IJERD)
IJERD Editor
 
PDF
Reversible encrypted data concealment in images by reserving room approach
IAEME Publication
 
PDF
Robust Watermarking through Dual Band IWT and Chinese Remainder Theorem
journalBEEI
 
PDF
Efficient Implementation of Low Power 2-D DCT Architecture
IJMER
 
PDF
A Time-Area-Power Efficient High Speed Vedic Mathematics Multiplier using Com...
Kumar Goud
 
PDF
An advancement in the N×N Multiplier Architecture Realization via the Ancient...
VIT-AP University
 
PDF
Clustbigfim frequent itemset mining of
ijfcstjournal
 
PDF
IRJET- Handwritten Decimal Image Compression using Deep Stacked Autoencoder
IRJET Journal
 
PDF
An Efficient Multiplierless Transform algorithm for Video Coding
CSCJournals
 
PDF
SQUASHED JPEG IMAGE COMPRESSION VIA SPARSE MATRIX
ijcsit
 
PDF
Deep Learning for Natural Language Processing
IRJET Journal
 
PDF
Design a New Image Encryption using Fuzzy Integral Permutation with Coupled C...
IJORCS
 
PDF
Content Based Image Retrieval Using 2-D Discrete Wavelet Transform
IOSR Journals
 
PDF
Fault-Tolerance Aware Multi Objective Scheduling Algorithm for Task Schedulin...
csandit
 
PDF
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
csandit
 
PDF
Median based parallel steering kernel regression for image reconstruction
csandit
 
PDF
Wind and solar integrated to smart grid using islanding operation
IAEME Publication
 
PDF
Modelling of high step up dc dc converter for photovoltaic modules
IAEME Publication
 
PDF
Investigative analysis of security issues and challenges in cloud computing a...
IAEME Publication
 
PDF
Ga based optimal facts controller for maximizing loadability with stability c...
IAEME Publication
 
PDF
Color vs texture feature extraction and matching in visual content retrieval ...
IAEME Publication
 

More Related Content

What's hot (17)

PDF
B046050711
IJERA Editor
 
PDF
Welcome to International Journal of Engineering Research and Development (IJERD)
IJERD Editor
 
PDF
Reversible encrypted data concealment in images by reserving room approach
IAEME Publication
 
PDF
Robust Watermarking through Dual Band IWT and Chinese Remainder Theorem
journalBEEI
 
PDF
Efficient Implementation of Low Power 2-D DCT Architecture
IJMER
 
PDF
A Time-Area-Power Efficient High Speed Vedic Mathematics Multiplier using Com...
Kumar Goud
 
PDF
An advancement in the N×N Multiplier Architecture Realization via the Ancient...
VIT-AP University
 
PDF
Clustbigfim frequent itemset mining of
ijfcstjournal
 
PDF
IRJET- Handwritten Decimal Image Compression using Deep Stacked Autoencoder
IRJET Journal
 
PDF
An Efficient Multiplierless Transform algorithm for Video Coding
CSCJournals
 
PDF
SQUASHED JPEG IMAGE COMPRESSION VIA SPARSE MATRIX
ijcsit
 
PDF
Deep Learning for Natural Language Processing
IRJET Journal
 
PDF
Design a New Image Encryption using Fuzzy Integral Permutation with Coupled C...
IJORCS
 
PDF
Content Based Image Retrieval Using 2-D Discrete Wavelet Transform
IOSR Journals
 
PDF
Fault-Tolerance Aware Multi Objective Scheduling Algorithm for Task Schedulin...
csandit
 
PDF
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
csandit
 
PDF
Median based parallel steering kernel regression for image reconstruction
csandit
 
B046050711
IJERA Editor
 
Welcome to International Journal of Engineering Research and Development (IJERD)
IJERD Editor
 
Reversible encrypted data concealment in images by reserving room approach
IAEME Publication
 
Robust Watermarking through Dual Band IWT and Chinese Remainder Theorem
journalBEEI
 
Efficient Implementation of Low Power 2-D DCT Architecture
IJMER
 
A Time-Area-Power Efficient High Speed Vedic Mathematics Multiplier using Com...
Kumar Goud
 
An advancement in the N×N Multiplier Architecture Realization via the Ancient...
VIT-AP University
 
Clustbigfim frequent itemset mining of
ijfcstjournal
 
IRJET- Handwritten Decimal Image Compression using Deep Stacked Autoencoder
IRJET Journal
 
An Efficient Multiplierless Transform algorithm for Video Coding
CSCJournals
 
SQUASHED JPEG IMAGE COMPRESSION VIA SPARSE MATRIX
ijcsit
 
Deep Learning for Natural Language Processing
IRJET Journal
 
Design a New Image Encryption using Fuzzy Integral Permutation with Coupled C...
IJORCS
 
Content Based Image Retrieval Using 2-D Discrete Wavelet Transform
IOSR Journals
 
Fault-Tolerance Aware Multi Objective Scheduling Algorithm for Task Schedulin...
csandit
 
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
csandit
 
Median based parallel steering kernel regression for image reconstruction
csandit
 

Viewers also liked (10)

PDF
Wind and solar integrated to smart grid using islanding operation
IAEME Publication
 
PDF
Modelling of high step up dc dc converter for photovoltaic modules
IAEME Publication
 
PDF
Investigative analysis of security issues and challenges in cloud computing a...
IAEME Publication
 
PDF
Ga based optimal facts controller for maximizing loadability with stability c...
IAEME Publication
 
PDF
Color vs texture feature extraction and matching in visual content retrieval ...
IAEME Publication
 
PDF
A survey on weighted clustering techniques in manets
IAEME Publication
 
PDF
Design of second order linear time invariant systems for deadbeat response
IAEME Publication
 
PDF
Online framework for video stabilization
IAEME Publication
 
PDF
Audio video steganography using forensic techniquefor data security
IAEME Publication
 
PDF
Congestion management through mvcm and congestion control based on mobile agents
IAEME Publication
 
Wind and solar integrated to smart grid using islanding operation
IAEME Publication
 
Modelling of high step up dc dc converter for photovoltaic modules
IAEME Publication
 
Investigative analysis of security issues and challenges in cloud computing a...
IAEME Publication
 
Ga based optimal facts controller for maximizing loadability with stability c...
IAEME Publication
 
Color vs texture feature extraction and matching in visual content retrieval ...
IAEME Publication
 
A survey on weighted clustering techniques in manets
IAEME Publication
 
Design of second order linear time invariant systems for deadbeat response
IAEME Publication
 
Online framework for video stabilization
IAEME Publication
 
Audio video steganography using forensic techniquefor data security
IAEME Publication
 
Congestion management through mvcm and congestion control based on mobile agents
IAEME Publication
 
Ad

Similar to Performance boosting of discrete cosine transform using parallel programming methodology (20)

PDF
IIIRJET-Implementation of Image Compression Algorithm on FPGA
IRJET Journal
 
PDF
Kassem2009
lazchi
 
PDF
4 - Simulation and analysis of different DCT techniques on MATLAB (presented ...
Youness Lahdili
 
PDF
Modified approximate 8-point multiplier less DCT like transform
IJERA Editor
 
PDF
A Review on Image Compression in Parallel using CUDA
IJERD Editor
 
PDF
Compression using JPEG
Sabih Hasan
 
PDF
Bivariatealgebraic integerencoded arai algorithm for
eSAT Publishing House
 
PDF
An35225228
IJERA Editor
 
PDF
Hv2514131415
IJERA Editor
 
PDF
Hv2514131415
IJERA Editor
 
PDF
Pipelined Architecture of 2D-DCT, Quantization and ZigZag Process for JPEG Im...
VLSICS Design
 
PDF
Discrete cosine transform
aniruddh Tyagi
 
PDF
Discrete cosine transform
Aniruddh Tyagi
 
PDF
Discrete cosine transform
aniruddh Tyagi
 
PDF
PIPELINED ARCHITECTURE OF 2D-DCT, QUANTIZATION AND ZIGZAG PROCESS FOR JPEG IM...
VLSICS Design
 
PDF
Intelligent Parallel Processing and Compound Image Compression
DR.P.S.JAGADEESH KUMAR
 
PDF
11.0003www.iiste.org call for paper_d_discrete cosine transform for image com...
Alexander Decker
 
PDF
A Comparative Study of Image Compression Algorithms
IJORCS
 
PDF
HEVC 2D-DCT architectures comparison for FPGA and ASIC implementations
TELKOMNIKA JOURNAL
 
PDF
Medical Image Compression using DCT with Entropy Encoding and Huffman on MRI ...
Associate Professor in VSB Coimbatore
 
IIIRJET-Implementation of Image Compression Algorithm on FPGA
IRJET Journal
 
Kassem2009
lazchi
 
4 - Simulation and analysis of different DCT techniques on MATLAB (presented ...
Youness Lahdili
 
Modified approximate 8-point multiplier less DCT like transform
IJERA Editor
 
A Review on Image Compression in Parallel using CUDA
IJERD Editor
 
Compression using JPEG
Sabih Hasan
 
Bivariatealgebraic integerencoded arai algorithm for
eSAT Publishing House
 
An35225228
IJERA Editor
 
Hv2514131415
IJERA Editor
 
Hv2514131415
IJERA Editor
 
Pipelined Architecture of 2D-DCT, Quantization and ZigZag Process for JPEG Im...
VLSICS Design
 
Discrete cosine transform
aniruddh Tyagi
 
Discrete cosine transform
Aniruddh Tyagi
 
Discrete cosine transform
aniruddh Tyagi
 
PIPELINED ARCHITECTURE OF 2D-DCT, QUANTIZATION AND ZIGZAG PROCESS FOR JPEG IM...
VLSICS Design
 
Intelligent Parallel Processing and Compound Image Compression
DR.P.S.JAGADEESH KUMAR
 
11.0003www.iiste.org call for paper_d_discrete cosine transform for image com...
Alexander Decker
 
A Comparative Study of Image Compression Algorithms
IJORCS
 
HEVC 2D-DCT architectures comparison for FPGA and ASIC implementations
TELKOMNIKA JOURNAL
 
Medical Image Compression using DCT with Entropy Encoding and Huffman on MRI ...
Associate Professor in VSB Coimbatore
 
Ad

More from IAEME Publication (20)

PDF
IAEME_Publication_Call_for_Paper_September_2022.pdf
IAEME Publication
 
PDF
MODELING AND ANALYSIS OF SURFACE ROUGHNESS AND WHITE LATER THICKNESS IN WIRE-...
IAEME Publication
 
PDF
A STUDY ON THE REASONS FOR TRANSGENDER TO BECOME ENTREPRENEURS
IAEME Publication
 
PDF
BROAD UNEXPOSED SKILLS OF TRANSGENDER ENTREPRENEURS
IAEME Publication
 
PDF
DETERMINANTS AFFECTING THE USER'S INTENTION TO USE MOBILE BANKING APPLICATIONS
IAEME Publication
 
PDF
ANALYSE THE USER PREDILECTION ON GPAY AND PHONEPE FOR DIGITAL TRANSACTIONS
IAEME Publication
 
PDF
VOICE BASED ATM FOR VISUALLY IMPAIRED USING ARDUINO
IAEME Publication
 
PDF
IMPACT OF EMOTIONAL INTELLIGENCE ON HUMAN RESOURCE MANAGEMENT PRACTICES AMONG...
IAEME Publication
 
PDF
VISUALISING AGING PARENTS & THEIR CLOSE CARERS LIFE JOURNEY IN AGING ECONOMY
IAEME Publication
 
PDF
A STUDY ON THE IMPACT OF ORGANIZATIONAL CULTURE ON THE EFFECTIVENESS OF PERFO...
IAEME Publication
 
PDF
GANDHI ON NON-VIOLENT POLICE
IAEME Publication
 
PDF
A STUDY ON TALENT MANAGEMENT AND ITS IMPACT ON EMPLOYEE RETENTION IN SELECTED...
IAEME Publication
 
PDF
ATTRITION IN THE IT INDUSTRY DURING COVID-19 PANDEMIC: LINKING EMOTIONAL INTE...
IAEME Publication
 
PDF
INFLUENCE OF TALENT MANAGEMENT PRACTICES ON ORGANIZATIONAL PERFORMANCE A STUD...
IAEME Publication
 
PDF
A STUDY OF VARIOUS TYPES OF LOANS OF SELECTED PUBLIC AND PRIVATE SECTOR BANKS...
IAEME Publication
 
PDF
EXPERIMENTAL STUDY OF MECHANICAL AND TRIBOLOGICAL RELATION OF NYLON/BaSO4 POL...
IAEME Publication
 
PDF
ROLE OF SOCIAL ENTREPRENEURSHIP IN RURAL DEVELOPMENT OF INDIA - PROBLEMS AND ...
IAEME Publication
 
PDF
OPTIMAL RECONFIGURATION OF POWER DISTRIBUTION RADIAL NETWORK USING HYBRID MET...
IAEME Publication
 
PDF
APPLICATION OF FRUGAL APPROACH FOR PRODUCTIVITY IMPROVEMENT - A CASE STUDY OF...
IAEME Publication
 
PDF
A MULTIPLE – CHANNEL QUEUING MODELS ON FUZZY ENVIRONMENT
IAEME Publication
 
IAEME_Publication_Call_for_Paper_September_2022.pdf
IAEME Publication
 
MODELING AND ANALYSIS OF SURFACE ROUGHNESS AND WHITE LATER THICKNESS IN WIRE-...
IAEME Publication
 
A STUDY ON THE REASONS FOR TRANSGENDER TO BECOME ENTREPRENEURS
IAEME Publication
 
BROAD UNEXPOSED SKILLS OF TRANSGENDER ENTREPRENEURS
IAEME Publication
 
DETERMINANTS AFFECTING THE USER'S INTENTION TO USE MOBILE BANKING APPLICATIONS
IAEME Publication
 
ANALYSE THE USER PREDILECTION ON GPAY AND PHONEPE FOR DIGITAL TRANSACTIONS
IAEME Publication
 
VOICE BASED ATM FOR VISUALLY IMPAIRED USING ARDUINO
IAEME Publication
 
IMPACT OF EMOTIONAL INTELLIGENCE ON HUMAN RESOURCE MANAGEMENT PRACTICES AMONG...
IAEME Publication
 
VISUALISING AGING PARENTS & THEIR CLOSE CARERS LIFE JOURNEY IN AGING ECONOMY
IAEME Publication
 
A STUDY ON THE IMPACT OF ORGANIZATIONAL CULTURE ON THE EFFECTIVENESS OF PERFO...
IAEME Publication
 
GANDHI ON NON-VIOLENT POLICE
IAEME Publication
 
A STUDY ON TALENT MANAGEMENT AND ITS IMPACT ON EMPLOYEE RETENTION IN SELECTED...
IAEME Publication
 
ATTRITION IN THE IT INDUSTRY DURING COVID-19 PANDEMIC: LINKING EMOTIONAL INTE...
IAEME Publication
 
INFLUENCE OF TALENT MANAGEMENT PRACTICES ON ORGANIZATIONAL PERFORMANCE A STUD...
IAEME Publication
 
A STUDY OF VARIOUS TYPES OF LOANS OF SELECTED PUBLIC AND PRIVATE SECTOR BANKS...
IAEME Publication
 
EXPERIMENTAL STUDY OF MECHANICAL AND TRIBOLOGICAL RELATION OF NYLON/BaSO4 POL...
IAEME Publication
 
ROLE OF SOCIAL ENTREPRENEURSHIP IN RURAL DEVELOPMENT OF INDIA - PROBLEMS AND ...
IAEME Publication
 
OPTIMAL RECONFIGURATION OF POWER DISTRIBUTION RADIAL NETWORK USING HYBRID MET...
IAEME Publication
 
APPLICATION OF FRUGAL APPROACH FOR PRODUCTIVITY IMPROVEMENT - A CASE STUDY OF...
IAEME Publication
 
A MULTIPLE – CHANNEL QUEUING MODELS ON FUZZY ENVIRONMENT
IAEME Publication
 

Recently uploaded (20)

PDF
Oh, the Possibilities - Balancing Innovation and Risk with Generative AI.pdf
Priyanka Aash
 
PDF
cnc-processing-centers-centateq-p-110-en.pdf
AmirStern2
 
PDF
From Manual to Auto Searching- FME in the Driver's Seat
Safe Software
 
PPTX
Security Tips for Enterprise Azure Solutions
Michele Leroux Bustamante
 
PDF
EIS-Webinar-Engineering-Retail-Infrastructure-06-16-2025.pdf
Earley Information Science
 
PDF
Tech-ASan: Two-stage check for Address Sanitizer - Yixuan Cao.pdf
caoyixuan2019
 
PDF
Smarter Aviation Data Management: Lessons from Swedavia Airports and Sweco
Safe Software
 
PPTX
CapCut Pro Crack For PC Latest Version {Fully Unlocked} 2025
pcprocore
 
PPTX
Securing Account Lifecycles in the Age of Deepfakes.pptx
FIDO Alliance
 
PDF
"Scaling in space and time with Temporal", Andriy Lupa.pdf
Fwdays
 
PDF
Database Benchmarking for Performance Masterclass: Session 2 - Data Modeling ...
ScyllaDB
 
PPTX
Curietech AI in action - Accelerate MuleSoft development
shyamraj55
 
PDF
Techniques for Automatic Device Identification and Network Assignment.pdf
Priyanka Aash
 
PDF
Coordinated Disclosure for ML - What's Different and What's the Same.pdf
Priyanka Aash
 
PDF
Enhance GitHub Copilot using MCP - Enterprise version.pdf
Nilesh Gule
 
PPTX
"How to survive Black Friday: preparing e-commerce for a peak season", Yurii ...
Fwdays
 
PDF
The Growing Value and Application of FME & GenAI
Safe Software
 
PDF
ReSTIR [DI]: Spatiotemporal reservoir resampling for real-time ray tracing ...
revolcs10
 
PDF
Mastering AI Workflows with FME by Mark Döring
Safe Software
 
PDF
Hyderabad MuleSoft In-Person Meetup (June 21, 2025) Slides
Ravi Tamada
 
Oh, the Possibilities - Balancing Innovation and Risk with Generative AI.pdf
Priyanka Aash
 
cnc-processing-centers-centateq-p-110-en.pdf
AmirStern2
 
From Manual to Auto Searching- FME in the Driver's Seat
Safe Software
 
Security Tips for Enterprise Azure Solutions
Michele Leroux Bustamante
 
EIS-Webinar-Engineering-Retail-Infrastructure-06-16-2025.pdf
Earley Information Science
 
Tech-ASan: Two-stage check for Address Sanitizer - Yixuan Cao.pdf
caoyixuan2019
 
Smarter Aviation Data Management: Lessons from Swedavia Airports and Sweco
Safe Software
 
CapCut Pro Crack For PC Latest Version {Fully Unlocked} 2025
pcprocore
 
Securing Account Lifecycles in the Age of Deepfakes.pptx
FIDO Alliance
 
"Scaling in space and time with Temporal", Andriy Lupa.pdf
Fwdays
 
Database Benchmarking for Performance Masterclass: Session 2 - Data Modeling ...
ScyllaDB
 
Curietech AI in action - Accelerate MuleSoft development
shyamraj55
 
Techniques for Automatic Device Identification and Network Assignment.pdf
Priyanka Aash
 
Coordinated Disclosure for ML - What's Different and What's the Same.pdf
Priyanka Aash
 
Enhance GitHub Copilot using MCP - Enterprise version.pdf
Nilesh Gule
 
"How to survive Black Friday: preparing e-commerce for a peak season", Yurii ...
Fwdays
 
The Growing Value and Application of FME & GenAI
Safe Software
 
ReSTIR [DI]: Spatiotemporal reservoir resampling for real-time ray tracing ...
revolcs10
 
Mastering AI Workflows with FME by Mark Döring
Safe Software
 
Hyderabad MuleSoft In-Person Meetup (June 21, 2025) Slides
Ravi Tamada
 

Performance boosting of discrete cosine transform using parallel programming methodology

  • 1. Proceedings of the International Conference on Emerging Trends in Engineering and Management (ICETEM14) 30 – 31, December 2014, Ernakulam, India 105 PERFORMANCE BOOSTING OF DISCRETE COSINE TRANSFORM USING PARALLEL PROGRAMMING METHODOLOGY Aparna M.P Final Year MTech, Dept of Computer Science & Engineering, Sree Narayana Gurukulam College of Engineering, Kerala, India Smitha Suresh Associate Professor, Dept of Computer Science & Engineering, Sree Narayana Gurukulam College of Engineering, Kerala, India Anoop M.P Software Engineer, Intel Corporation, Hillsboro, United States ABSTRACT Discrete Cosine Transform (DCT) is a most widely used transform in JPEG compression. DCT transforms an image (2D-signal) from time domain to frequency domain. DCT and Quantization are the first two steps in JPEG compression standard where inter-pixel redundancy and psycho-visual redundancy of the image are removed. However such operations involve complex and time consuming mathematical calculations such as the matrix multiplications. In this paper we demonstrate how DCT algorithm can execute faster on a given processor architecture by utilizing multiple processing cores and efficiently utilizing each processing core by generating SIMD instructions. DCT is a classic example of data parallel algorithm and the performance of this algorithm can be improved on a multi-core machine using the thread level parallelism and vector level parallelism within each processing core. Each processing core has vector registers which enables vector operations. The programming methodology used in this paper to enable thread level and vector level parallelism is Cilk Plus. This paper focuses on demonstrating the speedup in DCT/Inverse DCT (IDCT) and quantization/de-quantization algorithms. Keywords: DCT, Parallel Programming. 1. INTRODUCTION Images are the real world 3D scene captured on a 2 dimensional plane of pixels. In this paper we consider .bmp image files where each pixel in the image is represented in 24bit RGB bitmap format. The number of bits used to represent each pixel determines the quality of the image. The greater the number of bits used for the representing each color the higher is the quality of the image. INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET) ISSN 0976 – 6367(Print) ISSN 0976 – 6375(Online) Volume 5, Issue 12, December (2014), pp. 105-108 © IAEME: www.iaeme.com/IJCET.asp Journal Impact Factor (2014): 8.5328 (Calculated by GISI) www.jifactor.com IJCET © I A E M E
  • 2. Proceedings of the International Conference on Emerging Trends in Engineering and Management (ICETEM14) 30 – 31, December 2014, Ernakulam, India 106 Neighboring pixels in the image exhibit certain level of correlation. A transformation maps the correlated data to uncorrelated coefficients thereby reducing the interpixel redundancy. Discrete cosine transform[1] is a lossy compression algorithm used in the JPEG images where finite sequence of data points in the image, discarding the small high-frequency components of an image are represented in terms of a sum of cosine functions. The image is split into blocks of size 8x8. The process of eliminating higher frequency components not sensitive to human eyes is done through quantization. There is reduction in the quality of an image when DCT is applied followed by quantization in image compression. An inverse operation namely dequantization followed by the inverse Discrete Cosine Transform is done on the image in order to maintain the quality of the image and to increase the load on the processors by increasing the number of calculations. Parallelisms of both levels namely thread level and vector levels are implemented on the multicore machines to improve the performance. Thread level parallelism is applied on each core that processes independent 8x8 block of pixel obtained by splitting the .bmp image. The speed of operation on the cores is increased by using vector registers. In this paper we use the Advanced Vector Extension (AVX) architecture where the each vector register is 32 bytes long to store the multiple elements of array for processing. Each core has the vector level parallelism [2] implemented on it where data of the same data types are stored in the form of an array and the operation is applied simultaneously on all the elements stored in that array. Data level parallelism is achieved using Single Instruction Multiple Data (SIMD) [6]. SIMD is used to increase the computing speed by applying same operations on multiple data which are stored in the vector registers. Cilk Plus [8] is the programming methodology used in this paper to enable thread level and vector level parallelism. 2. DISCRETE COSINE TRANSFORM ALGORITHM DCT algorithm has the following steps 1. The image that needs to be compressed is broken down into 8x8 blocks of pixels. 2. DCT algorithm is applied to each of the image block. 3. Quantization algorithm is applied to each block for eliminating the higher frequency components. 4. The quantized image block is then de-quantized. 5. Finally Inverse DCT is applied on each de-quantized. The quality of the output image is dependent on the degree of compression which varies with the quantization matrix chosen. The quality level varies from 1-100 in scale. Value 100 denotes best quality image with lower compression and value 1 represents highest compression with poor quality image. In this paper we perform quant90 matrix for the compression. 3. CILK PLUS PROGRAMMING METHODOLOGY Traditional C/C++ programming language is not designed to express potential parallelism in an application. This demanded some extensions to the language which enables the programmer to express the potential parallelism. Cilk Plus [9] is a parallel programming model which provides tools for both enabling multi-threading as well as enabling SIMD in an application. Threading solution is offered using three keywords which cilk_for, cilk_spawn and cilk_sync [8]. SIMD solution is offered using three explicit vectorization tools which Array notations, pragma simd[4] and simd- enabled functions. Cilk Plus specification is supported C++ Compiler 13.0. 4. PROPOSED METHODOLOGY In this paper we assume the following: 1. An image of resolution 3264 x 2448 (24 bit RGB bitmap format). 2. A machine with 4 processing cores and each core supporting AVX architecture. 3. If serial and scalar processing of the image takes “n” units of time, then on a 4 core machine by enabling multi- threading, the theoretical time taken to process the image is reduced by 4. 4. Each operation involves single precision floating point data. Consider vector operations targeting AVX architecture for a theoretical potential speedup of 8x8 in comparison to serial implementation. Theoretical time taken is “n/8” units of time. 5. Combining the implementations of both threading and vectorization targeting AVX architecture, the theoretical potential speedup is 32x (Theoretical time taken is “n/32” units of time).
  • 3. Proceedings of the International Conference on Emerging Trends in Engineering and Management (ICETEM14) 30 – 31, December 2014, Ernakulam, India 107 4.1 Serial Implementation with scalar operation A DCT matrix of size 8x8 is generated using the equation 1 and the transpose of the matrix (IDCT) is generated. The transform is applied by multiplying DCT matrix with 8x8 image block and the IDCT. The quant90 matrix is the quantization matrix and the quantized matrix is obtained by dividing the transformed image block by quant90. The quantized matrix is de-quantized and the Inverse DCT is applied to the block to get the final block. In the serial implementation only 1 core executes the processing of the image block in a serial fashion. The full payload of the for loop is executed serially (single thread) in scalar mode. Algorithm 1. Create a DCT matrix of size 8x8 2. Create a Inverse DCT matrix (IDCT) of size 8x8 => IDCT = transpose (DCT) 3. Create quantization matrix (quant). 4. Divide the image into 8x8 blocks 5. Serial loop with scalar operations: 6. for i = 1 to n (number of image blocks) do 7. Compute DCT of block[i] => Transform = (DCT * block[i] * IDCT) 8. Quantize the transformed image block => Quantized matrix = (Transform/quant90) 9. De-quantize the quantized image block => De-quantized matrix = (Quantized matrix * quant90) 10. Compute Inverse DCT of block[i] => Final block = (IDCT * (de-quantized image block) * DCT) 4.2 .Thread level parallelism implementation with scalar operation In the thread level parallelism, implementation the 124848 blocks ((3264*2448)/(8x8)) which are divided between all the 4 cores. The 4 cores of the machine execute the same code on 4 different image blocks simultaneously. The theoretical speedup possible from this threading solution is 4x. Algorithm 1. Create a DCT matrix of size 8x8 2. Create a Inverse DCT matrix (IDCT) of size 8x8 => IDCT = transpose (DCT) 3. Create quantization matrix (quant90). 4. Divide the image into 8x8 blocks 5. Thread level parallelism with scalar operations: 6. for i = 1 to n (number of image blocks) do 7. Divide n /4 (number of cores available in the machine) 8. Assign n/4 blocks to each core for processing in a scalar mode 9. Assign each image block to each thread available in the core 10. Compute DCT of block[i] => Transform = (DCT * block[i] * IDCT) 11. Quantize the transformed image block => Quantized matrix = (Transform/quant) 12. De-quantize the quantized image block => De-quantized matrix = (Quantized matrix * quant) 13. Compute Inverse DCT of block[i] => Final block = (IDCT * (de-quantized image block) * DCT) 4.3 .Vector level parallelism implementation with array operation In the vector level parallelism, the vector registers (targeting AVX architecture) are used to execute the operations in vector mode. In the earlier system each register size was just 4 bytes but with latest AVX architecture, the register size is increased by 8 times (32 bytes). That means each instruction can execute on 8 times more data in comparison to the scalar operation mode. The theoretical potential speedup here is 8x. Algorithm: 1. Create a DCT matrix of size 8x8 2. Create an Inverse DCT matrix (IDCT) of size 8x8 => IDCT = transpose (DCT) 3. Create quantization matrix (quant). 4. Divide the image into 8x8 blocks 5. Vector level parallelism with single thread (serial mode) 6. for i = 1 to n (number of image blocks) do 7. Divide n/4 (number of arrays available in the machine) 8. Assign n/4 blocks to each array of core for processing in a vectorized way using SIMD 9. Compute DCT for array of block[i] => Transform = (DCT * block[i] * IDCT) 10. Quantize the transformed image block => Quantized matrix = (Transform/quant)
  • 4. Proceedings of the International Conference on Emerging Trends in Engineering and Management (ICETEM14) 30 – 31, December 2014, Ernakulam, India 108 11. De-quantize the quantized image block => De-quantized matrix = (Quantized matrix * quant) 12. Compute Inverse DCT of block[i] => Final block = (IDCT * (de-quantized image block) * DCT) 4.4. Thread level parallelism implementation with vector operation Algorithm 1. Create a DCT matrix of size 8x8 2. Create a Inverse DCT matrix (IDCT) of size 8x8 => IDCT = transpose (DCT) 3. Create quantization matrix (quant) 4. Divide the image into 8x8 blocks 5. for i = 1 to n (number of image blocks) do 6. Divide n/4 (number of cores available in the machine) 7. for j =1 to n/4 do 8. //This loop body executes in multi-threaded SIMD mode 9. Assign n/z blocks to each array of core for processing in a vectorized way using SIMD 10. Compute DCT for array of block[i] => Transform = (DCT * block[i] * IDCT) 11. Quantize the transformed image block => Quantized matrix = (Transform/quant) 12. De-quantize the quantized image block => De-quantized matrix = (Quantized matrix * quant). 13. Compute Inverse DCT of block[i] => Final block = (IDCT * (de-quantized image block) * DCT) This step combines the step 2 and 3 discussed previously (threading + SIMD solution). The theoretical speedup possible is 32x (theoretical speedup possible using multi-threading on 4 core machine * theoretical speedup possible using SIMD targeting Intel® AVX). 5. CONCLUSION Irrespective of the engineering or science domain, we deal with a lot of algorithms to simulate and solve the practical problems. Most of the practical applications will fall under either Task parallelism or Data parallelism. Irrespective of which parallelism the problem falls under, there are ways to convey the potential parallelism in the algorithm using certain parallel programming models like Intel® Cilk™ Plus. Making use of these parallel programming models helps utilize the hardware resources better thereby increasing the speed of execution of the algorithm. REFERENCES [1] Ken Cabeen and Peter Gent Math 45 College of the Redwoods “Image Compression and the Discrete Cosine Transform". [2] Autovectorization Using the Intel® C++ Compiler- https://software.intel.com/sites/default/files/8c/a9/CompilerAutovectorizationGuide.pdf. [3] LoopVectorizationhttps://software.intel.com/en- us/articles/ requirements-for-vectorizable-loops. [4] Pragma SIMD for loop vectorization - https://software.intel.com/en-us/articles/requirements-for-vectorizing- loops-with-pragma-simd. [5] Intel®Cilk™Plus:-https://software.intel.com/sites/default/files/article/185163/introduction-to-array-notation.pdf. [6] SIMD parallelism -https://software.intel.com/en-us/blogs/2010/ 09/03/simd-parallelism-using-array-notation/? wapkw=array+notation. [7] Dataparallelism:https://software.intel.com/sites/default/files/article/181418/whitepaperonelementalfunctions.Pdf. [8] Intel® Cilk™ Plus to Achieve Data and Thread Parallelism - https://software.intel.com/en-us/articles/data-and- thread-parallelism. [9] P. Prasanth Babu, L.Rangaiah and D.Maruthi Kumar, “Comparison and Improvement of Image Compression using DCT, DWT & Huffman Encoding Techniques”, International Journal of Computer Engineering & Technology (IJCET), Volume 4, Issue 1, 2013, pp. 54 - 60, ISSN Print: 0976 – 6367, ISSN Online: 0976 – 6375. [10] Neetu Rathi and Dr. Anil Kumar Sharma, “Secure Hybrid Watermarking using Discrete Wavelet Transform (DWT) & Discrete Cosine Transform (DCT)”, International Journal of Computer Engineering & Technology (IJCET), Volume 5, Issue 4, 2014, pp. 186 - 193, ISSN Print: 0976 – 6367, ISSN Online: 0976 – 6375.