SlideShare a Scribd company logo
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 06 Issue: 05 | May 2019 www.irjet.net p-ISSN: 2395-0072
© 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 6304
Performance Analysis of RSA Algorithm with CUDA Parallel Computing
Rishikesh Kadam1, Vishakha Vidhani2
1,2 Student, Department Of Computer Engineering, V.E.S.I.T., Mumbai University, India.
---------------------------------------------------------------------***---------------------------------------------------------------------
Abstract - In today’s data driven and digitally connected
world, computer system security is of paramount
importance. The computer security relies on cryptographic
security techniques for protection of data. RSA is one of the
most used cryptographic algorithms which can be sped up
using parallel computing techniques. The advances in
computing capability of Graphics Processing Unit as a co-
processor of the system CPU allows parallelization of RSA
encryption and decryption. This paper focuses on parallel
implementation of RSA using Nvidia’s Compute Unified
Device Architecture (CUDA). First, the RSA cryptographic
algorithm is studied. Then, it is designed for implementation
in CUDA framework. In the last step, the RSA function is
called parallely by all nodes in a Hadoop cluster. For this
step, JCUDA is used. The results are analyzed mainly for the
speed up introduced by GPU as compared to the CPU-only
implementation.
Key Words: Cryptography, RSA, CUDA, GPU, Hadoop.
1. INTRODUCTION
RSA algorithm is named after its inventors Ron Rivest, Adi
Shamir and Leonard Adleman. It is a public key
cryptographic algorithm whose security relies on the
difficulty of finding prime factors of large numbers.[1] But
the modular power used in this algorithm acts as a
bottleneck for its performance which makes its large scale
implementations inefficient. Thus, the optimization of RSA
has been a research focus. The advancements in multi-
core processing and parallel computing is huge boost for
RSA optimization.
In last decade, the use of Graphical Processing Unit in
training of machine learning models and neural networks
has been a major trend, but along with that, GPU turns out
to be exclusively beneficial for general purpose parallel
computing [2] after the advent of Compute Unified Device
Architecture (CUDA) technology. Nvidia’s CUDA platform
makes thousands of stream processors available to
provide parallel speed up. This paper explores the
implementation of CUDA for parallelization of RSA
algorithm and presents the performance analysis. First,
the RSA algorithm is encapsulated in Java Native Interface
(JNI). Then, JCUDA is used to implement the
parallelization.
In section 2 of the paper, traditional principle of RSA
algorithm is studied. Section 3 gives an architectural
overview of the system hardware used for the
implementation. Section 4 presents the design and
implementation of RSA parallelization and section 5 shows
the results of performance analysis.
2. RSA ALGORITHM
RSA algorithm is the most commonly used asymmetric
cryptography algorithm. It works on two different keys i.e.
public key and private key. The public key is known to
everyone and is used for encrypting messages. Encrypted
Messages can only be decrypted using the private key.
The keys for the RSA algorithm are generated in the
following way:
1. Choose two distinct large prime numbers, p and q.
2. Calculate n = pq, n is used as the modulus for both the
public and private keys.
3. Compute φ(pq) = (p − 1)(q − 1). (φ is totient function).
4. Pick an integer e such that 1 < e < φ(pq), and e and
φ(pq) are coprime (e and φ(pq) share no divisors other
than 1). The public key consists of e (often called public
exponent) and the modulus n.
5. Find d which satisfies the relation d*e = 1(mod φ(pq)).
The private key consists of d (private exponent) and the
modulus n.
The message M is encrypted using formula C ≡ Me mod n,
where C is the encrypted message. And it is decrypted
using the formula M ≡ Cd mod n.
Example :
1. The plaintext M to be encrypted: 1314
2. Let p=397 and q=401
3. n= pq = 159,197
4. φ(pq) = 396 * 400 = 158,400
5. The public key(e,n) is (343, 159197) and private key d
is 12,007 (d*e = 1 mod φ(pq))
6. C ≡ Me mod n, C ≡ 1314343 (mod 159,197) = 33,677
7. M ≡ Cd mod n, M ≡ 33,67712,007 (mod 159,197) = 1314
3. CUDA ARCHITECTURE
NVIDIA has designed a special C-based language CUDA to
utilize the massively parallel nature of GPU.[3] Parallel
execution is expressed by the kernel function that is
executed on a set of threads in parallel on GPU; GPU is also
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 06 Issue: 05 | May 2019 www.irjet.net p-ISSN: 2395-0072
© 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 6305
called the device. This kernel code is a C code for only one
thread. The number of thread blocks, and the number of
threads within those blocks that execute this kernel in
parallel are given explicitly when this function is called.
NVIDIA’s CUDA architecture includes a number of
multiprocessors, each multiprocessor consists of:
● 8 scalar processor cores
● 2 special function units for transcendentals
● 1 multithreaded instruction unit
● On-chip shared memory
Fig -1: CUDA Hardware
Each multiprocessor has SIMD architecture (single
instruction multiple data). Each processor of the
multiprocessor executes a different thread but all the
threads run the same instruction but operate on different
data.
Several threads (up to 512) may execute concurrently
within a multiprocessor and communicate through a small
shared memory bank (16KB) which can be seen in Fig -1.
Shared memory is local to each multiprocessor unlike
device memory and allows local synchronization to be
more efficient. It is divided into many parts. Each thread
block within multiprocessor accesses its own part of
shared memory. This part of shared memory is not
accessible by any other thread block of this multiprocessor
or of some other multiprocessor.
4. PARALLELIZATION
This paper presents a two-level parallelization of RSA
algorithm, node level and thread level as shown in Fig -2.
Fig -2: Two Level Parallelization
CUDA framework is used for parallel implementation on
thread level. RSA algorithm makes packets with the same
length by dividing the plaintext or ciphertext. The same
function will be executed for all the packets. As mentioned
in Section 2, the text is considered as integers. All the
packets are considered as arrays with same number of
integer elements. The encryption and decryption
processes are carried out on these packets. We can get the
thread and block index and then, they are used to assign
packets (elements) to each thread. Thus, on thread level,
CUDA multi-thread programming improves the
performance of RSA algorithm. [4]
The large size of plaintext or ciphertext can be
troublesome to handle on a single computer. We used a
distributed file system to store the text to be processed.
Each node (computer) in the cluster executes the parallel
RSA algorithm function. This is node level or computer
level parallelization. In such clusters, each node gets a
small part of data. In this case, a part of text is processed
by each node. Then, each node performs thread level
multi-processing.
This paper uses Hadoop Distributed File System (HDFS)
[5] for distribution of the text to all the nodes in the
cluster. HDFS is block-structured file system where each
file is divided into multiple blocks. The blocks are stored
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 06 Issue: 05 | May 2019 www.irjet.net p-ISSN: 2395-0072
© 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 6306
across the cluster of nodes. It is a Java based software
framework. Google’s MapReduce [6] and Google File
System (GFS) [7] papers built the foundation for HDFS. As
Hadoop is based on pure Java, we need to find a method to
implement our C language code of RSA on Hadoop. JCUDA
[7] provides an interface for invoking CUDA kernels using
Java code or framework such as Hadoop. The syntax and
usage of JCUDA are beyond the scope of this paper. A
sample of kernel code is as follows:
__global__ static void RSACUDA(int* arr)
{
long p,q,e,d,m,n,t,c,i;
int x = blockDimension.x * blockId.x + threadId.x;
p=191;
q=223;
n=p*q;
t=(p-1)*(q-1);
e=t-1;
d=1;
while(((e*d)%t)!=1)
d++;
for(i=0;i<50;i++)
{
c = encrypt(*(arr+i+x*50),e,n);
m = decrypt(c,d,n);
*(arr+i+x*50) = m;
}
}
In next section, performance of parallel implementation is
discussed.
5. VERIFICATION
For verification, we divide the tests into two parts. First,
we run the RSA program using a CPU-only processing.
Then, in the next step, we implement the RSA algorithm on
CUDA framework.
5.1 TEST ENVIRONMENT
The computers used for the testing have following
configuration:
● CPU: Intel Core i5-5200U 2.20GHz
● GPU: Nvidia GeForce 920M (2048MB Memory)
● Memory: 2.20GHz DDR3 (8192MB)
For development and testing, Visual Studio 2017 is used.
For GPU processing, CUDA Toolkit 10.1 is used. Using the
Hadoop framework, the text was distributed on a cluster
of 3 nodes. JCUDA was involved to invoke CUDA kernels
from Java code on each HDFS node.
5.2 RESULTS
Table -1: Time taken for execution (in milliseconds) for
different text sizes.
No. of
chars
CPU
10
Threads
100
Threads
500
Threads
10000 10176.45 12023.84 1256.73 276.49
20000 20312.82 24096.33 2303.34 530.18
30000 31034.47 36201.93 3678.01 760.48
40000 40583.05 47936.17 4907.38 1015.74
50000 51396.11 61034.28 6179.57 1346.05
60000 61564.26 71124.05 7389.18 1592.38
70000 71891.49 83456.61 8599.06 1904.75
80000 82392.73 96345.34 9803.49 2267.11
Table -1 displays the relationship between the size of
input text and the time taken for execution (in
milliseconds) in traditional CPU-only mode and GPU
implemented multithreading mode.
Chart -1: Time taken for execution by CPU-only and GPU
multithreading mode.
Chart 1 represents this output graphically. The
relationship between size of input text and execution time
is linear. As we can see, the execution time is very less
when we implement RSA algorithm in 500-thread mode. It
is almost 40 to 50 times faster than the CPU-only mode.
This output proves that higher the degree of parallelism,
better is the performance.
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 06 Issue: 05 | May 2019 www.irjet.net p-ISSN: 2395-0072
© 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 6307
Chart -2: Elements processed per millisecond in CPU-only
and multithreading mode
If number of threads are less, about 10, RSA algorithm
takes longer takes longer than CPU-only mode. This is the
result of co-operation processing and communication
overhead between CPU and GPU. It majorly consists of
copying of data from the host computer (CPU) to the
device (GPU). This can be verified in Chart -2.
6. CONCLUSIONS
This paper presented the CUDA based implementation of
RSA algorithm. We described the traditional RSA
cryptosystem. The large size of text and modular power
function act as the bottleneck for performance which is a
drawback for large scale implementation.
We used Nvidia’s Compute Unified Device Architecture
(CUDA) for parallel implementation of RSA. First, RSA was
executed in CPU-only mode. Then, it was implemented in
CUDA framework in 10-thread, 100-thread and 500-
thread modes. The results were compared which made it
clear that higher the degree parallelism, better us the
performance. The 10-thread mode gave a poor
performance as compared to CPU-only mode. It was a
result of communication and data transfer overhead
between CPU and GPU.
Apache Hadoop framework can be used for distributed
storage of large text files. All the nodes in the cluster store
a part of the text file and execute RSA in parallel on the
data.
REFERENCES
[1] R.L. Rivest, A. Shamir, L. Adleman, “A Method for
Obtaining Digital Signatures and Public-Key
Cryptosystems,” Communications of the ACM, vol. 21,
iss. 2, pp. 120-126, February 1978.
[2] John D. Owens, David Luebke, Naga Govindaraju,
Mark Harris, Jens Krüger, Aaron E. Lefohn, Tim
Purcell, “A survey of general-purpose computation on
graphics hardware,” Computer Graphics Forum, vol.
26, pp. 80-113, March 2007.
[3] NVIDIA Corporation, NVIDIA Compute Unified Device
Architecture Programming Guide, Version 10.1,
February 2019.
[4] Daniel Page, Nigel P. Smart, “Parallel cryptographic
arithmetic using a redundant montgomery
representation,” IEEE Transactions on Computers, vol.
53, no. 11, pp. 1474-1482, November 2004.
[5] Konstantin Shvachko, Hairong Kuang, Sanjay Radia,
Robert Chansler, “The Hadoop Distributed File
System,” Proceedings of the 2010 IEEE 26th
Symposium on Mass Storage Systems and
Technologies, pp. 1-10, May 2010.
[6] Jeffrey Dean, Sanjay Ghemawat, “MapReduce:
Simplified Data Processing on Large Clusters,” Sixth
Symposium on Operating System Design and
Implementation, San Francisco, pp. 137-150,
December 2004.
[7] Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung,
“The Google File System,” Proceedings of the 19th
ACM Symposium on Operating Systems Principles, NY,
pp. 20-43, October 2003.
[8] Yonghong Yan, Max Grossman, Vivek Sarkar, “JCUDA:
A Programmer-Friendly Interface for Accelerating
Java Programs with CUDA,” Proceedings of the 15th
International Euro-Par Conference on Parallel
Processing, pp. 887–899, August 2009.

More Related Content

What's hot (19)

PDF
Analysis of rsa algorithm using gpu
IJNSA Journal
 
PDF
Fixed Point Realization of Iterative LR-Aided Soft MIMO Decoding Algorithm
CSCJournals
 
PDF
Efficient Broadcast Authentication with Highest Life Span in Wireless Sensor ...
IRJET Journal
 
PPTX
Lec08 optimizations
Taras Zakharchenko
 
PDF
LITTLE DRAGON TWO: AN EFFICIENT MULTIVARIATE PUBLIC KEY CRYPTOSYSTEM
IJNSA Journal
 
PDF
Implementation of RSA Algorithm with Chinese Remainder Theorem for Modulus N ...
CSCJournals
 
PDF
A New Key Agreement Protocol Using BDP and CSP in Non Commutative Groups
Eswar Publications
 
PDF
A Novel Method for Preventing Selective Jamming Attacks in Wireless Networks
IJMER
 
PDF
Full Communication in a Wireless Sensor Network by Merging Blocks of a Key Pr...
cscpconf
 
PDF
Performance evaluation of ecc in single and multi( eliptic curve)
Danilo Calle
 
PDF
EXPERIMENTS ON DIFFERENT RECURRENT NEURAL NETWORKS FOR ENGLISH-HINDI MACHINE ...
csandit
 
PDF
On the Tree Construction of Multi hop Wireless Mesh Networks with Evolutionar...
CSCJournals
 
PDF
Solution(1)
Gopi Saiteja
 
PDF
SECURITY ENHANCED KEY PREDISTRIBUTION SCHEME USING TRANSVERSAL DESIGNS AND RE...
IJNSA Journal
 
PDF
Secure multipath routing scheme using key
ijfcstjournal
 
PDF
An Efficient FPGA Implementation of the Advanced Encryption Standard Algorithm
ijsrd.com
 
PDF
Zenoh Tutorial
Angelo Corsaro
 
PDF
Geo distributed parallelization pacts in map reduce
eSAT Publishing House
 
PPTX
Lec07 threading hw
Taras Zakharchenko
 
Analysis of rsa algorithm using gpu
IJNSA Journal
 
Fixed Point Realization of Iterative LR-Aided Soft MIMO Decoding Algorithm
CSCJournals
 
Efficient Broadcast Authentication with Highest Life Span in Wireless Sensor ...
IRJET Journal
 
Lec08 optimizations
Taras Zakharchenko
 
LITTLE DRAGON TWO: AN EFFICIENT MULTIVARIATE PUBLIC KEY CRYPTOSYSTEM
IJNSA Journal
 
Implementation of RSA Algorithm with Chinese Remainder Theorem for Modulus N ...
CSCJournals
 
A New Key Agreement Protocol Using BDP and CSP in Non Commutative Groups
Eswar Publications
 
A Novel Method for Preventing Selective Jamming Attacks in Wireless Networks
IJMER
 
Full Communication in a Wireless Sensor Network by Merging Blocks of a Key Pr...
cscpconf
 
Performance evaluation of ecc in single and multi( eliptic curve)
Danilo Calle
 
EXPERIMENTS ON DIFFERENT RECURRENT NEURAL NETWORKS FOR ENGLISH-HINDI MACHINE ...
csandit
 
On the Tree Construction of Multi hop Wireless Mesh Networks with Evolutionar...
CSCJournals
 
Solution(1)
Gopi Saiteja
 
SECURITY ENHANCED KEY PREDISTRIBUTION SCHEME USING TRANSVERSAL DESIGNS AND RE...
IJNSA Journal
 
Secure multipath routing scheme using key
ijfcstjournal
 
An Efficient FPGA Implementation of the Advanced Encryption Standard Algorithm
ijsrd.com
 
Zenoh Tutorial
Angelo Corsaro
 
Geo distributed parallelization pacts in map reduce
eSAT Publishing House
 
Lec07 threading hw
Taras Zakharchenko
 

Similar to IRJET- Performance Analysis of RSA Algorithm with CUDA Parallel Computing (20)

PDF
Efficient algorithm for rsa text encryption using cuda c
csandit
 
PPT
Introduction to parallel computing using CUDA
Martin Peniak
 
PDF
FPGA IMPLEMENTATION OF APPROXIMATE SOFTMAX FUNCTION FOR EFFICIENT CNN INFERENCE
International Research Journal of Modernization in Engineering Technology and Science
 
PPTX
Intro to GPGPU Programming with Cuda
Rob Gillen
 
PPTX
Intro to GPGPU with CUDA (DevLink)
Rob Gillen
 
PDF
Ef35745749
IJERA Editor
 
PDF
Conference Paper: Universal Node: Towards a high-performance NFV environment
Ericsson
 
PDF
Design and Implementation of Quintuple Processor Architecture Using FPGA
IJERA Editor
 
PDF
Ultra Fast SOM using CUDA
QuEST Global (erstwhile NeST Software)
 
PDF
Secure Text Transfer Using Diffie-Hellman Key Exchange Based On Cloud
IRJET Journal
 
PDF
Hardware Implementations of RS Decoding Algorithm for Multi-Gb/s Communicatio...
RSIS International
 
PDF
Optimizing Apple Lossless Audio Codec Algorithm using NVIDIA CUDA Architecture
IJECEIAES
 
PDF
IRJET - Implementation of Neural Network on FPGA
IRJET Journal
 
PPTX
Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA...
Stefano Di Carlo
 
PDF
Effective Sparse Matrix Representation for the GPU Architectures
IJCSEA Journal
 
PDF
Effective Sparse Matrix Representation for the GPU Architectures
IJCSEA Journal
 
PPT
Vpu technology &gpgpu computing
Arka Ghosh
 
PPT
Vpu technology &gpgpu computing
Arka Ghosh
 
PPT
Vpu technology &gpgpu computing
Arka Ghosh
 
PPT
Vpu technology &gpgpu computing
Arka Ghosh
 
Efficient algorithm for rsa text encryption using cuda c
csandit
 
Introduction to parallel computing using CUDA
Martin Peniak
 
FPGA IMPLEMENTATION OF APPROXIMATE SOFTMAX FUNCTION FOR EFFICIENT CNN INFERENCE
International Research Journal of Modernization in Engineering Technology and Science
 
Intro to GPGPU Programming with Cuda
Rob Gillen
 
Intro to GPGPU with CUDA (DevLink)
Rob Gillen
 
Ef35745749
IJERA Editor
 
Conference Paper: Universal Node: Towards a high-performance NFV environment
Ericsson
 
Design and Implementation of Quintuple Processor Architecture Using FPGA
IJERA Editor
 
Ultra Fast SOM using CUDA
QuEST Global (erstwhile NeST Software)
 
Secure Text Transfer Using Diffie-Hellman Key Exchange Based On Cloud
IRJET Journal
 
Hardware Implementations of RS Decoding Algorithm for Multi-Gb/s Communicatio...
RSIS International
 
Optimizing Apple Lossless Audio Codec Algorithm using NVIDIA CUDA Architecture
IJECEIAES
 
IRJET - Implementation of Neural Network on FPGA
IRJET Journal
 
Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA...
Stefano Di Carlo
 
Effective Sparse Matrix Representation for the GPU Architectures
IJCSEA Journal
 
Effective Sparse Matrix Representation for the GPU Architectures
IJCSEA Journal
 
Vpu technology &gpgpu computing
Arka Ghosh
 
Vpu technology &gpgpu computing
Arka Ghosh
 
Vpu technology &gpgpu computing
Arka Ghosh
 
Vpu technology &gpgpu computing
Arka Ghosh
 
Ad

More from IRJET Journal (20)

PDF
Enhanced heart disease prediction using SKNDGR ensemble Machine Learning Model
IRJET Journal
 
PDF
Utilizing Biomedical Waste for Sustainable Brick Manufacturing: A Novel Appro...
IRJET Journal
 
PDF
Kiona – A Smart Society Automation Project
IRJET Journal
 
PDF
DESIGN AND DEVELOPMENT OF BATTERY THERMAL MANAGEMENT SYSTEM USING PHASE CHANG...
IRJET Journal
 
PDF
Invest in Innovation: Empowering Ideas through Blockchain Based Crowdfunding
IRJET Journal
 
PDF
SPACE WATCH YOUR REAL-TIME SPACE INFORMATION HUB
IRJET Journal
 
PDF
A Review on Influence of Fluid Viscous Damper on The Behaviour of Multi-store...
IRJET Journal
 
PDF
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
IRJET Journal
 
PDF
Explainable AI(XAI) using LIME and Disease Detection in Mango Leaf by Transfe...
IRJET Journal
 
PDF
BRAIN TUMOUR DETECTION AND CLASSIFICATION
IRJET Journal
 
PDF
The Project Manager as an ambassador of the contract. The case of NEC4 ECC co...
IRJET Journal
 
PDF
"Enhanced Heat Transfer Performance in Shell and Tube Heat Exchangers: A CFD ...
IRJET Journal
 
PDF
Advancements in CFD Analysis of Shell and Tube Heat Exchangers with Nanofluid...
IRJET Journal
 
PDF
Breast Cancer Detection using Computer Vision
IRJET Journal
 
PDF
Auto-Charging E-Vehicle with its battery Management.
IRJET Journal
 
PDF
Analysis of high energy charge particle in the Heliosphere
IRJET Journal
 
PDF
A Novel System for Recommending Agricultural Crops Using Machine Learning App...
IRJET Journal
 
PDF
Auto-Charging E-Vehicle with its battery Management.
IRJET Journal
 
PDF
Analysis of high energy charge particle in the Heliosphere
IRJET Journal
 
PDF
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
IRJET Journal
 
Enhanced heart disease prediction using SKNDGR ensemble Machine Learning Model
IRJET Journal
 
Utilizing Biomedical Waste for Sustainable Brick Manufacturing: A Novel Appro...
IRJET Journal
 
Kiona – A Smart Society Automation Project
IRJET Journal
 
DESIGN AND DEVELOPMENT OF BATTERY THERMAL MANAGEMENT SYSTEM USING PHASE CHANG...
IRJET Journal
 
Invest in Innovation: Empowering Ideas through Blockchain Based Crowdfunding
IRJET Journal
 
SPACE WATCH YOUR REAL-TIME SPACE INFORMATION HUB
IRJET Journal
 
A Review on Influence of Fluid Viscous Damper on The Behaviour of Multi-store...
IRJET Journal
 
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
IRJET Journal
 
Explainable AI(XAI) using LIME and Disease Detection in Mango Leaf by Transfe...
IRJET Journal
 
BRAIN TUMOUR DETECTION AND CLASSIFICATION
IRJET Journal
 
The Project Manager as an ambassador of the contract. The case of NEC4 ECC co...
IRJET Journal
 
"Enhanced Heat Transfer Performance in Shell and Tube Heat Exchangers: A CFD ...
IRJET Journal
 
Advancements in CFD Analysis of Shell and Tube Heat Exchangers with Nanofluid...
IRJET Journal
 
Breast Cancer Detection using Computer Vision
IRJET Journal
 
Auto-Charging E-Vehicle with its battery Management.
IRJET Journal
 
Analysis of high energy charge particle in the Heliosphere
IRJET Journal
 
A Novel System for Recommending Agricultural Crops Using Machine Learning App...
IRJET Journal
 
Auto-Charging E-Vehicle with its battery Management.
IRJET Journal
 
Analysis of high energy charge particle in the Heliosphere
IRJET Journal
 
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
IRJET Journal
 
Ad

Recently uploaded (20)

PDF
CLIP_Internals_and_Architecture.pdf sdvsdv sdv
JoseLuisCahuanaRamos3
 
PPTX
Stability of IBR Dominated Grids - IEEE PEDG 2025 - short.pptx
ssuser307730
 
PDF
May 2025: Top 10 Read Articles in Data Mining & Knowledge Management Process
IJDKP
 
PDF
13th International Conference of Security, Privacy and Trust Management (SPTM...
ijcisjournal
 
PPTX
FSE_LLM4SE1_A Tool for In-depth Analysis of Code Execution Reasoning of Large...
cl144
 
PDF
NFPA 10 - Estandar para extintores de incendios portatiles (ed.22 ENG).pdf
Oscar Orozco
 
PDF
01-introduction to the ProcessDesign.pdf
StiveBrack
 
PDF
Rapid Prototyping for XR: Lecture 6 - AI for Prototyping and Research Directi...
Mark Billinghurst
 
PDF
Validating a Citizen Observatories enabling Platform by completing a Citizen ...
Diego López-de-Ipiña González-de-Artaza
 
PDF
Rapid Prototyping for XR: Lecture 2 - Low Fidelity Prototyping.
Mark Billinghurst
 
PPT
دراسة حاله لقرية تقع في جنوب غرب السودان
محمد قصص فتوتة
 
PDF
FSE-Journal-First-Automated code editing with search-generate-modify.pdf
cl144
 
PPT
FINAL plumbing code for board exam passer
MattKristopherDiaz
 
PPTX
Introduction to Python Programming Language
merlinjohnsy
 
PPTX
Mobile database systems 20254545645.pptx
herosh1968
 
PPTX
CST413 KTU S7 CSE Machine Learning Clustering K Means Hierarchical Agglomerat...
resming1
 
PPT
SF 9_Unit 1.ppt software engineering ppt
AmarrKannthh
 
PDF
Decision support system in machine learning models for a face recognition-bas...
TELKOMNIKA JOURNAL
 
PPTX
Precooling and Refrigerated storage.pptx
ThongamSunita
 
PPTX
Computer network Computer network Computer network Computer network
Shrikant317689
 
CLIP_Internals_and_Architecture.pdf sdvsdv sdv
JoseLuisCahuanaRamos3
 
Stability of IBR Dominated Grids - IEEE PEDG 2025 - short.pptx
ssuser307730
 
May 2025: Top 10 Read Articles in Data Mining & Knowledge Management Process
IJDKP
 
13th International Conference of Security, Privacy and Trust Management (SPTM...
ijcisjournal
 
FSE_LLM4SE1_A Tool for In-depth Analysis of Code Execution Reasoning of Large...
cl144
 
NFPA 10 - Estandar para extintores de incendios portatiles (ed.22 ENG).pdf
Oscar Orozco
 
01-introduction to the ProcessDesign.pdf
StiveBrack
 
Rapid Prototyping for XR: Lecture 6 - AI for Prototyping and Research Directi...
Mark Billinghurst
 
Validating a Citizen Observatories enabling Platform by completing a Citizen ...
Diego López-de-Ipiña González-de-Artaza
 
Rapid Prototyping for XR: Lecture 2 - Low Fidelity Prototyping.
Mark Billinghurst
 
دراسة حاله لقرية تقع في جنوب غرب السودان
محمد قصص فتوتة
 
FSE-Journal-First-Automated code editing with search-generate-modify.pdf
cl144
 
FINAL plumbing code for board exam passer
MattKristopherDiaz
 
Introduction to Python Programming Language
merlinjohnsy
 
Mobile database systems 20254545645.pptx
herosh1968
 
CST413 KTU S7 CSE Machine Learning Clustering K Means Hierarchical Agglomerat...
resming1
 
SF 9_Unit 1.ppt software engineering ppt
AmarrKannthh
 
Decision support system in machine learning models for a face recognition-bas...
TELKOMNIKA JOURNAL
 
Precooling and Refrigerated storage.pptx
ThongamSunita
 
Computer network Computer network Computer network Computer network
Shrikant317689
 

IRJET- Performance Analysis of RSA Algorithm with CUDA Parallel Computing

  • 1. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 06 Issue: 05 | May 2019 www.irjet.net p-ISSN: 2395-0072 © 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 6304 Performance Analysis of RSA Algorithm with CUDA Parallel Computing Rishikesh Kadam1, Vishakha Vidhani2 1,2 Student, Department Of Computer Engineering, V.E.S.I.T., Mumbai University, India. ---------------------------------------------------------------------***--------------------------------------------------------------------- Abstract - In today’s data driven and digitally connected world, computer system security is of paramount importance. The computer security relies on cryptographic security techniques for protection of data. RSA is one of the most used cryptographic algorithms which can be sped up using parallel computing techniques. The advances in computing capability of Graphics Processing Unit as a co- processor of the system CPU allows parallelization of RSA encryption and decryption. This paper focuses on parallel implementation of RSA using Nvidia’s Compute Unified Device Architecture (CUDA). First, the RSA cryptographic algorithm is studied. Then, it is designed for implementation in CUDA framework. In the last step, the RSA function is called parallely by all nodes in a Hadoop cluster. For this step, JCUDA is used. The results are analyzed mainly for the speed up introduced by GPU as compared to the CPU-only implementation. Key Words: Cryptography, RSA, CUDA, GPU, Hadoop. 1. INTRODUCTION RSA algorithm is named after its inventors Ron Rivest, Adi Shamir and Leonard Adleman. It is a public key cryptographic algorithm whose security relies on the difficulty of finding prime factors of large numbers.[1] But the modular power used in this algorithm acts as a bottleneck for its performance which makes its large scale implementations inefficient. Thus, the optimization of RSA has been a research focus. The advancements in multi- core processing and parallel computing is huge boost for RSA optimization. In last decade, the use of Graphical Processing Unit in training of machine learning models and neural networks has been a major trend, but along with that, GPU turns out to be exclusively beneficial for general purpose parallel computing [2] after the advent of Compute Unified Device Architecture (CUDA) technology. Nvidia’s CUDA platform makes thousands of stream processors available to provide parallel speed up. This paper explores the implementation of CUDA for parallelization of RSA algorithm and presents the performance analysis. First, the RSA algorithm is encapsulated in Java Native Interface (JNI). Then, JCUDA is used to implement the parallelization. In section 2 of the paper, traditional principle of RSA algorithm is studied. Section 3 gives an architectural overview of the system hardware used for the implementation. Section 4 presents the design and implementation of RSA parallelization and section 5 shows the results of performance analysis. 2. RSA ALGORITHM RSA algorithm is the most commonly used asymmetric cryptography algorithm. It works on two different keys i.e. public key and private key. The public key is known to everyone and is used for encrypting messages. Encrypted Messages can only be decrypted using the private key. The keys for the RSA algorithm are generated in the following way: 1. Choose two distinct large prime numbers, p and q. 2. Calculate n = pq, n is used as the modulus for both the public and private keys. 3. Compute φ(pq) = (p − 1)(q − 1). (φ is totient function). 4. Pick an integer e such that 1 < e < φ(pq), and e and φ(pq) are coprime (e and φ(pq) share no divisors other than 1). The public key consists of e (often called public exponent) and the modulus n. 5. Find d which satisfies the relation d*e = 1(mod φ(pq)). The private key consists of d (private exponent) and the modulus n. The message M is encrypted using formula C ≡ Me mod n, where C is the encrypted message. And it is decrypted using the formula M ≡ Cd mod n. Example : 1. The plaintext M to be encrypted: 1314 2. Let p=397 and q=401 3. n= pq = 159,197 4. φ(pq) = 396 * 400 = 158,400 5. The public key(e,n) is (343, 159197) and private key d is 12,007 (d*e = 1 mod φ(pq)) 6. C ≡ Me mod n, C ≡ 1314343 (mod 159,197) = 33,677 7. M ≡ Cd mod n, M ≡ 33,67712,007 (mod 159,197) = 1314 3. CUDA ARCHITECTURE NVIDIA has designed a special C-based language CUDA to utilize the massively parallel nature of GPU.[3] Parallel execution is expressed by the kernel function that is executed on a set of threads in parallel on GPU; GPU is also
  • 2. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 06 Issue: 05 | May 2019 www.irjet.net p-ISSN: 2395-0072 © 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 6305 called the device. This kernel code is a C code for only one thread. The number of thread blocks, and the number of threads within those blocks that execute this kernel in parallel are given explicitly when this function is called. NVIDIA’s CUDA architecture includes a number of multiprocessors, each multiprocessor consists of: ● 8 scalar processor cores ● 2 special function units for transcendentals ● 1 multithreaded instruction unit ● On-chip shared memory Fig -1: CUDA Hardware Each multiprocessor has SIMD architecture (single instruction multiple data). Each processor of the multiprocessor executes a different thread but all the threads run the same instruction but operate on different data. Several threads (up to 512) may execute concurrently within a multiprocessor and communicate through a small shared memory bank (16KB) which can be seen in Fig -1. Shared memory is local to each multiprocessor unlike device memory and allows local synchronization to be more efficient. It is divided into many parts. Each thread block within multiprocessor accesses its own part of shared memory. This part of shared memory is not accessible by any other thread block of this multiprocessor or of some other multiprocessor. 4. PARALLELIZATION This paper presents a two-level parallelization of RSA algorithm, node level and thread level as shown in Fig -2. Fig -2: Two Level Parallelization CUDA framework is used for parallel implementation on thread level. RSA algorithm makes packets with the same length by dividing the plaintext or ciphertext. The same function will be executed for all the packets. As mentioned in Section 2, the text is considered as integers. All the packets are considered as arrays with same number of integer elements. The encryption and decryption processes are carried out on these packets. We can get the thread and block index and then, they are used to assign packets (elements) to each thread. Thus, on thread level, CUDA multi-thread programming improves the performance of RSA algorithm. [4] The large size of plaintext or ciphertext can be troublesome to handle on a single computer. We used a distributed file system to store the text to be processed. Each node (computer) in the cluster executes the parallel RSA algorithm function. This is node level or computer level parallelization. In such clusters, each node gets a small part of data. In this case, a part of text is processed by each node. Then, each node performs thread level multi-processing. This paper uses Hadoop Distributed File System (HDFS) [5] for distribution of the text to all the nodes in the cluster. HDFS is block-structured file system where each file is divided into multiple blocks. The blocks are stored
  • 3. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 06 Issue: 05 | May 2019 www.irjet.net p-ISSN: 2395-0072 © 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 6306 across the cluster of nodes. It is a Java based software framework. Google’s MapReduce [6] and Google File System (GFS) [7] papers built the foundation for HDFS. As Hadoop is based on pure Java, we need to find a method to implement our C language code of RSA on Hadoop. JCUDA [7] provides an interface for invoking CUDA kernels using Java code or framework such as Hadoop. The syntax and usage of JCUDA are beyond the scope of this paper. A sample of kernel code is as follows: __global__ static void RSACUDA(int* arr) { long p,q,e,d,m,n,t,c,i; int x = blockDimension.x * blockId.x + threadId.x; p=191; q=223; n=p*q; t=(p-1)*(q-1); e=t-1; d=1; while(((e*d)%t)!=1) d++; for(i=0;i<50;i++) { c = encrypt(*(arr+i+x*50),e,n); m = decrypt(c,d,n); *(arr+i+x*50) = m; } } In next section, performance of parallel implementation is discussed. 5. VERIFICATION For verification, we divide the tests into two parts. First, we run the RSA program using a CPU-only processing. Then, in the next step, we implement the RSA algorithm on CUDA framework. 5.1 TEST ENVIRONMENT The computers used for the testing have following configuration: ● CPU: Intel Core i5-5200U 2.20GHz ● GPU: Nvidia GeForce 920M (2048MB Memory) ● Memory: 2.20GHz DDR3 (8192MB) For development and testing, Visual Studio 2017 is used. For GPU processing, CUDA Toolkit 10.1 is used. Using the Hadoop framework, the text was distributed on a cluster of 3 nodes. JCUDA was involved to invoke CUDA kernels from Java code on each HDFS node. 5.2 RESULTS Table -1: Time taken for execution (in milliseconds) for different text sizes. No. of chars CPU 10 Threads 100 Threads 500 Threads 10000 10176.45 12023.84 1256.73 276.49 20000 20312.82 24096.33 2303.34 530.18 30000 31034.47 36201.93 3678.01 760.48 40000 40583.05 47936.17 4907.38 1015.74 50000 51396.11 61034.28 6179.57 1346.05 60000 61564.26 71124.05 7389.18 1592.38 70000 71891.49 83456.61 8599.06 1904.75 80000 82392.73 96345.34 9803.49 2267.11 Table -1 displays the relationship between the size of input text and the time taken for execution (in milliseconds) in traditional CPU-only mode and GPU implemented multithreading mode. Chart -1: Time taken for execution by CPU-only and GPU multithreading mode. Chart 1 represents this output graphically. The relationship between size of input text and execution time is linear. As we can see, the execution time is very less when we implement RSA algorithm in 500-thread mode. It is almost 40 to 50 times faster than the CPU-only mode. This output proves that higher the degree of parallelism, better is the performance.
  • 4. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 06 Issue: 05 | May 2019 www.irjet.net p-ISSN: 2395-0072 © 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 6307 Chart -2: Elements processed per millisecond in CPU-only and multithreading mode If number of threads are less, about 10, RSA algorithm takes longer takes longer than CPU-only mode. This is the result of co-operation processing and communication overhead between CPU and GPU. It majorly consists of copying of data from the host computer (CPU) to the device (GPU). This can be verified in Chart -2. 6. CONCLUSIONS This paper presented the CUDA based implementation of RSA algorithm. We described the traditional RSA cryptosystem. The large size of text and modular power function act as the bottleneck for performance which is a drawback for large scale implementation. We used Nvidia’s Compute Unified Device Architecture (CUDA) for parallel implementation of RSA. First, RSA was executed in CPU-only mode. Then, it was implemented in CUDA framework in 10-thread, 100-thread and 500- thread modes. The results were compared which made it clear that higher the degree parallelism, better us the performance. The 10-thread mode gave a poor performance as compared to CPU-only mode. It was a result of communication and data transfer overhead between CPU and GPU. Apache Hadoop framework can be used for distributed storage of large text files. All the nodes in the cluster store a part of the text file and execute RSA in parallel on the data. REFERENCES [1] R.L. Rivest, A. Shamir, L. Adleman, “A Method for Obtaining Digital Signatures and Public-Key Cryptosystems,” Communications of the ACM, vol. 21, iss. 2, pp. 120-126, February 1978. [2] John D. Owens, David Luebke, Naga Govindaraju, Mark Harris, Jens Krüger, Aaron E. Lefohn, Tim Purcell, “A survey of general-purpose computation on graphics hardware,” Computer Graphics Forum, vol. 26, pp. 80-113, March 2007. [3] NVIDIA Corporation, NVIDIA Compute Unified Device Architecture Programming Guide, Version 10.1, February 2019. [4] Daniel Page, Nigel P. Smart, “Parallel cryptographic arithmetic using a redundant montgomery representation,” IEEE Transactions on Computers, vol. 53, no. 11, pp. 1474-1482, November 2004. [5] Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler, “The Hadoop Distributed File System,” Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies, pp. 1-10, May 2010. [6] Jeffrey Dean, Sanjay Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters,” Sixth Symposium on Operating System Design and Implementation, San Francisco, pp. 137-150, December 2004. [7] Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung, “The Google File System,” Proceedings of the 19th ACM Symposium on Operating Systems Principles, NY, pp. 20-43, October 2003. [8] Yonghong Yan, Max Grossman, Vivek Sarkar, “JCUDA: A Programmer-Friendly Interface for Accelerating Java Programs with CUDA,” Proceedings of the 15th International Euro-Par Conference on Parallel Processing, pp. 887–899, August 2009.