SlideShare a Scribd company logo
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 07 Issue: 04 | Apr 2020 www.irjet.net p-ISSN: 2395-0072
© 2020, IRJET | Impact Factor value: 7.34 | ISO 9001:2008 Certified Journal | Page 1984
Implementation of Neural Network on FPGA
Ranjith M S1, Sampanna T2, Niranjan Ganapati Hegde3, Shruthi R4
1,2,3Student, Dept. of ECE, The National Institute of Engineering, Mysuru, India
4Assistant Professor, Dept. of ECE, The National Institute of Engineering, Mysuru, India
---------------------------------------------------------------------***----------------------------------------------------------------------
Abstract - Deep neural networks and their application is rapidly changing many of the fields by bringing a different way of
solution to the problems. In fact, it is being applied in every field right from Signal processing to the communication. Although it
solves many problems easily, training and deploying it is limited due to various reasons. These deep neural networks requiremost
powerful computational devices like GPU in order to train them and large RAM, memory is required for its operation limiting its
usage in many of the applications. Even if these resources are provided throughcloudthe deployment ofsystemfacesproblemslike
more latency as there is round trip to the server, connectivity, privacy as the data needs to leave the system. All these problems
could be overcome using the FPGA’s for the deployment of the neural network as they are low cost and reconfigurable. In fact,
FPGA’s are faster than GPU during inference in many of the cases makingitsuitableforthe deployment intherealtimeapplication.
FPGA’s are faster than GPU’s as the neural network is implemented as a hardware in case of FPGA’s, where only hardware delays
are introduced but in case of CPU or GPU large number of floating-point operationsneeds tobedoneusingALU. Wewill implement
a trained neural network of TensorFlow on FPGA using Verilog HDL to perform a regression task.
Key Words: Deep learning, FPGA, Regression, Verilog HDL, Artificial Intelligence, Neural Networks, TensorFlow
1. INTRODUCTION
Currently a variety of applications like image processing are deployed on FPGA’s because of their advantages like cost-
optimized system integration, differentiated designs for a wide range of embedded applications,a significantspeedadvantage
compared to processor-based solutions. Even complicated calculations can be carried out in an extremely short time using
FPGA’s.
Deep neural networks are employed for the various tasks in various fields like object detection, speech recognition, natural
language processing, segmentation most of these are deployed on GPU. We use FPGA to employ deep neural network for the
regression task. Hardware is generated for the trained neural network which includes many floating-point operations to be
performed before the prediction is done i.e., to obtain the output. Initially the network is implemented and trained in python
environment using TensorFlow and the values of the weights are obtained as an array from the trained model and these
floating-point numbers are converted into IEEE 754 double precision format. The whole network is then broken into floating
point multiplication and addition which is implemented using Verilog HDL. Since Verilog HDL does not perform floatingpoint
arithmetic by default we implement the modules to perform floating point addition and multiplication on IEEE 754 double
precision format based on their sign, exponent and mantissa part.
Since we are using only dense networks we use He initialization[1] as compared with Xavier Initialization[2] for initializing the
weights while training the network.
“"overfitting" is greatly reduced by randomly omittinghalf ofthefeaturedetectorsoneachtrainingcase. Thispreventscomplexco-
adaptations in which a feature detector is only helpful in the context of several other specific feature detectors. Instead, each
neuron learns to detect a feature that is generally helpful for producing the correctanswergiventhecombinatoriallylargevariety
of internal contexts in which it must operate. Random "dropout" gives big improvements on many benchmark tasks”[3].
The optimizers such as Adam[4], rms[5], LARS[6] leads for the faster convergence and better optimization withoutgettingstuck
at the saddle points or at the local minima’s. “One of the key elements of super-convergence is training with one learning rate
cycle and a large maximum learning rate”[7]. To speed up training we use Batch Normalization[8].
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 07 Issue: 04 | Apr 2020 www.irjet.net p-ISSN: 2395-0072
© 2020, IRJET | Impact Factor value: 7.34 | ISO 9001:2008 Certified Journal | Page 1985
2. EXPERIMENTAL SETUP
2.1 Dataset
The dataset contains sold prices of many houses with their information like overall area in square feet. The maximum valueof
the area is 1.306800e+06 whereas the maximum value of the pricecolumnis3600.Iftheserawvaluesofthedatasetarepassed
into the network for training, it leads to overshooting and also slows the rate of convergence as the loss surface is not smooth.
The loss surface will have large number of saddle points making it difficult for training. These problems could be overcomeby
doing the mean normalization on the data before training and same has to be applied during the inference.
Mean normalization is given by,
Where μ(i) is the average of all the values for a featureand s(i)is the standard deviation.Aftermeannormalizationthemeanofa
feature is nearly 0 and standard deviation is nearly 1. Which counters the vanishing gradient problem.
Fig-1 represent how the price varies with area before the mean normailzation is applied.
fig-1: Dataset
2.2 Training the Neural Network
We implemented two architectures for the same dataset, perceptron and the neural network (multi-output perceptron). The
perceptron is implemented in python environment and the gradient descent algorithm is used tofindtheoptimal values ofthe
weight and bias. Mean squared error is used as a cost function based on this the parameters are optimized.
Neural network is implemented and trained using TensorFlowlibraryinpython environment.The network has2hiddenlayers
and an output layer. First hidden layer has 8 hidden units to whichReLUactivationisappliedthepurposeofactivationfunction
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 07 Issue: 04 | Apr 2020 www.irjet.net p-ISSN: 2395-0072
© 2020, IRJET | Impact Factor value: 7.34 | ISO 9001:2008 Certified Journal | Page 1986
is to introduce the non-linearities in the network. The output of the activation is applied to the second dense layer with the 4
hidden units, output of which is then applied with ReLU activation and these values are used to predict the final output. The
network architecture is as shown in fig-2. And the structure parameter and its dimension are as in table-1.
fig -2: Neural network architecture
Table -1: Network structure
Model structure and parameters
Layer Output shape NO. of parameters
Input (None, 1) 0
Dense_1 (None, 8) 16
Dense_2 (None, 4) 36
Output (None, 1) 5
Total params: 57
Trainable params: 57
Non-trainable params: 0
The network is trained using Adam optimizer with a batch size of 32 and the data is split into 90% for training and 10% for
validation. Mean squared error is used as a loss function.
Table -2: Network results after training
Final loss
Mean Squared Error
Training 0.2753
validation 0.2088
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 07 Issue: 04 | Apr 2020 www.irjet.net p-ISSN: 2395-0072
© 2020, IRJET | Impact Factor value: 7.34 | ISO 9001:2008 Certified Journal | Page 1987
fig -3: loss during training for neural network fig -4: loss during training in perceptron
From table-2 we see the mean squared error values after training the neural network. Since the values of training error and
validation error are near it indicates that the model is not overfitted to the training data.
Fig-3 indicates how the loss (mean squared error) varied during the training of the neural network andfig-4indicateshowthe
loss (mean squared error) varied during the training (only training data) of the perceptron.Inboththecaseslossreachedvery
low value by the end of the training indicating that they have learnt to predict the proper values of the target.
The model was trained on Tesla K80 GPU which can do up to 8.73 Teraflops single-precision and up to 2.91 Teraflops double-
precision performance with NVIDIA GPU Boost
3. IMPLEMENTATION OF TRAINED MODEL IN VERILOG HDL
The trained network is implemented in Verilog HDL to perform the inference. Since the network implementation includes the
floating-point arithmetic and Verilog does not have any functions for the floating-point operations the operations have to be
implemented based on the sign, exponent and mantissa. The optimized weights are converted into IEEE 754 double precision
format having 64 bits.
fig-5 IEEE 754 double precision format
Fig-5 shows the representation of the IEEE 754 double precisionformat.First bitindicatessign1indicatenegativenumberand
0 indicate positive number. Bit 62 to 52 indicates the exponent part and other bits indicates fractional part.
In floating point multiplication, the result will be negative if any one of the numbers is negative, result will be positive if both
are negative or positive which indicates the behaviour of XOR gate. Hence the resultant sign bit of the output is determined by
XOR of two sign bits of the numbers.
Exponent part of the resultant is determined by using the formula,
Er = E1 +E2-1023
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 07 Issue: 04 | Apr 2020 www.irjet.net p-ISSN: 2395-0072
© 2020, IRJET | Impact Factor value: 7.34 | ISO 9001:2008 Certified Journal | Page 1988
Where E1 and E2 are the exponent part of two numbers. Hence the 11-bit adder is required for the exponent part of the
resultant.
Fractional part is obtained by the normal multiplication between fractional parts of the number. And result is rounded to 52
bits. Hence the multiplier is implemented for the fractional part.
Hence all the trained network parameters are converted into IEEE 754 double precision format and the whole network is
implemented using these floating-point operation modules. The module now takes the input andoperationsaredone withthe
optimized weights to get the prediction.
Fig-6: Simulation output
From fig-6 we observe the output prediction (output_z) of the implemented hardware for the given mean normalized input
square feet.
4. RESULTS AND CONCLUSIONS
The implemented hardware using Verilog HDL produced the same result as that of the GPU or CPU which computes the result
using ALU.
The hardware implementation reduces the latency when compared with GPU or CPUinmanycases.Buttheimplementationof
the neural network on the hardware takes more time and the training of neural network is not possible in FPGA as there are
more training loops involved and the gradient computation and their update.
REFERENCES
[1] Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. Delving Deep into Rectifiers: Surpassing Human-Level
Performance on ImageNet Classification
[2] Bengio, Yoshua and Glorot, Xavier. Understanding thedifficulty of training deep feedforward neural
networks.InProceedings of AISTATS 2010, volume 9, pp. 249–256, May 2010.
[3] Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, Ruslan R. Salakhutdinov Improving neural
networks by preventing co-adaptation of feature detectors arXiv:1207.0580v1 [cs.NE]
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 07 Issue: 04 | Apr 2020 www.irjet.net p-ISSN: 2395-0072
© 2020, IRJET | Impact Factor value: 7.34 | ISO 9001:2008 Certified Journal | Page 1989
[4] Diederik P. Kingma, Jimmy Ba. Adam: A Method for Stochastic Optimization. arXiv:1412.6980v9 [cs.LG]
[5] Geoffrey Hinton. 2012. Neural Networks for MachineLearning - Lecture 6a - Overview of mini-batch gradi-ent
descent.
[6] Yang You, Igor Gitman, Boris Ginsburg. Large Batch Training of Convolutional Networks. arXiv:1708.03888v3
[cs.CV]
[7] Leslie N. Smith, Nicholay Topin. Super-Convergence: Very Fast Training of Neural Networks Using Large Learning
Rates. arXiv:1708.07120v3 [cs.LG]
[8] Sergey Ioffe, Christian Szegedy. Batch Normalization: Accelerating Deep Network Training by Reducing Internal
Covariate Shift. arXiv:1502.03167v3 [cs.LG]
[9] Alexandre de Brébisson, Étienne Simon, Alex Auvolat, Pascal Vincent, Yoshua Bengio. Artificial Neural Networks
Applied to Taxi Destination Prediction. arXiv:1508.00021v2 [cs.LG]
[10] Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S.,Irving, G., Isard, M., Kudlur, M.,
Levenberg, J., Monga, R., Moore, S., Murray, D. G.,Steiner, B., Tucker, P., Vasudevan, V., Warden, P., Wicke,M.,Yu,Y.,and
Zheng, X. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. arXiv:1603.04467
[cs.DC]
[11] J. D. Hunter, "Matplotlib: A 2D Graphics Environment", Computing in Science & Engineering, vol. 9, no. 3, pp. 90-95,
2007.
[12] Plotly Technologies Inc. (2015). Collaborative data science. Montreal, QC: Plotly Technolo-gies Inc.
[13] S. Sahin, Y. Becerik1i, S. Yazici, "Neural network implementation in hardware using FPGAs", NIP Neural Information
Processing, vol. 4234, no. 3, pp. 1105-1112, 2006.
[14] Ranjith M S, S Parameshwara. Optimizing Neural Networks for Embedded Systems. IRJET Volume 7, Issue 4, April
2020 S.NO: 211
[15] Yufeng Hao. A General Neural Network Hardware Architecture on FPGA
arXiv:1711.05860 [cs.CV]
[16] S. Coric, I. Latinovic and A. Pavasovic, "A neural network FPGA implementation," Proceedings of the 5th Seminar on
Neural Network Applications in Electrical Engineering. NEUREL 2000 (IEEE Cat. No.00EX287), Belgrade, Yugoslavia,
2000, pp. 117-120.

More Related Content

PDF
International Refereed Journal of Engineering and Science (IRJES)
PDF
Simulation of Single and Multilayer of Artificial Neural Network using Verilog
PDF
International Journal of Computational Engineering Research (IJCER)
PDF
Review: “Implementation of Feedforward and Feedback Neural Network for Signal...
PDF
IRJET- Image Classification – Cat and Dog Images
PDF
Dynamic Texture Coding using Modified Haar Wavelet with CUDA
PDF
An fpga implementation of the lms adaptive filter
PDF
An fpga implementation of the lms adaptive filter
International Refereed Journal of Engineering and Science (IRJES)
Simulation of Single and Multilayer of Artificial Neural Network using Verilog
International Journal of Computational Engineering Research (IJCER)
Review: “Implementation of Feedforward and Feedback Neural Network for Signal...
IRJET- Image Classification – Cat and Dog Images
Dynamic Texture Coding using Modified Haar Wavelet with CUDA
An fpga implementation of the lms adaptive filter
An fpga implementation of the lms adaptive filter

What's hot (16)

PDF
Levenberg marquardt-algorithm-for-karachi-stock-exchange-share-rates-forecast...
PDF
BACKPROPAGATION LEARNING ALGORITHM BASED ON LEVENBERG MARQUARDT ALGORITHM
PDF
Optimization of latency of temporal key Integrity protocol (tkip) using graph...
PDF
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
PDF
Median based parallel steering kernel regression for image reconstruction
PDF
Efficiency of Neural Networks Study in the Design of Trusses
PDF
Comparative Study of Neural Networks Algorithms for Cloud Computing CPU Sched...
PDF
Design and development of DrawBot using image processing
PDF
Qualitative Analysis of Optical Interleave Division Multiple Access using Spe...
PDF
IRJET- Intelligent Character Recognition of Handwritten Characters using ...
PDF
Digital Implementation of Artificial Neural Network for Function Approximatio...
PDF
An advancement in the N×N Multiplier Architecture Realization via the Ancient...
PDF
implementation of area efficient high speed eddr architecture
PDF
Artificial Neural Network Based Graphical User Interface for Estimation of Fa...
PDF
1.meena tushir finalpaper-1-12
PDF
I017425763
Levenberg marquardt-algorithm-for-karachi-stock-exchange-share-rates-forecast...
BACKPROPAGATION LEARNING ALGORITHM BASED ON LEVENBERG MARQUARDT ALGORITHM
Optimization of latency of temporal key Integrity protocol (tkip) using graph...
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
Median based parallel steering kernel regression for image reconstruction
Efficiency of Neural Networks Study in the Design of Trusses
Comparative Study of Neural Networks Algorithms for Cloud Computing CPU Sched...
Design and development of DrawBot using image processing
Qualitative Analysis of Optical Interleave Division Multiple Access using Spe...
IRJET- Intelligent Character Recognition of Handwritten Characters using ...
Digital Implementation of Artificial Neural Network for Function Approximatio...
An advancement in the N×N Multiplier Architecture Realization via the Ancient...
implementation of area efficient high speed eddr architecture
Artificial Neural Network Based Graphical User Interface for Estimation of Fa...
1.meena tushir finalpaper-1-12
I017425763
Ad

Similar to IRJET - Implementation of Neural Network on FPGA (20)

PDF
NeuralProcessingofGeneralPurposeApproximatePrograms
PDF
Towards neuralprocessingofgeneralpurposeapproximateprograms
PDF
Handwritten Digit Recognition using Convolutional Neural Networks
PDF
IRJET- Machine Learning based Object Identification System using Python
PDF
m3 (2).pdf
PDF
Digital Implementation of Artificial Neural Network for Function Approximatio...
PDF
Neural networks using tensor flow in amazon deep learning server
PDF
Hardware Acceleration for Machine Learning
PDF
Geometric Processing of Data in Neural Networks
PDF
On-device training of artificial intelligence models on microcontrollers
PPTX
Deep cv 101
PPTX
Neural networks
PPTX
Digit recognizer by convolutional neural network
PDF
Deep Learning Basics (lecture notes).pdf
PPTX
08 neural networks
PDF
OpenPOWER Workshop in Silicon Valley
PDF
Neural network book. Interesting and precise
PDF
Methodology (DLAI D6L2 2017 UPC Deep Learning for Artificial Intelligence)
NeuralProcessingofGeneralPurposeApproximatePrograms
Towards neuralprocessingofgeneralpurposeapproximateprograms
Handwritten Digit Recognition using Convolutional Neural Networks
IRJET- Machine Learning based Object Identification System using Python
m3 (2).pdf
Digital Implementation of Artificial Neural Network for Function Approximatio...
Neural networks using tensor flow in amazon deep learning server
Hardware Acceleration for Machine Learning
Geometric Processing of Data in Neural Networks
On-device training of artificial intelligence models on microcontrollers
Deep cv 101
Neural networks
Digit recognizer by convolutional neural network
Deep Learning Basics (lecture notes).pdf
08 neural networks
OpenPOWER Workshop in Silicon Valley
Neural network book. Interesting and precise
Methodology (DLAI D6L2 2017 UPC Deep Learning for Artificial Intelligence)
Ad

More from IRJET Journal (20)

PDF
Enhanced heart disease prediction using SKNDGR ensemble Machine Learning Model
PDF
Utilizing Biomedical Waste for Sustainable Brick Manufacturing: A Novel Appro...
PDF
Kiona – A Smart Society Automation Project
PDF
DESIGN AND DEVELOPMENT OF BATTERY THERMAL MANAGEMENT SYSTEM USING PHASE CHANG...
PDF
Invest in Innovation: Empowering Ideas through Blockchain Based Crowdfunding
PDF
SPACE WATCH YOUR REAL-TIME SPACE INFORMATION HUB
PDF
A Review on Influence of Fluid Viscous Damper on The Behaviour of Multi-store...
PDF
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
PDF
Explainable AI(XAI) using LIME and Disease Detection in Mango Leaf by Transfe...
PDF
BRAIN TUMOUR DETECTION AND CLASSIFICATION
PDF
The Project Manager as an ambassador of the contract. The case of NEC4 ECC co...
PDF
"Enhanced Heat Transfer Performance in Shell and Tube Heat Exchangers: A CFD ...
PDF
Advancements in CFD Analysis of Shell and Tube Heat Exchangers with Nanofluid...
PDF
Breast Cancer Detection using Computer Vision
PDF
Auto-Charging E-Vehicle with its battery Management.
PDF
Analysis of high energy charge particle in the Heliosphere
PDF
A Novel System for Recommending Agricultural Crops Using Machine Learning App...
PDF
Auto-Charging E-Vehicle with its battery Management.
PDF
Analysis of high energy charge particle in the Heliosphere
PDF
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
Enhanced heart disease prediction using SKNDGR ensemble Machine Learning Model
Utilizing Biomedical Waste for Sustainable Brick Manufacturing: A Novel Appro...
Kiona – A Smart Society Automation Project
DESIGN AND DEVELOPMENT OF BATTERY THERMAL MANAGEMENT SYSTEM USING PHASE CHANG...
Invest in Innovation: Empowering Ideas through Blockchain Based Crowdfunding
SPACE WATCH YOUR REAL-TIME SPACE INFORMATION HUB
A Review on Influence of Fluid Viscous Damper on The Behaviour of Multi-store...
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
Explainable AI(XAI) using LIME and Disease Detection in Mango Leaf by Transfe...
BRAIN TUMOUR DETECTION AND CLASSIFICATION
The Project Manager as an ambassador of the contract. The case of NEC4 ECC co...
"Enhanced Heat Transfer Performance in Shell and Tube Heat Exchangers: A CFD ...
Advancements in CFD Analysis of Shell and Tube Heat Exchangers with Nanofluid...
Breast Cancer Detection using Computer Vision
Auto-Charging E-Vehicle with its battery Management.
Analysis of high energy charge particle in the Heliosphere
A Novel System for Recommending Agricultural Crops Using Machine Learning App...
Auto-Charging E-Vehicle with its battery Management.
Analysis of high energy charge particle in the Heliosphere
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...

Recently uploaded (20)

PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PPTX
Geodesy 1.pptx...............................................
PPT
Mechanical Engineering MATERIALS Selection
PPTX
bas. eng. economics group 4 presentation 1.pptx
PPT
Project quality management in manufacturing
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PPTX
OOP with Java - Java Introduction (Basics)
PDF
Digital Logic Computer Design lecture notes
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
DOCX
573137875-Attendance-Management-System-original
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PDF
R24 SURVEYING LAB MANUAL for civil enggi
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PPTX
additive manufacturing of ss316l using mig welding
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
Embodied AI: Ushering in the Next Era of Intelligent Systems
Foundation to blockchain - A guide to Blockchain Tech
Geodesy 1.pptx...............................................
Mechanical Engineering MATERIALS Selection
bas. eng. economics group 4 presentation 1.pptx
Project quality management in manufacturing
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
OOP with Java - Java Introduction (Basics)
Digital Logic Computer Design lecture notes
UNIT-1 - COAL BASED THERMAL POWER PLANTS
573137875-Attendance-Management-System-original
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
R24 SURVEYING LAB MANUAL for civil enggi
Operating System & Kernel Study Guide-1 - converted.pdf
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
additive manufacturing of ss316l using mig welding
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT

IRJET - Implementation of Neural Network on FPGA

  • 1. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 07 Issue: 04 | Apr 2020 www.irjet.net p-ISSN: 2395-0072 © 2020, IRJET | Impact Factor value: 7.34 | ISO 9001:2008 Certified Journal | Page 1984 Implementation of Neural Network on FPGA Ranjith M S1, Sampanna T2, Niranjan Ganapati Hegde3, Shruthi R4 1,2,3Student, Dept. of ECE, The National Institute of Engineering, Mysuru, India 4Assistant Professor, Dept. of ECE, The National Institute of Engineering, Mysuru, India ---------------------------------------------------------------------***---------------------------------------------------------------------- Abstract - Deep neural networks and their application is rapidly changing many of the fields by bringing a different way of solution to the problems. In fact, it is being applied in every field right from Signal processing to the communication. Although it solves many problems easily, training and deploying it is limited due to various reasons. These deep neural networks requiremost powerful computational devices like GPU in order to train them and large RAM, memory is required for its operation limiting its usage in many of the applications. Even if these resources are provided throughcloudthe deployment ofsystemfacesproblemslike more latency as there is round trip to the server, connectivity, privacy as the data needs to leave the system. All these problems could be overcome using the FPGA’s for the deployment of the neural network as they are low cost and reconfigurable. In fact, FPGA’s are faster than GPU during inference in many of the cases makingitsuitableforthe deployment intherealtimeapplication. FPGA’s are faster than GPU’s as the neural network is implemented as a hardware in case of FPGA’s, where only hardware delays are introduced but in case of CPU or GPU large number of floating-point operationsneeds tobedoneusingALU. Wewill implement a trained neural network of TensorFlow on FPGA using Verilog HDL to perform a regression task. Key Words: Deep learning, FPGA, Regression, Verilog HDL, Artificial Intelligence, Neural Networks, TensorFlow 1. INTRODUCTION Currently a variety of applications like image processing are deployed on FPGA’s because of their advantages like cost- optimized system integration, differentiated designs for a wide range of embedded applications,a significantspeedadvantage compared to processor-based solutions. Even complicated calculations can be carried out in an extremely short time using FPGA’s. Deep neural networks are employed for the various tasks in various fields like object detection, speech recognition, natural language processing, segmentation most of these are deployed on GPU. We use FPGA to employ deep neural network for the regression task. Hardware is generated for the trained neural network which includes many floating-point operations to be performed before the prediction is done i.e., to obtain the output. Initially the network is implemented and trained in python environment using TensorFlow and the values of the weights are obtained as an array from the trained model and these floating-point numbers are converted into IEEE 754 double precision format. The whole network is then broken into floating point multiplication and addition which is implemented using Verilog HDL. Since Verilog HDL does not perform floatingpoint arithmetic by default we implement the modules to perform floating point addition and multiplication on IEEE 754 double precision format based on their sign, exponent and mantissa part. Since we are using only dense networks we use He initialization[1] as compared with Xavier Initialization[2] for initializing the weights while training the network. “"overfitting" is greatly reduced by randomly omittinghalf ofthefeaturedetectorsoneachtrainingcase. Thispreventscomplexco- adaptations in which a feature detector is only helpful in the context of several other specific feature detectors. Instead, each neuron learns to detect a feature that is generally helpful for producing the correctanswergiventhecombinatoriallylargevariety of internal contexts in which it must operate. Random "dropout" gives big improvements on many benchmark tasks”[3]. The optimizers such as Adam[4], rms[5], LARS[6] leads for the faster convergence and better optimization withoutgettingstuck at the saddle points or at the local minima’s. “One of the key elements of super-convergence is training with one learning rate cycle and a large maximum learning rate”[7]. To speed up training we use Batch Normalization[8].
  • 2. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 07 Issue: 04 | Apr 2020 www.irjet.net p-ISSN: 2395-0072 © 2020, IRJET | Impact Factor value: 7.34 | ISO 9001:2008 Certified Journal | Page 1985 2. EXPERIMENTAL SETUP 2.1 Dataset The dataset contains sold prices of many houses with their information like overall area in square feet. The maximum valueof the area is 1.306800e+06 whereas the maximum value of the pricecolumnis3600.Iftheserawvaluesofthedatasetarepassed into the network for training, it leads to overshooting and also slows the rate of convergence as the loss surface is not smooth. The loss surface will have large number of saddle points making it difficult for training. These problems could be overcomeby doing the mean normalization on the data before training and same has to be applied during the inference. Mean normalization is given by, Where μ(i) is the average of all the values for a featureand s(i)is the standard deviation.Aftermeannormalizationthemeanofa feature is nearly 0 and standard deviation is nearly 1. Which counters the vanishing gradient problem. Fig-1 represent how the price varies with area before the mean normailzation is applied. fig-1: Dataset 2.2 Training the Neural Network We implemented two architectures for the same dataset, perceptron and the neural network (multi-output perceptron). The perceptron is implemented in python environment and the gradient descent algorithm is used tofindtheoptimal values ofthe weight and bias. Mean squared error is used as a cost function based on this the parameters are optimized. Neural network is implemented and trained using TensorFlowlibraryinpython environment.The network has2hiddenlayers and an output layer. First hidden layer has 8 hidden units to whichReLUactivationisappliedthepurposeofactivationfunction
  • 3. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 07 Issue: 04 | Apr 2020 www.irjet.net p-ISSN: 2395-0072 © 2020, IRJET | Impact Factor value: 7.34 | ISO 9001:2008 Certified Journal | Page 1986 is to introduce the non-linearities in the network. The output of the activation is applied to the second dense layer with the 4 hidden units, output of which is then applied with ReLU activation and these values are used to predict the final output. The network architecture is as shown in fig-2. And the structure parameter and its dimension are as in table-1. fig -2: Neural network architecture Table -1: Network structure Model structure and parameters Layer Output shape NO. of parameters Input (None, 1) 0 Dense_1 (None, 8) 16 Dense_2 (None, 4) 36 Output (None, 1) 5 Total params: 57 Trainable params: 57 Non-trainable params: 0 The network is trained using Adam optimizer with a batch size of 32 and the data is split into 90% for training and 10% for validation. Mean squared error is used as a loss function. Table -2: Network results after training Final loss Mean Squared Error Training 0.2753 validation 0.2088
  • 4. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 07 Issue: 04 | Apr 2020 www.irjet.net p-ISSN: 2395-0072 © 2020, IRJET | Impact Factor value: 7.34 | ISO 9001:2008 Certified Journal | Page 1987 fig -3: loss during training for neural network fig -4: loss during training in perceptron From table-2 we see the mean squared error values after training the neural network. Since the values of training error and validation error are near it indicates that the model is not overfitted to the training data. Fig-3 indicates how the loss (mean squared error) varied during the training of the neural network andfig-4indicateshowthe loss (mean squared error) varied during the training (only training data) of the perceptron.Inboththecaseslossreachedvery low value by the end of the training indicating that they have learnt to predict the proper values of the target. The model was trained on Tesla K80 GPU which can do up to 8.73 Teraflops single-precision and up to 2.91 Teraflops double- precision performance with NVIDIA GPU Boost 3. IMPLEMENTATION OF TRAINED MODEL IN VERILOG HDL The trained network is implemented in Verilog HDL to perform the inference. Since the network implementation includes the floating-point arithmetic and Verilog does not have any functions for the floating-point operations the operations have to be implemented based on the sign, exponent and mantissa. The optimized weights are converted into IEEE 754 double precision format having 64 bits. fig-5 IEEE 754 double precision format Fig-5 shows the representation of the IEEE 754 double precisionformat.First bitindicatessign1indicatenegativenumberand 0 indicate positive number. Bit 62 to 52 indicates the exponent part and other bits indicates fractional part. In floating point multiplication, the result will be negative if any one of the numbers is negative, result will be positive if both are negative or positive which indicates the behaviour of XOR gate. Hence the resultant sign bit of the output is determined by XOR of two sign bits of the numbers. Exponent part of the resultant is determined by using the formula, Er = E1 +E2-1023
  • 5. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 07 Issue: 04 | Apr 2020 www.irjet.net p-ISSN: 2395-0072 © 2020, IRJET | Impact Factor value: 7.34 | ISO 9001:2008 Certified Journal | Page 1988 Where E1 and E2 are the exponent part of two numbers. Hence the 11-bit adder is required for the exponent part of the resultant. Fractional part is obtained by the normal multiplication between fractional parts of the number. And result is rounded to 52 bits. Hence the multiplier is implemented for the fractional part. Hence all the trained network parameters are converted into IEEE 754 double precision format and the whole network is implemented using these floating-point operation modules. The module now takes the input andoperationsaredone withthe optimized weights to get the prediction. Fig-6: Simulation output From fig-6 we observe the output prediction (output_z) of the implemented hardware for the given mean normalized input square feet. 4. RESULTS AND CONCLUSIONS The implemented hardware using Verilog HDL produced the same result as that of the GPU or CPU which computes the result using ALU. The hardware implementation reduces the latency when compared with GPU or CPUinmanycases.Buttheimplementationof the neural network on the hardware takes more time and the training of neural network is not possible in FPGA as there are more training loops involved and the gradient computation and their update. REFERENCES [1] Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification [2] Bengio, Yoshua and Glorot, Xavier. Understanding thedifficulty of training deep feedforward neural networks.InProceedings of AISTATS 2010, volume 9, pp. 249–256, May 2010. [3] Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, Ruslan R. Salakhutdinov Improving neural networks by preventing co-adaptation of feature detectors arXiv:1207.0580v1 [cs.NE]
  • 6. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 07 Issue: 04 | Apr 2020 www.irjet.net p-ISSN: 2395-0072 © 2020, IRJET | Impact Factor value: 7.34 | ISO 9001:2008 Certified Journal | Page 1989 [4] Diederik P. Kingma, Jimmy Ba. Adam: A Method for Stochastic Optimization. arXiv:1412.6980v9 [cs.LG] [5] Geoffrey Hinton. 2012. Neural Networks for MachineLearning - Lecture 6a - Overview of mini-batch gradi-ent descent. [6] Yang You, Igor Gitman, Boris Ginsburg. Large Batch Training of Convolutional Networks. arXiv:1708.03888v3 [cs.CV] [7] Leslie N. Smith, Nicholay Topin. Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates. arXiv:1708.07120v3 [cs.LG] [8] Sergey Ioffe, Christian Szegedy. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. arXiv:1502.03167v3 [cs.LG] [9] Alexandre de Brébisson, Étienne Simon, Alex Auvolat, Pascal Vincent, Yoshua Bengio. Artificial Neural Networks Applied to Taxi Destination Prediction. arXiv:1508.00021v2 [cs.LG] [10] Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S.,Irving, G., Isard, M., Kudlur, M., Levenberg, J., Monga, R., Moore, S., Murray, D. G.,Steiner, B., Tucker, P., Vasudevan, V., Warden, P., Wicke,M.,Yu,Y.,and Zheng, X. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. arXiv:1603.04467 [cs.DC] [11] J. D. Hunter, "Matplotlib: A 2D Graphics Environment", Computing in Science & Engineering, vol. 9, no. 3, pp. 90-95, 2007. [12] Plotly Technologies Inc. (2015). Collaborative data science. Montreal, QC: Plotly Technolo-gies Inc. [13] S. Sahin, Y. Becerik1i, S. Yazici, "Neural network implementation in hardware using FPGAs", NIP Neural Information Processing, vol. 4234, no. 3, pp. 1105-1112, 2006. [14] Ranjith M S, S Parameshwara. Optimizing Neural Networks for Embedded Systems. IRJET Volume 7, Issue 4, April 2020 S.NO: 211 [15] Yufeng Hao. A General Neural Network Hardware Architecture on FPGA arXiv:1711.05860 [cs.CV] [16] S. Coric, I. Latinovic and A. Pavasovic, "A neural network FPGA implementation," Proceedings of the 5th Seminar on Neural Network Applications in Electrical Engineering. NEUREL 2000 (IEEE Cat. No.00EX287), Belgrade, Yugoslavia, 2000, pp. 117-120.