SlideShare a Scribd company logo
Towards Neural Processing for General Purpose Approximate Programs
Prasanna Kothalkar1
, Mohid Nabil1
, Vidhi Agrawal1
, Paridha Saxena1
1
Department of Computer and Electrical Engineering, University of Texas at Dallas
pxk122030@utdallas.edu, mxn150230@utdallas.edu, vna150130@utdallas.edu, pxs158430@utdallas.edu
Abstract
Modern processor architectures have focused on increasing pro-
cessor speeds while reducing power consumption. The chal-
lenge to maintain energy consumption feasibility with increased
transistor density on microprocessor chips has lead to new gen-
eration of processors which replace program code with alternate
faster implementations at run time. In this report we present one
such architecture that use Neural Networks to mimic regions of
program code. We train neural networks on training data gen-
erated to model programs to be run on processors and then run
the neural networks at run time through program invocations to
generate our results from program outputs.
Index Terms: computer architecture, neural processing units,
program acceleration
1. Introduction
Due to the limitations of technology scaling for modern pro-
cessors, recent focus has moved towards computation special-
ization to run programs at fast speeds and minimal energy con-
sumption. Recent work provides acceleration by exploiting er-
ror tolerance by leveraging an approximate computing frame-
work. We implement Neural Processing Units(NPUs), a new
class of configurable accelerators for approximate computation.
Many application programs in diverse fields like image process-
ing, speech recognition, computer vision, signal processing can
tolerate errors in computation and thus provide immense oppor-
tunity to replace with alternate computing executions. While
traditional processors execute programs via instruction set ar-
chitecture and programming logic, NPUs are ’trained’ to mimic
regions of imperative code. We need to generate the training
data by running functions that need to be transformed via NPU
acceleration. To achieve this we save the inputs and outputs
generated using neural networks.
2. Neural Networks
Neural Networks are brain inspired machine learning models
[1] which have neurons as the basic building block and advan-
tageous in learning concepts in hierarchical fashion. Similar
to the brain which builds up it’s idea about a complex topic
from simpler ideas, neural networks learn a concept as a func-
tion from data by trying to learn to generate outputs from inputs.
The learning is performed in terms of weights on the edges that
connect neurons in adjacent layers. The learning is performed to
reduce the error by providing feedback and is realized through
the back-propagation algorithm.
Neural networks are a popular class of machine learning
models and can theoretically represent any given function given
two layers with infinite number of nodes. Practically, such com-
putation can only be measured in the limit and so one way to im-
plement such functions in neural networks will require adding
more layers. However, this vertical addition of layers faces
the problem of vanishing gradient where the gradient signal
in the backpropogation algorithm looses the information trav-
eling through many layers. A new class of algorithms using
layerwise pre-training technique were developed leading to a
resurgence in neural network literature and development of va-
riety of learning models. The basic idea involved initializing
the neural network edge weights in a more informed manner
such that the training procedure would not end up in local min-
ima while training. Such models were known as ’Deep Neu-
ral Networks’ and have lead to impressive results in challeng-
ing problems in natural language processing, speech processing
and vision problems. We plan to use Deep Belief Networks and
Convolutional Neural Networks for our Sobel Edge detection
NPU in future work.
Figure 1: Neural Networks
2.1. Weight Update Rule
Neural Network backpropogation updates the weights for neu-
ral network edges using the following update rule.
wji = wji + ∆wji (1)
where wji represents the edge weight between node j and node
i and ∆wji is the change in weight performed by the backpro-
pogation step.
∆wji = α(tj − yj)xi (2)
where α represents the learning rate of our training algorithm,
tj are the target values for jth
output node and yj are the output
values for jth
output node. Finally, xi are the input values for
the ith
input node.
3. Data generation
Data extraction was carried out for training and implentation of
neural network . For kmeans dataset, we used random inputs
to our code and extracted the generated outputs. For sobel edge
detection program input images were provided and the sobel
edge detection computation was learned by the neural network,
which was then further fine-tuned on a separate validation set of
images and tested on another set of images.
Dataset for Sobel Edge Detection Sample size: 200 train-
ing, 200 validation, 100 testing Input: Feature extraction per-
formed for each pixel such that each sample is made of individ-
ual pixel value and its neighbors Output: Sobel Edge Detection
sum value
Dataset for k-means Sample size: 850 training, 850 vali-
dation, 850 test Input: Randomly generated vector of size 10
Output: Vector of size 10 which provides Cluster Membership
for each of 10 input values
The output files were created in a fixed format so as to en-
able FANN to read the inputs and outputs from the text file. The
file format selected was as follows 1) the first row contained the
total number of iteration 2) number of iterations was followed
by total number of inputs 3) at the end of the line total number
of outputs was printed
Each line of the text file contained the inputs followed by
the output results.The same format file was used for both train-
ing and testing purposes
4. Software Neural Acceleration
As mentioned previously, we have used FANN toolkit to learn
neural networks off the fly for our programs. The code below
shows the neural network output function being called for gen-
erating the output instead of getting the output from sobel pro-
gram.
void edgeDetection ( Mat src , Mat dst , bool NPU, i n t prev i , s t r i n g name )
{
i n t gx , gy , sum ;
vector<int> output ;
i f (NPU){
for ( i n t y = 0; y < s r c . rows ; y++)
{
for ( i n t x = 0; x < s r c . c o l s ; x++)
{
d s t . at<uchar >(y , x ) = 0 . 0 ;
}
}
i n t i = p r e v i ;
i n t l e n g t h = ( d s t . rows −2)*( d s t . cols −2);
output = t e s t S o b e l ( i , l e n g t h ) ;
i n t idx = 0;
for ( i n t y = 1; y < s r c . rows − 1; y++){
for ( i n t x = 1; x < s r c . c o l s − 1; x++){
i f ( output [ idx ++] < 1 . 0 )
d s t . at<uchar >(y , x ) = 255;
e l s e
d s t . at<uchar >(y , x ) = 0;
}
}
name = name . r e p l a c e ( name . f i n d ( ’ . ’ ) , 4 , ”−npu . png ” ) ;
}
e l s e{
for ( i n t y = 0; y < s r c . rows ; y++)
for ( i n t x = 0; x < s r c . c o l s ; x++)
d s t . at<uchar >(y , x ) = 0 . 0 ;
for ( i n t y = 1; y < s r c . rows − 1; y++){
for ( i n t x = 1; x < s r c . c o l s − 1; x++){
gx = xGradient ( src , x , y ) ;
gy = yGradient ( src , x , y ) ;
sum = abs ( gx ) + abs ( gy ) ;
i n t output = sum > 127 ? 1 : 0;
i f ( output ==1)
d s t . at<uchar >(y , x ) = 255;
e l s e
d s t . at<uchar >(y , x ) = 0;
}
}
name = name . r e p l a c e ( name . f i n d ( ’ . ’ ) , 4 , ”−sob . png ” ) ;
}
imwrite ( name , d s t ) ;
}
In case of Sobel Edge detector program any input image has
154401 pixels. (441 × 321 or 321 × 481) Each pixel value is
transformed along with it’s neighboring pixels which form our
training set. The output of this patch of image through the sobel
filter is our target value. All the input and output values are
thresholded as binary image for edge detection problem. Hence,
our training, validation and testing data is binary for our neural
network. So, each pixel has 8 neighboring pixels giving an input
node of size 9, 3 hidden layers with 9 nodes each and an output
node of 1 node. For k-means 10 input values are generated
from range of 0-100 and ouptut values specify if they belong
to cluster 0 or cluster 1. So, we have 10 inputs and 10 ouptuts
along with 3 hidden layers with 10 nodes each.
5. FPGA simulation
Implementation of Neural networks on FPGA( Hardware Im-
plementation) is performed to test if further speed acceleration
can be achieved by
Generally the neural networks are implemented in software,
and are trained and simulated on general-purpose sequential
computers for emulating a wide range of neural networks mod-
els. Software implementations offer flexibility. However hard-
ware implementations of neural networks provide high speed in
real time applications and compactness. The usage of the FPGA
(Field Programmable Gate Array) for the implementation of the
neural network is done for the purpose of providing flexibility
and speed to the programmable systems. The neural network
design implememtation on the FPGAs provides higher speed
and smaller size for real time application than the other imple-
mentations.The major advantage includes that the programma-
bility of reconfigurable FPGAs yields fast special purpose hard-
ware for wide applications and this can also be used to explore
new neural network algorithms and problems of a scale that
would not be feasible with conventional processor implemen-
tation . This implementation is done using Very High Speed
Integrated Circuits Hardware Description Language (Verilog).
5.1. Overview
The basic idea includes that each of its neuron take some in-
formation as an input from another neuron or from an external
input. This information is propagated as an output that are com-
puted as weighted sum of inputs and are applied as non-linear
function. FPGAs consist of three basic blocks that are config-
urable logic blocks, in-out blocks and connection blocks. Logic
blocks perform logic function. Connection blocks connect logic
blocks with in-out blocks. These structures consist of routing
channels and programmable switches.
For this, first the training data is being generated on c, and
is the data is being saved in the file. Then the neural network
is implemented using the hardware language (verilog) on Xil-
inx. The inputs are given as the input nodes and the weights are
being wired between the different layers, while the output is ex-
tracted from the output nodes.The hidden layes are implemted
using different gates and are being looped for executing (Mul-
tiplying and addition), giving the output. This implementation
reads the data from the file generated in C++, in such way the
trained data is being passed to the FPGA, and the neural net-
work is executed on Xilinx. The execution time of this run is be-
ing recorded and is being compared to that of the conventional
run ( Software implementation on C). This shows the amount of
speed up of execution of the same neural network.
Figure 2: Neural Network block diagram in Xilinx
5.2. Implementation
By using of the FPGA features hardware implementation of
fully parallel ANN’s is possible. In this architecture number of
multipliers per neuron equals to number of connections to this
neuron and number of the full adders equals to number of con-
nections to the previous layer. In this the verilog library were
designed for floating point addition and floating point multipli-
cation. The inputs from previous layer enter the layer parallel
and multiplier serially with their corresponding weights. The
results of multiplication are stored in their neuron area in the
addition Neural Network. Multiplied value of per neuron are
inputs for adder. The inputs of adder are added serially and
each addition are inputs for lookup table. The results of look
up table are stored for next layer. This ANNs architecture is
shown in Figure 2. In this design number of layer and number
of neuron can be changed easily during the working phase. Our
development platform is the Xilinx SPARTAN-3E FPGA (Xil-
inx 2007). This can further be modelled to a FPGA. Following
is the ITL schematic of the implemented neural network. It
consists of the inputs of the neural network X1, X2,..X10, and
a clock , giving the outputs Y1, Y2,..Y10.
A test bench in Verilog consists of same two main parts
of a normal design; an entity and architecture. We are simply
supplying inputs and observing the outputs to the design in test.
The architecture of the test bench will consist of the design we
are testing as a component, internal signals for input and output,
a port map of the component for the UUT (unit under test), a
process to run the clock and finally a stimulus process, which
will be responsible for running the tests you write to test the
design. Then the stimulus code is added to it. Firstly, we have
defined the clock and clock period. Then, replaced the stimulus
process with code. Total time for which code was simulated =
1000ns. In each cycle weight is read from the file and fed to the
accumulator and in next cycle we get the output of accumulator.
always @( posedge clk )
begin
for ( s t a g e = 0; s t a g e < 4; s t a g e = s t a g e +1)
begin
for ( nod =(N*( s t a g e +1)+1); nod<=(( s t a g e +2)*N) ; nod=nod +1)
begin
node [ nod ]=0; / / i n i t i a l z e to zero f o r c l e a r i n g the p r ev i ou s summation
for ( in =((N* s t a g e ) + 1 ) ; in<= ( ( s t a g e +1)*N) ; in = in +1)
begin
node [ nod ] = b i a s [ nod ]+ node [ nod ]+ node [ in ]* t e s t [ t e s t c o u n t e r ] ;
end
Y1 = node [ nod −1];
end
end
end
After the test we got the resuts as shown in Figure 3. It
shows the various values of the nodes being updated with time.
Thereafter its simulation and run time were obtained and were
compared with that of the network implemented on software
(C++).
Figure 3: Timing diagram screenshot using Xilinx development
tool
6. Experiments and Results
We have computed the running time and energy consumption
for the software based version of neural network and compared
with runnning times of the orignial programs without neural ac-
celaration. Our results indicate a speedup of 10-900% without
much loss of accuracy. We have used the FANN toolkit in C++
for performing neural network training and testing. All the re-
sults can be seen in table below. Xilinx ran the neural network
implementation in 4 microseconds. This is excellent speedup
and we would like to investigate this further on different pro-
grams with larger training and testing data sizes.
7. Discussion
As we see from the results for software implementation of Neu-
ral Networks i.e. Fast Artificial Neural Networks(FANN) li-
brary, the time reduction in running the programs is clearly ap-
parent. It is more prominent for k-means algorithm which is an
iterative algorithm and is provided great speedup due to neural
processing. Speedup for Sobel edge detection is affected due to
entire image pixel processing computation which has equivalent
data point processing as original sobel filter. Energy consump-
tion does not show clear pattern Sobel Edge detection program
Program Running
time (mil-
liseconds)
Energy con-
sumption
(Watt)
Mean
Squared
Error
Sobel-
original (40
images)
17547 ms 7.732 W NA
Sobel-
transformed
(40 images)
16255 ms 8.1316 W 0.035139
Sobel-
original (80
images)
31567 ms 8.005 W NA
Sobel-
transformed
(80 images)
26911 ms 4.472 W 0.033425
Kmeans-
original
983 ms 8.827 W NA
Kmeans-
transformed
180 ms 2.38 W 0.040964
and needs to be further investigated on different training and
testing sizes. However, as seen from figure 4 and figure 5, power
consumption and maximum temperature are higher for the orig-
inal Sobel Edge detection code run on training and testing set of
size 80 images. K-means NPU-accelerated program again pro-
vides clear advantage over traditional k-means for power con-
sumption. The accuracy for all the images generated are accept-
able, although application dependency shall have the final say.
Mean squared errors for each dataset is less than 0.05.
8. Conclusions and Future Work
Can be utilized for many system programs but large scale utility
is slowed down by manual computation. Innovative program-
ming framework revisions to handle neural processing or tight
processor architecture integration will have to be performed to
take advantage of this technique for large scale acceptability and
usability. Future scope includes better implementation of algo-
rithms on hardware through smaller and more efficient mapping
FPGAs , ASICS and other hardware accelerators are all poten-
tial hosts for further testing of this approach. Another area of fu-
ture work is the study of different machine learning algorithms
along with neural networks Linear classifiers, principal compo-
nents analysis and spectral waveform analysis tools have a vast
potential especially in the field of electrical engineering and sig-
nal processing. Deep Neural Networks are a natural extension
for our current Neural Processing Architecture and should pro-
vide significant improvements.
Figure 4: Power consumption for NPU Accelerated Sobel code
9. Acknowledgements
We thank Dr. Bhanu Kapoor for guidance and advise during
development of the project.
Figure 5: Power consumption for original Sobel code
10. References
[1] C. M. Bishop, Neural networks for pattern recognition. Oxford
university press, 1995.
[2] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statis-
tical Learning – Data Mining, Inference, and Prediction. New
York: Springer, 2009.
Figure 6: Original image
Figure 7: Edge Detected image using NPU Accelerated Sobel
code
Figure 8: Edge detected image using Sobel filter

More Related Content

PPTX
Piotr Mirowski - Review Autoencoders (Deep Learning) - CIUUK14
PPTX
Unsupervised Feature Learning
PDF
Neural network in matlab
PDF
Deep Style: Using Variational Auto-encoders for Image Generation
PPTX
Deep Learning with TensorFlow: Understanding Tensors, Computations Graphs, Im...
PDF
Robust Ensemble Classifier Combination Based on Noise Removal with One-Class SVM
PDF
Secure Multi-Party Computation Based Privacy Preserving Extreme Learning Mach...
PDF
Convolutional Neural Network
Piotr Mirowski - Review Autoencoders (Deep Learning) - CIUUK14
Unsupervised Feature Learning
Neural network in matlab
Deep Style: Using Variational Auto-encoders for Image Generation
Deep Learning with TensorFlow: Understanding Tensors, Computations Graphs, Im...
Robust Ensemble Classifier Combination Based on Noise Removal with One-Class SVM
Secure Multi-Party Computation Based Privacy Preserving Extreme Learning Mach...
Convolutional Neural Network

What's hot (19)

PDF
Introduction to Neural Networks in Tensorflow
PPTX
Deep Learning in Computer Vision
PPTX
TensorFlow in 3 sentences
PDF
Training Neural Networks
PPTX
Deep Learning Interview Questions And Answers | AI & Deep Learning Interview ...
PDF
Language translation with Deep Learning (RNN) with TensorFlow
 
PDF
Deep learning in Computer Vision
PPTX
Machine Learning, Deep Learning and Data Analysis Introduction
PPTX
Neural network
PDF
Deep Learning - 인공지능 기계학습의 새로운 트랜드 :김인중
PDF
[DSC 2016] 系列活動:李宏毅 / 一天搞懂深度學習
PPTX
Deep Learning Tutorial | Deep Learning Tutorial For Beginners | What Is Deep ...
PPTX
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
PPTX
Deep Learning Tutorial | Deep Learning TensorFlow | Deep Learning With Neural...
PDF
Deep learning - A Visual Introduction
PDF
Introduction to Deep Learning with Python
PPTX
Diving into Deep Learning (Silicon Valley Code Camp 2017)
PPTX
Perceptron & Neural Networks
PDF
Neural Networks: Multilayer Perceptron
Introduction to Neural Networks in Tensorflow
Deep Learning in Computer Vision
TensorFlow in 3 sentences
Training Neural Networks
Deep Learning Interview Questions And Answers | AI & Deep Learning Interview ...
Language translation with Deep Learning (RNN) with TensorFlow
 
Deep learning in Computer Vision
Machine Learning, Deep Learning and Data Analysis Introduction
Neural network
Deep Learning - 인공지능 기계학습의 새로운 트랜드 :김인중
[DSC 2016] 系列活動:李宏毅 / 一天搞懂深度學習
Deep Learning Tutorial | Deep Learning Tutorial For Beginners | What Is Deep ...
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
Deep Learning Tutorial | Deep Learning TensorFlow | Deep Learning With Neural...
Deep learning - A Visual Introduction
Introduction to Deep Learning with Python
Diving into Deep Learning (Silicon Valley Code Camp 2017)
Perceptron & Neural Networks
Neural Networks: Multilayer Perceptron
Ad

Viewers also liked (11)

DOCX
Desarrollo de las actividades de tic
PPTX
Final Presentation BB
DOCX
ANNE HOPKINS
PDF
Improvement of chaotic secure communication scheme based on steganographic me...
PDF
Reclamation Regional Report Yuma 2016
PDF
Presentation
PDF
Assignment 3
PDF
Assignment 2 revised 091816
DOC
Katheryn Semore - Technical Business Analyst
PDF
Assignment 3 revised
PDF
Assignment 4 part 2
Desarrollo de las actividades de tic
Final Presentation BB
ANNE HOPKINS
Improvement of chaotic secure communication scheme based on steganographic me...
Reclamation Regional Report Yuma 2016
Presentation
Assignment 3
Assignment 2 revised 091816
Katheryn Semore - Technical Business Analyst
Assignment 3 revised
Assignment 4 part 2
Ad

Similar to Towards neuralprocessingofgeneralpurposeapproximateprograms (20)

PPS
Neural Networks
PPTX
Neural networks and deep learning
PDF
International Journal of Computational Engineering Research (IJCER)
PDF
Devanagari Digit and Character Recognition Using Convolutional Neural Network
PPTX
Neural networks
PDF
IRJET - Implementation of Neural Network on FPGA
PDF
Implementation of Feed Forward Neural Network for Classification by Education...
PDF
Feed forward neural network for sine
PPTX
Sachpazis: Demystifying Neural Networks: A Comprehensive Guide
DOCX
artificial-neural-network-seminar-report.docx
PDF
Digital image processing for camera application in mobile devices using artif...
PDF
11.digital image processing for camera application in mobile devices using ar...
PDF
(Im2col)accelerating deep neural networks on low power heterogeneous architec...
PDF
Digital Implementation of Artificial Neural Network for Function Approximatio...
PDF
Digital Implementation of Artificial Neural Network for Function Approximatio...
PPTX
B.tech_project_ppt.pptx
PDF
Lab 6 Neural Network
PDF
International Refereed Journal of Engineering and Science (IRJES)
PDF
International Refereed Journal of Engineering and Science (IRJES)
PDF
11.secure compressed image transmission using self organizing feature maps
Neural Networks
Neural networks and deep learning
International Journal of Computational Engineering Research (IJCER)
Devanagari Digit and Character Recognition Using Convolutional Neural Network
Neural networks
IRJET - Implementation of Neural Network on FPGA
Implementation of Feed Forward Neural Network for Classification by Education...
Feed forward neural network for sine
Sachpazis: Demystifying Neural Networks: A Comprehensive Guide
artificial-neural-network-seminar-report.docx
Digital image processing for camera application in mobile devices using artif...
11.digital image processing for camera application in mobile devices using ar...
(Im2col)accelerating deep neural networks on low power heterogeneous architec...
Digital Implementation of Artificial Neural Network for Function Approximatio...
Digital Implementation of Artificial Neural Network for Function Approximatio...
B.tech_project_ppt.pptx
Lab 6 Neural Network
International Refereed Journal of Engineering and Science (IRJES)
International Refereed Journal of Engineering and Science (IRJES)
11.secure compressed image transmission using self organizing feature maps

Recently uploaded (20)

PDF
Human-AI Collaboration: Balancing Agentic AI and Autonomy in Hybrid Systems
PPTX
UNIT 4 Total Quality Management .pptx
PDF
Level 2 – IBM Data and AI Fundamentals (1)_v1.1.PDF
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PDF
Artificial Superintelligence (ASI) Alliance Vision Paper.pdf
PPTX
additive manufacturing of ss316l using mig welding
PPT
Introduction, IoT Design Methodology, Case Study on IoT System for Weather Mo...
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PPTX
Sustainable Sites - Green Building Construction
PDF
R24 SURVEYING LAB MANUAL for civil enggi
DOCX
573137875-Attendance-Management-System-original
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PPT
Project quality management in manufacturing
PDF
A SYSTEMATIC REVIEW OF APPLICATIONS IN FRAUD DETECTION
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PDF
III.4.1.2_The_Space_Environment.p pdffdf
PPTX
Geodesy 1.pptx...............................................
PPTX
Fundamentals of Mechanical Engineering.pptx
Human-AI Collaboration: Balancing Agentic AI and Autonomy in Hybrid Systems
UNIT 4 Total Quality Management .pptx
Level 2 – IBM Data and AI Fundamentals (1)_v1.1.PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
Artificial Superintelligence (ASI) Alliance Vision Paper.pdf
additive manufacturing of ss316l using mig welding
Introduction, IoT Design Methodology, Case Study on IoT System for Weather Mo...
UNIT-1 - COAL BASED THERMAL POWER PLANTS
Sustainable Sites - Green Building Construction
R24 SURVEYING LAB MANUAL for civil enggi
573137875-Attendance-Management-System-original
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
Project quality management in manufacturing
A SYSTEMATIC REVIEW OF APPLICATIONS IN FRAUD DETECTION
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
CYBER-CRIMES AND SECURITY A guide to understanding
III.4.1.2_The_Space_Environment.p pdffdf
Geodesy 1.pptx...............................................
Fundamentals of Mechanical Engineering.pptx

Towards neuralprocessingofgeneralpurposeapproximateprograms

  • 1. Towards Neural Processing for General Purpose Approximate Programs Prasanna Kothalkar1 , Mohid Nabil1 , Vidhi Agrawal1 , Paridha Saxena1 1 Department of Computer and Electrical Engineering, University of Texas at Dallas [email protected], [email protected], [email protected], [email protected] Abstract Modern processor architectures have focused on increasing pro- cessor speeds while reducing power consumption. The chal- lenge to maintain energy consumption feasibility with increased transistor density on microprocessor chips has lead to new gen- eration of processors which replace program code with alternate faster implementations at run time. In this report we present one such architecture that use Neural Networks to mimic regions of program code. We train neural networks on training data gen- erated to model programs to be run on processors and then run the neural networks at run time through program invocations to generate our results from program outputs. Index Terms: computer architecture, neural processing units, program acceleration 1. Introduction Due to the limitations of technology scaling for modern pro- cessors, recent focus has moved towards computation special- ization to run programs at fast speeds and minimal energy con- sumption. Recent work provides acceleration by exploiting er- ror tolerance by leveraging an approximate computing frame- work. We implement Neural Processing Units(NPUs), a new class of configurable accelerators for approximate computation. Many application programs in diverse fields like image process- ing, speech recognition, computer vision, signal processing can tolerate errors in computation and thus provide immense oppor- tunity to replace with alternate computing executions. While traditional processors execute programs via instruction set ar- chitecture and programming logic, NPUs are ’trained’ to mimic regions of imperative code. We need to generate the training data by running functions that need to be transformed via NPU acceleration. To achieve this we save the inputs and outputs generated using neural networks. 2. Neural Networks Neural Networks are brain inspired machine learning models [1] which have neurons as the basic building block and advan- tageous in learning concepts in hierarchical fashion. Similar to the brain which builds up it’s idea about a complex topic from simpler ideas, neural networks learn a concept as a func- tion from data by trying to learn to generate outputs from inputs. The learning is performed in terms of weights on the edges that connect neurons in adjacent layers. The learning is performed to reduce the error by providing feedback and is realized through the back-propagation algorithm. Neural networks are a popular class of machine learning models and can theoretically represent any given function given two layers with infinite number of nodes. Practically, such com- putation can only be measured in the limit and so one way to im- plement such functions in neural networks will require adding more layers. However, this vertical addition of layers faces the problem of vanishing gradient where the gradient signal in the backpropogation algorithm looses the information trav- eling through many layers. A new class of algorithms using layerwise pre-training technique were developed leading to a resurgence in neural network literature and development of va- riety of learning models. The basic idea involved initializing the neural network edge weights in a more informed manner such that the training procedure would not end up in local min- ima while training. Such models were known as ’Deep Neu- ral Networks’ and have lead to impressive results in challeng- ing problems in natural language processing, speech processing and vision problems. We plan to use Deep Belief Networks and Convolutional Neural Networks for our Sobel Edge detection NPU in future work. Figure 1: Neural Networks 2.1. Weight Update Rule Neural Network backpropogation updates the weights for neu- ral network edges using the following update rule. wji = wji + ∆wji (1) where wji represents the edge weight between node j and node i and ∆wji is the change in weight performed by the backpro- pogation step. ∆wji = α(tj − yj)xi (2) where α represents the learning rate of our training algorithm, tj are the target values for jth output node and yj are the output values for jth output node. Finally, xi are the input values for the ith input node.
  • 2. 3. Data generation Data extraction was carried out for training and implentation of neural network . For kmeans dataset, we used random inputs to our code and extracted the generated outputs. For sobel edge detection program input images were provided and the sobel edge detection computation was learned by the neural network, which was then further fine-tuned on a separate validation set of images and tested on another set of images. Dataset for Sobel Edge Detection Sample size: 200 train- ing, 200 validation, 100 testing Input: Feature extraction per- formed for each pixel such that each sample is made of individ- ual pixel value and its neighbors Output: Sobel Edge Detection sum value Dataset for k-means Sample size: 850 training, 850 vali- dation, 850 test Input: Randomly generated vector of size 10 Output: Vector of size 10 which provides Cluster Membership for each of 10 input values The output files were created in a fixed format so as to en- able FANN to read the inputs and outputs from the text file. The file format selected was as follows 1) the first row contained the total number of iteration 2) number of iterations was followed by total number of inputs 3) at the end of the line total number of outputs was printed Each line of the text file contained the inputs followed by the output results.The same format file was used for both train- ing and testing purposes 4. Software Neural Acceleration As mentioned previously, we have used FANN toolkit to learn neural networks off the fly for our programs. The code below shows the neural network output function being called for gen- erating the output instead of getting the output from sobel pro- gram. void edgeDetection ( Mat src , Mat dst , bool NPU, i n t prev i , s t r i n g name ) { i n t gx , gy , sum ; vector<int> output ; i f (NPU){ for ( i n t y = 0; y < s r c . rows ; y++) { for ( i n t x = 0; x < s r c . c o l s ; x++) { d s t . at<uchar >(y , x ) = 0 . 0 ; } } i n t i = p r e v i ; i n t l e n g t h = ( d s t . rows −2)*( d s t . cols −2); output = t e s t S o b e l ( i , l e n g t h ) ; i n t idx = 0; for ( i n t y = 1; y < s r c . rows − 1; y++){ for ( i n t x = 1; x < s r c . c o l s − 1; x++){ i f ( output [ idx ++] < 1 . 0 ) d s t . at<uchar >(y , x ) = 255; e l s e d s t . at<uchar >(y , x ) = 0; } } name = name . r e p l a c e ( name . f i n d ( ’ . ’ ) , 4 , ”−npu . png ” ) ; } e l s e{ for ( i n t y = 0; y < s r c . rows ; y++) for ( i n t x = 0; x < s r c . c o l s ; x++) d s t . at<uchar >(y , x ) = 0 . 0 ; for ( i n t y = 1; y < s r c . rows − 1; y++){ for ( i n t x = 1; x < s r c . c o l s − 1; x++){ gx = xGradient ( src , x , y ) ; gy = yGradient ( src , x , y ) ; sum = abs ( gx ) + abs ( gy ) ; i n t output = sum > 127 ? 1 : 0; i f ( output ==1) d s t . at<uchar >(y , x ) = 255; e l s e d s t . at<uchar >(y , x ) = 0; } } name = name . r e p l a c e ( name . f i n d ( ’ . ’ ) , 4 , ”−sob . png ” ) ; } imwrite ( name , d s t ) ; } In case of Sobel Edge detector program any input image has 154401 pixels. (441 × 321 or 321 × 481) Each pixel value is transformed along with it’s neighboring pixels which form our training set. The output of this patch of image through the sobel filter is our target value. All the input and output values are thresholded as binary image for edge detection problem. Hence, our training, validation and testing data is binary for our neural network. So, each pixel has 8 neighboring pixels giving an input node of size 9, 3 hidden layers with 9 nodes each and an output node of 1 node. For k-means 10 input values are generated from range of 0-100 and ouptut values specify if they belong to cluster 0 or cluster 1. So, we have 10 inputs and 10 ouptuts along with 3 hidden layers with 10 nodes each. 5. FPGA simulation Implementation of Neural networks on FPGA( Hardware Im- plementation) is performed to test if further speed acceleration can be achieved by Generally the neural networks are implemented in software, and are trained and simulated on general-purpose sequential computers for emulating a wide range of neural networks mod- els. Software implementations offer flexibility. However hard- ware implementations of neural networks provide high speed in real time applications and compactness. The usage of the FPGA (Field Programmable Gate Array) for the implementation of the neural network is done for the purpose of providing flexibility and speed to the programmable systems. The neural network design implememtation on the FPGAs provides higher speed and smaller size for real time application than the other imple- mentations.The major advantage includes that the programma- bility of reconfigurable FPGAs yields fast special purpose hard- ware for wide applications and this can also be used to explore new neural network algorithms and problems of a scale that would not be feasible with conventional processor implemen- tation . This implementation is done using Very High Speed Integrated Circuits Hardware Description Language (Verilog). 5.1. Overview The basic idea includes that each of its neuron take some in- formation as an input from another neuron or from an external input. This information is propagated as an output that are com- puted as weighted sum of inputs and are applied as non-linear function. FPGAs consist of three basic blocks that are config- urable logic blocks, in-out blocks and connection blocks. Logic blocks perform logic function. Connection blocks connect logic blocks with in-out blocks. These structures consist of routing channels and programmable switches. For this, first the training data is being generated on c, and is the data is being saved in the file. Then the neural network is implemented using the hardware language (verilog) on Xil- inx. The inputs are given as the input nodes and the weights are being wired between the different layers, while the output is ex- tracted from the output nodes.The hidden layes are implemted using different gates and are being looped for executing (Mul- tiplying and addition), giving the output. This implementation reads the data from the file generated in C++, in such way the trained data is being passed to the FPGA, and the neural net- work is executed on Xilinx. The execution time of this run is be- ing recorded and is being compared to that of the conventional run ( Software implementation on C). This shows the amount of speed up of execution of the same neural network.
  • 3. Figure 2: Neural Network block diagram in Xilinx 5.2. Implementation By using of the FPGA features hardware implementation of fully parallel ANN’s is possible. In this architecture number of multipliers per neuron equals to number of connections to this neuron and number of the full adders equals to number of con- nections to the previous layer. In this the verilog library were designed for floating point addition and floating point multipli- cation. The inputs from previous layer enter the layer parallel and multiplier serially with their corresponding weights. The results of multiplication are stored in their neuron area in the addition Neural Network. Multiplied value of per neuron are inputs for adder. The inputs of adder are added serially and each addition are inputs for lookup table. The results of look up table are stored for next layer. This ANNs architecture is shown in Figure 2. In this design number of layer and number of neuron can be changed easily during the working phase. Our development platform is the Xilinx SPARTAN-3E FPGA (Xil- inx 2007). This can further be modelled to a FPGA. Following is the ITL schematic of the implemented neural network. It consists of the inputs of the neural network X1, X2,..X10, and a clock , giving the outputs Y1, Y2,..Y10. A test bench in Verilog consists of same two main parts of a normal design; an entity and architecture. We are simply supplying inputs and observing the outputs to the design in test. The architecture of the test bench will consist of the design we are testing as a component, internal signals for input and output, a port map of the component for the UUT (unit under test), a process to run the clock and finally a stimulus process, which will be responsible for running the tests you write to test the design. Then the stimulus code is added to it. Firstly, we have defined the clock and clock period. Then, replaced the stimulus process with code. Total time for which code was simulated = 1000ns. In each cycle weight is read from the file and fed to the accumulator and in next cycle we get the output of accumulator. always @( posedge clk ) begin for ( s t a g e = 0; s t a g e < 4; s t a g e = s t a g e +1) begin for ( nod =(N*( s t a g e +1)+1); nod<=(( s t a g e +2)*N) ; nod=nod +1) begin node [ nod ]=0; / / i n i t i a l z e to zero f o r c l e a r i n g the p r ev i ou s summation for ( in =((N* s t a g e ) + 1 ) ; in<= ( ( s t a g e +1)*N) ; in = in +1) begin node [ nod ] = b i a s [ nod ]+ node [ nod ]+ node [ in ]* t e s t [ t e s t c o u n t e r ] ; end Y1 = node [ nod −1]; end end end After the test we got the resuts as shown in Figure 3. It shows the various values of the nodes being updated with time. Thereafter its simulation and run time were obtained and were compared with that of the network implemented on software (C++). Figure 3: Timing diagram screenshot using Xilinx development tool 6. Experiments and Results We have computed the running time and energy consumption for the software based version of neural network and compared with runnning times of the orignial programs without neural ac- celaration. Our results indicate a speedup of 10-900% without much loss of accuracy. We have used the FANN toolkit in C++ for performing neural network training and testing. All the re- sults can be seen in table below. Xilinx ran the neural network implementation in 4 microseconds. This is excellent speedup and we would like to investigate this further on different pro- grams with larger training and testing data sizes. 7. Discussion As we see from the results for software implementation of Neu- ral Networks i.e. Fast Artificial Neural Networks(FANN) li- brary, the time reduction in running the programs is clearly ap- parent. It is more prominent for k-means algorithm which is an iterative algorithm and is provided great speedup due to neural processing. Speedup for Sobel edge detection is affected due to entire image pixel processing computation which has equivalent data point processing as original sobel filter. Energy consump- tion does not show clear pattern Sobel Edge detection program
  • 4. Program Running time (mil- liseconds) Energy con- sumption (Watt) Mean Squared Error Sobel- original (40 images) 17547 ms 7.732 W NA Sobel- transformed (40 images) 16255 ms 8.1316 W 0.035139 Sobel- original (80 images) 31567 ms 8.005 W NA Sobel- transformed (80 images) 26911 ms 4.472 W 0.033425 Kmeans- original 983 ms 8.827 W NA Kmeans- transformed 180 ms 2.38 W 0.040964 and needs to be further investigated on different training and testing sizes. However, as seen from figure 4 and figure 5, power consumption and maximum temperature are higher for the orig- inal Sobel Edge detection code run on training and testing set of size 80 images. K-means NPU-accelerated program again pro- vides clear advantage over traditional k-means for power con- sumption. The accuracy for all the images generated are accept- able, although application dependency shall have the final say. Mean squared errors for each dataset is less than 0.05. 8. Conclusions and Future Work Can be utilized for many system programs but large scale utility is slowed down by manual computation. Innovative program- ming framework revisions to handle neural processing or tight processor architecture integration will have to be performed to take advantage of this technique for large scale acceptability and usability. Future scope includes better implementation of algo- rithms on hardware through smaller and more efficient mapping FPGAs , ASICS and other hardware accelerators are all poten- tial hosts for further testing of this approach. Another area of fu- ture work is the study of different machine learning algorithms along with neural networks Linear classifiers, principal compo- nents analysis and spectral waveform analysis tools have a vast potential especially in the field of electrical engineering and sig- nal processing. Deep Neural Networks are a natural extension for our current Neural Processing Architecture and should pro- vide significant improvements. Figure 4: Power consumption for NPU Accelerated Sobel code 9. Acknowledgements We thank Dr. Bhanu Kapoor for guidance and advise during development of the project.
  • 5. Figure 5: Power consumption for original Sobel code 10. References [1] C. M. Bishop, Neural networks for pattern recognition. Oxford university press, 1995. [2] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statis- tical Learning – Data Mining, Inference, and Prediction. New York: Springer, 2009. Figure 6: Original image Figure 7: Edge Detected image using NPU Accelerated Sobel code Figure 8: Edge detected image using Sobel filter