Towards neuralprocessingofgeneralpurposeapproximateprograms

Towards Neural Processing for General Purpose Approximate Programs
Prasanna Kothalkar1
, Mohid Nabil1
, Vidhi Agrawal1
, Paridha Saxena1
1
Department of Computer and Electrical Engineering, University of Texas at Dallas
pxk122030@utdallas.edu, mxn150230@utdallas.edu, vna150130@utdallas.edu, pxs158430@utdallas.edu
Abstract
Modern processor architectures have focused on increasing pro-
cessor speeds while reducing power consumption. The chal-
lenge to maintain energy consumption feasibility with increased
transistor density on microprocessor chips has lead to new gen-
eration of processors which replace program code with alternate
faster implementations at run time. In this report we present one
such architecture that use Neural Networks to mimic regions of
program code. We train neural networks on training data gen-
erated to model programs to be run on processors and then run
the neural networks at run time through program invocations to
generate our results from program outputs.
Index Terms: computer architecture, neural processing units,
program acceleration
1. Introduction
Due to the limitations of technology scaling for modern pro-
cessors, recent focus has moved towards computation special-
ization to run programs at fast speeds and minimal energy con-
sumption. Recent work provides acceleration by exploiting er-
ror tolerance by leveraging an approximate computing frame-
work. We implement Neural Processing Units(NPUs), a new
class of configurable accelerators for approximate computation.
Many application programs in diverse fields like image process-
ing, speech recognition, computer vision, signal processing can
tolerate errors in computation and thus provide immense oppor-
tunity to replace with alternate computing executions. While
traditional processors execute programs via instruction set ar-
chitecture and programming logic, NPUs are ’trained’ to mimic
regions of imperative code. We need to generate the training
data by running functions that need to be transformed via NPU
acceleration. To achieve this we save the inputs and outputs
generated using neural networks.
2. Neural Networks
Neural Networks are brain inspired machine learning models
[1] which have neurons as the basic building block and advan-
tageous in learning concepts in hierarchical fashion. Similar
to the brain which builds up it’s idea about a complex topic
from simpler ideas, neural networks learn a concept as a func-
tion from data by trying to learn to generate outputs from inputs.
The learning is performed in terms of weights on the edges that
connect neurons in adjacent layers. The learning is performed to
reduce the error by providing feedback and is realized through
the back-propagation algorithm.
Neural networks are a popular class of machine learning
models and can theoretically represent any given function given
two layers with infinite number of nodes. Practically, such com-
putation can only be measured in the limit and so one way to im-
plement such functions in neural networks will require adding
more layers. However, this vertical addition of layers faces
the problem of vanishing gradient where the gradient signal
in the backpropogation algorithm looses the information trav-
eling through many layers. A new class of algorithms using
layerwise pre-training technique were developed leading to a
resurgence in neural network literature and development of va-
riety of learning models. The basic idea involved initializing
the neural network edge weights in a more informed manner
such that the training procedure would not end up in local min-
ima while training. Such models were known as ’Deep Neu-
ral Networks’ and have lead to impressive results in challeng-
ing problems in natural language processing, speech processing
and vision problems. We plan to use Deep Belief Networks and
Convolutional Neural Networks for our Sobel Edge detection
NPU in future work.
Figure 1: Neural Networks
2.1. Weight Update Rule
Neural Network backpropogation updates the weights for neu-
ral network edges using the following update rule.
wji = wji + ∆wji (1)
where wji represents the edge weight between node j and node
i and ∆wji is the change in weight performed by the backpro-
pogation step.
∆wji = α(tj − yj)xi (2)
where α represents the learning rate of our training algorithm,
tj are the target values for jth
output node and yj are the output
values for jth
output node. Finally, xi are the input values for
the ith
input node.

3. Data generation
Data extraction was carried out for training and implentation of
neural network . For kmeans dataset, we used random inputs
to our code and extracted the generated outputs. For sobel edge
detection program input images were provided and the sobel
edge detection computation was learned by the neural network,
which was then further fine-tuned on a separate validation set of
images and tested on another set of images.
Dataset for Sobel Edge Detection Sample size: 200 train-
ing, 200 validation, 100 testing Input: Feature extraction per-
formed for each pixel such that each sample is made of individ-
ual pixel value and its neighbors Output: Sobel Edge Detection
sum value
Dataset for k-means Sample size: 850 training, 850 vali-
dation, 850 test Input: Randomly generated vector of size 10
Output: Vector of size 10 which provides Cluster Membership
for each of 10 input values
The output files were created in a fixed format so as to en-
able FANN to read the inputs and outputs from the text file. The
file format selected was as follows 1) the first row contained the
total number of iteration 2) number of iterations was followed
by total number of inputs 3) at the end of the line total number
of outputs was printed
Each line of the text file contained the inputs followed by
the output results.The same format file was used for both train-
ing and testing purposes
4. Software Neural Acceleration
As mentioned previously, we have used FANN toolkit to learn
neural networks off the fly for our programs. The code below
shows the neural network output function being called for gen-
erating the output instead of getting the output from sobel pro-
gram.
void edgeDetection ( Mat src , Mat dst , bool NPU, i n t prev i , s t r i n g name )
{
i n t gx , gy , sum ;
vector<int> output ;
i f (NPU){
for ( i n t y = 0; y < s r c . rows ; y++)
{
for ( i n t x = 0; x < s r c . c o l s ; x++)
{
d s t . at<uchar >(y , x ) = 0 . 0 ;
}
}
i n t i = p r e v i ;
i n t l e n g t h = ( d s t . rows −2)*( d s t . cols −2);
output = t e s t S o b e l ( i , l e n g t h ) ;
i n t idx = 0;
for ( i n t y = 1; y < s r c . rows − 1; y++){
for ( i n t x = 1; x < s r c . c o l s − 1; x++){
i f ( output [ idx ++] < 1 . 0 )
d s t . at<uchar >(y , x ) = 255;
e l s e
d s t . at<uchar >(y , x ) = 0;
}
}
name = name . r e p l a c e ( name . f i n d ( ’ . ’ ) , 4 , ”−npu . png ” ) ;
}
e l s e{
for ( i n t y = 0; y < s r c . rows ; y++)
for ( i n t x = 0; x < s r c . c o l s ; x++)
d s t . at<uchar >(y , x ) = 0 . 0 ;
for ( i n t y = 1; y < s r c . rows − 1; y++){
for ( i n t x = 1; x < s r c . c o l s − 1; x++){
gx = xGradient ( src , x , y ) ;
gy = yGradient ( src , x , y ) ;
sum = abs ( gx ) + abs ( gy ) ;
i n t output = sum > 127 ? 1 : 0;
i f ( output ==1)
d s t . at<uchar >(y , x ) = 255;
e l s e
d s t . at<uchar >(y , x ) = 0;
}
}
name = name . r e p l a c e ( name . f i n d ( ’ . ’ ) , 4 , ”−sob . png ” ) ;
}
imwrite ( name , d s t ) ;
}
In case of Sobel Edge detector program any input image has
154401 pixels. (441 × 321 or 321 × 481) Each pixel value is
transformed along with it’s neighboring pixels which form our
training set. The output of this patch of image through the sobel
filter is our target value. All the input and output values are
thresholded as binary image for edge detection problem. Hence,
our training, validation and testing data is binary for our neural
network. So, each pixel has 8 neighboring pixels giving an input
node of size 9, 3 hidden layers with 9 nodes each and an output
node of 1 node. For k-means 10 input values are generated
from range of 0-100 and ouptut values specify if they belong
to cluster 0 or cluster 1. So, we have 10 inputs and 10 ouptuts
along with 3 hidden layers with 10 nodes each.
5. FPGA simulation
Implementation of Neural networks on FPGA( Hardware Im-
plementation) is performed to test if further speed acceleration
can be achieved by
Generally the neural networks are implemented in software,
and are trained and simulated on general-purpose sequential
computers for emulating a wide range of neural networks mod-
els. Software implementations offer flexibility. However hard-
ware implementations of neural networks provide high speed in
real time applications and compactness. The usage of the FPGA
(Field Programmable Gate Array) for the implementation of the
neural network is done for the purpose of providing flexibility
and speed to the programmable systems. The neural network
design implememtation on the FPGAs provides higher speed
and smaller size for real time application than the other imple-
mentations.The major advantage includes that the programma-
bility of reconfigurable FPGAs yields fast special purpose hard-
ware for wide applications and this can also be used to explore
new neural network algorithms and problems of a scale that
would not be feasible with conventional processor implemen-
tation . This implementation is done using Very High Speed
Integrated Circuits Hardware Description Language (Verilog).
5.1. Overview
The basic idea includes that each of its neuron take some in-
formation as an input from another neuron or from an external
input. This information is propagated as an output that are com-
puted as weighted sum of inputs and are applied as non-linear
function. FPGAs consist of three basic blocks that are config-
urable logic blocks, in-out blocks and connection blocks. Logic
blocks perform logic function. Connection blocks connect logic
blocks with in-out blocks. These structures consist of routing
channels and programmable switches.
For this, first the training data is being generated on c, and
is the data is being saved in the file. Then the neural network
is implemented using the hardware language (verilog) on Xil-
inx. The inputs are given as the input nodes and the weights are
being wired between the different layers, while the output is ex-
tracted from the output nodes.The hidden layes are implemted
using different gates and are being looped for executing (Mul-
tiplying and addition), giving the output. This implementation
reads the data from the file generated in C++, in such way the
trained data is being passed to the FPGA, and the neural net-
work is executed on Xilinx. The execution time of this run is be-
ing recorded and is being compared to that of the conventional
run ( Software implementation on C). This shows the amount of
speed up of execution of the same neural network.

Figure 2: Neural Network block diagram in Xilinx
5.2. Implementation
By using of the FPGA features hardware implementation of
fully parallel ANN’s is possible. In this architecture number of
multipliers per neuron equals to number of connections to this
neuron and number of the full adders equals to number of con-
nections to the previous layer. In this the verilog library were
designed for floating point addition and floating point multipli-
cation. The inputs from previous layer enter the layer parallel
and multiplier serially with their corresponding weights. The
results of multiplication are stored in their neuron area in the
addition Neural Network. Multiplied value of per neuron are
inputs for adder. The inputs of adder are added serially and
each addition are inputs for lookup table. The results of look
up table are stored for next layer. This ANNs architecture is
shown in Figure 2. In this design number of layer and number
of neuron can be changed easily during the working phase. Our
development platform is the Xilinx SPARTAN-3E FPGA (Xil-
inx 2007). This can further be modelled to a FPGA. Following
is the ITL schematic of the implemented neural network. It
consists of the inputs of the neural network X1, X2,..X10, and
a clock , giving the outputs Y1, Y2,..Y10.
A test bench in Verilog consists of same two main parts
of a normal design; an entity and architecture. We are simply
supplying inputs and observing the outputs to the design in test.
The architecture of the test bench will consist of the design we
are testing as a component, internal signals for input and output,
a port map of the component for the UUT (unit under test), a
process to run the clock and finally a stimulus process, which
will be responsible for running the tests you write to test the
design. Then the stimulus code is added to it. Firstly, we have
defined the clock and clock period. Then, replaced the stimulus
process with code. Total time for which code was simulated =
1000ns. In each cycle weight is read from the file and fed to the
accumulator and in next cycle we get the output of accumulator.
always @( posedge clk )
begin
for ( s t a g e = 0; s t a g e < 4; s t a g e = s t a g e +1)
begin
for ( nod =(N*( s t a g e +1)+1); nod<=(( s t a g e +2)*N) ; nod=nod +1)
begin
node [ nod ]=0; / / i n i t i a l z e to zero f o r c l e a r i n g the p r ev i ou s summation
for ( in =((N* s t a g e ) + 1 ) ; in<= ( ( s t a g e +1)*N) ; in = in +1)
begin
node [ nod ] = b i a s [ nod ]+ node [ nod ]+ node [ in ]* t e s t [ t e s t c o u n t e r ] ;
end
Y1 = node [ nod −1];
end
end
end
After the test we got the resuts as shown in Figure 3. It
shows the various values of the nodes being updated with time.
Thereafter its simulation and run time were obtained and were
compared with that of the network implemented on software
(C++).
Figure 3: Timing diagram screenshot using Xilinx development
tool
6. Experiments and Results
We have computed the running time and energy consumption
for the software based version of neural network and compared
with runnning times of the orignial programs without neural ac-
celaration. Our results indicate a speedup of 10-900% without
much loss of accuracy. We have used the FANN toolkit in C++
for performing neural network training and testing. All the re-
sults can be seen in table below. Xilinx ran the neural network
implementation in 4 microseconds. This is excellent speedup
and we would like to investigate this further on different pro-
grams with larger training and testing data sizes.
7. Discussion
As we see from the results for software implementation of Neu-
ral Networks i.e. Fast Artificial Neural Networks(FANN) li-
brary, the time reduction in running the programs is clearly ap-
parent. It is more prominent for k-means algorithm which is an
iterative algorithm and is provided great speedup due to neural
processing. Speedup for Sobel edge detection is affected due to
entire image pixel processing computation which has equivalent
data point processing as original sobel filter. Energy consump-
tion does not show clear pattern Sobel Edge detection program

Program Running
time (mil-
liseconds)
Energy con-
sumption
(Watt)
Mean
Squared
Error
Sobel-
original (40
images)
17547 ms 7.732 W NA
Sobel-
transformed
(40 images)
16255 ms 8.1316 W 0.035139
Sobel-
original (80
images)
31567 ms 8.005 W NA
Sobel-
transformed
(80 images)
26911 ms 4.472 W 0.033425
Kmeans-
original
983 ms 8.827 W NA
Kmeans-
transformed
180 ms 2.38 W 0.040964
and needs to be further investigated on different training and
testing sizes. However, as seen from figure 4 and figure 5, power
consumption and maximum temperature are higher for the orig-
inal Sobel Edge detection code run on training and testing set of
size 80 images. K-means NPU-accelerated program again pro-
vides clear advantage over traditional k-means for power con-
sumption. The accuracy for all the images generated are accept-
able, although application dependency shall have the final say.
Mean squared errors for each dataset is less than 0.05.
8. Conclusions and Future Work
Can be utilized for many system programs but large scale utility
is slowed down by manual computation. Innovative program-
ming framework revisions to handle neural processing or tight
processor architecture integration will have to be performed to
take advantage of this technique for large scale acceptability and
usability. Future scope includes better implementation of algo-
rithms on hardware through smaller and more efficient mapping
FPGAs , ASICS and other hardware accelerators are all poten-
tial hosts for further testing of this approach. Another area of fu-
ture work is the study of different machine learning algorithms
along with neural networks Linear classifiers, principal compo-
nents analysis and spectral waveform analysis tools have a vast
potential especially in the field of electrical engineering and sig-
nal processing. Deep Neural Networks are a natural extension
for our current Neural Processing Architecture and should pro-
vide significant improvements.
Figure 4: Power consumption for NPU Accelerated Sobel code
9. Acknowledgements
We thank Dr. Bhanu Kapoor for guidance and advise during
development of the project.

Figure 5: Power consumption for original Sobel code
10. References
[1] C. M. Bishop, Neural networks for pattern recognition. Oxford
university press, 1995.
[2] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statis-
tical Learning – Data Mining, Inference, and Prediction. New
York: Springer, 2009.
Figure 6: Original image
Figure 7: Edge Detected image using NPU Accelerated Sobel
code
Figure 8: Edge detected image using Sobel ﬁlter

Towards neuralprocessingofgeneralpurposeapproximateprograms

More Related Content

What's hot (19)

Viewers also liked (11)

Similar to Towards neuralprocessingofgeneralpurposeapproximateprograms (20)

Recently uploaded (20)

Towards neuralprocessingofgeneralpurposeapproximateprograms