SlideShare a Scribd company logo
ISSN (e): 2250 – 3005 || Volume, 06 || Issue, 04||April – 2016 ||
International Journal of Computational Engineering Research (IJCER)
www.ijceronline.com Open Access Journal Page 1
Matlab Based High Level Synthesis Engine for Area And Power
Efficient Arithmetic Operations
Semih Aslan
Ingram School of Engineering Texas State University San Marcos, Texas, 78666, USA
I. Introduction
Today, a significant number of embedded systems are focused on multimedia applications with almost insatiable
demand for low cost, high performance and low power hardware. Designing complex systems such as image
and video processing, compression, face recognition, object tracking, 3G or 4G modems, multi-standard
CODECs, and HD decoding schemes requires the integration of many complex blocks and a long verification
process [1][2]. These complex designs are based on I/O peripherals, one or more processors, bus interfaces,
A/D, D/A, embedded software, memories and sensors. In the past, complete systems were designed with
multiple chips and connected together on PCBs, but with today’s technology, all functions can be incorporated
in a single chip. These complete systems are known as System-on-Chip (SoC) [2].System-on-chip (SoC)
designs are mainly accomplished by using Register Transfer Languages (RTL) such as Verilog and VHDL. RTL
design flow [1] [2] for both FPGA and ASIC is similar and is shown in Figure 1. An algorithm can be converted
to RTL using the behavioral model description method or by using pre-defined IP core blocks. After completing
this RTL code, formal verification must be done before implementation. After implementation of the RTL code,
timing verification needs to be done for proper operation.
RTL design abstracts logic structures, timing and registers [1]. Because of this, every clock change
causes a state change in the design. This timing dependency causes every event to be simulated, which results in
a slower simulation time and longer verification period of the design. The design and verification of an
algorithm in RTL in Figure 1 can take up 50-60% of the “Time to Market” (TTM). The RTL design becomes
impractical for larger systems that have high data flow between the blocks, and it requires millions of gates.
Even though design time may improve by using behavioral modeling and IP cores, the difficulty in synthesis,
poor performance results and rapid changes in the design make IP cores difficult to adapt and change. Therefore,
systems rapidly become obsolete.The limitations of RTL and longer TTM forced designers to think of the
design as a whole system rather than blocks. In addition, software integration in SoC was always done after the
hardware was designed. When the system gets more complex, software integration is desirable during hardware
Abstract
Embedded systems used in real-time applications require low power, less area and a high
computation speed. For digital signal processing (DSP), image processing and communication
applications, data are often received at a continuously high rate. Embedded processors have to
cope with this high data rate and process the incoming data based on specific application
requirements. Even though there are many different application domains, they all require
arithmetic operations that quickly compute the desired values using a larger range of operation,
reconfigurable behavior, low power and high precision. The type of necessary arithmetic
operations may vary greatly among different applications. The RTL-based design and verification
of one or more of these functions may be time-consuming. Some High Level Synthesis tools reduce
this design and verification time but may not be optimal or suitable for low power applications.
The developed MATLAB-based Arithmetic Engine improves design time and reduces the
verification process, but the key point is to use a unified design that combines some of the basic
operations with more complex operations to reduce area and power consumption. The results
indicate that using the Arithmetic Engine from a simple design to more complex systems can
improve design time by reducing the verification time by up to 62%. The MATLAB-based
Arithmetic Engine generates structural RTL code, a testbench, and gives the designers more
control. The MATLAB-based design and verification engine uses optimized algorithms for better
accuracy at a better throughput.
Keywords: FPGA, High Level Synthesis, MATLAB, Optimized Hardware, Power Efficient, RTL.
Matlab Based High Level Synthesis Engine For Area And Power Efficient Arithmetic Operations
www.ijceronline.com Open Access Journal Page 2
implementation. Over the last two decades, designers were forced to find new methods to replace RTL due to
improvements in SoC and shorter TTM. Because of the extensive work done in Electronics System Level
Design (ESLD), HW/SW co-design of a system and High Level Synthesis (HLS)[2][4] are integrated into
FPGA and ASIC design flow.
Algorithm
RTL Timing
Translate
Map
Place&Route
IMPLEMENT
Bit
File
Formal Proof
Logic
Synthesis
Figure 1. FPGA RTL level synthesis flow
The next section will describe the proposed MATLAB HLS Arithmetic (MHA) Engine design and
implementation. Section III and IV will focus on the error analysis and testbench generation respectively and the
conclusion will describe future work and improvements.
II. Mha Engine
RTL description of a system can be implemented from a behavioral description of the system in Perl,
C, Python and MATLAB. This will result in a faster verification process and shorter TTM. It is also possible to
have a hybrid design where RTL blocks can be integrated with HLS [2].The HLS design flow shows that a
group of algorithms that represent the whole system or parts of a system can be implemented using a high level
language such as Perl, C, C++ , Java, MATLAB [2][5]. Each part in the system can be tested independently
before the whole system is tested. During this testing process, the RTL testbenches may also be generated. After
testing is complete, the system can be partitioned into HW and SW. This enables SW designers to join the
design process during HW design; in addition, RTL can be tested by using both HW/SW together. After the
verification process, the design can be implemented using FPGA synthesis tools.The integration of HLS into
FPGA design flow is shown in Figure 2.
MATLAB
Algorithm
High Level
Synthesis
(HLS)
Timing
Translate
Map
Place&Route
IMPLEMENT
Bit
File
Formal Proof
RTL
MHA Engine
Figure 2. FPGA high level synthesis flow with MHA Engine
Since the early days of VLSI design, application-specific hardware has been used for optimal
implementation of algorithms. This approach is considered the fastest design scheme but is also the most area
consuming system due to the inherently redundant nature of a design that only computes one operation.
However, there are other possible designs for DSP implementations that can be used for two or more operations
[1][3]. This design approach consists of processing blocks that can compute multiple operations using dedicated
hardware designed for a particular cluster of operations. An improved design approach should exploit the
redundancy and common elements that exist among the sub-blocks. This would result in shared building blocks
and dramatically reduced hardware requirements.
Matlab Based High Level Synthesis Engine For Area And Power Efficient Arithmetic Operations
www.ijceronline.com Open Access Journal Page 3
The proposed design focuses on designing a large system that will be faster with a design principle
similar to HLS, asexplained above. The main work focuses on multi-purpose, reused hardware structures that
produce a unified, area efficient reconfigurable system. This design can reduce the area by 64% [2][[6]. The
components of the MHA Engineare shown in Figure 3.
Arithmetic Operation
. Addition/Subtraction
. Multiplication
. Division
. Square Root
. Inverse Square Root
. Sin(θ), Cos(θ), Tan(θ), Cot(θ)
. Sinh(θ), Cosh(θ), Tanh(θ)
. ex
ControlI/O
MHE Engine
Figure 3. MATLAB Based HLS Arithmetic Engine (MHA) block
The MHA block has three important principles:
 Compute required arithmetic operations
 Customized range and accuracy
 Generate an area-efficient, fast system for low power applications
The MHA accepts inputs from the user via two GUIs to make it more user friendly and efficient. The
“Main” GUI that is shown in Figure 4 below includes the following sections:
 FPGA or ASIC support
 Vendor based IP Core support
 Project Name (default is c:MHAMHA)
 Top Module Name (default is MHA)
 Language – Verilog or VHDL (design and verifications – current system only supports Verilog HDL)
 Rounding - Truncation or RNE (Rounding cannot be done without a selection)
 Number system – Fixed or Floating Point (Fixed point up to 64-bit - current system only supports Fixed
Point Number system)
Figure 4. Main GUI for MHA Engine
 Signed or unsigned number systems
 Target – Frequency and throughput
 Area or speed based optimization
Matlab Based High Level Synthesis Engine For Area And Power Efficient Arithmetic Operations
www.ijceronline.com Open Access Journal Page 4
 Testbench generation
o Automated testbench with MATLAB
o Modelsim .do file for fast automation
o Automated testbench file for Modelsim
o Error comparison with MATLAB
o User defined test data option
The “Arithmetic” tab shown in Figure 5 has the following sections:
 Basic arithmetic operations
o Addition/Subtraction (Area or speed optimized based on Ripple-Carry Adder (RCA) or Carry-Lookahead
Adder (CLA))
o Multiplication (Array or Booth multiplier)
 Advanced arithmetic operations
o Division (Newton-Raphson, Goldschmidt, or CORDIC)
o Square Root (Newton-Raphson, Goldschmidt, or CORDIC)
o Inverse Square Root (Newton-Raphson, Goldschmidt, or CORDIC)
 Elementary functions
o Trigonometric functions – sine, cosine, tangent, and cotangent (table method, CORDIC or polynomial
based design)
o Hyperbolic functions – sinh, cosh, tanh (table method, CORDIC or polynomial based design)
o Exponential function – exp(x) (table method, CORDIC or polynomial based design)
Figure 5. Arithmetic operations GUI
The MHA uses a bottom-up design process that starts with the elementary functions and then moves to
the simplest arithmetic operations such as multiplication and addition. This is shown in Figure 6 below.
This design flow contains the following procedure: First, selection of elementary functions [7], selection of
basic arithmetic operations, and generation of area efficient hardware for FPGA and VLSI. There are 2-64 bit
selections that are suitable for a vast variety of applications with the requested precision. The section’s addition
and multiplications are used based on the previous designs. Division, inverse square root and square roots are
designed based on the same architecture, and the modified design reduces the area by 64% [8]. Next, the
CORDIC [9] [10] or polynomial methods [4] are used to calculate elementary functions [7] [11]. This area-
efficient design is optimized for speed by implementing a smart control system. For performance evaluation and
synthesis are implemented with Xilinx FPGAs [12] and Microwind [13] VLSI design tool.
Matlab Based High Level Synthesis Engine For Area And Power Efficient Arithmetic Operations
www.ijceronline.com Open Access Journal Page 5
Arithmetic Block
Input
Parameters
Elementary
Functions
Arithmetic
HDL
Testbench
Precision
Figure 6. The MHA design flow
III. Error Analysis
When designing hardware with many arithmetic operations, one of the most important objectives is to
produce results that have a minimal absolute and average computation error. Arithmetic operations in digital
systems generally introduce three types of errors: number representation, rounding, and algorithmic or design
error [14][15].The exact representation of some numbers or events in radix-n may not be possible due to the
limitations in ADCs, the sampling rate, and the number of available bits. In addition, many numbers cannot be
converted from radix-n to radix-m without an error. For example, number 0.1 and 0.2 in radix-10 cannot be
represented in radix-2 without an error. This error can be reduced by increasing the number of bits. The
reduction of this error with respect to the number of bits is shown in Figure 7.
Figure 7. Error representation of number radix conversion
Figure 8 shows the error generated for radix-10 to radix-2 conversion of 128 fractional numbers (fractional part
of 30 bits).
Figure8. Random 128-number conversion error from Radix-10 to Radix-2
Matlab Based High Level Synthesis Engine For Area And Power Efficient Arithmetic Operations
www.ijceronline.com Open Access Journal Page 6
During and after the calculation of certain arithmetic operations, the total number of bits may exceed
the number of bits available; these values need to be rounded. For example, multiplying two n-bit numbers
produces a product of 2n-bit and this result may need to be represented with n-bit.If there were more
multiplications on the design path, the number of bits would increase in a linear fashion. To prevent this, each
multiplier output needs to be rounded. There are a few ways to implement rounding in hardware, with the most
commonly used methods being round to the nearest even, round towards zero (truncation), round down (floor),
round up (ceiling) and round away from zero [14].
In this section, truncation (TRA)[14] and round to the nearest even (RNE)[14] schemes are compared. During
these rounding operations, an error value is introduced. The RNE and TRA and their error values are shown in
Table 1.
Table 1. RNE and TRA
Number RNE TRA
Rounded Value Error Rounded Value Error
X0.00 X0. 0.00 X0. 0.00
X0.01 X0. 0.25 X0. 0.25
X0.10 X0. 0.50 X0. 0.50
X0.11 X0.+ulp -0.25 X0. 0.75
X1.00 X1. 0.00 X1. 0.00
X1.01 X1. 0.25 X1. 0.25
X1.10 X1.+ulp -0.50 X1. 0.50
X1.11 X1.+ulp -0.25 X1. 0.75
Total --- 0 --- 3.00
Figure 9 shows advantage of RNE over TRA when average error is considered. The requested number of
precision and selected rounding scheme can affect the size of the hardware and overall speed and throughput.
Users can change the selected accuracy to see area and throughput estimates simultaneously without increasing
the overall design time. This will make it possible to select the optimal design for synthesis. The MHA Engine
can create hardware for precision based on increase the bit size during the mid-operation and apply rounding
before the output stage.
Figure 9. 32-Bit to 16-Bit Rounding Errors with RNE and TRA
Iv. Testbench Generation
One of the most important and complicated sections of the MHAEngine is generation of the testbench
files and error checking using MATLAB and Modelsim. Before the RTL code is synthesized, it can be tested
using a testbench that is created using MATLAB and Modelsim. The testbench generation and error checking
block diagram is shown in Figure 10 below.
MODELSIM
USER
CONSTRAINTS
MATLAB
VERILOG
TESTBENCH
RTL
DUT
RESULTS
(text file)
Verification and
Error Files
Figure 10. MATLAB and Modelsim flow
Matlab Based High Level Synthesis Engine For Area And Power Efficient Arithmetic Operations
www.ijceronline.com Open Access Journal Page 7
After generation of the design and testbench files,the user defined test vectors need to be generated. The
first step is to generate positive random numbers [0,1]. These numbers must be converted into positive and
negative numbers based on signed numbers. To generate random test values and test results, the following
procedure is followed:
 Get the user defined testvector number n.
 Generate random test vectors (T{n})
 Generate random binary numbers using MATLAB
o For fixed-point signed numbers:
T_Bin=dec2bin(T*2^n,n)
o For fixed-point unsigned numbers:
T_Bin=dec2bin(T*2^(n-i),n)
 Generate the Modelsim testbench file and get results
 Compare results using MATLAB
After generation of the design, testbench, and test vector files, the next step is to generate a Modelsim
tcl .do file that can be transferred into Modelsim all together. The .do file will generate the project file and will
import all design files, including testbench, into Modelsim. This will run all files and generate the results as a
text file. Once Modelsim-generated results are imported into MATLAB, correct operation and error analysis
needs to be performed. An important issue which needs to be addressed during the verification process is
working with negative fixed-point numbers in MATLAB. It is important because it does not convert negative
binary numbers and binary floating numbers to a decimal number. This problem is addressed using the
MATLAB codes given in Figure 11.
Figure 11. Signed binary to decimal conversion
V. Conclusion
An area efficient, MATLAB based HLS engine for arithmeticoperations is designed for low power and
high-speed applications. The MHA Engine decreases design system time and verification by up to 64% without
compromising speed and efficiency. The MHA Engine uses a smart control system that is optimized based on
the desired operations. The MHA Engine is a bridge between RTL and HLS. It uses RTL-based basic blocks to
design most complicated arithmetic operations using structural model design and HLS-style fast and optimized
verification. Any designed system can be reconfigured at any time in any way in MHA Engine without going
through the same design and verification hassle.
MATLAB-based verification makes it possible to use all the features of MATLAB for faster and more efficient
verification. The MHA Engine can be easily reconfigurable to systems available at any level, due to changes in
the computer system and software.
As explained above, this system generates area efficient fast arithmetic and elementary functionsthat
can be used over a wide area of applications in DSP, image processing, and communication systems. It can be
used for FFT, DCT and DWT calculations and Chirplet transforms [16][17][18].
It can also be very important for educational institutions in order to test their systems using the verification
testbench that, when desired, works as an independent design tool. Overall, this work will forge the way for
those who need to make sudden changes in their systems and need fast verification. They can adopt and apply
any changes using the MHA Engine or generate similar systems for faster design and verification. In addition,
because the MHA Engine generated code is designed structurally, code can be changed easily at any level, if so
desired. Future workwill integrate VHDL code and IEEE 754 floating point numbers (both single and double
precision) implementation. Another future goal is to make this platform totally open source by using only
Iverilog and replacing MATLAB with Octave.
Matlab Based High Level Synthesis Engine For Area And Power Efficient Arithmetic Operations
www.ijceronline.com Open Access Journal Page 8
References
[1]. Hendry, D.C., and A.A. Duncan. "Area Efficient DSP Datapath Synthesis." Design Automation Conference (1995): 130-135.
[2]. Aslan, S., Oruklu, E., and Saniie J., “A high-level synthesis and verification tool for fixed to floating point conversion”, IEEE
International Midwest Symposium on Circuits and Systems, 2012, Pages, 908-911.
[3]. Andrieux, J., M. Feix, G. Mourgues, P. Bertrand, B. Izrar, and V. Nguyen. "Optimum Smoothing of the Wigner Ville Distribution."
IEEE Transactions on Acoustics, Speech, and Signal Processing 36.5(1987): 764-769.
[4]. Kilts, S. Advanced FPGA Design Architecture, Implementation, and Optimizations. New York: Wiley Inter-Science, 2007.
[5]. Chen, W. The VLSI Handbook. Boca Raton: CRC Publisher, 2007.
[6]. Dehon, A., and S. Hauck. Reconfigurable Computing The Theory and Practice of FPGA-Based Computing. Burlington,
Massachusetts: Elsevier, 2008.
[7]. Walther, J.S. "A unified Algortihm for Elementary Functions." American Federation of Information Processing Societies Joint
Computer Conferences (1971): 379-385.
[8]. Oruklu, E., J. Saniie, and S. Aslan. "Realization of Area Efficient QR Factorization using Unified Diivision, Square Root and
Inverse Sqaure Root Hardware." IEEE Electro/Information Technology (2009): 245-250.
[9] Volder, J. "The CORDIC Trigonometric Computing Technique." IEEE Transactions Electronic Computers 8.3 (1959): 330-334.
[9]. 10] Striling, W. C., and T. K. Moon. Mathematical Methods and algorithms for Signal Processing. New Jersey: Prentice
Hall, 2000.
[10]. Xilinx. (2016), http://www.xilinx.com/
[11]. Microwind (2106) http://microwind.net/
[12]. Stine, J. E. "Digital Computer Arithmetic Datapath Design using Verilog HDL." Digital Computer Arithmetic Datapath Design
using Verilog HDL. Norwell, Massachusetts: Kluwer Academic Publishing, 2004.
[13]. Teukolsky, S. A., W. T. Vetterling, B. P. Flannery, and W. H. Press. "Numerical Recipes: The Art of Scientific Computing."
Numerical Recipes: The Art of Scientific Computing, 3rd ed. New York, New York: Cambridge University Press, 2007.
[14]. Omar, J., E. E. Swartzlander Jr., and M. J. Schulte. "Optimal Initial Approximations for the Newton-Raphson Division Algorithm."
Springer-Verlag Journal of Computing 53.3-4 (1994): 233-242.
[15]. Seidel, P-M, W. E. Ferguson, and G. Even. "A Parametric Error Analysis of Goldschmidt’s Division Algorithm." Journal of
Computer and System Sciences 70.1 (2005): 118-139.
[16]. Chunduri, K. C. "Implementation of Adaptive Filter Structures on a Fixed Point Signal Processor for Acoustical Noise Reduction"
(2006).

More Related Content

PDF
Implementation of Radix-4 Booth Multiplier by VHDL
PDF
Proposal for google summe of code 2016
PDF
“eXtending” the Automation Toolbox: Introduction to TwinCAT 3 Software and eX...
PDF
Focus - GSM UMTS LTE Performance and Configuration Management Solution
PDF
Ethercat twincat e
PPT
Background And An Architecture Example
PPT
program flow mechanisms, advanced computer architecture
PPT
advanced computer architesture-conditions of parallelism
Implementation of Radix-4 Booth Multiplier by VHDL
Proposal for google summe of code 2016
“eXtending” the Automation Toolbox: Introduction to TwinCAT 3 Software and eX...
Focus - GSM UMTS LTE Performance and Configuration Management Solution
Ethercat twincat e
Background And An Architecture Example
program flow mechanisms, advanced computer architecture
advanced computer architesture-conditions of parallelism

What's hot (20)

PPTX
FT Architecture For Cloud Service Computing
PDF
Performance Evaluation of FPGA Based Runtime Dynamic Partial Reconfiguration ...
PDF
[Capella Day 2019] Model execution and system simulation in Capella
PPTX
참여기관_발표자료-국민대학교 201301 정기회의
PPTX
Techniques for Efficient RTL Clock and Memory Gating Takedown of Next Generat...
PDF
An Approach to Overcome Modeling Inaccuracies for Performance Simulation Sig...
PDF
Altera up1
PPT
3D-DRESD AC
PDF
Capella annual meeting 2021
PDF
Overcoming challenges of_verifying complex mixed signal designs
PPSX
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)
PPTX
Instruction level parallelism
PPT
Evaluation of morden computer & system attributes in ACA
PPTX
Provisioning Bandwidth & Logical Circuits Using Telecom-Based GIS .
PDF
Qualifying a high performance memory subsysten for Functional Safety
PPT
UIC Thesis Cancare
PPT
Network Planning & Design: An Art or a Science?
PPT
A Survey of Recent Advances in Network Planning/Traffic Engineering (TE) Tools
PDF
Pipelining and ILP (Instruction Level Parallelism)
PDF
Teklabz schematics generator
FT Architecture For Cloud Service Computing
Performance Evaluation of FPGA Based Runtime Dynamic Partial Reconfiguration ...
[Capella Day 2019] Model execution and system simulation in Capella
참여기관_발표자료-국민대학교 201301 정기회의
Techniques for Efficient RTL Clock and Memory Gating Takedown of Next Generat...
An Approach to Overcome Modeling Inaccuracies for Performance Simulation Sig...
Altera up1
3D-DRESD AC
Capella annual meeting 2021
Overcoming challenges of_verifying complex mixed signal designs
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)
Instruction level parallelism
Evaluation of morden computer & system attributes in ACA
Provisioning Bandwidth & Logical Circuits Using Telecom-Based GIS .
Qualifying a high performance memory subsysten for Functional Safety
UIC Thesis Cancare
Network Planning & Design: An Art or a Science?
A Survey of Recent Advances in Network Planning/Traffic Engineering (TE) Tools
Pipelining and ILP (Instruction Level Parallelism)
Teklabz schematics generator
Ad

Similar to Matlab Based High Level Synthesis Engine for Area And Power Efficient Arithmetic Operations (20)

PDF
J044084349
PPTX
module nenddhd dhdbdh dehrbdbddnd d 1.pptx
PDF
IRJET- A Review- FPGA based Architectures for Image Capturing Consequently Pr...
PPTX
HighLevel Synthesis Algorithms for the vlsi
PPTX
VLSI_CAD_Introductionxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.pptx
PDF
24-02-18 Rejender pratap.pdf
PDF
High-Level Synthesis with GAUT
PDF
Basic signal processing system design on fpga using lms based adaptive filter
PPTX
module 1-2 - Design Methods, parameters and examples.pptx
PPT
Short.course.introduction.to.vhdl for beginners
PPT
An Introduction to Field Programmable Gate Arrays
PPT
CASFPGA1.ppt
PPT
FPGA_prototyping proccesing with conclusion
PDF
AI Assisted Digital System Design Lecture 1
PPTX
Unit-V.pptx
PDF
DSD-1 (Introduction).pdf
PPT
Lecture1111111111111111111111_vhdl_Introduction.ppt
PDF
Embedded system design: a modern approach to the electronic design.
PPT
Fmcad08
PDF
IMPLEMENTATION OF IMAGE PROCESSING ALGORITHMS ON FPGA HARDWARE.pdf
J044084349
module nenddhd dhdbdh dehrbdbddnd d 1.pptx
IRJET- A Review- FPGA based Architectures for Image Capturing Consequently Pr...
HighLevel Synthesis Algorithms for the vlsi
VLSI_CAD_Introductionxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.pptx
24-02-18 Rejender pratap.pdf
High-Level Synthesis with GAUT
Basic signal processing system design on fpga using lms based adaptive filter
module 1-2 - Design Methods, parameters and examples.pptx
Short.course.introduction.to.vhdl for beginners
An Introduction to Field Programmable Gate Arrays
CASFPGA1.ppt
FPGA_prototyping proccesing with conclusion
AI Assisted Digital System Design Lecture 1
Unit-V.pptx
DSD-1 (Introduction).pdf
Lecture1111111111111111111111_vhdl_Introduction.ppt
Embedded system design: a modern approach to the electronic design.
Fmcad08
IMPLEMENTATION OF IMAGE PROCESSING ALGORITHMS ON FPGA HARDWARE.pdf
Ad

Recently uploaded (20)

PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PDF
Digital Logic Computer Design lecture notes
PPT
Project quality management in manufacturing
PDF
composite construction of structures.pdf
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PDF
Well-logging-methods_new................
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PPTX
web development for engineering and engineering
PPT
Mechanical Engineering MATERIALS Selection
PPTX
Construction Project Organization Group 2.pptx
PPTX
CH1 Production IntroductoryConcepts.pptx
PPTX
Internet of Things (IOT) - A guide to understanding
PPTX
Lecture Notes Electrical Wiring System Components
PPTX
UNIT 4 Total Quality Management .pptx
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PPTX
Artificial Intelligence
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PPT
introduction to datamining and warehousing
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
Digital Logic Computer Design lecture notes
Project quality management in manufacturing
composite construction of structures.pdf
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
Foundation to blockchain - A guide to Blockchain Tech
Well-logging-methods_new................
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
web development for engineering and engineering
Mechanical Engineering MATERIALS Selection
Construction Project Organization Group 2.pptx
CH1 Production IntroductoryConcepts.pptx
Internet of Things (IOT) - A guide to understanding
Lecture Notes Electrical Wiring System Components
UNIT 4 Total Quality Management .pptx
UNIT-1 - COAL BASED THERMAL POWER PLANTS
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
Artificial Intelligence
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
introduction to datamining and warehousing

Matlab Based High Level Synthesis Engine for Area And Power Efficient Arithmetic Operations

  • 1. ISSN (e): 2250 – 3005 || Volume, 06 || Issue, 04||April – 2016 || International Journal of Computational Engineering Research (IJCER) www.ijceronline.com Open Access Journal Page 1 Matlab Based High Level Synthesis Engine for Area And Power Efficient Arithmetic Operations Semih Aslan Ingram School of Engineering Texas State University San Marcos, Texas, 78666, USA I. Introduction Today, a significant number of embedded systems are focused on multimedia applications with almost insatiable demand for low cost, high performance and low power hardware. Designing complex systems such as image and video processing, compression, face recognition, object tracking, 3G or 4G modems, multi-standard CODECs, and HD decoding schemes requires the integration of many complex blocks and a long verification process [1][2]. These complex designs are based on I/O peripherals, one or more processors, bus interfaces, A/D, D/A, embedded software, memories and sensors. In the past, complete systems were designed with multiple chips and connected together on PCBs, but with today’s technology, all functions can be incorporated in a single chip. These complete systems are known as System-on-Chip (SoC) [2].System-on-chip (SoC) designs are mainly accomplished by using Register Transfer Languages (RTL) such as Verilog and VHDL. RTL design flow [1] [2] for both FPGA and ASIC is similar and is shown in Figure 1. An algorithm can be converted to RTL using the behavioral model description method or by using pre-defined IP core blocks. After completing this RTL code, formal verification must be done before implementation. After implementation of the RTL code, timing verification needs to be done for proper operation. RTL design abstracts logic structures, timing and registers [1]. Because of this, every clock change causes a state change in the design. This timing dependency causes every event to be simulated, which results in a slower simulation time and longer verification period of the design. The design and verification of an algorithm in RTL in Figure 1 can take up 50-60% of the “Time to Market” (TTM). The RTL design becomes impractical for larger systems that have high data flow between the blocks, and it requires millions of gates. Even though design time may improve by using behavioral modeling and IP cores, the difficulty in synthesis, poor performance results and rapid changes in the design make IP cores difficult to adapt and change. Therefore, systems rapidly become obsolete.The limitations of RTL and longer TTM forced designers to think of the design as a whole system rather than blocks. In addition, software integration in SoC was always done after the hardware was designed. When the system gets more complex, software integration is desirable during hardware Abstract Embedded systems used in real-time applications require low power, less area and a high computation speed. For digital signal processing (DSP), image processing and communication applications, data are often received at a continuously high rate. Embedded processors have to cope with this high data rate and process the incoming data based on specific application requirements. Even though there are many different application domains, they all require arithmetic operations that quickly compute the desired values using a larger range of operation, reconfigurable behavior, low power and high precision. The type of necessary arithmetic operations may vary greatly among different applications. The RTL-based design and verification of one or more of these functions may be time-consuming. Some High Level Synthesis tools reduce this design and verification time but may not be optimal or suitable for low power applications. The developed MATLAB-based Arithmetic Engine improves design time and reduces the verification process, but the key point is to use a unified design that combines some of the basic operations with more complex operations to reduce area and power consumption. The results indicate that using the Arithmetic Engine from a simple design to more complex systems can improve design time by reducing the verification time by up to 62%. The MATLAB-based Arithmetic Engine generates structural RTL code, a testbench, and gives the designers more control. The MATLAB-based design and verification engine uses optimized algorithms for better accuracy at a better throughput. Keywords: FPGA, High Level Synthesis, MATLAB, Optimized Hardware, Power Efficient, RTL.
  • 2. Matlab Based High Level Synthesis Engine For Area And Power Efficient Arithmetic Operations www.ijceronline.com Open Access Journal Page 2 implementation. Over the last two decades, designers were forced to find new methods to replace RTL due to improvements in SoC and shorter TTM. Because of the extensive work done in Electronics System Level Design (ESLD), HW/SW co-design of a system and High Level Synthesis (HLS)[2][4] are integrated into FPGA and ASIC design flow. Algorithm RTL Timing Translate Map Place&Route IMPLEMENT Bit File Formal Proof Logic Synthesis Figure 1. FPGA RTL level synthesis flow The next section will describe the proposed MATLAB HLS Arithmetic (MHA) Engine design and implementation. Section III and IV will focus on the error analysis and testbench generation respectively and the conclusion will describe future work and improvements. II. Mha Engine RTL description of a system can be implemented from a behavioral description of the system in Perl, C, Python and MATLAB. This will result in a faster verification process and shorter TTM. It is also possible to have a hybrid design where RTL blocks can be integrated with HLS [2].The HLS design flow shows that a group of algorithms that represent the whole system or parts of a system can be implemented using a high level language such as Perl, C, C++ , Java, MATLAB [2][5]. Each part in the system can be tested independently before the whole system is tested. During this testing process, the RTL testbenches may also be generated. After testing is complete, the system can be partitioned into HW and SW. This enables SW designers to join the design process during HW design; in addition, RTL can be tested by using both HW/SW together. After the verification process, the design can be implemented using FPGA synthesis tools.The integration of HLS into FPGA design flow is shown in Figure 2. MATLAB Algorithm High Level Synthesis (HLS) Timing Translate Map Place&Route IMPLEMENT Bit File Formal Proof RTL MHA Engine Figure 2. FPGA high level synthesis flow with MHA Engine Since the early days of VLSI design, application-specific hardware has been used for optimal implementation of algorithms. This approach is considered the fastest design scheme but is also the most area consuming system due to the inherently redundant nature of a design that only computes one operation. However, there are other possible designs for DSP implementations that can be used for two or more operations [1][3]. This design approach consists of processing blocks that can compute multiple operations using dedicated hardware designed for a particular cluster of operations. An improved design approach should exploit the redundancy and common elements that exist among the sub-blocks. This would result in shared building blocks and dramatically reduced hardware requirements.
  • 3. Matlab Based High Level Synthesis Engine For Area And Power Efficient Arithmetic Operations www.ijceronline.com Open Access Journal Page 3 The proposed design focuses on designing a large system that will be faster with a design principle similar to HLS, asexplained above. The main work focuses on multi-purpose, reused hardware structures that produce a unified, area efficient reconfigurable system. This design can reduce the area by 64% [2][[6]. The components of the MHA Engineare shown in Figure 3. Arithmetic Operation . Addition/Subtraction . Multiplication . Division . Square Root . Inverse Square Root . Sin(θ), Cos(θ), Tan(θ), Cot(θ) . Sinh(θ), Cosh(θ), Tanh(θ) . ex ControlI/O MHE Engine Figure 3. MATLAB Based HLS Arithmetic Engine (MHA) block The MHA block has three important principles:  Compute required arithmetic operations  Customized range and accuracy  Generate an area-efficient, fast system for low power applications The MHA accepts inputs from the user via two GUIs to make it more user friendly and efficient. The “Main” GUI that is shown in Figure 4 below includes the following sections:  FPGA or ASIC support  Vendor based IP Core support  Project Name (default is c:MHAMHA)  Top Module Name (default is MHA)  Language – Verilog or VHDL (design and verifications – current system only supports Verilog HDL)  Rounding - Truncation or RNE (Rounding cannot be done without a selection)  Number system – Fixed or Floating Point (Fixed point up to 64-bit - current system only supports Fixed Point Number system) Figure 4. Main GUI for MHA Engine  Signed or unsigned number systems  Target – Frequency and throughput  Area or speed based optimization
  • 4. Matlab Based High Level Synthesis Engine For Area And Power Efficient Arithmetic Operations www.ijceronline.com Open Access Journal Page 4  Testbench generation o Automated testbench with MATLAB o Modelsim .do file for fast automation o Automated testbench file for Modelsim o Error comparison with MATLAB o User defined test data option The “Arithmetic” tab shown in Figure 5 has the following sections:  Basic arithmetic operations o Addition/Subtraction (Area or speed optimized based on Ripple-Carry Adder (RCA) or Carry-Lookahead Adder (CLA)) o Multiplication (Array or Booth multiplier)  Advanced arithmetic operations o Division (Newton-Raphson, Goldschmidt, or CORDIC) o Square Root (Newton-Raphson, Goldschmidt, or CORDIC) o Inverse Square Root (Newton-Raphson, Goldschmidt, or CORDIC)  Elementary functions o Trigonometric functions – sine, cosine, tangent, and cotangent (table method, CORDIC or polynomial based design) o Hyperbolic functions – sinh, cosh, tanh (table method, CORDIC or polynomial based design) o Exponential function – exp(x) (table method, CORDIC or polynomial based design) Figure 5. Arithmetic operations GUI The MHA uses a bottom-up design process that starts with the elementary functions and then moves to the simplest arithmetic operations such as multiplication and addition. This is shown in Figure 6 below. This design flow contains the following procedure: First, selection of elementary functions [7], selection of basic arithmetic operations, and generation of area efficient hardware for FPGA and VLSI. There are 2-64 bit selections that are suitable for a vast variety of applications with the requested precision. The section’s addition and multiplications are used based on the previous designs. Division, inverse square root and square roots are designed based on the same architecture, and the modified design reduces the area by 64% [8]. Next, the CORDIC [9] [10] or polynomial methods [4] are used to calculate elementary functions [7] [11]. This area- efficient design is optimized for speed by implementing a smart control system. For performance evaluation and synthesis are implemented with Xilinx FPGAs [12] and Microwind [13] VLSI design tool.
  • 5. Matlab Based High Level Synthesis Engine For Area And Power Efficient Arithmetic Operations www.ijceronline.com Open Access Journal Page 5 Arithmetic Block Input Parameters Elementary Functions Arithmetic HDL Testbench Precision Figure 6. The MHA design flow III. Error Analysis When designing hardware with many arithmetic operations, one of the most important objectives is to produce results that have a minimal absolute and average computation error. Arithmetic operations in digital systems generally introduce three types of errors: number representation, rounding, and algorithmic or design error [14][15].The exact representation of some numbers or events in radix-n may not be possible due to the limitations in ADCs, the sampling rate, and the number of available bits. In addition, many numbers cannot be converted from radix-n to radix-m without an error. For example, number 0.1 and 0.2 in radix-10 cannot be represented in radix-2 without an error. This error can be reduced by increasing the number of bits. The reduction of this error with respect to the number of bits is shown in Figure 7. Figure 7. Error representation of number radix conversion Figure 8 shows the error generated for radix-10 to radix-2 conversion of 128 fractional numbers (fractional part of 30 bits). Figure8. Random 128-number conversion error from Radix-10 to Radix-2
  • 6. Matlab Based High Level Synthesis Engine For Area And Power Efficient Arithmetic Operations www.ijceronline.com Open Access Journal Page 6 During and after the calculation of certain arithmetic operations, the total number of bits may exceed the number of bits available; these values need to be rounded. For example, multiplying two n-bit numbers produces a product of 2n-bit and this result may need to be represented with n-bit.If there were more multiplications on the design path, the number of bits would increase in a linear fashion. To prevent this, each multiplier output needs to be rounded. There are a few ways to implement rounding in hardware, with the most commonly used methods being round to the nearest even, round towards zero (truncation), round down (floor), round up (ceiling) and round away from zero [14]. In this section, truncation (TRA)[14] and round to the nearest even (RNE)[14] schemes are compared. During these rounding operations, an error value is introduced. The RNE and TRA and their error values are shown in Table 1. Table 1. RNE and TRA Number RNE TRA Rounded Value Error Rounded Value Error X0.00 X0. 0.00 X0. 0.00 X0.01 X0. 0.25 X0. 0.25 X0.10 X0. 0.50 X0. 0.50 X0.11 X0.+ulp -0.25 X0. 0.75 X1.00 X1. 0.00 X1. 0.00 X1.01 X1. 0.25 X1. 0.25 X1.10 X1.+ulp -0.50 X1. 0.50 X1.11 X1.+ulp -0.25 X1. 0.75 Total --- 0 --- 3.00 Figure 9 shows advantage of RNE over TRA when average error is considered. The requested number of precision and selected rounding scheme can affect the size of the hardware and overall speed and throughput. Users can change the selected accuracy to see area and throughput estimates simultaneously without increasing the overall design time. This will make it possible to select the optimal design for synthesis. The MHA Engine can create hardware for precision based on increase the bit size during the mid-operation and apply rounding before the output stage. Figure 9. 32-Bit to 16-Bit Rounding Errors with RNE and TRA Iv. Testbench Generation One of the most important and complicated sections of the MHAEngine is generation of the testbench files and error checking using MATLAB and Modelsim. Before the RTL code is synthesized, it can be tested using a testbench that is created using MATLAB and Modelsim. The testbench generation and error checking block diagram is shown in Figure 10 below. MODELSIM USER CONSTRAINTS MATLAB VERILOG TESTBENCH RTL DUT RESULTS (text file) Verification and Error Files Figure 10. MATLAB and Modelsim flow
  • 7. Matlab Based High Level Synthesis Engine For Area And Power Efficient Arithmetic Operations www.ijceronline.com Open Access Journal Page 7 After generation of the design and testbench files,the user defined test vectors need to be generated. The first step is to generate positive random numbers [0,1]. These numbers must be converted into positive and negative numbers based on signed numbers. To generate random test values and test results, the following procedure is followed:  Get the user defined testvector number n.  Generate random test vectors (T{n})  Generate random binary numbers using MATLAB o For fixed-point signed numbers: T_Bin=dec2bin(T*2^n,n) o For fixed-point unsigned numbers: T_Bin=dec2bin(T*2^(n-i),n)  Generate the Modelsim testbench file and get results  Compare results using MATLAB After generation of the design, testbench, and test vector files, the next step is to generate a Modelsim tcl .do file that can be transferred into Modelsim all together. The .do file will generate the project file and will import all design files, including testbench, into Modelsim. This will run all files and generate the results as a text file. Once Modelsim-generated results are imported into MATLAB, correct operation and error analysis needs to be performed. An important issue which needs to be addressed during the verification process is working with negative fixed-point numbers in MATLAB. It is important because it does not convert negative binary numbers and binary floating numbers to a decimal number. This problem is addressed using the MATLAB codes given in Figure 11. Figure 11. Signed binary to decimal conversion V. Conclusion An area efficient, MATLAB based HLS engine for arithmeticoperations is designed for low power and high-speed applications. The MHA Engine decreases design system time and verification by up to 64% without compromising speed and efficiency. The MHA Engine uses a smart control system that is optimized based on the desired operations. The MHA Engine is a bridge between RTL and HLS. It uses RTL-based basic blocks to design most complicated arithmetic operations using structural model design and HLS-style fast and optimized verification. Any designed system can be reconfigured at any time in any way in MHA Engine without going through the same design and verification hassle. MATLAB-based verification makes it possible to use all the features of MATLAB for faster and more efficient verification. The MHA Engine can be easily reconfigurable to systems available at any level, due to changes in the computer system and software. As explained above, this system generates area efficient fast arithmetic and elementary functionsthat can be used over a wide area of applications in DSP, image processing, and communication systems. It can be used for FFT, DCT and DWT calculations and Chirplet transforms [16][17][18]. It can also be very important for educational institutions in order to test their systems using the verification testbench that, when desired, works as an independent design tool. Overall, this work will forge the way for those who need to make sudden changes in their systems and need fast verification. They can adopt and apply any changes using the MHA Engine or generate similar systems for faster design and verification. In addition, because the MHA Engine generated code is designed structurally, code can be changed easily at any level, if so desired. Future workwill integrate VHDL code and IEEE 754 floating point numbers (both single and double precision) implementation. Another future goal is to make this platform totally open source by using only Iverilog and replacing MATLAB with Octave.
  • 8. Matlab Based High Level Synthesis Engine For Area And Power Efficient Arithmetic Operations www.ijceronline.com Open Access Journal Page 8 References [1]. Hendry, D.C., and A.A. Duncan. "Area Efficient DSP Datapath Synthesis." Design Automation Conference (1995): 130-135. [2]. Aslan, S., Oruklu, E., and Saniie J., “A high-level synthesis and verification tool for fixed to floating point conversion”, IEEE International Midwest Symposium on Circuits and Systems, 2012, Pages, 908-911. [3]. Andrieux, J., M. Feix, G. Mourgues, P. Bertrand, B. Izrar, and V. Nguyen. "Optimum Smoothing of the Wigner Ville Distribution." IEEE Transactions on Acoustics, Speech, and Signal Processing 36.5(1987): 764-769. [4]. Kilts, S. Advanced FPGA Design Architecture, Implementation, and Optimizations. New York: Wiley Inter-Science, 2007. [5]. Chen, W. The VLSI Handbook. Boca Raton: CRC Publisher, 2007. [6]. Dehon, A., and S. Hauck. Reconfigurable Computing The Theory and Practice of FPGA-Based Computing. Burlington, Massachusetts: Elsevier, 2008. [7]. Walther, J.S. "A unified Algortihm for Elementary Functions." American Federation of Information Processing Societies Joint Computer Conferences (1971): 379-385. [8]. Oruklu, E., J. Saniie, and S. Aslan. "Realization of Area Efficient QR Factorization using Unified Diivision, Square Root and Inverse Sqaure Root Hardware." IEEE Electro/Information Technology (2009): 245-250. [9] Volder, J. "The CORDIC Trigonometric Computing Technique." IEEE Transactions Electronic Computers 8.3 (1959): 330-334. [9]. 10] Striling, W. C., and T. K. Moon. Mathematical Methods and algorithms for Signal Processing. New Jersey: Prentice Hall, 2000. [10]. Xilinx. (2016), http://www.xilinx.com/ [11]. Microwind (2106) http://microwind.net/ [12]. Stine, J. E. "Digital Computer Arithmetic Datapath Design using Verilog HDL." Digital Computer Arithmetic Datapath Design using Verilog HDL. Norwell, Massachusetts: Kluwer Academic Publishing, 2004. [13]. Teukolsky, S. A., W. T. Vetterling, B. P. Flannery, and W. H. Press. "Numerical Recipes: The Art of Scientific Computing." Numerical Recipes: The Art of Scientific Computing, 3rd ed. New York, New York: Cambridge University Press, 2007. [14]. Omar, J., E. E. Swartzlander Jr., and M. J. Schulte. "Optimal Initial Approximations for the Newton-Raphson Division Algorithm." Springer-Verlag Journal of Computing 53.3-4 (1994): 233-242. [15]. Seidel, P-M, W. E. Ferguson, and G. Even. "A Parametric Error Analysis of Goldschmidt’s Division Algorithm." Journal of Computer and System Sciences 70.1 (2005): 118-139. [16]. Chunduri, K. C. "Implementation of Adaptive Filter Structures on a Fixed Point Signal Processor for Acoustical Noise Reduction" (2006).