RISC-V & SoC Architectural Exploration for AI and ML Accelerators

Information Classification: General
December 8-10, 2020 | Virtual Event
Architectural Exploration for AI / ML accelerators
Simon Davidmann, Duncan Graham
Imperas Software
info@imperas.com
#RISCVSUMMIT
mp

Architectural Exploration for AI and ML accelerators 2
Machine Intelligence compute requirement
is growing fast
300,000x increase… https://openai.com/blog/ai-and-compute/

35 Years of microprocessor trend data
Even though there are more transistors, don’t get performance gain, trend is to move to parallel / more cores

Computation needed for AI / ML
Summary:
• e.g. 1 Billion MACs for AlexNet – image recognition… training
• X86 is not getting faster
• So trend is to move to special processing and run in parallel
=>
• So you need the fastest cores (with often custom extension / acceleration)
• And, it needs to be the correct parallel…
• And, designers need to know that their algorithms run “well” on the configuration of hardware
they select

Processor Hardware options for
Software acceleration
• Dedicated external accelerator hardware
• Fast for the limited set of know use cases
• but inflexible if software needs change
• Processor extension
• Closely coupled gives efficiency with
flexibility
• but future improvements limited by End of
Moore’s Law
• Processor custom extension
• Performance advantages with optimized
instructions
• and lightweight inter-processor
communications for scale
Scalar processors
with vector extensions
CPU
Vector Extensions
Vector processors with
Instruction extensions
plus micro-arch coms
CPU
Vector Extensions
Custom Instructions
Comms Extensions
Accelerator
CPU
Scalar processors
with external accelerator

AI SoC Architecture Exploration
Scalar processors
with vector extensions
plus micro-arch coms
CPU
Vector Extensions
CPU
Vector Extensions
DL Extensions
Comms Extensions
CPU
Vector Extensions
DL Extensions
Array of Processing Elements (PE)
AI & Machine Learning Accelerators
• Datacenter: training & inference
• Edge: inference (mostly)
• Compute arrays with processor
elements (PE) configured for
- Scalar
- Vector
- Spatial
- Communications
- PE <–> PE & PE <-> NoC
CPU
CPU
CPU
CPU
CPU
CPU
CPU CPU
CPU
CPU
CPU
CPU
Accelerator
CPU
Configurations of
Processing Elements (PE)
CPU Features of Processing Elements (PE)

Imperas works with the leaders for
RISC-V Vector Extensions
• Andes certifies Imperas models and simulator as reference for new Andes RISC-V Vectors Core
with lead customers and partners
• Imperas code morphing simulation technology, virtual platforms and tools used by lead
customers for early software development and high-level architectural exploration
"Andes has announced the new RISC-V family 27-series
cores, which in addition to new and advanced features,
include the new Vector extensions that are an ideal solution
for our customers working on leading edge design for AI and
ML. Andes is pleased to certify the Imperas model and
simulator as a reference for the new Vector processor
NX27V, and is already actively used by our mutual
customers."
Charlie Hong-Men Su, CTO and Executive Vice President at
Andes Technology Corp
Taking RISC-V® Mainstream
9
NX27V VPU Overview
VPU: Vector Processing Unit
RVV spec: ongoing 0.8
Data formats:
SEW supported: int8, int16, int32, fp16, fp32
Extension formats: bfloat16 and int4
Support LMUL 1, 2, 4, 8
VPU main configurations:
SIMD width and VLEN (bits): 128, 256, and 512
Functional units chainable, with dedicated IQ, most fully pipelined
Wide system bus for data accesses
Vector Registers as operands for ACE instructions
Usage example: custom vector load/store from a dedicated memory port
Verification: leverage/enhance Google UVM, working with Imperas

Example US Customer
• Customer project
• Full AI / ML engine
• 150+ CPU cores
• Over half with RISC-V Vector extension engine
• Imperas Reference Models and Virtual Platform provides environment for software stack development
• Simulation runs of software stack running in virtual platform take ~ 2hrs @ 500MIPS
• Cross compiled software running on simulated CPUs
• Allows hardware platform configuration, re-configuration, architectural changes
• Explore performance options
• Runs real software (production binaries) – can see how it interacts with HW configuration
• Running in Imperas more than a year before RTL commit
• Customer has SW and is looking to design HW to make it work the way they want…
• Also a by-product: kick-start SoC process by feeding models into HW DV at start

Example
Japanese partner
• Overview
• Platform : ARM Cortex-A57 x 1 + RISC-V RV64GCV x 17
• Application1 : AlexNet image recognition deep neural network

Imagenet with AlexNet deep neural network
• AlexNet (University of Toronto, 2012)
• https://towardsdatascience.com/the-w3h-of-alexnet-vggnet-resnet-and-inception-7baaaecccc96
• Hyper parameters
• Number of Parameter : 58 M (float32)
• Computation cost : 1,000 M (Number of multiply-add)

Parallelization for multiple core
0
50,000,000
100,000,000
150,000,000
200,000,000
250,000,000
300,000,000
350,000,000
400,000,000
450,000,000
convolution convolutionconvolutionconvolutionconvolution fully
connection
fully
connection
fully
connection
conv1 conv2 conv3 conv4 conv5 fc6 fc7 fc8
Number
of
muliply-add
mult-add local response normalization
Convolution layers have a lot of calculation
Parallelized these layers to use 16 CPU cores

Simulate a Virtual Platform model
UART0
(for ARM[0])
UART1
(for RV64[0])
UART2
(for RV64[1])
UART3
(for RV64[2])
UART17
(for RV64[16])
RAM
ARM
Cortex-A57 [0]
RISC-V
RV64GC [0]
RISC-V
RV64GC [1]
RISC-V
RV64GC [16]
RAM
RAM Bus bridge
Bus bridge
ARM bus RISC-V bus
shared bus
…
…

Executing simulation – different consoles

Single Multi-Processor Debug
Debugging both ARM & RISC-V cores using one debugger at same time.
aarch64 register set
RV64 register set

Example
Japanese partner
• Overview
• Platform : ARM Cortex-A57 x 1 + RISC-V RV64GCV x 17
• Application1 : AlexNet image recognition deep neural network
• Keypoints
• “Imperas simulator can simulate heterogeneous virtual platform”
• “Imperas also provides dedicated debugger which can debug hetero-system (ex.
ARM and RISC-V) using one debugger at same time”
• “Very fast. This example runs (at most) 10 times slower than native x64 execution
on host PC”

How is Processor Performance Optimized?
• Move to multicore and to different multicore configurations
• Tune accelerators, configuration options (e.g. vector engine sizes)
• Optimize the pipeline
• Improve memory usage/latency
• Custom instructions for application/domain optimization (feature of RISC-V)

Flow to add new custom instructions
• Instruction Accurate Simulation
• Trace / Debug
• Timing Simulation
• Function Timing / Profiling
Characterize C Application
• Design Instructions
• Add to Application
• Add to Model
• Add Timing
Develop New Custom
Instructions
• Instruction Accurate Simulation
• Trace / Debug
• Timing Simulation
• Function Timing / Profiling
Characterize New
Instructions in Application
• Instruction Coverage
• Line Coverage
• Instruction Performance
• Generate PDF model doc
Optimize & Document model
• Check RISC-V Compliance
• Use as reference for RTL Design Verification
• Use in Imperas/OVP Platforms, EPKs
• Heterogeneous / Homogeneous
• Multi-core, Many-core
• Imperas Multi-Processor Debug, VAP tools
• Port OS, RTOS (Linux, FreeRTOS…)
• Use in many simulation envs (inc. SystemC)
• Deliver to end users
Release & Deploy

Demo walkthrough

Imperas Tools / Environment
SlipStreamer API
Application Software
& Operating System
T
E
S
T
B
E
N
C
H
Virtual Platform
Memory
Peripheral
OVP
CPU
OVP
CPU
Verification, Analysis &
Profiling (VAP) tools
• Trace
• Profile
• Code coverage
• Memory monitor
• Protocol checker
• Assertion checkers
JIT simulator engine
Multiprocessor /
Multicore
Debugger
Eclipse IDE
• OS task tracing
• OS scheduler analysis
• Fault injection
• Function tracing
• Variable tracing
• …
B
U
S

Imperas works with Mellanox on
RISC-V Processor Verification
• Imperas Leading RISC-V CPU Reference Model for Hardware Design Verification Selected
by Mellanox/NVIDIA
• Verification tools and golden reference model provide support for RISC-V custom
instruction extensions and full processor design verification

Summary
• Current AI / ML applications need new / custom configurations of hardware to obtain the required
performance goals
• Fast simulation allows software to run on virtual platforms many months (maybe a year) before RTL
commit
• Imperas allows analysis of performance on different hardware configuration choices
• including running heterogeneous platforms with full OS running
• provides detailed analysis, profiling, performance and debug tooling
• Imperas Reference Model includes all the current RISC-V specification features and enables you to
develop custom instructions
• Is a golden reference for many users validating their silicon
• Imperas provides solutions to enable architectural Exploration for AI and ML accelerators

More Information: info@imperas.com
• Stop by the virtual Imperas booth at the December 2020 RISC-V Summit
Summit
• www.Imperas.com
• www.OVPworld.org
• www.GitHub.com/riscv-ovpsim

RISC-V & SoC Architectural Exploration for AI and ML Accelerators

More Related Content

What's hot (20)

Similar to RISC-V & SoC Architectural Exploration for AI and ML Accelerators (20)

More from RISC-V International (20)

Recently uploaded (20)

RISC-V & SoC Architectural Exploration for AI and ML Accelerators