SlideShare a Scribd company logo
6
Most read
8
Most read
12
Most read
Key Requirements to Successfully
Implement GenAI in Edge Devices —
Optimized Mapping to the Enhanced
NPX6 Neural Processing Unit IP
Gordon Cooper
Principal Product Manager
Synopsys
© 2025 Synopsys Inc.
1
The Challenge of Fitting GenAI into an Edge Device SoC
Assumptions
• Target solution is an
AI-enabled SoC
• GenAI (built on
transformer models)
capabilities needed
• NPU is needed for
transformers / GenAI
performance/power
efficiency
GenAI capable
NPU IP
© 2025 Synopsys Inc. 2
Extreme Ironing: Panoptic Segmentation Using CNNs
3
Image source: https://www.barnorama.com/wp-content/uploads/2016/12/03-Confusing-Pictures.jpg
Model Used: Detectron2 - COCO-PanopticSegmentation/panoptic_fpn_R_101_3x
© 2025 Synopsys Inc.
Panoptic
FPN_ResNet101_3x
LLaVA
(Large
Languag
&
Vision
Assistant)
4
Extreme Ironing: Multimodal Transformers
Provide Better Contextual Awareness
User: What is unusual about this image?
LLaVA: The unusual aspect of the image is
that a man is ironing clothes on the back
of a yellow minivan while it is on the road.
This is an unconventional and unsafe place
to perform such an activity, as ironing
clothes typically requires a stable surface
and appropriate equipment. Ironing
clothes in a moving vehicle could lead to
potential hazards for both the person
doing the ironing and other road users.
Image source: https://arxiv.org/pdf/2304.08485.pdf
© 2025 Synopsys Inc.
2021 to present
Transformers /
Gen AI
~91% accuracy*
2012 to 2021
Convolutional
Neural Networks
From 65% to 90%
accuracy*
Up to 2012
DSP-based
Computer Vision
~50% accuracy*
Challenge: AI/ML Technology Evolving
(Moving Target)
MoE
(Mixture-of-Experts)
Uses a collection of smaller
expert networks, each
specialized in different aspects
of the input, to improve
performance and efficiency
Concept originated from the
1991 paper Adaptive Mixture
of Local Experts.
Used in Deepseek, Llama-4,
etc.
*ImageNet Top-1 Accuracy © 2025 Synopsys Inc.
• Residual connections
• Depthwise separable convolutions
• Squeeze and Excitation layers
• Inception
• New activation functions
SNPS NPU Gen6
SNPS CNN Gen1 to 5
5
6
Challenge: AI / ML Requirements for AI SoCs Rising
Last 5 years​ Ongoing Designs​ Next 3 years​
Algorithms​ CNNs, RNNs​ Transformers, GenAI (Im
age Gen, LLMs)​
Transformers, GenAI (
LVMs, LMMs, SLMs)​
High End M/L Performance
on the edge​
100s of TOPS​ Up to 1000 TOPS​ 2000+ TOPS​
NPU Data Types​ INT8​ INT8 / INT4​
FP16 / BF16​
INT4 / INT8​
FP4, FP8, OCP MX​
Multi-Die/Chiplet N / A​ UCIe v1.1​ UCIe v1.2​
Typical
Process Nodes​*
16 nm / 12 nm 7 nm / 5 nm / 3 nm​ 3 nm / 2 nm​
*ARC Processor IP (NPX6) is process node agnostic
© 2025 Synopsys Inc.
Challenge: Memory Interface a Chokepoint for GenAI
(Especially for edge Devices)
HBM4 LPDDR5/5x
Common use case Cloud AI / Training Edge AI Inference
Max interface
bandwidth
1.5+ TB / sec 68 Gbps
Power efficiency
(mW/Gbps)
Best Good
Availability Poor Good
• Many customers are
avoiding HBM due to
cost, limited access to
TSMC CoWoS, and
DRAM supply issues
7
© 2025 Synopsys Inc.
AI Models Parameters
ViT Vision Transformer 86 M–632 M
BERT-Large Language Model 340 M
ResNet50 CNN 25 M
Mobile ViT Vision Transformer 1.7 M
Challenge: GenAI Parameters Significantly Larger
• Generative AI produces compelling
results…But parameters required are
orders of magnitude larger than CNNs
– this makes them bandwidth limited in
edge implementations
AI Models Parameters
GPT-4 LLM 1.76 T
LLaVa LMM 175 B
GPT-3.5 LLM 175 B
Deepseek LLM 671 B (47B)
Llama 4 Scout LLM 109 B (17 B)
Llama 2 LLM 7 B / 13 B / 70 B
Llama 3.2 LLM 1 B / 3B / 11 B / 90B
GPT-J LLM 6 B
GPT 3.5 LLM 1.5 B / 6 B
Deepseek R1 QWEN LLM 1.5 B / 7B / 14B / 32B
Stable Diffusion Image Generator 1.5 B
ViT Vision Transformer 86 M–632 M
BERT-Large Language Model 340 M
ResNet50 CNN 25 M
Mobile ViT Vision Transformer 1.7 M
GenAI
models
<10M
parameters
7
© 2025 Synopsys Inc.
• Time to first token
• Tokens per second
9
Key Architecture Considerations for NPUs Running GenAI
NPU
Programmable Computational Core
Math Engine
DMA
L1
memory
Internal Interconnect
L2
memory
NoC
ITF
Main NoC fabric
L3 Memory (ex:
DDR)
STU
DMA
HOST CPU
 Requires a Programmable
Solution Designed for latest
Transformers
 Requires multi-level memory
management
 Requires Bandwidth reduction in
hardware and software
 Requires software tools that
support HW features and rapid
architecture exploration
 Requires Low-bit resolution
support (INT4, FP4, FP6 etc.) for
data transfers to minimize BW
© 2025 Synopsys Inc.
ARC NPX6 NPU IP Supports Generative AI for Edge Devices
MetaWare MX Development Toolkit
Runtimes &
Libraries
Compilers &
Debugger
NN SDK Simulators
Virtual
Platforms
SDK
Licensable
Synopsys ARC NPX6 FS NPU IP
4K MAC to 96K MAC Configurations
L2 Shared Memory
High-bandwidth, low latency interconnect with DMA Broadcast
Streaming
Transfer Units
…
Core
24
Core
2
Core
1
DMA
Convolution
Accelerator
4K MAC
L1 Memory Generic
Tensor
Accelerator
L1
Controller
with MMU
Tensor FPU
10
© 2025 Synopsys Inc.
Scalable NPX6 processor
architecture
• 1 to 24 core NPU w/multi
NPU support (3000+
TOPS*)​
Memory Hierarchy –
• high bandwidth L1 and L2
memories
• Powerful data sharing…
lowers external memory
bandwidth requirements
and improves latency
Trusted software tools scale
• Rapid hardware
exploration
* 1.3 GHz,5nm FFC worst case
conditions using sparse EDSR model
New Data Compression
Option
• Supports packing for OCP
MX data types, INT
Bandwidth Reduction
• Hardware & SW
compression, etc.
Silicon Proven, Automotive Quality, Synopsys backed
• Convolution accelerator feature
• Support of matrix-matrix multiplications
• Feature-maps on both operands
• Generic Tensor Accelerator
• Efficient support for softmax across
channels/feature-maps
• Efficient support for L2 Normalization
across feature-maps
• GeLU support
• L1 DMA – gather support
• Allows efficient embedding lookups
• The DMA will read multiple vectors
based on a vector of addresses
computed by the Generic Tensor
Accelerator
FC
GeLU
FC
ReduceMean
Sub
Pow
Sqrt
ReduceMean
Div
FC
ReduceMean
Sub
Pow
Sqrt
ReduceMean
Div
MatMul
MatMul
SoftMax
Transp
FC FC FC
Div
Core
1
L1 DMA
Convolution
Accelerator
4K MAC
L1 Memory
Generic
Tensor
Accelerator
L1
Controller
with MMU
Tensor FPU
11
NPX6 Design From Ground up for Transformers Support
© 2025 Synopsys Inc.
Concurrency NPX core (Transformer Optimized)
12
© 2025 Synopsys Inc.
Enhanced NXP6 NPU IP Supports Many Data Types
*supported in DMA
Format name Bits
INT16 16
INT14* 14
INT12* 12
INT10* 10
INT8 8
MXINT8* 8
INT6* 6
INT4* 4
Format Name Element Type Bits
FP16 FP16 (E5M10) 16
BF16 BF16 (E8M7) 16
MXFP8* FP8 (E5M2)
FP8 (E4M3)
8
MXFP6* FP6 (E3M2)
FP6 (E2M3)
6
MXFP4* FP4 (E2M1) 4
© 2025 Synopsys Inc. 13
NPX6 Supports Smart Architectural Exploration
IP and SoC-level Architectural Exploration
Integration into
Platform Architect
SoC-level Performance
Analysis
• Memory architecture analysis
• Interconnect metrics
• Latency, Throughput
• Contention,
Outstanding transactions
• SoC-level power (roadmap)
NPX6
Host
MWMX Analytic
Performance Model
• Throughput
• Latency
• Bandwidths (L2, DDR)
• Energy/Power
• Area
• Stall analysis
Benchmarking Results
<20% margin of error
Fast
iterations
(100+)
IP Level Performance
Analysis
NPX6/model configs:
- # of MACs
- L2 mem size
- Input image size
- DDR Bandwidth
- DDR Latency
- Batch size
- Sparsity
- Quantization
Network & Timing
© 2025 Synopsys Inc. 14
© 2025 Synopsys Inc.
• Transformers Boost up to 45% better
performance on transformer neural network
models, accelerating vision and GenAI
applications
• Power Reduction Up to 10% reduction
in power extends battery life and minimizes
thermal impact for on-device AI applications
• AI Data Compression New option
supports input and output of new
microscaling (OCP MX) data types, reducing
memory footprint and bandwidth pressure
for GenAI and other neural networks
Enhanced Version of Silicon-proven ARC NPX6 NPU IP family of AI accelerators
NEW
15
NPX6 Performance, Power and Bandwidth Improvements
ARC-V Expands on Winning ARC Processor IP Portfolio
Scalable CPU, DSP and AI IP & Tools with Unrivalled PPA Efficiency
Specialty
EV Family
Vision Processor
VPX Family
Vector DSP
NPX Family
NPU
• SIMD/VLIW design for parallel processing
• Multiple vector FP units for high precision
• Scalable neural processor units (1K-96K MACs)
• Supports latest AI networks (e.g., transformers)
• Heterogeneous multicore for vision processing
• DNN (Deep Neural Network) Engine
Classic
ARC-V
(RISC-V
ISA)
RMX Family
Ultra Low Power Embedded
RHX Family
Real-Time Performance
RPX Family
Host Processor
• 32-bit embedded processor, DSP option
• High efficiency 3- and 5-stage pipeline configs
• 32-bit real-time processor, 1-16 cores
• High-speed, dual-issue 10-stage pipeline
• 64-bit host processor, 1-16 cores
• SMP Linux, L2 cache support
Functional Safety (FS) Processors
• Integrated hardware safety features for ASIL compliance across the portfolio (up to ASIL D)
• Accelerates ISO 26262 certification for safety-critical automotive SoCs
EM Family
Embedded MPU
SEM Family
Security CPU
HS Family
High Speed CPU
• 3-stage pipeline with high efficiency DSP
• Optimized for low power IoT
• High performance CPUs, CPU + DSP
• Single- and multi-core configs
• Protection against HW, SW, side channel attacks
• SecureShield to create Trusted Exec Environment
© 2025 Synopsys Inc. 16
25 years of investment & commitment
#2 IP provider worldwide
Leader in Foundation IP
Leader in Interface IP
Growing Processor and Security IP portfolios
Increase productivity and reduce design risk with high-quality Synopsys IP
Synopsys Confidential Information
Security IP Processor IP
Foundation
IP
Other IP
Interface IP
Custom Logic
Broadest & Most Advanced IP Portfolio
17
Summary
• Transformers lead to state-of-the-art results for vision and speech –
and have enabled rise of Generative AI
• Generative AI models can run on NPUs designed for transformers
• Moving quickly into the embedded space (<10B parameters)
• Suffers bandwidth bottlenecks due to large parameter size
• INT4 and MoE-based approaches (like DeepSeek) reduce memory impact
• NPX6 NPU was designed for Transformers and supports Gen AI
efficiently
• Silicon Proven and Scalable solution (includes automotive versions)
• Enhanced NPX6 NPU IP available now
18
NPX6-64K layout
(128 dense TOPS at 1 GHz)
© 2025 Synopsys Inc.
Questions?
19
• Visit the Synopsys Booth: #717
• Check Out the Demos!
o Synopsys NN Performance Model
Analysis with Platform Architect
o Visionary.ai Real Time Video Denoiser
o ADAS NPU Algorithm deployment on
working silicon
For more information, please visit:
www.synopsys.com
© 2025 Synopsys Inc.

More Related Content

PPTX
AI Hardware Landscape 2021
PDF
FPGA Hardware Accelerator for Machine Learning
PDF
“Scaling i.MX Applications Processors’ Native Edge AI with Discrete AI Accele...
PDF
Applying Deep Learning Vision Technology to low-cost/power Embedded Systems
PDF
“How to Run Audio and Vision AI Algorithms at Ultra-low Power,” a Presentatio...
PDF
“Visual AI at the Edge: From Surveillance Cameras to People Counters,” a Pres...
PDF
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
PDF
Vertex Perspectives | AI Optimized Chipsets | Part II
AI Hardware Landscape 2021
FPGA Hardware Accelerator for Machine Learning
“Scaling i.MX Applications Processors’ Native Edge AI with Discrete AI Accele...
Applying Deep Learning Vision Technology to low-cost/power Embedded Systems
“How to Run Audio and Vision AI Algorithms at Ultra-low Power,” a Presentatio...
“Visual AI at the Edge: From Surveillance Cameras to People Counters,” a Pres...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Vertex Perspectives | AI Optimized Chipsets | Part II

Similar to “Key Requirements to Successfully Implement Generative AI in Edge Devices—Optimized Mapping to the Enhanced NPX6 Neural Processing Unit IP,” a Presentation from Synopsys (20)

PDF
Implementing AI: Running AI at the Edge
 
PDF
“A Re-imagination of Embedded Vision System Design,” a Presentation from Imag...
PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
PDF
“Introducing the i.MX 93: Your “Go-to” Processor for Embedded Vision,” a Pres...
PDF
infoShare AI Roadshow 2018 - Tomasz Kopacz (Microsoft) - jakie możliwości daj...
PDF
“Toward the Era of AI Everywhere,” a Presentation from DEEPX
PDF
"New Dataflow Architecture for Machine Learning," a Presentation from Wave Co...
PDF
Deep Learning on Everyday Devices
PPTX
Tuning For Deep Learning Inference with Intel® Processor Graphics | SIGGRAPH ...
PDF
“Trends in Neural Network Topologies for Vision at the Edge,” a Presentation ...
PDF
AI talk at CogX 2018
PDF
Leading Research Across the AI Spectrum
PPTX
Edge AI Framework for Healthcare Applications
PDF
GTC Europe 2017 Keynote
PPTX
Innovation with ai at scale on the edge vt sept 2019 v0
PDF
AI in the Financial Services Industry
PDF
Deep learning: Hardware Landscape
PDF
GTC Taiwan 2017 主題演說
PDF
Smart Data Slides: Emerging Hardware Choices for Modern AI Data Management
PDF
In datacenter performance analysis of a tensor processing unit
Implementing AI: Running AI at the Edge
 
“A Re-imagination of Embedded Vision System Design,” a Presentation from Imag...
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
“Introducing the i.MX 93: Your “Go-to” Processor for Embedded Vision,” a Pres...
infoShare AI Roadshow 2018 - Tomasz Kopacz (Microsoft) - jakie możliwości daj...
“Toward the Era of AI Everywhere,” a Presentation from DEEPX
"New Dataflow Architecture for Machine Learning," a Presentation from Wave Co...
Deep Learning on Everyday Devices
Tuning For Deep Learning Inference with Intel® Processor Graphics | SIGGRAPH ...
“Trends in Neural Network Topologies for Vision at the Edge,” a Presentation ...
AI talk at CogX 2018
Leading Research Across the AI Spectrum
Edge AI Framework for Healthcare Applications
GTC Europe 2017 Keynote
Innovation with ai at scale on the edge vt sept 2019 v0
AI in the Financial Services Industry
Deep learning: Hardware Landscape
GTC Taiwan 2017 主題演說
Smart Data Slides: Emerging Hardware Choices for Modern AI Data Management
In datacenter performance analysis of a tensor processing unit
Ad

More from Edge AI and Vision Alliance (20)

PDF
“An Introduction to the MIPI CSI-2 Image Sensor Standard and Its Latest Advan...
PDF
“Visual Search: Fine-grained Recognition with Embedding Models for the Edge,”...
PDF
“Optimizing Real-time SLAM Performance for Autonomous Robots with GPU Acceler...
PDF
“LLMs and VLMs for Regulatory Compliance, Quality Control and Safety Applicat...
PDF
“Simplifying Portable Computer Vision with OpenVX 2.0,” a Presentation from AMD
PDF
“Quantization Techniques for Efficient Deployment of Large Language Models: A...
PDF
“Introduction to Data Types for AI: Trade-offs and Trends,” a Presentation fr...
PDF
“Introduction to Radar and Its Use for Machine Perception,” a Presentation fr...
PDF
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
PDF
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
PDF
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
PDF
“ONNX and Python to C++: State-of-the-art Graph Compilation,” a Presentation ...
PDF
“Beyond the Demo: Turning Computer Vision Prototypes into Scalable, Cost-effe...
PDF
“Running Accelerated CNNs on Low-power Microcontrollers Using Arm Ethos-U55, ...
PDF
“MPU+: A Transformative Solution for Next-Gen AI at the Edge,” a Presentation...
PDF
“Evolving Inference Processor Software Stacks to Support LLMs,” a Presentatio...
PDF
“Efficiently Registering Depth and RGB Images,” a Presentation from eInfochips
PDF
“How to Right-size and Future-proof a Container-first Edge AI Infrastructure,...
PDF
“Image Tokenization for Distributed Neural Cascades,” a Presentation from Goo...
PDF
“Bridging the Gap: Streamlining the Process of Deploying AI onto Processors,”...
“An Introduction to the MIPI CSI-2 Image Sensor Standard and Its Latest Advan...
“Visual Search: Fine-grained Recognition with Embedding Models for the Edge,”...
“Optimizing Real-time SLAM Performance for Autonomous Robots with GPU Acceler...
“LLMs and VLMs for Regulatory Compliance, Quality Control and Safety Applicat...
“Simplifying Portable Computer Vision with OpenVX 2.0,” a Presentation from AMD
“Quantization Techniques for Efficient Deployment of Large Language Models: A...
“Introduction to Data Types for AI: Trade-offs and Trends,” a Presentation fr...
“Introduction to Radar and Its Use for Machine Perception,” a Presentation fr...
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
“ONNX and Python to C++: State-of-the-art Graph Compilation,” a Presentation ...
“Beyond the Demo: Turning Computer Vision Prototypes into Scalable, Cost-effe...
“Running Accelerated CNNs on Low-power Microcontrollers Using Arm Ethos-U55, ...
“MPU+: A Transformative Solution for Next-Gen AI at the Edge,” a Presentation...
“Evolving Inference Processor Software Stacks to Support LLMs,” a Presentatio...
“Efficiently Registering Depth and RGB Images,” a Presentation from eInfochips
“How to Right-size and Future-proof a Container-first Edge AI Infrastructure,...
“Image Tokenization for Distributed Neural Cascades,” a Presentation from Goo...
“Bridging the Gap: Streamlining the Process of Deploying AI onto Processors,”...
Ad

Recently uploaded (20)

PPT
Teaching material agriculture food technology
PPTX
Cloud computing and distributed systems.
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
Machine Learning_overview_presentation.pptx
PDF
Machine learning based COVID-19 study performance prediction
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Empathic Computing: Creating Shared Understanding
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
DOCX
The AUB Centre for AI in Media Proposal.docx
Teaching material agriculture food technology
Cloud computing and distributed systems.
Dropbox Q2 2025 Financial Results & Investor Presentation
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Review of recent advances in non-invasive hemoglobin estimation
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Machine Learning_overview_presentation.pptx
Machine learning based COVID-19 study performance prediction
Advanced methodologies resolving dimensionality complications for autism neur...
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
MYSQL Presentation for SQL database connectivity
sap open course for s4hana steps from ECC to s4
Empathic Computing: Creating Shared Understanding
Programs and apps: productivity, graphics, security and other tools
Diabetes mellitus diagnosis method based random forest with bat algorithm
The AUB Centre for AI in Media Proposal.docx

“Key Requirements to Successfully Implement Generative AI in Edge Devices—Optimized Mapping to the Enhanced NPX6 Neural Processing Unit IP,” a Presentation from Synopsys

  • 1. Key Requirements to Successfully Implement GenAI in Edge Devices — Optimized Mapping to the Enhanced NPX6 Neural Processing Unit IP Gordon Cooper Principal Product Manager Synopsys © 2025 Synopsys Inc. 1
  • 2. The Challenge of Fitting GenAI into an Edge Device SoC Assumptions • Target solution is an AI-enabled SoC • GenAI (built on transformer models) capabilities needed • NPU is needed for transformers / GenAI performance/power efficiency GenAI capable NPU IP © 2025 Synopsys Inc. 2
  • 3. Extreme Ironing: Panoptic Segmentation Using CNNs 3 Image source: https://www.barnorama.com/wp-content/uploads/2016/12/03-Confusing-Pictures.jpg Model Used: Detectron2 - COCO-PanopticSegmentation/panoptic_fpn_R_101_3x © 2025 Synopsys Inc. Panoptic FPN_ResNet101_3x
  • 4. LLaVA (Large Languag & Vision Assistant) 4 Extreme Ironing: Multimodal Transformers Provide Better Contextual Awareness User: What is unusual about this image? LLaVA: The unusual aspect of the image is that a man is ironing clothes on the back of a yellow minivan while it is on the road. This is an unconventional and unsafe place to perform such an activity, as ironing clothes typically requires a stable surface and appropriate equipment. Ironing clothes in a moving vehicle could lead to potential hazards for both the person doing the ironing and other road users. Image source: https://arxiv.org/pdf/2304.08485.pdf © 2025 Synopsys Inc.
  • 5. 2021 to present Transformers / Gen AI ~91% accuracy* 2012 to 2021 Convolutional Neural Networks From 65% to 90% accuracy* Up to 2012 DSP-based Computer Vision ~50% accuracy* Challenge: AI/ML Technology Evolving (Moving Target) MoE (Mixture-of-Experts) Uses a collection of smaller expert networks, each specialized in different aspects of the input, to improve performance and efficiency Concept originated from the 1991 paper Adaptive Mixture of Local Experts. Used in Deepseek, Llama-4, etc. *ImageNet Top-1 Accuracy © 2025 Synopsys Inc. • Residual connections • Depthwise separable convolutions • Squeeze and Excitation layers • Inception • New activation functions SNPS NPU Gen6 SNPS CNN Gen1 to 5 5
  • 6. 6 Challenge: AI / ML Requirements for AI SoCs Rising Last 5 years​ Ongoing Designs​ Next 3 years​ Algorithms​ CNNs, RNNs​ Transformers, GenAI (Im age Gen, LLMs)​ Transformers, GenAI ( LVMs, LMMs, SLMs)​ High End M/L Performance on the edge​ 100s of TOPS​ Up to 1000 TOPS​ 2000+ TOPS​ NPU Data Types​ INT8​ INT8 / INT4​ FP16 / BF16​ INT4 / INT8​ FP4, FP8, OCP MX​ Multi-Die/Chiplet N / A​ UCIe v1.1​ UCIe v1.2​ Typical Process Nodes​* 16 nm / 12 nm 7 nm / 5 nm / 3 nm​ 3 nm / 2 nm​ *ARC Processor IP (NPX6) is process node agnostic © 2025 Synopsys Inc.
  • 7. Challenge: Memory Interface a Chokepoint for GenAI (Especially for edge Devices) HBM4 LPDDR5/5x Common use case Cloud AI / Training Edge AI Inference Max interface bandwidth 1.5+ TB / sec 68 Gbps Power efficiency (mW/Gbps) Best Good Availability Poor Good • Many customers are avoiding HBM due to cost, limited access to TSMC CoWoS, and DRAM supply issues 7 © 2025 Synopsys Inc.
  • 8. AI Models Parameters ViT Vision Transformer 86 M–632 M BERT-Large Language Model 340 M ResNet50 CNN 25 M Mobile ViT Vision Transformer 1.7 M Challenge: GenAI Parameters Significantly Larger • Generative AI produces compelling results…But parameters required are orders of magnitude larger than CNNs – this makes them bandwidth limited in edge implementations AI Models Parameters GPT-4 LLM 1.76 T LLaVa LMM 175 B GPT-3.5 LLM 175 B Deepseek LLM 671 B (47B) Llama 4 Scout LLM 109 B (17 B) Llama 2 LLM 7 B / 13 B / 70 B Llama 3.2 LLM 1 B / 3B / 11 B / 90B GPT-J LLM 6 B GPT 3.5 LLM 1.5 B / 6 B Deepseek R1 QWEN LLM 1.5 B / 7B / 14B / 32B Stable Diffusion Image Generator 1.5 B ViT Vision Transformer 86 M–632 M BERT-Large Language Model 340 M ResNet50 CNN 25 M Mobile ViT Vision Transformer 1.7 M GenAI models <10M parameters 7 © 2025 Synopsys Inc. • Time to first token • Tokens per second
  • 9. 9 Key Architecture Considerations for NPUs Running GenAI NPU Programmable Computational Core Math Engine DMA L1 memory Internal Interconnect L2 memory NoC ITF Main NoC fabric L3 Memory (ex: DDR) STU DMA HOST CPU  Requires a Programmable Solution Designed for latest Transformers  Requires multi-level memory management  Requires Bandwidth reduction in hardware and software  Requires software tools that support HW features and rapid architecture exploration  Requires Low-bit resolution support (INT4, FP4, FP6 etc.) for data transfers to minimize BW © 2025 Synopsys Inc.
  • 10. ARC NPX6 NPU IP Supports Generative AI for Edge Devices MetaWare MX Development Toolkit Runtimes & Libraries Compilers & Debugger NN SDK Simulators Virtual Platforms SDK Licensable Synopsys ARC NPX6 FS NPU IP 4K MAC to 96K MAC Configurations L2 Shared Memory High-bandwidth, low latency interconnect with DMA Broadcast Streaming Transfer Units … Core 24 Core 2 Core 1 DMA Convolution Accelerator 4K MAC L1 Memory Generic Tensor Accelerator L1 Controller with MMU Tensor FPU 10 © 2025 Synopsys Inc. Scalable NPX6 processor architecture • 1 to 24 core NPU w/multi NPU support (3000+ TOPS*)​ Memory Hierarchy – • high bandwidth L1 and L2 memories • Powerful data sharing… lowers external memory bandwidth requirements and improves latency Trusted software tools scale • Rapid hardware exploration * 1.3 GHz,5nm FFC worst case conditions using sparse EDSR model New Data Compression Option • Supports packing for OCP MX data types, INT Bandwidth Reduction • Hardware & SW compression, etc. Silicon Proven, Automotive Quality, Synopsys backed
  • 11. • Convolution accelerator feature • Support of matrix-matrix multiplications • Feature-maps on both operands • Generic Tensor Accelerator • Efficient support for softmax across channels/feature-maps • Efficient support for L2 Normalization across feature-maps • GeLU support • L1 DMA – gather support • Allows efficient embedding lookups • The DMA will read multiple vectors based on a vector of addresses computed by the Generic Tensor Accelerator FC GeLU FC ReduceMean Sub Pow Sqrt ReduceMean Div FC ReduceMean Sub Pow Sqrt ReduceMean Div MatMul MatMul SoftMax Transp FC FC FC Div Core 1 L1 DMA Convolution Accelerator 4K MAC L1 Memory Generic Tensor Accelerator L1 Controller with MMU Tensor FPU 11 NPX6 Design From Ground up for Transformers Support © 2025 Synopsys Inc.
  • 12. Concurrency NPX core (Transformer Optimized) 12 © 2025 Synopsys Inc.
  • 13. Enhanced NXP6 NPU IP Supports Many Data Types *supported in DMA Format name Bits INT16 16 INT14* 14 INT12* 12 INT10* 10 INT8 8 MXINT8* 8 INT6* 6 INT4* 4 Format Name Element Type Bits FP16 FP16 (E5M10) 16 BF16 BF16 (E8M7) 16 MXFP8* FP8 (E5M2) FP8 (E4M3) 8 MXFP6* FP6 (E3M2) FP6 (E2M3) 6 MXFP4* FP4 (E2M1) 4 © 2025 Synopsys Inc. 13
  • 14. NPX6 Supports Smart Architectural Exploration IP and SoC-level Architectural Exploration Integration into Platform Architect SoC-level Performance Analysis • Memory architecture analysis • Interconnect metrics • Latency, Throughput • Contention, Outstanding transactions • SoC-level power (roadmap) NPX6 Host MWMX Analytic Performance Model • Throughput • Latency • Bandwidths (L2, DDR) • Energy/Power • Area • Stall analysis Benchmarking Results <20% margin of error Fast iterations (100+) IP Level Performance Analysis NPX6/model configs: - # of MACs - L2 mem size - Input image size - DDR Bandwidth - DDR Latency - Batch size - Sparsity - Quantization Network & Timing © 2025 Synopsys Inc. 14
  • 15. © 2025 Synopsys Inc. • Transformers Boost up to 45% better performance on transformer neural network models, accelerating vision and GenAI applications • Power Reduction Up to 10% reduction in power extends battery life and minimizes thermal impact for on-device AI applications • AI Data Compression New option supports input and output of new microscaling (OCP MX) data types, reducing memory footprint and bandwidth pressure for GenAI and other neural networks Enhanced Version of Silicon-proven ARC NPX6 NPU IP family of AI accelerators NEW 15 NPX6 Performance, Power and Bandwidth Improvements
  • 16. ARC-V Expands on Winning ARC Processor IP Portfolio Scalable CPU, DSP and AI IP & Tools with Unrivalled PPA Efficiency Specialty EV Family Vision Processor VPX Family Vector DSP NPX Family NPU • SIMD/VLIW design for parallel processing • Multiple vector FP units for high precision • Scalable neural processor units (1K-96K MACs) • Supports latest AI networks (e.g., transformers) • Heterogeneous multicore for vision processing • DNN (Deep Neural Network) Engine Classic ARC-V (RISC-V ISA) RMX Family Ultra Low Power Embedded RHX Family Real-Time Performance RPX Family Host Processor • 32-bit embedded processor, DSP option • High efficiency 3- and 5-stage pipeline configs • 32-bit real-time processor, 1-16 cores • High-speed, dual-issue 10-stage pipeline • 64-bit host processor, 1-16 cores • SMP Linux, L2 cache support Functional Safety (FS) Processors • Integrated hardware safety features for ASIL compliance across the portfolio (up to ASIL D) • Accelerates ISO 26262 certification for safety-critical automotive SoCs EM Family Embedded MPU SEM Family Security CPU HS Family High Speed CPU • 3-stage pipeline with high efficiency DSP • Optimized for low power IoT • High performance CPUs, CPU + DSP • Single- and multi-core configs • Protection against HW, SW, side channel attacks • SecureShield to create Trusted Exec Environment © 2025 Synopsys Inc. 16
  • 17. 25 years of investment & commitment #2 IP provider worldwide Leader in Foundation IP Leader in Interface IP Growing Processor and Security IP portfolios Increase productivity and reduce design risk with high-quality Synopsys IP Synopsys Confidential Information Security IP Processor IP Foundation IP Other IP Interface IP Custom Logic Broadest & Most Advanced IP Portfolio 17
  • 18. Summary • Transformers lead to state-of-the-art results for vision and speech – and have enabled rise of Generative AI • Generative AI models can run on NPUs designed for transformers • Moving quickly into the embedded space (<10B parameters) • Suffers bandwidth bottlenecks due to large parameter size • INT4 and MoE-based approaches (like DeepSeek) reduce memory impact • NPX6 NPU was designed for Transformers and supports Gen AI efficiently • Silicon Proven and Scalable solution (includes automotive versions) • Enhanced NPX6 NPU IP available now 18 NPX6-64K layout (128 dense TOPS at 1 GHz) © 2025 Synopsys Inc.
  • 19. Questions? 19 • Visit the Synopsys Booth: #717 • Check Out the Demos! o Synopsys NN Performance Model Analysis with Platform Architect o Visionary.ai Real Time Video Denoiser o ADAS NPU Algorithm deployment on working silicon For more information, please visit: www.synopsys.com © 2025 Synopsys Inc.