SlideShare a Scribd company logo
© 2017 Arm Limited
Peter Greenhalgh
VP and GM of Central Technology
Arm DynamIQ:
Intelligent Solutions
Using Cluster Based
Multiprocessing
© 2017 Arm Limited2
Drones Wearable technology Smartwatch
3D printing Voice recognition Social media
Technology innovations of 2013
© 2017 Arm Limited3
Looking ahead from edge to cloud
The future requires a new approach to CPU design
Safe and autonomous Hyper-efficient
Secure private compute
Cortex beyond mobile Mixed reality
Confidential © Arm 20174
Arm DynamIQ
Rearchitecting the compute experience
Multi-core redefined for
broad market
Massive system
performance uplift
More intelligent
systems
© 2017 Arm Limited5
Innovating for the scalable future
Up to 8 CPUs
‘Octacore’ smartphones
Dual cluster
Heterogeneous processing
Nearly “Unlimited”
design spectrum
Covers all existing use
cases
DynamIQ cluster
Dynamic flexibility
2013 2017
Expanding Arm technology
processor architecture for
broad market
Arm AMBA
Arm big.LITTLE
Arm CoreLink
Arm TrustZone
Arm NEON
Key Arm technologies
© 2017 Arm Limited
DSU – Broadening the
reach of technology
© 2017 Arm Limited7
DynamIQ: New cluster design for new cores
DynamIQ big.LITTLE systems:
• Greater product differentiation and scalability
• Improved energy efficiency and performance
• SW compatibility with Energy Aware Scheduling (EAS)
Private L2 and shared L3 caches
• Local cache close to processors
• L3 cache shared between all cores
DynamIQ Shared Unit (DSU)
• Contains L3, Snoop Control Unit (SCU) and all cluster
interfaces
1b+4L1b+3L1b+2L
1b+7L
Example DynamIQ big.LITTLE configurations
..
AMBA4 ACE
SCU
Shared L3 cacheACP
Cortex-A55
32b/64b Core
Private L2 cache
Async BridgesPeripheral Port
Cortex-A75
32b/64b Core
Private L2 cache
DynamIQ Shared Unit (DSU)
2b+6L
4b+4L
© 2017 Arm Limited8
DynamIQ cluster
0 - 7 CoresCore 0
Snoop
filter
Power
Management
L3
Cache
Bus
I/F
ACP and
peripheral
port I/F
Core 7
Asynchronous bridges
DynamIQ Shared Unit (DSU)
DynamIQ Shared Unit (DSU)
Streamlines
traffic across
bridges
Advanced power
management
features
Latency and bandwidth
optimizations
Support for multiple
performance domains
Scalable interfaces for edge to
cloud applications
Supports large amounts
of local memory
Low latency interfaces for
closely coupled accelerators
© 2017 Arm Limited9
Level 3 cache memory system
New memory system for Cortex-A clusters
Integrated snoop filter to improve efficiency
Enabling lower cache latencies
DynamIQ cluster
0–7 Cores
Core0
Snoop
filter
Power
Mngmt
L3
Cache
Bus
I/F
ACP and
peripheral
port I/F
Core7
Asynchronous bridges
DynamIQ Shared Unit (DSU)
L1 cache
L2 cache
Load to Use Cycles* Cortex-A53 Cortex-A55 Cortex-A73 Cortex-A75
L1 hit 3 2 3 3
L2 hit 13 6 19 8
L3 hit - 21 - 25
Interconnect boundary 20 21 26 25
L1 cache
L2 cache
© 2017 Arm Limited10
Level 3 cache partition
Infrastructure
• Process 1 = data plane
• Process 2 = control
plane
• Packet processing data
sent through low
latency ACP interface
Sensors or
I/O agents ACP
Process 2
Core group 2
Example configuration with two Core groups in a DynamIQ cluster
Group 1 Group 2
Core group 1
Process 1
L3 cache
Core0 Core1 Core2 Core3 Core4 Core5 Core6 Core7
Reserved for external
accelerators via ACP
© 2017 Arm Limited11
Level 3 cache partition
Automotive
• Each process could
represent an
independent ADAS
algorithm
• Sensors linked through
low latency ACP
interface
Sensors or
I/O agents ACP
Process 4
Core group 4
Example configuration with four Core groups in a DynamIQ cluster
Group 1,2 Group 4
Core group 1
Process 1
L3 cache
Core0 Core1 Core6 Core7
Core group 2
Process 2
Core2 Core3
Core group 3
Process 3
Core4 Core5
Group 3
Reserved for external
accelerators via ACP
© 2017 Arm Limited12
Increasing performance through cache stashing
Enables reads/writes into the shared L3 cache or
per-core L2 cache
Allows closely coupled accelerators and I/O agents
to gain access to core memory
AMBA 5 CHI and Accelerator Coherency Port (ACP)
can be used for cache stashing
More throughput with Peripheral Port (PP) for
acceleration, network, storage use-cases Accelerator
or I/O
CoreLink CMN-600
DMC-620 DMC-620
Agile System Cache
DDR4 DDR4
L3 Cache
L2 Cache
CPU
L2 Cache
CPU
L2 Cache
CPU
L2 Cache
Cortex-A
Agile System Cache
L3 Cache
L2 Cache
CPU
L2 Cache
CPU
L2 Cache
CPU
L2 Cache
Cortex-A
Stash critical data to
any cache level
DynamIQ cluster
0–7 Cores
Core0
Snoop
filter
Power
Mngmt
L3
Cache
Bus
I/F
ACP and
peripheral
port I/F
Core7
Asynchronous bridges
DynamIQ Shared Unit (DSU)
L1 cache
L2 cache
L1 cache
L2 cache
© 2017 Arm Limited13
Increasing performance through tight integration
Offload acceleration
Example application:
Offload crypto acceleration
I/O processing
Example application:
Packet processing in network systems
DynamIQ cluster
Accelerator
(4) Writes result into
Core memory
(1) Configure
registers for task
(2) Fetches data from
Core memory
(3) Carries out
acceleration
ACP
PP
DynamIQ cluster
I/O agent
(4) Reads result from Core
memory
or sends data for onward
processing
(3) Processing
completed
(1) Writes data into
Core memory
(2) Carries out
computation
ACP
PP
© 2017 Arm Limited14
Automotive and industrial safety and reliability
ADAS and IVI compute performance
• DynamIQ provides performance required for
autonomous cars
• Faster responsiveness
DynamIQ: Functional Safety
• Following ASIL D systematic flow
• Provides higher safety integrity
Industry’s broadest functional safety
capable CPU portfolio
Autonomous system
Sense Perceive Decide Actuate
Cortex-M Cortex-R
Safety IslandApplication cores
L3 Cache
L2 Cache
CPU
L2 Cache
CPU
L2 Cache
CPU
L2 Cache
Cortex-A
Sensors
SoC
Lock-step core
© 2017 Arm Limited
Cortex-A75 – Increasing
Performance
Cortex-A55 – Improving
Efficiency
© 2017 Arm Limited16
New levels of performance for smart solutions
Cortex-A75 Cortex-A55
All comparisons at ISO
process and frequency
Baseline to Cortex-A73 Baseline to Cortex-A53
1.21x
1.42x
1.97x
1.14x
1.22x
SPECINT2006
SPECFP2006
LMBench memcpy
Octane 2.0
Geekbench v4
1.22x
1.33x
1.16x
1.48x
1.34x
SPECINT2006
SPECFP2006
LMBench memcpy
Octane 2.0
Geekbench v4
All comparisons at ISO
process and frequency
© 2017 Arm Limited17
Architecture and Pipelines
Common features
• Armv8.2-A Architecture
• DynamIQ big.LITTLE
Cortex-A75 – performance focussed
• Out-of-Order, 11-13 stage integer pipeline
Cortex-A55 – efficiency focussed
• In-order, 8 stage integer pipeline
ALU/INT (MAC)
NEON/FP F0
Decode
NEON/FP F1
Instruction
Fetch
Writeback
Issue
ALU/INT (DIV)
Branch
AGU Load
AGU Store
Cortex-A55
Int I0 (MUL)
Decode
Instruction
Fetch
Writeback
Int I1 (DIV)
AGU LD/ST
AGU LD/ST
Branch B
Cortex-A75
Instruction
Queue
Writeback
Rename
Dispatch
IsQ (12)
IsQ (12)
IsQ (8)
IsQ (8)
IsQ (20)
Decode
Rename
IsQ (8)
IsQ (8)
IsQ (8)
NEION/FP F1
NE/FP Store
NEON/FP F0
Writeback
© 2017 Arm Limited18
Instruction
Extraction & Parsing
Instruction
Queue
FillBuffer
Conditional
PredictorL1
Instruction
Cache
AGU
Indirect
Predictor
Branch
Predictor
Instruction fetch
Common features
• 4-way set associative
• Virtually indexed, physically tagged (VIPT)
• Decoupled from Cores thru instruction queue
Cortex-A75
• 64KB
• 4-wide instruction fetch
Cortex-A55
• 16KB / 32KB / 64KB
• 2-wide instruction fetch
© 2017 Arm Limited19
Instruction
Extraction & Parsing
Instruction
Queue
FillBuffer
Conditional
PredictorL1
Instruction
Cache
AGU
Indirect
Predictor
Branch
Predictor
Branch prediction
Cortex-A75
• Fine-tuned 0-cycle prediction
• State of the art, mobile focussed, table based
conditional prediction
Cortex-A55
• Brand new 0-cycle predictors
• New main conditional predictor - Neural network
based
• New loop predictors
© 2017 Arm Limited20
Cortex-A75: Datapaths
3-way superscalar high-performance pipeline
• Single-cycle decode with instruction fusing and
micro-ops
7 independent high-performance issue queues
• 2x Load/Store, 2x NEON/FPU, 1x Branch and 2x
Integer core
Increased capacity to sustain operation under
L1 miss / L2 hit
• 12 entries for integer core to maximise on in-
flight instructions and out-of-order capabilities
• 8 entries for Load/Store and NEON/FPU
Cortex-A75
Private L2 Cache
Instruction
Fetch
Main
TLB
Arm
Register
File
D.E.
Register
File
Dispatch
Issue
64k
D-Cache
STB
64k
I-Cache
Branch
Prediction
Decode
Rename
Load/Store
Advanced NEON
Floating Point
ALUs
iDIV
MAC
AGUs
Writeback
© 2017 Arm Limited21
Cortex-A55: Datapaths
Dual issue of loads and stores
Improved latency for forwarding ALU
results to the AGU
• Reduced by one cycle for many common ALU
operations
Reduced L1 cache load-to-use latency for
pointer chasing to two cycles
Integer
Register
File
NEON-FP Regfile NEON Pipe
Decode
Store Pipe
x2
Cortex-A55
ALU Pipe
ALU Pipe
Integer Pipe
Divide Pipe
Mult Acc
Shift ALU
Shift ALU
Load PipeAGU
Data Cache
Output
Data Cache
Address
© 2017 Arm Limited22
L1 memory system
Common features
• 4-way set associative
• VIPT with PIPT programmer’s view
• Improved prefetchers
Cortex-A75
• 64KB
• Wider load-store than Cortex-A73
• Support Read-after-Write OoO with filtering
Cortex-A55
• 16KB / 32KB / 64KB
• Improved store buffer bandwidth to L1
• Larger 16-entry L1-TLB
Store Buffer
L1
Data
Cache
Prefetcher
L1 TLB
L2 TLB
L2
Cache
© 2017 Arm Limited23
Store Buffer
L1
Data
Cache
Prefetcher
L1 TLB
L2 TLB
L2
Cache
L2 memory system
Common features
• Private L2 cache in each Core
• Running at Core speed
• Exclusive data cache
• Cache stashing into the L2
• Non-blocking 1024-entry TLB for hit-under-miss
Cortex-A75
• 256KB / 512 KB
Cortex-A55
• 0KB / 64KB / 128KB / 256KB
© 2017 Arm Limited24
Next-generation features
Dot product and half-precision float for AI/ML processing
Virtualized Host Extensions (VHE) offering Type-2 hypervisor
(KVM) performance improvements
Cache stashing and atomic operations improves multicore
networking performance and improves latency
Cache clean to persistence to support storage class memory
Infrastructure class RAS enhancement including data poisoning
and improved error management
© 2017 Arm Limited25
Innovating for the scalable future
2013-2017: The nature of compute is changing the landscape
Expanding Arm technologies for broad market applicability
New cluster design with new DynamIQ cores:
• Cortex-A75: Breakthrough performance
• Cortex-A55: Efficiency redefined
Functional safety for industrial and automotive applications
New features expanding microarchitecture capabilities:
• DynamIQ Shared Unit , new cache features, new branch prediction
2626
Thank You!
Danke!
Merci!
谢谢!
ありがとう!
Gracias!
Kiitos!
© 2017 Arm Limited
2727 © 2017 Arm Limited
The Arm trademarks featured in this
presentation are registered trademarks or
trademarks of Arm Limited (or its
subsidiaries) in the US and/or elsewhere. All
rights reserved. All other marks featured may
be trademarks of their respective owners.
www.arm.com/company/policies/trademarks

More Related Content

PDF
malloc & vmalloc in Linux
PDF
Q4.11: ARM Architecture
PDF
BusyBox for Embedded Linux
PPT
linux device driver
PDF
Arm device tree and linux device drivers
PDF
Anatomy of the loadable kernel module (lkm)
PDF
Performance Wins with eBPF: Getting Started (2021)
malloc & vmalloc in Linux
Q4.11: ARM Architecture
BusyBox for Embedded Linux
linux device driver
Arm device tree and linux device drivers
Anatomy of the loadable kernel module (lkm)
Performance Wins with eBPF: Getting Started (2021)

What's hot (20)

PDF
IPMI is dead, Long live Redfish
PDF
Introduction to Modern U-Boot
PDF
FPGA Hardware Accelerator for Machine Learning
PDF
Part 02 Linux Kernel Module Programming
PPTX
The TCP/IP Stack in the Linux Kernel
PDF
DPDK: Multi Architecture High Performance Packet Processing
PPTX
Linux kernel
PDF
U-Boot - An universal bootloader
PDF
Getting started with BeagleBone Black - Embedded Linux
PDF
Architecture Of The Linux Kernel
PDF
Process Address Space: The way to create virtual address (page table) of user...
PDF
Velocity 2015 linux perf tools
ODP
Q4.11: Porting Android to new Platforms
PDF
Linux kernel modules
PPTX
COSCUP 2020 RISC-V 32 bit linux highmem porting
PDF
An Introduction to the Android Framework -- a core architecture view from app...
PDF
Q4.11: Introduction to eMMC
PPTX
Understanding eBPF in a Hurry!
IPMI is dead, Long live Redfish
Introduction to Modern U-Boot
FPGA Hardware Accelerator for Machine Learning
Part 02 Linux Kernel Module Programming
The TCP/IP Stack in the Linux Kernel
DPDK: Multi Architecture High Performance Packet Processing
Linux kernel
U-Boot - An universal bootloader
Getting started with BeagleBone Black - Embedded Linux
Architecture Of The Linux Kernel
Process Address Space: The way to create virtual address (page table) of user...
Velocity 2015 linux perf tools
Q4.11: Porting Android to new Platforms
Linux kernel modules
COSCUP 2020 RISC-V 32 bit linux highmem porting
An Introduction to the Android Framework -- a core architecture view from app...
Q4.11: Introduction to eMMC
Understanding eBPF in a Hurry!
Ad

Similar to Arm DynamIQ: Intelligent Solutions Using Cluster Based Multiprocessing (20)

PDF
01 AAA SoC Prototyping Oct2024P - Future of AI.pdf
PPTX
Arm Processors Architectures
PDF
Architecture and Implementation of the ARM Cortex-A8 Microprocessor
PDF
ARM AAE - Architecture
PDF
ARM 7 and 9 Core Architecture Illustration
PPT
ARM cortex A15
PDF
ARM.pdf
PPTX
Arm cortex a72 processor is the presentation.pptx
PPT
Arm processor
PPT
The past and the next 20 years? Scalable computing as a key evolution
PPTX
Balance, Flexibility, and Partnership: An ARM Approach to Future HPC Node Arc...
PDF
Lecture Presentation 11.pdfLecture Presentation 9.pdf fpga soc
PDF
Arm white
PDF
iPhone Architecture - Review
PPTX
Seminario utovrm
PPTX
SNAPDRAGON SoC Family and ARM Architecture
PDF
PPT
ARM - Advance RISC Machine
PPTX
mobile processors introduction..
01 AAA SoC Prototyping Oct2024P - Future of AI.pdf
Arm Processors Architectures
Architecture and Implementation of the ARM Cortex-A8 Microprocessor
ARM AAE - Architecture
ARM 7 and 9 Core Architecture Illustration
ARM cortex A15
ARM.pdf
Arm cortex a72 processor is the presentation.pptx
Arm processor
The past and the next 20 years? Scalable computing as a key evolution
Balance, Flexibility, and Partnership: An ARM Approach to Future HPC Node Arc...
Lecture Presentation 11.pdfLecture Presentation 9.pdf fpga soc
Arm white
iPhone Architecture - Review
Seminario utovrm
SNAPDRAGON SoC Family and ARM Architecture
ARM - Advance RISC Machine
mobile processors introduction..
Ad

More from Arm (12)

PDF
Project Trillium: Arm Machine Learning Platform
 
PPTX
IoTs Place in the World of 5G
 
PPTX
AI Today, AI Tomorrow
 
PDF
An Amazing World of Possibilities (Computex 2017)
 
PDF
The importance of strong entropy for iot
 
PDF
Efficient software development with heterogeneous devices
 
PDF
Optimizing ARM cortex a and cortex-m based heterogeneous multiprocessor syste...
 
PDF
So you think developing an SoC needs to be complex or expensive?
 
PDF
Developing functional safety systems with arm architecture solutions stroud
 
PDF
Software development in ar mv8 m architecture - yiu
 
PDF
A practical approach to securing embedded and io t platforms
 
PDF
Sustainably Connecting a Global Community
 
Project Trillium: Arm Machine Learning Platform
 
IoTs Place in the World of 5G
 
AI Today, AI Tomorrow
 
An Amazing World of Possibilities (Computex 2017)
 
The importance of strong entropy for iot
 
Efficient software development with heterogeneous devices
 
Optimizing ARM cortex a and cortex-m based heterogeneous multiprocessor syste...
 
So you think developing an SoC needs to be complex or expensive?
 
Developing functional safety systems with arm architecture solutions stroud
 
Software development in ar mv8 m architecture - yiu
 
A practical approach to securing embedded and io t platforms
 
Sustainably Connecting a Global Community
 

Recently uploaded (20)

PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Empathic Computing: Creating Shared Understanding
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PPTX
Machine Learning_overview_presentation.pptx
PPTX
Tartificialntelligence_presentation.pptx
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Machine learning based COVID-19 study performance prediction
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PPTX
MYSQL Presentation for SQL database connectivity
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Approach and Philosophy of On baking technology
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
NewMind AI Weekly Chronicles - August'25-Week II
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
Mobile App Security Testing_ A Comprehensive Guide.pdf
Empathic Computing: Creating Shared Understanding
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Machine Learning_overview_presentation.pptx
Tartificialntelligence_presentation.pptx
“AI and Expert System Decision Support & Business Intelligence Systems”
Building Integrated photovoltaic BIPV_UPV.pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Machine learning based COVID-19 study performance prediction
Assigned Numbers - 2025 - Bluetooth® Document
MYSQL Presentation for SQL database connectivity
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Approach and Philosophy of On baking technology
Build a system with the filesystem maintained by OSTree @ COSCUP 2025

Arm DynamIQ: Intelligent Solutions Using Cluster Based Multiprocessing

  • 1. © 2017 Arm Limited Peter Greenhalgh VP and GM of Central Technology Arm DynamIQ: Intelligent Solutions Using Cluster Based Multiprocessing
  • 2. © 2017 Arm Limited2 Drones Wearable technology Smartwatch 3D printing Voice recognition Social media Technology innovations of 2013
  • 3. © 2017 Arm Limited3 Looking ahead from edge to cloud The future requires a new approach to CPU design Safe and autonomous Hyper-efficient Secure private compute Cortex beyond mobile Mixed reality
  • 4. Confidential © Arm 20174 Arm DynamIQ Rearchitecting the compute experience Multi-core redefined for broad market Massive system performance uplift More intelligent systems
  • 5. © 2017 Arm Limited5 Innovating for the scalable future Up to 8 CPUs ‘Octacore’ smartphones Dual cluster Heterogeneous processing Nearly “Unlimited” design spectrum Covers all existing use cases DynamIQ cluster Dynamic flexibility 2013 2017 Expanding Arm technology processor architecture for broad market Arm AMBA Arm big.LITTLE Arm CoreLink Arm TrustZone Arm NEON Key Arm technologies
  • 6. © 2017 Arm Limited DSU – Broadening the reach of technology
  • 7. © 2017 Arm Limited7 DynamIQ: New cluster design for new cores DynamIQ big.LITTLE systems: • Greater product differentiation and scalability • Improved energy efficiency and performance • SW compatibility with Energy Aware Scheduling (EAS) Private L2 and shared L3 caches • Local cache close to processors • L3 cache shared between all cores DynamIQ Shared Unit (DSU) • Contains L3, Snoop Control Unit (SCU) and all cluster interfaces 1b+4L1b+3L1b+2L 1b+7L Example DynamIQ big.LITTLE configurations .. AMBA4 ACE SCU Shared L3 cacheACP Cortex-A55 32b/64b Core Private L2 cache Async BridgesPeripheral Port Cortex-A75 32b/64b Core Private L2 cache DynamIQ Shared Unit (DSU) 2b+6L 4b+4L
  • 8. © 2017 Arm Limited8 DynamIQ cluster 0 - 7 CoresCore 0 Snoop filter Power Management L3 Cache Bus I/F ACP and peripheral port I/F Core 7 Asynchronous bridges DynamIQ Shared Unit (DSU) DynamIQ Shared Unit (DSU) Streamlines traffic across bridges Advanced power management features Latency and bandwidth optimizations Support for multiple performance domains Scalable interfaces for edge to cloud applications Supports large amounts of local memory Low latency interfaces for closely coupled accelerators
  • 9. © 2017 Arm Limited9 Level 3 cache memory system New memory system for Cortex-A clusters Integrated snoop filter to improve efficiency Enabling lower cache latencies DynamIQ cluster 0–7 Cores Core0 Snoop filter Power Mngmt L3 Cache Bus I/F ACP and peripheral port I/F Core7 Asynchronous bridges DynamIQ Shared Unit (DSU) L1 cache L2 cache Load to Use Cycles* Cortex-A53 Cortex-A55 Cortex-A73 Cortex-A75 L1 hit 3 2 3 3 L2 hit 13 6 19 8 L3 hit - 21 - 25 Interconnect boundary 20 21 26 25 L1 cache L2 cache
  • 10. © 2017 Arm Limited10 Level 3 cache partition Infrastructure • Process 1 = data plane • Process 2 = control plane • Packet processing data sent through low latency ACP interface Sensors or I/O agents ACP Process 2 Core group 2 Example configuration with two Core groups in a DynamIQ cluster Group 1 Group 2 Core group 1 Process 1 L3 cache Core0 Core1 Core2 Core3 Core4 Core5 Core6 Core7 Reserved for external accelerators via ACP
  • 11. © 2017 Arm Limited11 Level 3 cache partition Automotive • Each process could represent an independent ADAS algorithm • Sensors linked through low latency ACP interface Sensors or I/O agents ACP Process 4 Core group 4 Example configuration with four Core groups in a DynamIQ cluster Group 1,2 Group 4 Core group 1 Process 1 L3 cache Core0 Core1 Core6 Core7 Core group 2 Process 2 Core2 Core3 Core group 3 Process 3 Core4 Core5 Group 3 Reserved for external accelerators via ACP
  • 12. © 2017 Arm Limited12 Increasing performance through cache stashing Enables reads/writes into the shared L3 cache or per-core L2 cache Allows closely coupled accelerators and I/O agents to gain access to core memory AMBA 5 CHI and Accelerator Coherency Port (ACP) can be used for cache stashing More throughput with Peripheral Port (PP) for acceleration, network, storage use-cases Accelerator or I/O CoreLink CMN-600 DMC-620 DMC-620 Agile System Cache DDR4 DDR4 L3 Cache L2 Cache CPU L2 Cache CPU L2 Cache CPU L2 Cache Cortex-A Agile System Cache L3 Cache L2 Cache CPU L2 Cache CPU L2 Cache CPU L2 Cache Cortex-A Stash critical data to any cache level DynamIQ cluster 0–7 Cores Core0 Snoop filter Power Mngmt L3 Cache Bus I/F ACP and peripheral port I/F Core7 Asynchronous bridges DynamIQ Shared Unit (DSU) L1 cache L2 cache L1 cache L2 cache
  • 13. © 2017 Arm Limited13 Increasing performance through tight integration Offload acceleration Example application: Offload crypto acceleration I/O processing Example application: Packet processing in network systems DynamIQ cluster Accelerator (4) Writes result into Core memory (1) Configure registers for task (2) Fetches data from Core memory (3) Carries out acceleration ACP PP DynamIQ cluster I/O agent (4) Reads result from Core memory or sends data for onward processing (3) Processing completed (1) Writes data into Core memory (2) Carries out computation ACP PP
  • 14. © 2017 Arm Limited14 Automotive and industrial safety and reliability ADAS and IVI compute performance • DynamIQ provides performance required for autonomous cars • Faster responsiveness DynamIQ: Functional Safety • Following ASIL D systematic flow • Provides higher safety integrity Industry’s broadest functional safety capable CPU portfolio Autonomous system Sense Perceive Decide Actuate Cortex-M Cortex-R Safety IslandApplication cores L3 Cache L2 Cache CPU L2 Cache CPU L2 Cache CPU L2 Cache Cortex-A Sensors SoC Lock-step core
  • 15. © 2017 Arm Limited Cortex-A75 – Increasing Performance Cortex-A55 – Improving Efficiency
  • 16. © 2017 Arm Limited16 New levels of performance for smart solutions Cortex-A75 Cortex-A55 All comparisons at ISO process and frequency Baseline to Cortex-A73 Baseline to Cortex-A53 1.21x 1.42x 1.97x 1.14x 1.22x SPECINT2006 SPECFP2006 LMBench memcpy Octane 2.0 Geekbench v4 1.22x 1.33x 1.16x 1.48x 1.34x SPECINT2006 SPECFP2006 LMBench memcpy Octane 2.0 Geekbench v4 All comparisons at ISO process and frequency
  • 17. © 2017 Arm Limited17 Architecture and Pipelines Common features • Armv8.2-A Architecture • DynamIQ big.LITTLE Cortex-A75 – performance focussed • Out-of-Order, 11-13 stage integer pipeline Cortex-A55 – efficiency focussed • In-order, 8 stage integer pipeline ALU/INT (MAC) NEON/FP F0 Decode NEON/FP F1 Instruction Fetch Writeback Issue ALU/INT (DIV) Branch AGU Load AGU Store Cortex-A55 Int I0 (MUL) Decode Instruction Fetch Writeback Int I1 (DIV) AGU LD/ST AGU LD/ST Branch B Cortex-A75 Instruction Queue Writeback Rename Dispatch IsQ (12) IsQ (12) IsQ (8) IsQ (8) IsQ (20) Decode Rename IsQ (8) IsQ (8) IsQ (8) NEION/FP F1 NE/FP Store NEON/FP F0 Writeback
  • 18. © 2017 Arm Limited18 Instruction Extraction & Parsing Instruction Queue FillBuffer Conditional PredictorL1 Instruction Cache AGU Indirect Predictor Branch Predictor Instruction fetch Common features • 4-way set associative • Virtually indexed, physically tagged (VIPT) • Decoupled from Cores thru instruction queue Cortex-A75 • 64KB • 4-wide instruction fetch Cortex-A55 • 16KB / 32KB / 64KB • 2-wide instruction fetch
  • 19. © 2017 Arm Limited19 Instruction Extraction & Parsing Instruction Queue FillBuffer Conditional PredictorL1 Instruction Cache AGU Indirect Predictor Branch Predictor Branch prediction Cortex-A75 • Fine-tuned 0-cycle prediction • State of the art, mobile focussed, table based conditional prediction Cortex-A55 • Brand new 0-cycle predictors • New main conditional predictor - Neural network based • New loop predictors
  • 20. © 2017 Arm Limited20 Cortex-A75: Datapaths 3-way superscalar high-performance pipeline • Single-cycle decode with instruction fusing and micro-ops 7 independent high-performance issue queues • 2x Load/Store, 2x NEON/FPU, 1x Branch and 2x Integer core Increased capacity to sustain operation under L1 miss / L2 hit • 12 entries for integer core to maximise on in- flight instructions and out-of-order capabilities • 8 entries for Load/Store and NEON/FPU Cortex-A75 Private L2 Cache Instruction Fetch Main TLB Arm Register File D.E. Register File Dispatch Issue 64k D-Cache STB 64k I-Cache Branch Prediction Decode Rename Load/Store Advanced NEON Floating Point ALUs iDIV MAC AGUs Writeback
  • 21. © 2017 Arm Limited21 Cortex-A55: Datapaths Dual issue of loads and stores Improved latency for forwarding ALU results to the AGU • Reduced by one cycle for many common ALU operations Reduced L1 cache load-to-use latency for pointer chasing to two cycles Integer Register File NEON-FP Regfile NEON Pipe Decode Store Pipe x2 Cortex-A55 ALU Pipe ALU Pipe Integer Pipe Divide Pipe Mult Acc Shift ALU Shift ALU Load PipeAGU Data Cache Output Data Cache Address
  • 22. © 2017 Arm Limited22 L1 memory system Common features • 4-way set associative • VIPT with PIPT programmer’s view • Improved prefetchers Cortex-A75 • 64KB • Wider load-store than Cortex-A73 • Support Read-after-Write OoO with filtering Cortex-A55 • 16KB / 32KB / 64KB • Improved store buffer bandwidth to L1 • Larger 16-entry L1-TLB Store Buffer L1 Data Cache Prefetcher L1 TLB L2 TLB L2 Cache
  • 23. © 2017 Arm Limited23 Store Buffer L1 Data Cache Prefetcher L1 TLB L2 TLB L2 Cache L2 memory system Common features • Private L2 cache in each Core • Running at Core speed • Exclusive data cache • Cache stashing into the L2 • Non-blocking 1024-entry TLB for hit-under-miss Cortex-A75 • 256KB / 512 KB Cortex-A55 • 0KB / 64KB / 128KB / 256KB
  • 24. © 2017 Arm Limited24 Next-generation features Dot product and half-precision float for AI/ML processing Virtualized Host Extensions (VHE) offering Type-2 hypervisor (KVM) performance improvements Cache stashing and atomic operations improves multicore networking performance and improves latency Cache clean to persistence to support storage class memory Infrastructure class RAS enhancement including data poisoning and improved error management
  • 25. © 2017 Arm Limited25 Innovating for the scalable future 2013-2017: The nature of compute is changing the landscape Expanding Arm technologies for broad market applicability New cluster design with new DynamIQ cores: • Cortex-A75: Breakthrough performance • Cortex-A55: Efficiency redefined Functional safety for industrial and automotive applications New features expanding microarchitecture capabilities: • DynamIQ Shared Unit , new cache features, new branch prediction
  • 27. 2727 © 2017 Arm Limited The Arm trademarks featured in this presentation are registered trademarks or trademarks of Arm Limited (or its subsidiaries) in the US and/or elsewhere. All rights reserved. All other marks featured may be trademarks of their respective owners. www.arm.com/company/policies/trademarks