SlideShare a Scribd company logo
The ARM Architecture
ARM
•Introduction and processor modes
•Instruction Set Architecture – I
•Instruction Set Architecture- II
•Pipelining in ARM
ARM
• ARM: Advanced RISC Machines
• Most widely used 32- bit RISC instruction set
  architecture
• The relative simplicity makes it suitable for low power
  devices
• ARM7, ARM9, ARM11 and Cortex
• Approximately 90% of all embedded 32-bit RISC
  processors
• Used extensively in consumer electronics,
  including PDAs, mobile phones, digital media and music
  players, hand-held game consoles, calculators and
  computer peripherals such as hard drives and routers.
Product Code Description
• M: Multiplier
  ARM processor have hardware multiplier unit doing
  multiplication
• I: Embedded ICE Macrocel
  Hardware circuit used to generate trace information. Used in
  advance debugging.
• E: Enhanced Instruction Set
• J: Java Acceleration by Jazelle mode
  Hardware circuit used for running JAVA byte code
• F: Vector Floating point
  Hardware implementation of floating operations.
• S: Synthesizable Version
  The ARM architecture can be modified as it comes in terms
  of soft processor core.
Example
• ARM7TDMI
 This is the ARM7 family processor which has T= Thumb
 instruction set, D= Debug Unit, M= MMU(Memory
 Management Unit), I= Embedded Trace core.
• ARM946E-S
  1. ARM9xx core
  2. Enhanced Instruction set
  3. Synthesizable
ARM
• ARM has 3 instruction set states
   1. 32-bit ARM instruction set
   2. 16-bit Thumb instruction set
   3. 8- bit Jazelle instruction set
• ARM – 32 bit Load/Store architecture with every instruction
  being conditional.
• Thumb- 16 bit with only branch instructions being conditional
  and only half of the registers used
• Jazelle- Allows Java byte code to be directly executed in ARM
  architecture. Improves performance by 5x-10x
ARM- Processor Modes
• Seven basic operating modes exist:
   1. User: Unprivileged mode under which most tasks run
   2. FIQ: Entered when a high priority interrupt is raised
   3. IRQ: Entered when a low priority interrupt is raised
   4. Supervisor: Entered on reset and when a software
      Interrupt instruction is executed
   5. Abort: Used to handle memory access violations
   6. Undef: Used to handle undefined instructions
   7. System: Privileged mode using the same registers as user
      mode.
Register Organization Summary
 User           FIQ       IRQ         SVC           Undef      Abort
    r0
    r1
                User
    r2         mode
     r3        r0-r7,
     r4         r15,       User         User         User        User      Thumb state
                and        mode         mode         mode        mode
     r5
                cpsr      r0-r12,      r0-r12,      r0-r12,     r0-r12,
                                                                           Low registers
     r6
                           r15,         r15,         r15,        r15,
     r7                     and          and          and         and
     r8          r8        cpsr         cpsr         cpsr        cpsr
     r9          r9
    r10          r10                                                       Thumb state
    r11          r11                                                       High registers
    r12          r12
 r13 (sp)      r13 (sp)   r13 (sp)    r13 (sp)      r13 (sp)    r13 (sp)
  r14 (lr)     r14 (lr)   r14 (lr)    r14 (lr)      r14 (lr)    r14 (lr)
 r15 (pc)

   cpsr
                spsr       spsr         spsr         spsr        spsr


Note: System mode uses the User mode register set
ARM- The Registers
• ARM has 37 registers all of which are 32-bits long.
    –   1 dedicated program counter
    –   1 dedicated current program status register
    –   5 dedicated saved program status registers
    –   30 general purpose registers

• The current processor mode governs which of several banks is
  accessible. Each mode can access
    –   a particular set of r0-r12 registers
    –   a particular r13 (the stack pointer, sp) and r14 (the link register, lr)
    –   the program counter, r15(pc)
    –   the current program status register, cpsr

   Privileged modes (except System) can also access
    – a particular spsr (saved program status register)
Program Status Registers
    31           28 27     24   23                  16 15                8   7   6   5   4              0

    NZ C VQ                 J          U n d e f i n e d                     I F T               mode
          f                             s                x                                   c
•        Condition code flags                               •   Interrupt Disable bits.
           –     N = Negative result from ALU                    – I = 1: Disables the IRQ.
           –     Z = Zero result from ALU                        – F = 1: Disables the FIQ.
           –     C = ALU operation Carried out
           –     V = ALU operation overflowed               •   T Bit
                                                                 – Architecture xT only
•        Sticky Overflow flag - Q flag                           – T = 0: Processor in ARM state
           – Architecture 5TE/J only                             – T = 1: Processor in Thumb state
           – Indicates if saturation has occurred
                                                            •   Mode bits
•        J bit                                                   – Specify the processor mode
           – Architecture 5TEJ only
           – J = 1: Processor in Jazelle state
Program Counter (r15)
• When the processor is executing in ARM state:
    – All instructions are 32 bits wide
    – All instructions must be word aligned
    – Therefore the PC value is stored in bits [31:2] with bits [1:0] undefined (as
      instruction cannot be halfword or byte aligned).

• When the processor is executing in Thumb state:
    – All instructions are 16 bits wide
    – All instructions must be halfword aligned
    – Therefore the PC value is stored in bits [31:1] with bit [0] undefined (as
      instruction cannot be byte aligned).

• When the processor is executing in Jazelle state:
    – All instructions are 8 bits wide
    – Processor performs a word access to read 4 instructions at once
Exception Handling
• When an exception occurs, the ARM:
  – Copies CPSR into SPSR_<mode>
  – Sets appropriate CPSR bits
     • Change to ARM state
                                           0x1C               FIQ
     • Change to exception mode            0x18               IRQ
     • Disable interrupts (if appropriate) 0x14          (Reserved)
  – Stores the return address in          0x10           Data Abort
  LR_<mode>                               0x0C         Prefetch Abort
                                          0x08         Software Interrupt
  – Sets PC to vector address             0x04        Undefined Instruction

• To return, exception handler            0x00              Reset

needs to:                                              Vector Table
                                                   Vector table can be at
  – Restore CPSR from SPSR_<mode>                 0xFFFF0000 on ARM720T
                                                  and on ARM9/10 family
  – Restore PC from LR_<mode>                             devices
  This can only be done in ARM state.
Development of the
                                   ARM Architecture
                                           Improved
                 Halfword                  ARM/Thumb       5TE   Jazelle
                                    4
                 and signed                Interworking                                5TEJ
      1                                                          Java bytecode
                 halfword /                                      execution
                                           CLZ
                 byte support
                 System         SA-110     Saturated maths         ARM9EJ-S          ARM926EJ-S
      2          mode
                                           DSP multiply-
                                SA-1110                            ARM7EJ-S          ARM1026EJ-S
                                           accumulate
                                           instructions
      3
                                            ARM1020E             SIMD Instructions
                Thumb              4T                                                         6
                instruction                                      Multi-processing
                set                           XScale
Early ARM                                                        V6 Memory
architectures                                                    architecture (VMSA)
                 ARM7TDMI       ARM9TDMI     ARM9E-S
                                                                 Unaligned data
                  ARM720T       ARM940T    ARM966E-S             support             ARM1136EJ-S
The ARM Instruction Set part1
Main features of the
               ARM Instruction Set
•   All instructions are 32 bits long.
•   Most instructions execute in a single cycle.
•   Every instruction can be conditionally executed.
•   A load/store architecture
    – Data processing instructions act only on registers
       • Three operand format
       • Combined ALU and shifter for high speed bit manipulation
    – Specific memory access instructions with powerful
      auto-indexing addressing modes.
Conditional Execution
• Most instruction sets only allow branches to be executed
  conditionally by postfixing them with the appropriate condition
  code field..
• However by reusing the condition evaluation hardware, ARM
  effectively increases number of instructions.
   – All instructions contain a condition field which determines whether
     the CPU will execute them.
   – Non-executed instructions soak up 1 cycle.
       • Still have to complete cycle so as to allow fetching and decoding of following
         instructions.
• This removes the need for many branches, which stall the pipeline
  (3 cycles to refill).
   – Allows very dense in-line code, without branches.
   – The Time penalty of not executing several conditional instructions is
     frequently less than overhead of the branch
     or subroutine call that would otherwise be needed.
The Condition Field
              31        28          24   20   16           12           8         4             0

                Cond


0000 = EQ - Z set (equal)                          1001 = LS - C clear or Z (set unsigned
0001 = NE - Z clear (not equal)                           lower or same)

0010 = HS / CS - C set (unsigned                   1010 = GE - N set and V set, or N clear
       higher or same)                                    and V clear (>or =)
0011 = LO / CC - C clear (unsigned                 1011 = LT - N set and V clear, or N clear
       lower)                                             and V set (>)
0100 = MI -N set (negative)                        1100 = GT - Z clear, and either N set and
0101 = PL - N clear (positive or zero)                    V set, or N clear and V set (>)
0110 = VS - V set (overflow)                       1101 = LE - Z set, or N set and V clear,or
0111 = VC - V clear (no overflow)                         N clear and V set (<, or =)

1000 = HI - C set and Z clear                      1110 = AL - always
       (unsigned higher)                           1111 = NV - reserved.
Using and updating the Condition Field
• To execute an instruction conditionally, simply postfix it with the
  appropriate condition:
    – For example an add instruction takes the form:
        • ADD r0,r1,r2 ; r0 = r1 + r2 (ADDAL)
    – To execute this only if the zero flag is set:
        • ADDEQ r0,r1,r2             ; If zero flag set then…
                                     ; ... r0 = r1 + r2
• By default, data processing operations do not affect the condition
  flags (apart from the comparisons where this is the only effect). To
  cause the condition flags to be updated, the S bit of the instruction
  needs to be set by postfixing the instruction (and any condition
  code) with an “S”.
    – For example to add two numbers and set the condition flags:
        • ADDS r0,r1,r2              ; r0 = r1 + r2
                                     ; ... and set flags
Data processing Instructions
• Largest family of ARM instructions, all sharing the same
  instruction format.
• Contains:
   –   Arithmetic operations
   –   Comparisons (no results - just set condition codes)
   –   Logical operations
   –   Data movement between registers
• Remember, this is a load / store architecture
   – These instruction only work on registers, NOT memory.
• They each perform a specific operation on one or two
  operands.
   – First operand always a register - Rn
   – Second operand sent to the ALU via barrel shifter.
ARM Processor
ARM Processor
Data Movement
• Operations are:
   – MOV      operand2
   – MVN      NOT operand2
  Note that these make no use of operand1 i.e operand1
  is ignored.
• Syntax:
   – <Operation>{<cond>}{S} Rd, Operand2
• Examples:
   – MOV r0, r1
   – MOVS r2, #10
   – MVNEQ r1,#0
Arithmetic Operations
• Operations are:
   –   ADD       operand1 + operand2
   –   ADC       operand1 + operand2 + carry
   –   SUB       operand1 - operand2
   –   SBC       operand1 - operand2 + carry -1
   –   RSB       operand2 - operand1
   –   RSC       operand2 - operand1 + carry - 1
• Syntax:
   – <Operation>{<cond>}{S} Rd, Rn, Operand2
• Examples
   –   ADD r0, r1, r2
   –   SUBGT r3, r3, #1
   –   RSBLES r4, r5, #5
   –   SUB r4,r5,r7,LSR r2    ; Logical right shift R7 by the number in
                              ; the bottom byte of R2, subtract result
                              ; from R5, and put the answer into R4.
Logical Operations
• Operations are:
   –   AND    operand1 AND operand2
   –   EOR    operand1 EOR operand2
   –   ORR    operand1 OR operand2
   –   BIC    operand1 AND NOT operand2 [ie bit clear]
• Syntax:
   – <Operation>{<cond>}{S} Rd, Rn, Operand2
• Examples:
   – AND      r0, r1, r2
   – BICEQ    r2, r3, #7
   – EORS     r1,r3,r0
Multiplication Instructions
• The Basic ARM provides two multiplication instructions.
• Multiply
   – MUL{<cond>}{S} Rd, Rm, Rs            ; Rd = Rm * Rs
• Multiply Accumulate            - does addition for free
   – MLA{<cond>}{S} Rd, Rm, Rs,Rn         ; Rd = (Rm * Rs) + Rn
• Restrictions on use:
   – Rd and Rm cannot be the same register
       • Can be avoid by swapping Rm and Rs around. This works because
         multiplication is commutative.
   – Cannot use PC.
  These will be picked up by the assembler if overlooked.
• Operands can be considered signed or unsigned
   – Up to user to interpret correctly.
• The multiply form of the instruction gives Rd:=Rm*Rs. Rn is
  ignored, and should be set to zero for compatibility with
  possible future upgrades to the instruction set.
Multiplication Implementation
 • The ARM makes use of Booth’s Algorithm to perform integer
   multiplication.
 • On non-M ARMs this operates on 2 bits of Rs at a time.
        – For each pair of bits this takes 1 cycle (plus 1 cycle to start with).
        – However when there are no more 1’s left in Rs, the multiplication will
          early-terminate.
 • Example: Multiply 18 and -1 : Rd = Rm * Rs
  Rm         18 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 18     Rs

   Rs        -1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 -1    Rm

17 cycles                                                                            4 cycles

 • Note: Compiler does not use early termination criteria to
   decide on which order to place operands.
Booth’s Algorithm
Extended Multiply Instructions
• M variants of ARM cores contain extended multiplication
  hardware. This provides three enhancements:
   – An 8 bit Booth’s Algorithm is used
       • Multiplication is carried out faster (maximum for standard instructions
         is now 5 cycles).
   – Early termination method improved so that now completes
     multiplication when all remaining bit sets contain
       • all zeroes (as with non-M ARMs), or
       • all ones.
     Thus the previous example would early terminate in 2 cycles in
     both cases.
   – 64 bit results can now be produced from two 32bit operands
       • Higher accuracy.
       • Pair of registers used to store result.
Multiply-Long and
             Multiply-Accumulate Long
• Instructions are
    – MULL which gives RdHi,RdLo:=Rm*Rs
    – MLAL which gives RdHi,RdLo:=(Rm*Rs)+RdHi,RdLo
• However the full 64 bit of the result now matter (lower precision
  multiply instructions simply throws top 32bits away)
    – Need to specify whether operands are signed or unsigned
• Therefore syntax of new instructions are:
    –   UMULL{<cond>}{S} RdLo,RdHi,Rm,Rs
    –   UMLAL{<cond>}{S} RdLo,RdHi,Rm,Rs
    –   SMULL{<cond>}{S} RdLo, RdHi, Rm, Rs
    –   SMLAL{<cond>}{S} RdLo, RdHi, Rm, Rs
• Not generated by the compiler.
   Warning : Unpredictable on non-M ARMs.
Operand restrictions
  • R15 must not be used as an operand or as a destination
  register.
  • RdHi, RdLo, and Rm must all specify different registers.
ISA part 1
Data Transfer

• ARM is a load/store architecture
• Involves
   -Load data from memory to register
   -Store data from register into memory
• ARM has three types of load/store instructions
  -LDR/STR
  -LDM/STM
  -SWP
LDR/STR Instructions
Types of load/store instructions

Simple load/store has options like the following
• LDR/STR       involved in storing/loading words(32 bits)
• LDRB/STRB involved with a byte transfer
• In ARM v4 we also have support for halfwords(16 bits)
   LDRH/STRH without sign extension
   LDRSB/STRSB with sign extension
• Condition codes can also be suffixed
   LDREQB/STREQB
• General syntax looks somewhat like..
   <LDR|STR>{<cond>}{<size>} Rd, <address>
Base Register
• STR r0,[r1] Stores content in address contained in r1 in r0
  LDR r2,[r1] Loads content in address contained in r1 to r2


                     r0            Memory
      Source
                    0x5
      Register
      for STR


              r1                                r2
  Base                                               Destination
            0x200          0x200    0x5        0x5
 Register                                             Register
                                                      for LDR
Off set from the base register

• ARM also supports accessing locations pointed out as an
  offset from the base register
• The offset can be
  An unsigned 12 bit immediate value(0-4096)
  A register with the option of shift
• Option exists for ‘+’ or ‘-’ from base register
• Offset can be applied
  - before transfer is made
    optionally auto incremnets base register by using ‘!’
  -after transfer is made
    base register auto incremented
Pre-Indexed Addressing

• Example :STR r0,[r1,#12]
                                                   r0   Source
                                     Memory
                                                  0x5   Register
                   Offset                               for STR
                     12      0x20c    0x5
             r1
 Base      0x200             0x200
Register


  •Offset value can as well be -12 (STR r0,[r1,#-12])
  •To perform auto increment on base reg STR r0,[r1,#12]!
    -updates base register to value 0x20C
  •If r2 contains 3 then this will yield the same result
   STR r0,[r1,r2,LSL#2]
  •Useful if only a particular element is to be accessed
Post Indexed Addressing
• Example :STR r0,[r1],#12
                                         Memory

Updated      r1       Offset                             r0    Source
 Base      0x20c        12       0x20c                  0x5    Register
Register                                                       for STR
                                 0x200    0x5
Original     r1
 Base      0x200
Register


 •If r2 contains 3 then this will also yield the same result
   STR r0,[r1],r2,LSL #2
 •Useful if traversal is required through elements
For half words/signed byte access

• Instructions can be used in much the same way except
  - the offset value is restricted to 8 bits(0-255)
  - the registers cannot be shifted
For LDRH/STRH register offset
For LDRH/STRH immediate offset
LDM/STM (Block data transfer)
• Allow for transfer between 1-16 registers to or from memory
• The transferred registers can be:
  - Any subset of the current bank of registers (default).
  - Any subset of the user mode bank of registers when in a
    privileged mode (postfix instruction with a ‘^’).
Instruction Format
Block Data Transfer

• Base register determines where memory access can occur
• Base register can be updated after data transfer by suffixing a
  ‘!’
• These instructions are useful for
   - Saving and restoring context
   - moving large chunks of data to/from memory
Stack Example
Block Data Transfer

• One use of stacks is to temporary create register space for
  subroutines
  STMFD sp!,{r0-r12, lr}         ; stack all registers
   ........                      ; and the return address
   ........
  LDMFD sp!,{r0-r12, pc}         ; load all the registers
                                 ; and return automatically

• If the pop instruction also had the ‘S’ bit set (using ‘^’) then
  the transfer of the PC when in a priviledged mode would also
  cause the SPSR to be copied into the CPSR (see exception
  handling module).
Direct functionality Of Block Data Transfer

• When not being used for a stack operation these instructions
  can also be used in a generic way
• The LDM/STM support a further set of instructions
   – STMIA / LDMIA : Increment After
   – STMIB / LDMIB : Increment Before
   – STMDA / LDMDA : Decrement After
   – STMDB / LDMDB : Decrement Before
Criteria for different block data transfer
Swap Instruction
Swap Instruction

• The instruction is used to swap data between a register and a
  memory
• This instruction is atomic (cannot be interrupted)
• The swap address is determined by the contents of the base
  register (Rn).
• The processor first reads the contents of the swap address.
  Then it writes the contents of the source register (Rm) to the
  swap address, and stores the old memory contents in the
  destination register (Rd).
• The same register may be specified as both the source and
  destination
Branch and Exchange




•Used to switch between the Thumb state and the ARM state
Branch and Branch Link
Branch and Branch with Link

• Branch instructions contain a signed 2’s complement 24 bit offset.
• This is shifted left two bits, sign extended to 32 bits, and added to
  the PC.
• The instruction can therefore specify a branch of +/- 32Mbytes.
• The branch offset must take account of the prefetch operation,
  which causes the PC to be 2 words (8 bytes) ahead of the current
  instruction.
• Branches beyond +/- 32Mbytes must use an offset or absolute
  destination which has been previously loaded into a register. In this
  case the PC should be manually saved in R14 if a Branch with Link
  type operation is required.
Link Bit

• Branch with Link (BL) writes the old PC into the link register
  (R14) of the current bank.
• The PC value written into R14 is adjusted to allow for the
  prefetch, and contains the address of the instruction following
  the branch and link instruction.
• The CPSR is not saved with the PC
Barrel Shifter

• A barrel shifter is a digital circuit that can shift a data word by
  a specified number of bits in one clock cycle.
• It can be implemented as a sequence of multiplexers (mux.),
  and in such an implementation the output of one mux is
  connected to the input of the next mux in a way that depends
  on the shift distance.
• A barrel shifter is often implemented as a cascade of parallel
  2×1 multiplexers.
Using the Barrel Shifter




•There are 2 options for shifting
 - where shift amount is stored in a base register bottom byte
 - shift amount as a % bit unsigned integer
Shift Operations

• Shifts Left by specified amount (multiplies)
• Example: LSL #5




          CF                  Destination        0
Shift Operations

• Logical Shift Right
• Shifts right without preserving sign bit
                               ...0              Destination   CF


• Arithmetic Shift Right
• Preserves the sign bit


                                             Destination       CF

                           Sign bit shifted in
Rotate

• Rotate Right
  Same as ASR but the bits wrap around as they rotate
   The rotated bit also used as carry flag


                                         Rotate Right


                                         Destination    CF
Comparison
• The only effect of the comparisons is to
   – UPDATE THE CONDITION FLAGS. Thus no need to set S bit.
• Operations are:
   – CMP      operand1 - operand2, but result not written
   – CMN      operand1 + operand2, but result not written
   – TST      operand1 AND operand2, but result not written
   – TEQ      operand1 EOR operand2, but result not written
• Syntax:
   – <Operation>{<cond>} Rn, Operand2
• Examples:
   – CMP      r0, r1
   – TSTEQ r2, #5
ARM Processor
Pipelining
• Initially implemented a 3-stage pipeline
  organization. (upto ARM7)
  – Fetch
  – Decode
  – Execute
• 3-stage pipeline organization
  – Principal components
     • The register bank
     • The barrel shifter
        – Can shift or rotate one operand by any number of bits
     • The ALU
     • The address register and incrementer
        – Select and hold all memory addresses and generate
          sequential addresses
     • The data registers
     • The instruction decoder and associated control logic
• Fetch - The instruction is
  fetched from memory and
  placed in the instruction
  pipeline
• Decode - The instruction is
  decoded and the datapath
  control signals prepared for
  the next cycle
• Execute - The register bank
  is read, an operand shifted,
  the ALU result generated
  and written back into
  destination register
• At any time slice, 3 different instructions may occupy
  each of these stages, so the hardware in each stage has
  to be capable of independent operations

• When the processor is executing data processing
  instructions , the latency = 3 cycles and the throughput
  = 1 instruction/cycle

• Drawback: Every data transfer instruction causes a
  pipeline “stall”. (Single memory for data and
  instruction- next instruction cannot be fetched while
  data is being read)
5-stage Pipeline Organization
• Implemented in ARM9TDMI
• Tprog = Ninst * CPI / fclk
  – Tprog: the time taken to execute a given program
  – Ninst: the number of ARM instructions executed in
    the program (compiler dependent)
  – CPI: average number of clock cycles per
    instructions => hazard causes pipeline stalls
  – fclk: frequency
• Fetch
   – The instruction is fetched from
     memory and placed in the
     instruction pipeline
• Decode
   – The instruction is decoded and
     register operands read from the
     register files. There are 3
     operand read ports in the
     register file so most ARM
     instructions can source all their
     operands in one cycle
• Execute
   – An operand is shifted and the
     ALU result generated. If the
     instruction is a load or store,
     the memory address is
     computed in the ALU
• Buffer/Data
  – Data memory is accessed
    if required. Otherwise the
    ALU result is simply
    buffered for one cycle.
• Write back
  – The result generated by
    the instruction are written
    back to the register file,
    including any data loaded
    from memory.
5-stage pipeline organization
• Moved the register read step from the execute
  stage to the decode stage
• Execute stage was split into 3 stages- ALU,
  memory access, write back.
• Result: Better balanced pipeline with
  minimized latencies between stages, which
  can run at a faster clock speed.
Pipeline Hazards
• There are situations, called hazards, that prevent the
  next instruction in the instruction stream from being
  executed during its designated clock cycle. Hazards
  reduce the performance from the ideal speedup
  gained by pipelining.
• There are three classes of hazards:
   – Structural Hazards
   – Data Hazards
   – Control Hazards
Structural Hazards
• When a machine is pipelined, the overlapped
  execution of instructions requires pipelining of
  functional units and duplication of resources
  to allow all possible combinations of
  instructions in the pipeline.
• If some combination of instructions cannot be
  accommodated because of a resource conflict,
  the machine is said to have a structural
  hazard.
• Ex. A machine has shared a single-memory pipeline
  for data and instructions. As a result, when an
  instruction contains a data-memory reference (load),
  it will conflict with the instruction reference for a
  later instruction (instr 3):
Solution
• To resolve this, we stall the pipeline for one clock
  cycle when a data-memory access occurs. The effect
  of the stall is actually to occupy the resources for
  that instruction slot. The following table shows how
  the stall is actually implemented.
Solution
• Another solution is to use separate instruction
  and data memories.
• ARM has moved from the von-Neumann
  architecture to the Harvard architecture in
  ARM9.
  – Implemented a 5-stage pipeline and separate data
    and instruction memory.
  – Doesn’t suffer from this hazard.
Data Hazards
• They arise when an instruction depends on the result of a
  previous instruction in a way that is exposed by the
  overlapping of instructions in the pipeline.
• The problem with data hazards can be solved with a
  hardware technique called data forwarding (by making
  use of feedback paths).
• Without forwarding, the pipeline would have to be
  stalled to get the results from the respective registers
• Example:
Data Hazards




•   The first forwarding is for value of R1 from EXadd to EXsub.
•   The second forwarding is also for value of R1 from MEMadd to EXand.
•   This code now can be executed without stalls.
•   Forwarding can be generalized to include passing the result directly
    to the functional unit that requires it: a result is forwarded from the
    output of one unit to the input of another, rather than just from the
    result of a unit to the input of the same unit.
Control Hazards
• They arise from the pipelining of branches and other
  instructions that change the PC.
Further Improvements
THANK YOU




•Alok Sharma
•Aniket Thakur
•Paritosh Ramanan
•Pavan A.R.

More Related Content

PDF
ARM 32-bit Microcontroller Cortex-M3 introduction
PDF
Unit II Arm7 Thumb Instruction
PPTX
Introduction to arm processor
PDF
ARM CORTEX M3 PPT
PDF
Introduction to arm architecture
PPT
Arm processor
PPT
PPSX
Lect 2 ARM processor architecture
ARM 32-bit Microcontroller Cortex-M3 introduction
Unit II Arm7 Thumb Instruction
Introduction to arm processor
ARM CORTEX M3 PPT
Introduction to arm architecture
Arm processor
Lect 2 ARM processor architecture

What's hot (20)

PDF
ARM Architecture
PPTX
ARM Processors
PPTX
I2C Protocol
PDF
Communication protocols - Embedded Systems
PPTX
CISC & RISC Architecture
PPTX
PIC Microcontrollers
PPT
ARM Architecture
PPTX
Advanced Pipelining in ARM Processors.pptx
PPTX
Unit vi (2)
PPT
AVR Fundamentals
PPTX
ATmega32-AVR microcontrollers-Part I
PPTX
Arm architecture chapter2_steve_furber
PPTX
Architecture of 8051
PDF
Arm instruction set
PPTX
Pic microcontroller architecture
PPT
Controller area network (CAN bus) ppt
PPT
I2C Protocol
PDF
Control Unit Design
PPTX
Design challenges in embedded systems
PDF
8259 Programmable Interrupt Controller
ARM Architecture
ARM Processors
I2C Protocol
Communication protocols - Embedded Systems
CISC & RISC Architecture
PIC Microcontrollers
ARM Architecture
Advanced Pipelining in ARM Processors.pptx
Unit vi (2)
AVR Fundamentals
ATmega32-AVR microcontrollers-Part I
Arm architecture chapter2_steve_furber
Architecture of 8051
Arm instruction set
Pic microcontroller architecture
Controller area network (CAN bus) ppt
I2C Protocol
Control Unit Design
Design challenges in embedded systems
8259 Programmable Interrupt Controller
Ad

Viewers also liked (6)

DOCX
ARM7-ARCHITECTURE
PPTX
PDF
Unit II Arm 7 Introduction
PPT
Handheld Devices
PPTX
Handheld operting system
DOC
8051 Microcontroller Notes
ARM7-ARCHITECTURE
Unit II Arm 7 Introduction
Handheld Devices
Handheld operting system
8051 Microcontroller Notes
Ad

Similar to ARM Processor (20)

PDF
Arm architecture overview
PDF
2 introduction to arm architecture
PPTX
Arm architecture
PPT
LPC 2148 Instructions Set.ppt
PPT
AdvancedRiscMachineryss-INTRODUCTION.ppt
PPT
ARM Introduction
PPT
arm-intro.ppt
PDF
ARM Holings presentation for the worldd.pdf
PDF
ESD_05_ARM_Instructions set for preparation
PPT
The ARM Architecture: ARM : ARM Architecture
PPTX
PPTX
ARM-7 ADDRESSING MODES INSTRUCTION SET
PPT
arm.ppt, RISC Machines , Acorn, Apple and VLSI
PPT
ARM7TDMI-S_CPU.ppt
PPT
07-arm_overview.ppt
PPTX
Introduction to ARM
PPT
arm_3.ppt
PPT
One day-workshop on tms320 f2812
PPT
ARM7_Architecture.ppt, RISC-processor core
Arm architecture overview
2 introduction to arm architecture
Arm architecture
LPC 2148 Instructions Set.ppt
AdvancedRiscMachineryss-INTRODUCTION.ppt
ARM Introduction
arm-intro.ppt
ARM Holings presentation for the worldd.pdf
ESD_05_ARM_Instructions set for preparation
The ARM Architecture: ARM : ARM Architecture
ARM-7 ADDRESSING MODES INSTRUCTION SET
arm.ppt, RISC Machines , Acorn, Apple and VLSI
ARM7TDMI-S_CPU.ppt
07-arm_overview.ppt
Introduction to ARM
arm_3.ppt
One day-workshop on tms320 f2812
ARM7_Architecture.ppt, RISC-processor core

Recently uploaded (20)

PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
cloud_computing_Infrastucture_as_cloud_p
PDF
Machine learning based COVID-19 study performance prediction
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
Tartificialntelligence_presentation.pptx
PDF
Heart disease approach using modified random forest and particle swarm optimi...
PPTX
OMC Textile Division Presentation 2021.pptx
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PPT
Teaching material agriculture food technology
PDF
Getting Started with Data Integration: FME Form 101
PDF
Encapsulation theory and applications.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
A comparative study of natural language inference in Swahili using monolingua...
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PPTX
Spectroscopy.pptx food analysis technology
Building Integrated photovoltaic BIPV_UPV.pdf
cloud_computing_Infrastucture_as_cloud_p
Machine learning based COVID-19 study performance prediction
Programs and apps: productivity, graphics, security and other tools
MIND Revenue Release Quarter 2 2025 Press Release
Mobile App Security Testing_ A Comprehensive Guide.pdf
Tartificialntelligence_presentation.pptx
Heart disease approach using modified random forest and particle swarm optimi...
OMC Textile Division Presentation 2021.pptx
Digital-Transformation-Roadmap-for-Companies.pptx
Group 1 Presentation -Planning and Decision Making .pptx
SOPHOS-XG Firewall Administrator PPT.pptx
Teaching material agriculture food technology
Getting Started with Data Integration: FME Form 101
Encapsulation theory and applications.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
A comparative study of natural language inference in Swahili using monolingua...
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Spectroscopy.pptx food analysis technology

ARM Processor

  • 2. ARM •Introduction and processor modes •Instruction Set Architecture – I •Instruction Set Architecture- II •Pipelining in ARM
  • 3. ARM • ARM: Advanced RISC Machines • Most widely used 32- bit RISC instruction set architecture • The relative simplicity makes it suitable for low power devices • ARM7, ARM9, ARM11 and Cortex • Approximately 90% of all embedded 32-bit RISC processors • Used extensively in consumer electronics, including PDAs, mobile phones, digital media and music players, hand-held game consoles, calculators and computer peripherals such as hard drives and routers.
  • 4. Product Code Description • M: Multiplier ARM processor have hardware multiplier unit doing multiplication • I: Embedded ICE Macrocel Hardware circuit used to generate trace information. Used in advance debugging. • E: Enhanced Instruction Set • J: Java Acceleration by Jazelle mode Hardware circuit used for running JAVA byte code • F: Vector Floating point Hardware implementation of floating operations. • S: Synthesizable Version The ARM architecture can be modified as it comes in terms of soft processor core.
  • 5. Example • ARM7TDMI This is the ARM7 family processor which has T= Thumb instruction set, D= Debug Unit, M= MMU(Memory Management Unit), I= Embedded Trace core. • ARM946E-S 1. ARM9xx core 2. Enhanced Instruction set 3. Synthesizable
  • 6. ARM • ARM has 3 instruction set states 1. 32-bit ARM instruction set 2. 16-bit Thumb instruction set 3. 8- bit Jazelle instruction set • ARM – 32 bit Load/Store architecture with every instruction being conditional. • Thumb- 16 bit with only branch instructions being conditional and only half of the registers used • Jazelle- Allows Java byte code to be directly executed in ARM architecture. Improves performance by 5x-10x
  • 7. ARM- Processor Modes • Seven basic operating modes exist: 1. User: Unprivileged mode under which most tasks run 2. FIQ: Entered when a high priority interrupt is raised 3. IRQ: Entered when a low priority interrupt is raised 4. Supervisor: Entered on reset and when a software Interrupt instruction is executed 5. Abort: Used to handle memory access violations 6. Undef: Used to handle undefined instructions 7. System: Privileged mode using the same registers as user mode.
  • 8. Register Organization Summary User FIQ IRQ SVC Undef Abort r0 r1 User r2 mode r3 r0-r7, r4 r15, User User User User Thumb state and mode mode mode mode r5 cpsr r0-r12, r0-r12, r0-r12, r0-r12, Low registers r6 r15, r15, r15, r15, r7 and and and and r8 r8 cpsr cpsr cpsr cpsr r9 r9 r10 r10 Thumb state r11 r11 High registers r12 r12 r13 (sp) r13 (sp) r13 (sp) r13 (sp) r13 (sp) r13 (sp) r14 (lr) r14 (lr) r14 (lr) r14 (lr) r14 (lr) r14 (lr) r15 (pc) cpsr spsr spsr spsr spsr spsr Note: System mode uses the User mode register set
  • 9. ARM- The Registers • ARM has 37 registers all of which are 32-bits long. – 1 dedicated program counter – 1 dedicated current program status register – 5 dedicated saved program status registers – 30 general purpose registers • The current processor mode governs which of several banks is accessible. Each mode can access – a particular set of r0-r12 registers – a particular r13 (the stack pointer, sp) and r14 (the link register, lr) – the program counter, r15(pc) – the current program status register, cpsr Privileged modes (except System) can also access – a particular spsr (saved program status register)
  • 10. Program Status Registers 31 28 27 24 23 16 15 8 7 6 5 4 0 NZ C VQ J U n d e f i n e d I F T mode f s x c • Condition code flags • Interrupt Disable bits. – N = Negative result from ALU – I = 1: Disables the IRQ. – Z = Zero result from ALU – F = 1: Disables the FIQ. – C = ALU operation Carried out – V = ALU operation overflowed • T Bit – Architecture xT only • Sticky Overflow flag - Q flag – T = 0: Processor in ARM state – Architecture 5TE/J only – T = 1: Processor in Thumb state – Indicates if saturation has occurred • Mode bits • J bit – Specify the processor mode – Architecture 5TEJ only – J = 1: Processor in Jazelle state
  • 11. Program Counter (r15) • When the processor is executing in ARM state: – All instructions are 32 bits wide – All instructions must be word aligned – Therefore the PC value is stored in bits [31:2] with bits [1:0] undefined (as instruction cannot be halfword or byte aligned). • When the processor is executing in Thumb state: – All instructions are 16 bits wide – All instructions must be halfword aligned – Therefore the PC value is stored in bits [31:1] with bit [0] undefined (as instruction cannot be byte aligned). • When the processor is executing in Jazelle state: – All instructions are 8 bits wide – Processor performs a word access to read 4 instructions at once
  • 12. Exception Handling • When an exception occurs, the ARM: – Copies CPSR into SPSR_<mode> – Sets appropriate CPSR bits • Change to ARM state 0x1C FIQ • Change to exception mode 0x18 IRQ • Disable interrupts (if appropriate) 0x14 (Reserved) – Stores the return address in 0x10 Data Abort LR_<mode> 0x0C Prefetch Abort 0x08 Software Interrupt – Sets PC to vector address 0x04 Undefined Instruction • To return, exception handler 0x00 Reset needs to: Vector Table Vector table can be at – Restore CPSR from SPSR_<mode> 0xFFFF0000 on ARM720T and on ARM9/10 family – Restore PC from LR_<mode> devices This can only be done in ARM state.
  • 13. Development of the ARM Architecture Improved Halfword ARM/Thumb 5TE Jazelle 4 and signed Interworking 5TEJ 1 Java bytecode halfword / execution CLZ byte support System SA-110 Saturated maths ARM9EJ-S ARM926EJ-S 2 mode DSP multiply- SA-1110 ARM7EJ-S ARM1026EJ-S accumulate instructions 3 ARM1020E SIMD Instructions Thumb 4T 6 instruction Multi-processing set XScale Early ARM V6 Memory architectures architecture (VMSA) ARM7TDMI ARM9TDMI ARM9E-S Unaligned data ARM720T ARM940T ARM966E-S support ARM1136EJ-S
  • 14. The ARM Instruction Set part1
  • 15. Main features of the ARM Instruction Set • All instructions are 32 bits long. • Most instructions execute in a single cycle. • Every instruction can be conditionally executed. • A load/store architecture – Data processing instructions act only on registers • Three operand format • Combined ALU and shifter for high speed bit manipulation – Specific memory access instructions with powerful auto-indexing addressing modes.
  • 16. Conditional Execution • Most instruction sets only allow branches to be executed conditionally by postfixing them with the appropriate condition code field.. • However by reusing the condition evaluation hardware, ARM effectively increases number of instructions. – All instructions contain a condition field which determines whether the CPU will execute them. – Non-executed instructions soak up 1 cycle. • Still have to complete cycle so as to allow fetching and decoding of following instructions. • This removes the need for many branches, which stall the pipeline (3 cycles to refill). – Allows very dense in-line code, without branches. – The Time penalty of not executing several conditional instructions is frequently less than overhead of the branch or subroutine call that would otherwise be needed.
  • 17. The Condition Field 31 28 24 20 16 12 8 4 0 Cond 0000 = EQ - Z set (equal) 1001 = LS - C clear or Z (set unsigned 0001 = NE - Z clear (not equal) lower or same) 0010 = HS / CS - C set (unsigned 1010 = GE - N set and V set, or N clear higher or same) and V clear (>or =) 0011 = LO / CC - C clear (unsigned 1011 = LT - N set and V clear, or N clear lower) and V set (>) 0100 = MI -N set (negative) 1100 = GT - Z clear, and either N set and 0101 = PL - N clear (positive or zero) V set, or N clear and V set (>) 0110 = VS - V set (overflow) 1101 = LE - Z set, or N set and V clear,or 0111 = VC - V clear (no overflow) N clear and V set (<, or =) 1000 = HI - C set and Z clear 1110 = AL - always (unsigned higher) 1111 = NV - reserved.
  • 18. Using and updating the Condition Field • To execute an instruction conditionally, simply postfix it with the appropriate condition: – For example an add instruction takes the form: • ADD r0,r1,r2 ; r0 = r1 + r2 (ADDAL) – To execute this only if the zero flag is set: • ADDEQ r0,r1,r2 ; If zero flag set then… ; ... r0 = r1 + r2 • By default, data processing operations do not affect the condition flags (apart from the comparisons where this is the only effect). To cause the condition flags to be updated, the S bit of the instruction needs to be set by postfixing the instruction (and any condition code) with an “S”. – For example to add two numbers and set the condition flags: • ADDS r0,r1,r2 ; r0 = r1 + r2 ; ... and set flags
  • 19. Data processing Instructions • Largest family of ARM instructions, all sharing the same instruction format. • Contains: – Arithmetic operations – Comparisons (no results - just set condition codes) – Logical operations – Data movement between registers • Remember, this is a load / store architecture – These instruction only work on registers, NOT memory. • They each perform a specific operation on one or two operands. – First operand always a register - Rn – Second operand sent to the ALU via barrel shifter.
  • 22. Data Movement • Operations are: – MOV operand2 – MVN NOT operand2 Note that these make no use of operand1 i.e operand1 is ignored. • Syntax: – <Operation>{<cond>}{S} Rd, Operand2 • Examples: – MOV r0, r1 – MOVS r2, #10 – MVNEQ r1,#0
  • 23. Arithmetic Operations • Operations are: – ADD operand1 + operand2 – ADC operand1 + operand2 + carry – SUB operand1 - operand2 – SBC operand1 - operand2 + carry -1 – RSB operand2 - operand1 – RSC operand2 - operand1 + carry - 1 • Syntax: – <Operation>{<cond>}{S} Rd, Rn, Operand2 • Examples – ADD r0, r1, r2 – SUBGT r3, r3, #1 – RSBLES r4, r5, #5 – SUB r4,r5,r7,LSR r2 ; Logical right shift R7 by the number in ; the bottom byte of R2, subtract result ; from R5, and put the answer into R4.
  • 24. Logical Operations • Operations are: – AND operand1 AND operand2 – EOR operand1 EOR operand2 – ORR operand1 OR operand2 – BIC operand1 AND NOT operand2 [ie bit clear] • Syntax: – <Operation>{<cond>}{S} Rd, Rn, Operand2 • Examples: – AND r0, r1, r2 – BICEQ r2, r3, #7 – EORS r1,r3,r0
  • 25. Multiplication Instructions • The Basic ARM provides two multiplication instructions. • Multiply – MUL{<cond>}{S} Rd, Rm, Rs ; Rd = Rm * Rs • Multiply Accumulate - does addition for free – MLA{<cond>}{S} Rd, Rm, Rs,Rn ; Rd = (Rm * Rs) + Rn • Restrictions on use: – Rd and Rm cannot be the same register • Can be avoid by swapping Rm and Rs around. This works because multiplication is commutative. – Cannot use PC. These will be picked up by the assembler if overlooked. • Operands can be considered signed or unsigned – Up to user to interpret correctly.
  • 26. • The multiply form of the instruction gives Rd:=Rm*Rs. Rn is ignored, and should be set to zero for compatibility with possible future upgrades to the instruction set.
  • 27. Multiplication Implementation • The ARM makes use of Booth’s Algorithm to perform integer multiplication. • On non-M ARMs this operates on 2 bits of Rs at a time. – For each pair of bits this takes 1 cycle (plus 1 cycle to start with). – However when there are no more 1’s left in Rs, the multiplication will early-terminate. • Example: Multiply 18 and -1 : Rd = Rm * Rs Rm 18 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 18 Rs Rs -1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 -1 Rm 17 cycles 4 cycles • Note: Compiler does not use early termination criteria to decide on which order to place operands.
  • 29. Extended Multiply Instructions • M variants of ARM cores contain extended multiplication hardware. This provides three enhancements: – An 8 bit Booth’s Algorithm is used • Multiplication is carried out faster (maximum for standard instructions is now 5 cycles). – Early termination method improved so that now completes multiplication when all remaining bit sets contain • all zeroes (as with non-M ARMs), or • all ones. Thus the previous example would early terminate in 2 cycles in both cases. – 64 bit results can now be produced from two 32bit operands • Higher accuracy. • Pair of registers used to store result.
  • 30. Multiply-Long and Multiply-Accumulate Long • Instructions are – MULL which gives RdHi,RdLo:=Rm*Rs – MLAL which gives RdHi,RdLo:=(Rm*Rs)+RdHi,RdLo • However the full 64 bit of the result now matter (lower precision multiply instructions simply throws top 32bits away) – Need to specify whether operands are signed or unsigned • Therefore syntax of new instructions are: – UMULL{<cond>}{S} RdLo,RdHi,Rm,Rs – UMLAL{<cond>}{S} RdLo,RdHi,Rm,Rs – SMULL{<cond>}{S} RdLo, RdHi, Rm, Rs – SMLAL{<cond>}{S} RdLo, RdHi, Rm, Rs • Not generated by the compiler. Warning : Unpredictable on non-M ARMs.
  • 31. Operand restrictions • R15 must not be used as an operand or as a destination register. • RdHi, RdLo, and Rm must all specify different registers.
  • 33. Data Transfer • ARM is a load/store architecture • Involves -Load data from memory to register -Store data from register into memory • ARM has three types of load/store instructions -LDR/STR -LDM/STM -SWP
  • 35. Types of load/store instructions Simple load/store has options like the following • LDR/STR  involved in storing/loading words(32 bits) • LDRB/STRB involved with a byte transfer • In ARM v4 we also have support for halfwords(16 bits) LDRH/STRH without sign extension LDRSB/STRSB with sign extension • Condition codes can also be suffixed LDREQB/STREQB • General syntax looks somewhat like.. <LDR|STR>{<cond>}{<size>} Rd, <address>
  • 36. Base Register • STR r0,[r1] Stores content in address contained in r1 in r0 LDR r2,[r1] Loads content in address contained in r1 to r2 r0 Memory Source 0x5 Register for STR r1 r2 Base Destination 0x200 0x200 0x5 0x5 Register Register for LDR
  • 37. Off set from the base register • ARM also supports accessing locations pointed out as an offset from the base register • The offset can be An unsigned 12 bit immediate value(0-4096) A register with the option of shift • Option exists for ‘+’ or ‘-’ from base register • Offset can be applied - before transfer is made optionally auto incremnets base register by using ‘!’ -after transfer is made base register auto incremented
  • 38. Pre-Indexed Addressing • Example :STR r0,[r1,#12] r0 Source Memory 0x5 Register Offset for STR 12 0x20c 0x5 r1 Base 0x200 0x200 Register •Offset value can as well be -12 (STR r0,[r1,#-12]) •To perform auto increment on base reg STR r0,[r1,#12]! -updates base register to value 0x20C •If r2 contains 3 then this will yield the same result STR r0,[r1,r2,LSL#2] •Useful if only a particular element is to be accessed
  • 39. Post Indexed Addressing • Example :STR r0,[r1],#12 Memory Updated r1 Offset r0 Source Base 0x20c 12 0x20c 0x5 Register Register for STR 0x200 0x5 Original r1 Base 0x200 Register •If r2 contains 3 then this will also yield the same result STR r0,[r1],r2,LSL #2 •Useful if traversal is required through elements
  • 40. For half words/signed byte access • Instructions can be used in much the same way except - the offset value is restricted to 8 bits(0-255) - the registers cannot be shifted
  • 43. LDM/STM (Block data transfer) • Allow for transfer between 1-16 registers to or from memory • The transferred registers can be: - Any subset of the current bank of registers (default). - Any subset of the user mode bank of registers when in a privileged mode (postfix instruction with a ‘^’).
  • 45. Block Data Transfer • Base register determines where memory access can occur • Base register can be updated after data transfer by suffixing a ‘!’ • These instructions are useful for - Saving and restoring context - moving large chunks of data to/from memory
  • 47. Block Data Transfer • One use of stacks is to temporary create register space for subroutines STMFD sp!,{r0-r12, lr} ; stack all registers ........ ; and the return address ........ LDMFD sp!,{r0-r12, pc} ; load all the registers ; and return automatically • If the pop instruction also had the ‘S’ bit set (using ‘^’) then the transfer of the PC when in a priviledged mode would also cause the SPSR to be copied into the CPSR (see exception handling module).
  • 48. Direct functionality Of Block Data Transfer • When not being used for a stack operation these instructions can also be used in a generic way • The LDM/STM support a further set of instructions – STMIA / LDMIA : Increment After – STMIB / LDMIB : Increment Before – STMDA / LDMDA : Decrement After – STMDB / LDMDB : Decrement Before
  • 49. Criteria for different block data transfer
  • 51. Swap Instruction • The instruction is used to swap data between a register and a memory • This instruction is atomic (cannot be interrupted) • The swap address is determined by the contents of the base register (Rn). • The processor first reads the contents of the swap address. Then it writes the contents of the source register (Rm) to the swap address, and stores the old memory contents in the destination register (Rd). • The same register may be specified as both the source and destination
  • 52. Branch and Exchange •Used to switch between the Thumb state and the ARM state
  • 54. Branch and Branch with Link • Branch instructions contain a signed 2’s complement 24 bit offset. • This is shifted left two bits, sign extended to 32 bits, and added to the PC. • The instruction can therefore specify a branch of +/- 32Mbytes. • The branch offset must take account of the prefetch operation, which causes the PC to be 2 words (8 bytes) ahead of the current instruction. • Branches beyond +/- 32Mbytes must use an offset or absolute destination which has been previously loaded into a register. In this case the PC should be manually saved in R14 if a Branch with Link type operation is required.
  • 55. Link Bit • Branch with Link (BL) writes the old PC into the link register (R14) of the current bank. • The PC value written into R14 is adjusted to allow for the prefetch, and contains the address of the instruction following the branch and link instruction. • The CPSR is not saved with the PC
  • 56. Barrel Shifter • A barrel shifter is a digital circuit that can shift a data word by a specified number of bits in one clock cycle. • It can be implemented as a sequence of multiplexers (mux.), and in such an implementation the output of one mux is connected to the input of the next mux in a way that depends on the shift distance. • A barrel shifter is often implemented as a cascade of parallel 2×1 multiplexers.
  • 57. Using the Barrel Shifter •There are 2 options for shifting - where shift amount is stored in a base register bottom byte - shift amount as a % bit unsigned integer
  • 58. Shift Operations • Shifts Left by specified amount (multiplies) • Example: LSL #5 CF Destination 0
  • 59. Shift Operations • Logical Shift Right • Shifts right without preserving sign bit ...0 Destination CF • Arithmetic Shift Right • Preserves the sign bit Destination CF Sign bit shifted in
  • 60. Rotate • Rotate Right Same as ASR but the bits wrap around as they rotate The rotated bit also used as carry flag Rotate Right Destination CF
  • 61. Comparison • The only effect of the comparisons is to – UPDATE THE CONDITION FLAGS. Thus no need to set S bit. • Operations are: – CMP operand1 - operand2, but result not written – CMN operand1 + operand2, but result not written – TST operand1 AND operand2, but result not written – TEQ operand1 EOR operand2, but result not written • Syntax: – <Operation>{<cond>} Rn, Operand2 • Examples: – CMP r0, r1 – TSTEQ r2, #5
  • 63. Pipelining • Initially implemented a 3-stage pipeline organization. (upto ARM7) – Fetch – Decode – Execute
  • 64. • 3-stage pipeline organization – Principal components • The register bank • The barrel shifter – Can shift or rotate one operand by any number of bits • The ALU • The address register and incrementer – Select and hold all memory addresses and generate sequential addresses • The data registers • The instruction decoder and associated control logic
  • 65. • Fetch - The instruction is fetched from memory and placed in the instruction pipeline • Decode - The instruction is decoded and the datapath control signals prepared for the next cycle • Execute - The register bank is read, an operand shifted, the ALU result generated and written back into destination register
  • 66. • At any time slice, 3 different instructions may occupy each of these stages, so the hardware in each stage has to be capable of independent operations • When the processor is executing data processing instructions , the latency = 3 cycles and the throughput = 1 instruction/cycle • Drawback: Every data transfer instruction causes a pipeline “stall”. (Single memory for data and instruction- next instruction cannot be fetched while data is being read)
  • 67. 5-stage Pipeline Organization • Implemented in ARM9TDMI • Tprog = Ninst * CPI / fclk – Tprog: the time taken to execute a given program – Ninst: the number of ARM instructions executed in the program (compiler dependent) – CPI: average number of clock cycles per instructions => hazard causes pipeline stalls – fclk: frequency
  • 68. • Fetch – The instruction is fetched from memory and placed in the instruction pipeline • Decode – The instruction is decoded and register operands read from the register files. There are 3 operand read ports in the register file so most ARM instructions can source all their operands in one cycle • Execute – An operand is shifted and the ALU result generated. If the instruction is a load or store, the memory address is computed in the ALU
  • 69. • Buffer/Data – Data memory is accessed if required. Otherwise the ALU result is simply buffered for one cycle. • Write back – The result generated by the instruction are written back to the register file, including any data loaded from memory.
  • 70. 5-stage pipeline organization • Moved the register read step from the execute stage to the decode stage • Execute stage was split into 3 stages- ALU, memory access, write back. • Result: Better balanced pipeline with minimized latencies between stages, which can run at a faster clock speed.
  • 71. Pipeline Hazards • There are situations, called hazards, that prevent the next instruction in the instruction stream from being executed during its designated clock cycle. Hazards reduce the performance from the ideal speedup gained by pipelining. • There are three classes of hazards: – Structural Hazards – Data Hazards – Control Hazards
  • 72. Structural Hazards • When a machine is pipelined, the overlapped execution of instructions requires pipelining of functional units and duplication of resources to allow all possible combinations of instructions in the pipeline. • If some combination of instructions cannot be accommodated because of a resource conflict, the machine is said to have a structural hazard.
  • 73. • Ex. A machine has shared a single-memory pipeline for data and instructions. As a result, when an instruction contains a data-memory reference (load), it will conflict with the instruction reference for a later instruction (instr 3):
  • 74. Solution • To resolve this, we stall the pipeline for one clock cycle when a data-memory access occurs. The effect of the stall is actually to occupy the resources for that instruction slot. The following table shows how the stall is actually implemented.
  • 75. Solution • Another solution is to use separate instruction and data memories. • ARM has moved from the von-Neumann architecture to the Harvard architecture in ARM9. – Implemented a 5-stage pipeline and separate data and instruction memory. – Doesn’t suffer from this hazard.
  • 76. Data Hazards • They arise when an instruction depends on the result of a previous instruction in a way that is exposed by the overlapping of instructions in the pipeline. • The problem with data hazards can be solved with a hardware technique called data forwarding (by making use of feedback paths). • Without forwarding, the pipeline would have to be stalled to get the results from the respective registers • Example:
  • 77. Data Hazards • The first forwarding is for value of R1 from EXadd to EXsub. • The second forwarding is also for value of R1 from MEMadd to EXand. • This code now can be executed without stalls. • Forwarding can be generalized to include passing the result directly to the functional unit that requires it: a result is forwarded from the output of one unit to the input of another, rather than just from the result of a unit to the input of the same unit.
  • 78. Control Hazards • They arise from the pipelining of branches and other instructions that change the PC.
  • 80. THANK YOU •Alok Sharma •Aniket Thakur •Paritosh Ramanan •Pavan A.R.

Editor's Notes

  • #71: Question: how can it result in a better balanced pipeline?; what do you mean by a balanced pipeline?