### **JDEP 284H**

**Foundations of Computer Systems** 

# Machine-Level Programming I: Introduction

Dr. Steve Goddard goddard @cse.unl.edu

http://cse.unl.edu/~goddard/Courses/JDEP284

# Giving credit where credit is due

- Most of slides for this lecture are based on slides created by Drs. Bryant and O'Hallaron, Carnegie Mellon University.
- I have modified them and added new slides.

2

# **Topics**

- Assembly Programmer's Execution Model
- ■Accessing Information
  - Registers
  - Memory
- Arithmetic operations

# **IA32 Processors**

**Totally Dominate Computer Market** 

### **Evolutionary Design**

- Starting in 1978 with 8086
- Added more features as time goes on
- Still support old features, although obsolete

# Complex Instruction Set Computer (CISC)

- Many different instructions with many different formats
   But, only small subset encountered with Linux programs
- Hard to match performance of Reduced Instruction Set Computers (RISC)
- But, Intel has done just that!

# X86 Evolution: Programmer's View

Name Date Transistors 8086 1978 29K

- 16-bit processor. Basis for IBM PC & DOS
- Limited to 1MB address space. DOS only gives you 640K

80286 1982 134K

- Added elaborate, but not very useful, addressing scheme
- Basis for IBM PC-AT and Windows

386 1985 275K

- Extended to 32 bits. Added "flat addressing"
- Capable of running Unix
- Linux/gcc uses no instructions introduced in later models

X86 Evolution: Programmer's View

 Name
 Date
 Transistors

 486
 1989
 1.9M

 Pentium
 1993
 3.1M

 Pentium/MMX
 1997
 4.5M

■ Added special collection of instructions for operating on 64bit vectors of 1, 2, or 4 byte integer data

PentiumPro 1995 6.5M

- Added conditional move instructions
- Big change in underlying microarchitecture

6

# X86 Evolution: Programmer's View

Name Date **Transistors** Pentium III 1999

- Added "streaming SIMD" instructions for operating on 128-bit vectors of 1, 2, or 4 byte integer or floating point data
- Our fish machines

42M Pentium 4

■ Added 8-byte formats and 144 new instructions for streaming SIMD mode

# X86 Evolution: Clones

### Advanced Micro Devices (AMD)

- Historically
  - AMD has followed just behind Intel
  - A little bit slower, a lot cheaper
- Recently
  - Recruited top circuit designers from Digital Equipment Corp.
  - Exploited fact that Intel distracted by IA64
  - Now are close competitors to Intel
- Developing own extension to 64 bits

# X86 Evolution: Clones

#### **Transmeta**

- Recent start-up
  - •Employer of Linus Torvalds
- Radically different approach to implementation
- ●Translates x86 code into "Very Long Instruction Word" (VLIW)
- •High degree of parallelism
- Shooting for low-power market

■ Joint project with Hewlett-Packard

Name

Itanium

10M ■ Extends to IA64, a 64-bit architecture

■ Radically new instruction set designed for high performance

**Transistors** 

- Will be able to run existing IA32 programs
  - On-board "x86 engine"

Itanium 2 2002 221M

**New Species: IA64** 

Date

2001

■ Big performance boost

#### **Assembly Programmer's View** Memory Addresses Registers Object Code Program Data OS Data Instructions Stack **Programmer-Visible State** Program Counter Address of next instruction ■ Register File ■ Memory Heavily used program data . Byte addressable array ■ Condition Codes Code, user data, (some) OS data • Store status information about most recent arithmetic operation · Includes stack used to • Used for conditional branching support procedures















# What Can be Disassembled?

```
% objdump -d WINWORD.EXE
WINWORD.EXE:
                       file format pei-i386
No symbols in "WINWORD.EXE".
Disassembly of section .text:
30001000 <.text>:
30001000: 55
30001001: 8b ec
30001003: 6a ff
                                     push
                                                %ebp
                                     mov
push
                                               %esp,%ebp
$0xffffffff
30001005: 68 90 10 00 30 push
3000100a: 68 91 dc 4c 30 push
                                                $0x30001090
```

- Anything that can be interpreted as executable code
- Disassembler examines bytes and reconstructs assembly

**Moving Data** 

### **Moving Data**

mov1 Source, Dest:

- Move 4-byte ("long") word
- Lots of these in typical code

### **Operand Types**

- Immediate: Constant integer data
  - Like C constant, but prefixed with '\$'
  - E.a.. \$0x400. \$-533
  - Encoded with 1, 2, or 4 bytes
- Register: One of 8 integer registers
  - But %esp and %ebp reserved for special use
  - Others have special uses for particular instructions
- Memory: 4 consecutive bytes of memory
  - Various "address modes"

%eax %edx

%ecx

%ebx

%esi

%edi

%esp

%ebp

# mov1 Operand Combinations



■ Cannot do memory-memory transfers with single

# **Simple Addressing Modes**

#### Mem[Reg[R]] Normal (R)

■ Register R specifies memory address

movl (%ecx),%eax

#### Displacement D(R) Mem[Reg[R]+D]

- Register R specifies start of memory region
- Constant displacement D specifies offset

movl 8(%ebp),%edx

# **Example**

| Address | Value |
|---------|-------|
| 0x100   | 0xFF  |
| 0x104   | 0x00  |
| 0x108   | 0x13  |
| 0x10C   | 0x11  |

| R | egister | Value |
|---|---------|-------|
| % | eax     | 0x100 |
| % | есх     | 0x1   |

| Operand   | Value |  |
|-----------|-------|--|
| %ecx      | 0x1   |  |
| (%eax)    | 0xFF  |  |
| 8(%eax)   | 0x13  |  |
| 263(%ecx) | 0x13  |  |

# **Exercise**

| Address | Value |
|---------|-------|
| 0x100   | 0x0   |
| 0x104   | 0x1   |
| 0x108   | 0x2   |
| 0x10C   | 0x3   |

| Register | Value |
|----------|-------|
| %eax     | 0x104 |
| %ecx     | 0x100 |

| Operand   | Value |
|-----------|-------|
| %eax      | 0×104 |
| (%ecx)    | 0x0   |
| 4(%eax)   | 0x2   |
| 0xC(%ecx) | 0×3   |

























# **Exercise**

| Address | Value |
|---------|-------|
| 0x100   | 0xFF  |
| 0x104   | 0x00  |
| 0x108   | 0x13  |
| 0x10C   | 0x11  |

| Register | Value |
|----------|-------|
| %eax     | 0x100 |
| %есх     | 0x104 |
| %edx     | 0x1   |
| %ebx     | 0x8   |

| Operand         | Value |
|-----------------|-------|
| 3(%eax,%edx)    | 0x00  |
| 254(,%edx,2)    | 0xFF  |
| (%eax, %edx, 4) | 0x00  |
| (%ecx,%ebx)     | 0x11  |

**More on Data Movement** 

**MOVB** and **MOVW** 

MOVW moves two bytes, when one of its operands is a register, it must be one of the 8 two-byte registers

e.g. MOVW %ax, %dx

MOVB moves a single byte, when one of its operands is a register, it must be one of the 8 single-byte registers

e.g. MOVB %al, %ah

 %eax
 %ah
 %al

 %edx
 %dh
 %dl

 %ecx
 %ch
 %cl

 %cbx
 %bh
 %bl

 %esi
 %edi
 %esp

 %ebp
 %ebp

38

# **More on Data Movement**

#### MOVSBL and MOVZBL

- MOVSBL sign-extends a single byte, and copies it into a double-word destination
- MOVZBL expands a single byte to 32 bits with 24 leading zeros, and copies it into a double-word destination

### Example:

%eax = 0x12345678 %edx = 0xAAAABBBB

 MOVB
 %dh,
 %al
 %eax =
 0x123456BB

 MOVSBL
 %dh,
 %eax
 %eax =
 0xFFFFFBB

 MOVZBL
 %dh,
 %eax
 %eax =
 0x000000BB

# **Exercise**

%eax = 0x12345678 %edx = 0xAAAA22CC

MOVB %dh, %ah %eax = #2

MOVSBL %dh, %eax %eax = #3

MOVZBL %dh, %eax %eax = #3

MOVSBL %dl, %eax %eax = #5

1. 0x12345622 2. 0x12342278 3. 0x00000022 4. 0xFFFFFF22 5. 0xFFFFFFCC

# **Address Computation Instruction**

### leal Src,Dest

- Src is address mode expression
- Set Dest to address denoted by expression

### Uses

- Computing address without doing memory reference
  - E.g., translation of p = &x[i];
- Computing arithmetic expressions of the form x + k\*y

• k = 1, 2, 4, or 8.

# **Example**

Assume register %eax holds value X

%ecx holds value Y

| Expression                | Result in %edx |
|---------------------------|----------------|
| lea1 8(%eax), %edx        | X+8            |
| lea1 (%eax,%ecx), %edx    | X+Y            |
| lea1 8(%eax,%ecx), %edx   | X+Y+8          |
| lea1 8(%eax,%eax,4), %edx | 5X+8           |
| lea1 8(%eax,%ecx,2), %edx | X+2Y+8         |

#### **Some Arithmetic Operations** Computation **Format Two Operand Instructions** addl Src, Dest Dest = Dest + Src subl Src, Dest Dest = Dest - Src imull Src, Dest Dest = Dest \* Src sall Src,Dest Dest = Dest << Src Also called shll sarl Src.Dest Dest = Dest >> Src Arithmetic shrl Src, Dest Dest = Dest >> Src Logical Dest = Dest ^ Src xorl Src, Dest andl Src.Dest Dest = Dest & Src orl Src,Dest Dest = Dest | Src



```
Using leal for Arithmetic
Expressions
                                arith:
                                   pushl %ebp
movl %esp,%ebp
                                                                    } Set
int arith
  (int x, int y, int z)
                                    movl 8(%ebp),%eax
  int t1 = x+y;
                                    movl 12(%ebp),%edx
  int t2 = x+y;
int t2 = z+t1;
int t3 = x+4;
int t4 = y * 48;
int t5 = t3 + t4;
int rval = t2 * t5;
                                    leal (%edx, %eax), %ecx
                                    leal (%edx,%edx,2),%edx
                                                                       Body
                                    sall $4,%edx
addl 16(%ebp),%ecx
                                    leal 4(%edx,%eax),%eax
                                    imull %ecx,%eax
  return rval;
                                    movl %ebp,%esp
                                                                      Finish
                                    popl %ebp
```



```
# eax = x
mov1 8 (%ebp), %eax

# edx = y
mov1 12 (%ebp), %edx
# edx = xy
tint t1 = x+y;
int t2 = z+t1;
int t3 = x+4;
int t4 = y * 48;
int t5 = t3 + t4;
int t7 = z + t2;
return rval;
}

# eax = x
mov1 8 (%ebp), %eax
# edx = 3*y
leal (%edx, %eax), %ecx
# edx = 48*y (t4)
sall $4, %edx
# edx = 4*t4*x (t5)
leal 4 (%edx, %eax), %eax
# eax = t5*t2 (rval)
imull %ecx, %eax
```

```
Another Example
                                        logical:
                                            pushl %ebp
movl %esp,%ebp
                                                                         } Set Up
int logical(int x, int y)
  int t1 = x^y;
  int t2 = t1 >> 17;
int mask = (1<<13) - 7;
int rval = t2 & mask;
                                            movl 8(%ebp),%eax
                                            xorl 12(%ebp),%eax
                                            sarl $17,%eax
  return rval;
                                            andl $8185.%eax
                                                                             Body
                                            movl %ebp,%esp
                                            popl %ebp
                                                                             Finish
  2<sup>13</sup> = 8192, 2<sup>13</sup> - 7 = 8185
        movl 8(%ebp),%eax
                                       eax = x
eax = x^y (t1)
eax = t1>>17 (t2)
eax = t2 & 8185
        xorl 12(%ebp),%eax
sarl $17,%eax
        andl $8185,%eax
```

# **Push and Pop**

PUSHL takes a single operand: the data source, and store it to the top of stack.

For example,

PUSHL %eax has equivalent behavior as subl \$4, %esp ; stack grows downward movl %eax, (%esp)

POPL takes the data destination, and pop the top element of stack onto the destination.

POPL %eax has equivalent behavior as movl (%esp), %eax addl \$4, %esp

49

# **CISC Properties**

Instruction can reference different operand types

■ Immediate, register, memory

Arithmetic operations can read/write memory

Memory reference can involve complex computation

- Rb + S\*Ri + D
- Useful for arithmetic expressions, too

Instructions can have varying lengths

■ IA32 instructions can range from 1 to 15 bytes

50

#### **Summary: Abstract Machines Machine Models** Data Control C 1) loops 2) conditionals 1) char 2) int, float mem \_\_\_ proc 3) double 3) switch 4) Proc. call 5) Proc. return 5) pointer **Assembly** 1) byte 3) branc 2) 2-byte word 4) call 3) 4-byte long word 5) ret 4) contiguous byte allocation 5) address of initial byte 3) branch/jump regs alu Cond.

# Pentium Pro (P6)

### History

- Announced in Feb. '95
- Basis for Pentium II. Pentium III. and Celeron processors
- Pentium 4 similar idea, but different details

#### **Features**

- Dynamically translates instructions to more regular format
   Very wide, but simple instructions
- Executes operations in parallel
  - Up to 5 at once
- Very deep pipeline
  - 12–18 cycle latency

52

# PentiumPro Block Diagram



# **PentiumPro Operation**

Translates instructions dynamically into "Uops"

- 118 bits wide
- Holds operation, two sources, and destination

Executes Uops with "Out of Order" engine

- Uop executed when
  - Operands available
  - Functional unit available
- Execution controlled by "Reservation Stations"
  - Keeps track of data dependencies between uops
  - Allocates resources

### Consequences

- Indirect relationship between IA32 code & what actually gets executed
- Tricky to predict / optimize performance at assembly level

54

# Whose Assembler?

### Intel/Microsoft Format

### GAS/Gnu Format

| lea | eax,[ecx+ecx*2]             |
|-----|-----------------------------|
|     | esp,8                       |
| cmp | dword ptr [ebp-8],0         |
| mov | eax, dword ptr [eax*4+100h] |

leal (%ecx,%ecx,2),%eax subl \$8,%esp cmpl \$0,-8(%ebp) movl \$0x100(,%eax,4),%eax

### Intel/Microsoft Differs from GAS

- Operands listed in opposite order mov Dest, Src mov1 Src, Dest
- Constants not preceded by '\$', Denote hex with 'h' at end

  100h \$0x100
- 100h \$0x100
   Operand size indicated by operands rather than operator suffix
- sub subl
- Addressing format shows effective address computation
  [eax\*4+100h] \$0x100(,%eax,4)

55