CS465Lec1.ppt computer architecture in the fall term

CS 465
Computer Architecture
Fall 2009
Lecture 01: Introduction
Daniel Barbará ( cs.gmu.edu/~dbarbara)
[Adapted from Computer Organization and Design,
Patterson & Hennessy, © 2005, UCB]

Course Administration
 Instructor: Daniel Barbará
dbarbara@gmu.edu
4420 Eng. Bldg.
 Text: Required: Computer Organization & Design –
The Hardware Software Interface, Patterson &
Hennessy, the 4th Edition

Grading Information
 Grade determinates
 Midterm Exam ~25%
 Final Exam 1 ~35%
 Homeworks ~40%
- Due at the beginning of class (or, if its code to be submitted
electronically, by 17:00 on the due date). No late assignments
will be accepted.
 Course prerequisites
 grade of C or better in CS 367

Acknowledgements
 Slides adopted from Dr. Zhong
 Contributions from Dr. Setia
 Slides also adopt materials from many other universities
 IMPORTANT:
- Slides are not intended as replacement for the text
- You spent the money on the book, please read it!

Course Topics (Tentative)
 Instruction set architecture (Chapter 2)
 MIPS
 Arithmetic operations & data (Chapter 3)
 System performance (Chapter 4)
 Processor (Chapter 5)
 Datapath and control
 Pipelining to improve performance (Chapter 6)
 Memory hierarchy (Chapter 7)
 I/O (Chapter 8)

Focus of the Course
 How computers work
 MIPS instruction set architecture
 The implementation of MIPS instruction set architecture – MIPS
processor design
 Issues affecting modern processors
 Pipelining – processor performance improvement
 Cache – memory system, I/O systems

Why Learn Computer Architecture?
 You want to call yourself a “computer scientist”
 Computer architecture impacts every other aspect of computer science
 You need to make a purchasing decision or offer “expert” advice
 You want to build software people use – sell many, many copies-
(need performance)
 Both hardware and software affect performance
- Algorithm determines number of source-level statements
- Language/compiler/architecture determine machine instructions (Chapter 2
and 3)
- Processor/memory determine how fast instructions are executed (Chapter 5,
6, and 7)
- Assessing and understanding performance(Chapter 4)

Outline Today
 Course logistics
 Computer architectures overview
 Trends in computer architectures

Computer Systems
 Software
 Application software – Word Processors, Email, Internet
Browsers, Games
 Systems software – Compilers, Operating Systems
 Hardware
 CPU
 Memory
 I/O devices (mouse, keyboard, display, disks, networks,……..)

Operating
systems
Applications
software
laTE X
Virtual
memory
File
system
I/O device
drivers
Assemblers
as
Compilers
gcc
Systems
software
Software
Software

D.Barbará
instruction set
software
hardware
Instruction Set Architecture
 One of the most important abstractions is ISA
 A critical interface between HW and SW
 Example: MIPS
 Desired properties
 Convenience (from software side)
 Efficiency (from hardware side)

D.Barbará
What is Computer Architecture
 Programmer’s view: a pleasant environment
 Operating system’s view: a set of resources (hw
& sw)
 System architecture view: a set of components
 Compiler’s view: an instruction set architecture
with OS help
 Microprocessor architecture view: a set of
functional units
 VLSI designer’s view: a set of transistors
implementing logic
 Mechanical engineer’s view: a heater!

D.Barbará
What is Computer Architecture
 Patterson & Hennessy: Computer
architecture = Instruction set architecture
+ Machine organization + Hardware
 For this course, computer architecture
mainly refers to ISA (Instruction Set
Architecture)
 Programmer-visible, serves as the boundary
between the software and hardware
 Modern ISA examples: MIPS, SPARC,
PowerPC, DEC Alpha

D.Barbará
Organization and Hardware
 Organization: high-level aspects of a computer’s
design
 Principal components: memory, CPU, I/O, …
 How components are interconnected
 How information flows between components
 E.g. AMD Opteron 64 and Intel Pentium 4: same ISA
but different organizations
 Hardware: detailed logic design and the
packaging technology of a computer
 E.g. Pentium 4 and Mobile Pentium 4: nearly identical
organizations but different hardware details

Types of computers and their applications
 Desktop
 Run third-party software
 Office to home applications
 30 years old
 Servers
 Modern version of what used to be called mainframes,
minicomputers and supercomputers
 Large workloads
 Built using the same technology in desktops but higher capacity
- Expandable
- Scalable
- Reliable
 Large spectrum: from low-end (file storage, small businesses) to
supercomputers (high end scientific and engineering
applications)
- Gigabytes to Terabytes to Petabytes of storage
 Examples: file servers, web servers, database servers

Types of computers…
 Embedded
 Microprocessors everywhere! (washing machines, cell phones,
automobiles, video games)
 Run one or a few applications
 Specialized hardware integrated with the application (not your
common processor)
 Usually stringent limitations (battery power)
 High tolerance for failure (don’t want your airplane avionics to
fail!)
 Becoming ubiquitous
 Engineered using processor cores
- The core allows the engineer to integrate other functions into the
processor for fabrication on the same chip
- Using hardware description languages: Verilog, VHDL

Where is the Market?
290
93
3
488
114
3
892
135
4
862
129
4
1122
131
5
0
200
400
600
800
1000
1200
1998 1999 2000 2001 2002
Embedded
Desktop
Servers
Millions
of
Computers

In this class you will learn
 How programs written in a high-level language (e.g.,
Java) translate into the language of the hardware and
how the hardware executes them.
 The interface between software and hardware and how
software instructs hardware to perform the needed
functions.
 The factors that determine the performance of a program
 The techniques that hardware designers employ to
improve performance.
As a consequence, you will understand what features may
make one computer design better than another for a
particular application

High-level to Machine Language
High-level language program
(in C)
Assembly language program
(for MIPS)
Binary machine language program
(for MIPS)
Compiler
Assembler

Evolution…
 In the beginning there were only bits… and people spent
countless hours trying to program in machine language
01100011001 011001110100
 Finally before everybody went insane, the assembler
was invented: write in mnemonics called assembly
language and let the assembler translate (a one to one
translation)
Add A,B
 This wasn’t for everybody, obviously… (imagine how
modern applications would have been possible in
assembly), so high-level language were born (and with
them compilers to translate to assembly, a many-to-one
translation)
C= A*(SQRT(B)+3.0)

THE BIG IDEA
 Levels of abstraction: each layer provides its own
(simplified) view and hides the details of the next.

Instruction Set Architecture (ISA)
 ISA: An abstract interface between the hardware and the
lowest level software of a machine that encompasses all
the information necessary to write a machine language
program that will run correctly, including instructions,
registers, memory access, I/O, and so on.
“... the attributes of a [computing] system as seen by the
programmer, i.e., the conceptual structure and functional
behavior, as distinct from the organization of the data flows and
controls, the logic design, and the physical implementation.”
– Amdahl, Blaauw, and Brooks, 1964
 Enables implementations of varying cost and performance to run
identical software
 ABI (application binary interface): The user portion of the
instruction set plus the operating system interfaces used
by application programmers. Defines a standard for
binary portability across computers.

ISA Type Sales
0
200
400
600
800
1000
1200
1400
1998 1999 2000 2001 2002
Other
SPARC
Hitachi SH
PowerPC
Motorola 68K
MIPS
IA-32
ARM
PowerPoint “comic” bar chart with approximate values (see
text for correct values)
Millions
of
Processor

Anatomy of Computer
Personal Computer
Processor
Computer
Control
(“brain”)
Datapath
(“brawn”)
Memory
(where
programs,
data
live when
running)
Devices
Input
Output
Keyboard,
Mouse
Display,
Printer
Disk
(where
programs,
data
live when
not running)
5 classic components
 Datapath: performs arithmetic operation
 Control: guides the operation of other components based on the user
instructions

Moore’s Law
 In 1965, Gordon Moore predicted that the number of
transistors that can be integrated on a die would double
every 18 to 24 months (i.e., grow exponentially with
time).
 Amazingly visionary – million transistor/chip barrier was
crossed in the 1980’s.
 2300 transistors, 1 MHz clock (Intel 4004) - 1971
 16 Million transistors (Ultra Sparc III)
 42 Million transistors, 2 GHz clock (Intel Xeon) – 2001
 55 Million transistors, 3 GHz, 130nm technology, 250mm2 die
(Intel Pentium 4) - 2004
 140 Million transistor (HP PA-8500)

Processor Performance Increase
1
10
100
1000
10000
1987 1989 1991 1993 1995 1997 1999 2001 2003
Year
Performance
(SPEC
Int)
SUN-4/260 MIPS M/120
MIPS M2000
IBM RS6000
HP 9000/750
DEC AXP/500 IBM POWER 100
DEC Alpha 4/266
DEC Alpha 5/500
DEC Alpha 21264/600
DEC Alpha 5/300
DEC Alpha 21264A/667
Intel Xeon/2000
Intel Pentium 4/3000

Year
Transistors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000
i80386
i4004
i8080
Pentium
i80486
i80286
i8086
CMOS improvements:
• Die size: 2X every 3 yrs
• Line width: halve / 7 yrs
Itanium II: 241 million
Pentium 4: 55 million
Alpha 21264: 15 million
Pentium Pro: 5.5 million
PowerPC 620: 6.9 million
Alpha 21164: 9.3 million
Sparc Ultra: 5.2 million
Moore’s Law
Trend: Microprocessor Capacity

Moore’s Law
 “Cramming More Components onto Integrated Circuits”
 Gordon Moore, Electronics, 1965
 # of transistors per cost-effective integrated circuit doubles every 18 months
“Transistor capacity doubles every 18-24 months”
Speed 2x / 1.5 years (since ‘85);
100X performance in last decade

Trend: Microprocessor Performance

Memory
 Dynamic Random Access Memory (DRAM)
 The choice for main memory
 Volatile (contents go away when power is lost)
 Fast
 Relatively small
 DRAM capacity: 2x / 2 years (since ‘96);
64x size improvement in last decade
 Static Random Access Memory (SRAM)
 The choice for cache
 Much faster than DRAM, but less dense and more costly
 Magnetic disks
 The choice for secondary memory
 Non-volatile
 Slower
 Relatively large
 Capacity: 2x / 1 year (since ‘97)
250X size in last decade
 Solid state (Flash) memory
 The choice for embedded computers
 Non-volatile

Memory
 Optical disks
 Removable, therefore very large
 Slower than disks
 Magnetic tape
 Even slower
 Sequential (non-random) access
 The choice for archival

DRAM Capacity Growth
10
100
1000
10000
100000
1000000
1976 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002
Year of introduction
Kbit
capacity
16K
64K
256K
1M
4M
16M
64M
128M
256M
512M

Trend: Memory Capacity
size
Year
Bits
1000
10000
100000
1000000
10000000
100000000
1000000000
1970 1975 1980 1985 1990 1995 2000
year size (Mbit)
1980 0.0625
1983 0.25
1986 1
1989 4
1992 16
1996 64
1998 128
2000 256
2002 512
2006 2048
• Now 1.4X/yr, or 2X every 2 years.
• more than 10000X since 1980!
Growth of capacity per chip

(Kilo, Mega, Giga, Tera, Peta, Exa, Zetta, Yotta = 1024)
Come up with a clever mnemonic, fame!
Dramatic Technology Change
 State-of-the-art PC when you graduate:
(at least…)
 Processor clock speed: 5000 MegaHertz
(5.0 GigaHertz)
 Memory capacity: 4000 MegaBytes
(4.0 GigaBytes)
 Disk capacity: 2000 GigaBytes
(2.0 TeraBytes)
 New units! Mega => Giga, Giga => Tera

Example Machine Organization
 Workstation design target
 25% of cost on processor
 25% of cost on memory (minimum memory size)
 Rest on I/O devices, power supplies, box
CPU
Computer
Control
Datapath
Memory Devices
Input
Output

MIPS R3000 Instruction Set Architecture
 Instruction Categories
 Load/Store
 Computational
 Jump and Branch
 Floating Point
- coprocessor
 Memory Management
 Special
R0 - R31
PC
HI
LO
OP
OP
OP
rs rt rd sa funct
rs rt immediate
jump target
3 Instruction Formats: all 32 bits wide
Registers

Defining Performance
 Which airplane has the best performance?
0 100 200 300 400 500
Douglas
DC-8-50
BAC/Sud
Concorde
Boeing 747
Boeing 777
Passenger Capacity
0 2000 4000 6000 8000 10000
Douglas DC-
8-50
BAC/Sud
Concorde
Boeing 747
Boeing 777
Cruising Range (miles)
0 500 1000 1500
Douglas
DC-8-50
BAC/Sud
Concorde
Boeing 747
Boeing 777
Cruising Speed (mph)
0 100000 200000 300000 400000
Douglas DC-
8-50
BAC/Sud
Concorde
Boeing 747
Boeing 777
Passengers x mph
§1.4
Performance

Response Time and Throughput
 Response time
 How long it takes to do a task
 Throughput
 Total work done per unit time
- e.g., tasks/transactions/… per hour
 How are response time and throughput affected by
 Replacing the processor with a faster version?
 Adding more processors?
 We’ll focus on response time for now…

Relative Performance
 Define Performance = 1/Execution Time
 “X is n time faster than Y”
n

 X
Y
Y
X
time
Execution
time
Execution
e
Performanc
e
Performanc
 Example: time taken to run a program
 10s on A, 15s on B
 Execution TimeB / Execution TimeA
= 15s / 10s = 1.5
 So A is 1.5 times faster than B

Measuring Execution Time
 Elapsed time
 Total response time, including all aspects
- Processing, I/O, OS overhead, idle time
 Determines system performance
 CPU time
 Time spent processing a given job
- Discounts I/O time, other jobs’ shares
 Comprises user CPU time and system CPU time
 Different programs are affected differently by CPU and system
performance

CPU Clocking
 Operation of digital hardware governed by a constant-rate clock
Clock (cycles)
Data transfer
and computation
Update state
Clock period
 Clock period: duration of a clock cycle
 e.g., 250ps = 0.25ns = 250×10–12s
 Clock frequency (rate): cycles per second
 e.g., 4.0GHz = 4000MHz = 4.0×109Hz

CPU Time
 Performance improved by
 Reducing number of clock cycles
 Increasing clock rate
 Hardware designer must often trade off clock rate against cycle
count
Rate
Clock
Cycles
Clock
CPU
Time
Cycle
Clock
Cycles
Clock
CPU
Time
CPU




CPU Time Example
 Computer A: 2GHz clock, 10s CPU time
 Designing Computer B
 Aim for 6s CPU time
 Can do faster clock, but causes 1.2 × clock cycles
 How fast must Computer B clock be?
4GHz
6s
10
24
6s
10
20
1.2
Rate
Clock
10
20
2GHz
10s
Rate
Clock
Time
CPU
Cycles
Clock
6s
Cycles
Clock
1.2
Time
CPU
Cycles
Clock
Rate
Clock
9
9
B
9
A
A
A
A
B
B
B
















Instruction Count and CPI
 Instruction Count for a program
 Determined by program, ISA and compiler
 Average cycles per instruction
 Determined by CPU hardware
 If different instructions have different CPI
- Average CPI affected by instruction mix
Rate
Clock
CPI
Count
n
Instructio
Time
Cycle
Clock
CPI
Count
n
Instructio
Time
CPU
n
Instructio
per
Cycles
Count
n
Instructio
Cycles
Clock








CPI Example
 Computer A: Cycle Time = 250ps, CPI = 2.0
 Computer B: Cycle Time = 500ps, CPI = 1.2
 Same ISA
 Which is faster, and by how much?
1.2
500ps
I
600ps
I
A
Time
CPU
B
Time
CPU
600ps
I
500ps
1.2
I
B
Time
Cycle
B
CPI
Count
n
Instructio
B
Time
CPU
500ps
I
250ps
2.0
I
A
Time
Cycle
A
CPI
Count
n
Instructio
A
Time
CPU




















A is faster…
…by this much

CPI in More Detail
 If different instruction classes take different numbers of
cycles




n
1
i
i
i )
Count
n
Instructio
(CPI
Cycles
Clock
 Weighted average CPI











n
1
i
i
i
Count
n
Instructio
Count
n
Instructio
CPI
Count
n
Instructio
Cycles
Clock
CPI
Relative frequency

CPI Example
 Alternative compiled code sequences using instructions in classes A,
B, C
Class A B C
CPI for class 1 2 3
IC in sequence 1 2 1 2
IC in sequence 2 4 1 1
 Sequence 1: IC = 5
 Clock Cycles
= 2×1 + 1×2 + 2×3
= 10
 Avg. CPI = 10/5 = 2.0
 Sequence 2: IC = 6
 Clock Cycles
= 4×1 + 1×2 + 1×3
= 9
 Avg. CPI = 9/6 = 1.5

Performance Summary
 Performance depends on
 Algorithm: affects IC, possibly CPI
 Programming language: affects IC, CPI
 Compiler: affects IC, CPI
 Instruction set architecture: affects IC, CPI, Tc
The BIG Picture
cycle
Clock
Seconds
n
Instructio
cycles
Clock
Program
ns
Instructio
Time
CPU 



Power Trends
 In CMOS IC technology
§1.5
The
Power
Wall
Frequency
Voltage
load
Capacitive
Power 2



×1000
×30 5V → 1V

Reducing Power
 Suppose a new CPU has
 85% of capacitive load of old CPU
 15% voltage and 15% frequency reduction
0.52
0.85
F
V
C
0.85
F
0.85)
(V
0.85
C
P
P 4
old
2
old
old
old
2
old
old
old
new










 The power wall
 We can’t reduce voltage further
 We can’t remove more heat
 How else can we improve performance?

Uniprocessor Performance
§1.6
The
Sea
Change:
The
Switch
to
Multiprocessors
Constrained by power, instruction-level parallelism,
memory latency

Multiprocessors
 Multicore microprocessors
 More than one processor per chip
 Requires explicitly parallel programming
 Compare with instruction level parallelism
- Hardware executes multiple instructions at once
- Hidden from the programmer
 Hard to do
- Programming for performance
- Load balancing
- Optimizing communication and synchronization

SPEC CPU Benchmark
 Programs used to measure performance
 Supposedly typical of actual workload
 Standard Performance Evaluation Corp (SPEC)
 Develops benchmarks for CPU, I/O, Web, …
 SPEC CPU2006
 Elapsed time to execute a selection of programs
- Negligible I/O, so focuses on CPU performance
 Normalize relative to reference machine
 Summarize as geometric mean of performance ratios
- CINT2006 (integer) and CFP2006 (floating-point)
n
n
1
i
i
ratio
time
Execution



CINT2006 for Opteron X4 2356
Name Description IC×109 CPI Tc (ns) Exec time Ref time SPECratio
perl Interpreted string processing 2,118 0.75 0.40 637 9,777 15.3
bzip2 Block-sorting compression 2,389 0.85 0.40 817 9,650 11.8
gcc GNU C Compiler 1,050 1.72 0.47 24 8,050 11.1
mcf Combinatorial optimization 336 10.00 0.40 1,345 9,120 6.8
go Go game (AI) 1,658 1.09 0.40 721 10,490 14.6
hmmer Search gene sequence 2,783 0.80 0.40 890 9,330 10.5
sjeng Chess game (AI) 2,176 0.96 0.48 37 12,100 14.5
libquantum Quantum computer simulation 1,623 1.61 0.40 1,047 20,720 19.8
h264avc Video compression 3,102 0.80 0.40 993 22,130 22.3
omnetpp Discrete event simulation 587 2.94 0.40 690 6,250 9.1
astar Games/path finding 1,082 1.79 0.40 773 7,020 9.1
xalancbmk XML parsing 1,058 2.70 0.40 1,143 6,900 6.0
Geometric mean 11.7
High cache miss rates

SPEC Power Benchmark
 Power consumption of server at different workload levels
 Performance: ssj_ops/sec
 Power: Watts (Joules/sec)












 
 

10
0
i
i
10
0
i
i power
ssj_ops
Watt
per
ssj_ops
Overall

SPECpower_ssj2008 for X4
Target Load % Performance (ssj_ops/sec) Average Power (Watts)
100% 231,867 295
90% 211,282 286
80% 185,803 275
70% 163,427 265
60% 140,160 256
50% 118,324 246
40% 920,35 233
30% 70,500 222
20% 47,126 206
10% 23,066 180
0% 0 141
Overall sum 1,283,590 2,605
∑ssj_ops/ ∑power 493

Pitfall: Amdahl’s Law
 Improving an aspect of a computer and expecting a proportional
improvement in overall performance
§1.8
Fallacies
and
Pitfalls
20
80
20 

n
 Can’t be done!
unaffected
affected
improved T
factor
t
improvemen
T
T 

 Example: multiply accounts for 80s/100s
 How much improvement in multiply performance to get 5× overall?
 Corollary: make the common case fast

Fallacy: Low Power at Idle
 Look back at X4 power benchmark
 At 100% load: 295W
 At 50% load: 246W (83%)
 At 10% load: 180W (61%)
 Google data center
 Mostly operates at 10% – 50% load
 At 100% load less than 1% of the time
 Consider designing processors to make power
proportional to load

Pitfall: MIPS as a Performance Metric
 MIPS: Millions of Instructions Per Second
 Doesn’t account for
- Differences in ISAs between computers
- Differences in complexity between instructions
6
6
6
10
CPI
rate
Clock
10
rate
Clock
CPI
count
n
Instructio
count
n
Instructio
10
time
Execution
count
n
Instructio
MIPS







 CPI varies between programs on a given CPU

Concluding Remarks
 Cost/performance is improving
 Due to underlying technology development
 Hierarchical layers of abstraction
 In both hardware and software
 Instruction set architecture
 The hardware/software interface
 Execution time: the best performance measure
 Power is a limiting factor
 Use parallelism to improve performance
§1.9
Concluding
Remarks

CS465Lec1.ppt computer architecture in the fall term

More Related Content

Similar to CS465Lec1.ppt computer architecture in the fall term (20)

Recently uploaded (20)

CS465Lec1.ppt computer architecture in the fall term