SlideShare a Scribd company logo
License Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5 License.
UC Berkeley has partnered with Intel and Microsoft to build the world’s #1 research lab to “accelerate developments in parallel computing and advance the powerful benefits of multi-core processing to mainstream consumer and business computers.” inst.eecs.berkeley.edu/~cs61c   UCB CS61C : Machine Structures   Lecture 40 –  Parallelism in Processor Design   2008-05-05 Lecturer SOE Dan Garcia parlab.eecs.berkeley.edu How parallel is your processor?
A  Thread   stands for “thread of execution”, is a single stream of instructions A program can  split , or  fork  itself into separate threads, which can (in theory) execute simultaneously. It has its own registers, PC, etc. Threads from the same process operate in the same virtual address space switching threads faster than switching processes! An easy way to describe/think about parallelism A single CPU can execute many threads by  Time Division Multipexing Background: Threads CPU Time Thread 0 Thread 1 Thread 2
Background: Multithreading Multithreading  is running multiple threads through the same hardware Could we do  Time Division Multipexing  better in hardware? Sure, if we had the HW to support it!
Put multiple CPU’s on the same die Why is this better than multiple dies? Smaller, Cheaper Closer , so lower inter-processor latency Can  share  a L2 Cache (complicated) Less power Cost of multicore:  Complexity Slower single-thread execution Background: Multicore
Cell Processor (heart of the PS3) 9 Cores (1PPE, 8SPE) at 3.2GHz Power Processing Element (PPE) Supervises all activities, allocates work Is multithreaded (2 threads) Synergystic Processing Element (SPE) Where work gets done Very Superscalar No Cache, only “ Local Store ” aka “Scratchpad RAM” During testing, one “locked out” I.e., it didn’t work; shut down
The  majority of PS3’s processing power  comes from the  Cell  processor  Berkeley profs believe  multicore  is the future of computing Current multicore techniques  can scale well  to many (32+) cores Peer Instruction ABC 0:  FFF 1:  FF T 2:  F T F 3:  F TT 4:  T FF 5:  T F T 6:  TT F 7:  TTT
Peer Instruction Answer All PS3 is 2.18TFLOPS, Cell is only 204GFLOPS  (GPU can do a lot…)  FALSE Not multicore, manycore! FALSE Share memory and caches huge barrier. That’s why Cell has Local Store! FALSE ABC 0:  FFF 1:  FF T 2:  F T F 3:  F TT 4:  T FF 5:  T F T 6:  TT F 7:  TTT The  majority of PS3’s processing power  comes from the  Cell  processor  Berkeley profs believe  multicore  is the future of computing Current multicore techniques  can scale well  to many (32+) cores
Conventional Wisdom (CW) in Computer Architecture Old CW: Power is free, but transistors expensive New CW: Power wall Power expensive, transistors “free”  Can put more transistors on a chip than have power to turn on Old CW: Multiplies slow, but loads fast New CW: Memory wall Loads slow, multiplies fast  200 clocks to DRAM, but even FP multiplies only 4 clocks Old CW: More ILP via compiler / architecture innovation  Branch prediction, speculation, Out-of-order execution, VLIW, … New CW: ILP wall Diminishing returns on more ILP Old CW: 2X CPU Performance every 18 months New CW: Power Wall+Memory Wall+ILP Wall =  Brick Wall
Uniprocessor Performance (SPECint) VAX   : 25%/year 1978 to 1986 RISC + x86: 52%/year 1986 to 2002 RISC + x86: ??%/year 2002 to present From Hennessy and Patterson,  Computer Architecture: A Quantitative Approach , 4th edition, Sept. 15, 2006    Sea change in chip design: multiple “cores” or processors per chip 3X
Sea Change in Chip Design Intel 4004 (1971) 4-bit processor, 2312 transistors, 0.4 MHz,  10 micron PMOS, 11 mm 2  chip RISC II (1983) 32-bit, 5 stage  pipeline, 40,760 transistors, 3 MHz,  3 micron NMOS, 60 mm 2  chip 125 mm 2  chip, 0.065 micron CMOS  = 2312 RISC II + FPU + Icache + Dcache RISC II shrinks to    0.02 mm 2  at 65 nm Caches via DRAM or 1 transistor SRAM or 3D chip stacking Proximity Communication via capacitive coupling at > 1 TB/s ? (Ivan Sutherland @ Sun / Berkeley) Processor is the new transistor!
Parallelism again? What’s different this time? “ This shift toward increasing parallelism is not a triumphant stride forward based on breakthroughs in novel software and architectures for parallelism; instead,  this plunge into parallelism is actually a retreat from even greater challenges that thwart efficient silicon implementation of traditional uniprocessor architectures .”  –  Berkeley View, December 2006 HW/SW Industry bet its future that breakthroughs will appear before it’s too late view.eecs.berkeley.edu
Need a New Approach Berkeley researchers from many backgrounds met between February 2005 and December 2006 to discuss parallelism Circuit design, computer architecture, massively parallel computing, computer-aided design, embedded hardware and software, programming languages, compilers, scientific programming, and numerical analysis Krste Asanovic, Ras Bodik, Jim Demmel, Edward Lee, John Kubiatowicz, George Necula, Kurt Keutzer, Dave Patterson, Koshik Sen, John Shalf, Kathy Yelick + others Tried to learn from successes in embedded and high performance computing (HPC) Led to 7 Questions to frame parallel research
7 Questions for Parallelism Applications: 1. What are the apps? 2. What are kernels of apps? Architecture & Hardware: 3. What are HW building blocks? 4. How to connect them? Programming Model & Systems Software: 5. How to describe apps & kernels? 6. How to program the HW? Evaluation:  7. How to measure success? (Inspired by a view of the  Golden Gate Bridge from Berkeley)
Hardware Tower: What are problems? Power limits leading edge chip designs Intel Tejas Pentium 4 cancelled due to power issues Yield on leading edge processes dropping dramatically IBM quotes yields of 10 – 20% on 8-processor Cell Design/validation leading edge chip is becoming unmanageable Verification teams > design teams on leading edge processors
HW Solution: Small is Beautiful Expect modestly pipelined (5- to 9-stage)  CPUs, FPUs, vector, Single Inst Multiple Data (SIMD) Processing Elements (PEs) Small cores not much slower than large cores Parallel is energy efficient path to performance:  POWER ≈ VOLTAGE 2 Lower threshold and supply voltages lowers energy per op Redundant processors can improve chip yield Cisco Metro 188 CPUs + 4 spares; Cell in PS3 Small, regular processing elements easier to verify
Number of Cores/Socket We need revolution, not evolution Software or architecture alone can’t fix parallel programming problem, need innovations in both “ Multicore” 2X cores per generation: 2, 4, 8, …  “ Manycore” 100s is highest performance per unit area, and per Watt, then 2X per generation:  64, 128, 256, 512, 1024 … Multicore architectures & Programming Models good for 2 to 32 cores won’t evolve to Manycore systems of 1000’s of processors    Desperately need HW/SW models that work for Manycore or will run out of steam (as ILP ran out of steam at 4 instructions)
Measuring Success: What are the problems?    Only companies can build HW; it takes years Software people don’t start working hard until hardware arrives 3 months after HW arrives, SW people list everything that must be fixed, then we all wait 4 years for next iteration of HW/SW How get 1000 CPU systems in hands of researchers to innovate in timely fashion on in algorithms, compilers, languages, OS, architectures, … ? Can avoid waiting years between HW/SW iterations?
Build Academic Manycore from FPGAs  As    16 CPUs will fit in Field Programmable Gate Array (FPGA), 1000-CPU system from    64 FPGAs? 8 32-bit simple “soft core” RISC at 100MHz in 2004 (Virtex-II) FPGA generations every 1.5 yrs;    2X CPUs,    1.2X clock rate HW research community does logic design (“gate shareware”) to create out-of-the-box, Manycore  E.g., 1000 processor, standard ISA binary-compatible, 64-bit,  cache-coherent supercomputer @    150 MHz/CPU in 2007 RAMPants: 10 faculty at Berkeley, CMU, MIT, Stanford, Texas, and Washington “ Research Accelerator for Multiple Processors” as a vehicle to attract many to parallel challenge
And in Conclusion… Everything is changing  Old conventional wisdom is out We  desperately  need new approach to HW and SW based on parallelism since industry has bet its future that parallelism works Need to create a “watering hole” to bring everyone together to quickly find that solution architects, language designers, application experts, numerical analysts, algorithm designers, programmers, …
Bonus slides These are extra slides that used to be included in lecture notes, but have been moved to this, the “bonus” area to serve as a supplement. The slides will appear in the order they would have in the normal presentation Bonus
Why is  Manycore  Good for  Research ?  SMP Cluster Simulate RAMP Scalability (1k CPUs) C A A A Cost (1k CPUs) F  ($40M) C ($2-3M) A+  ($0M)  A  ($0.1-0.2M)  Cost of ownership A D A A Power/Space (kilowatts, racks) D  (120 kw, 12 racks) D  (120 kw, 12 racks) A+  (.1 kw, 0.1 racks)  A  (1.5 kw,  0.3 racks)  Community D A A A Observability D C A+ A+ Reproducibility B D A+ A+ Reconfigurability D C A+ A+ Credibility A+ A+ F B+/A- Perform. (clock) A  (2 GHz) A  (3 GHz) F  (0 GHz) C (0.1 GHz) GPA C B- B A-
Multiprocessing Watering Hole Killer app:    All CS Research, Advanced Development  RAMP attracts many communities to shared artifact    Cross-disciplinary interactions  RAMP as next Standard Research/AD Platform?  (e.g., VAX/BSD Unix in 1980s)  Parallel file system Flight Data Recorder Transactional Memory Fault insertion to check dependability Data center in a box Internet in a box Dataflow language/computer Security enhancements Router design Compile to FPGA Parallel languages RAMP 128-bit Floating Point Libraries
Reasons for Optimism towards Parallel Revolution this time End of sequential microprocessor/faster clock rates No looming sequential juggernaut to kill parallel revolution SW & HW industries fully committed to parallelism End of La-Z-Boy Programming Era Moore’s Law continues, so soon can put 1000s of simple cores on an economical chip Communication between cores within a chip at low latency (20X) and high bandwidth (100X) Processor-to-Processor fast even if Memory slow All cores equal distance to shared main memory Less data distribution challenges Open Source Software movement means that SW stack can evolve more quickly than in past RAMP as vehicle to ramp up parallel research

More Related Content

PDF
Intel 2020 Labs Day Keynote Slides
PDF
Early Benchmarking Results for Neuromorphic Computing
PDF
NNSA Explorations: ARM for Supercomputing
PDF
Cache Consistency – Requirements and its packet processing Performance implic...
PDF
intel speed-select-technology-base-frequency-enhancing-performance
PDF
Dileep Random Access Talk at salishan 2016
PDF
My amazing journey from mainframes to smartphones chm lecture aug 2014 final
PDF
Hardware & Software Platforms for HPC, AI and ML
Intel 2020 Labs Day Keynote Slides
Early Benchmarking Results for Neuromorphic Computing
NNSA Explorations: ARM for Supercomputing
Cache Consistency – Requirements and its packet processing Performance implic...
intel speed-select-technology-base-frequency-enhancing-performance
Dileep Random Access Talk at salishan 2016
My amazing journey from mainframes to smartphones chm lecture aug 2014 final
Hardware & Software Platforms for HPC, AI and ML

What's hot (20)

PDF
Exceeding the Limits of Air Cooling to Unlock Greater Potential in HPC
PDF
Using a Field Programmable Gate Array to Accelerate Application Performance
PDF
MIT's experience on OpenPOWER/POWER 9 platform
PPTX
Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor Core
 
PDF
Machine programming
PPT
No[1][1]
PPTX
Moscow conference keynote
PPTX
AMD Hot Chips Bulldozer & Bobcat Presentation
 
PDF
OpenSPARC T1 Processor
PDF
SGI HPC DAY 2011 Kiev
PDF
My Feb 2003 HPCA9 Keynote Slides - Billion Transistor Processor Chips
PDF
Hands-on Lab: How to Unleash Your Storage Performance by Using NVM Express™ B...
PPTX
CLOUD COMPUTING: AN ALTERNATIVE PLATFORM FOR SCIENTIFIC COMPUTING
PPTX
FPGA Inference - DellEMC SURFsara
PDF
High Memory Bandwidth Demo @ One Intel Station
PDF
"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel
PDF
Increasing Throughput per Node for Content Delivery Networks
PDF
SGI HPC Update for June 2013
PDF
AI Crash Course- Supercomputing
PDF
Energy Efficient Computing using Dynamic Tuning
Exceeding the Limits of Air Cooling to Unlock Greater Potential in HPC
Using a Field Programmable Gate Array to Accelerate Application Performance
MIT's experience on OpenPOWER/POWER 9 platform
Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor Core
 
Machine programming
No[1][1]
Moscow conference keynote
AMD Hot Chips Bulldozer & Bobcat Presentation
 
OpenSPARC T1 Processor
SGI HPC DAY 2011 Kiev
My Feb 2003 HPCA9 Keynote Slides - Billion Transistor Processor Chips
Hands-on Lab: How to Unleash Your Storage Performance by Using NVM Express™ B...
CLOUD COMPUTING: AN ALTERNATIVE PLATFORM FOR SCIENTIFIC COMPUTING
FPGA Inference - DellEMC SURFsara
High Memory Bandwidth Demo @ One Intel Station
"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel
Increasing Throughput per Node for Content Delivery Networks
SGI HPC Update for June 2013
AI Crash Course- Supercomputing
Energy Efficient Computing using Dynamic Tuning
Ad

Viewers also liked (8)

ZIP
Analysis
PPTX
Ip forte 2_final_report_ instructions_2010
PDF
Map Reduce
PPT
Sustainable vacation presentation
PDF
Wiziq
PDF
Introduction to Concurrency
PDF
Locks (Concurrency)
PPTX
The netherlands presentation
Analysis
Ip forte 2_final_report_ instructions_2010
Map Reduce
Sustainable vacation presentation
Wiziq
Introduction to Concurrency
Locks (Concurrency)
The netherlands presentation
Ad

Similar to Parallelism Processor Design (20)

PDF
The Berkeley View on the Parallel Computing Landscape
PDF
Nikravesh big datafeb2013bt
PPTX
Exascale Capabl
PDF
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
PDF
Toward an Open and Unified Model for Heterogeneous and Accelerated Multicore ...
PPTX
Industrial trends in heterogeneous and esoteric compute
PDF
ParaForming - Patterns and Refactoring for Parallel Programming
PDF
Lllsjjsjsjjshshjshjsjjsjjsjjzjsjjzjjzjjzj
PPT
Valladolid final-septiembre-2010
PPT
Intel new processors
PPT
The Cell Processor
PPTX
2021Arch_2_Ch1_1.pptx Fundamentals of Quantitative Design and Analysis
PDF
Accelerating EDA workloads on Azure – Best Practice and benchmark on Intel EM...
PPTX
Programmable Exascale Supercomputer
PDF
Real time machine learning proposers day v3
PDF
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
PPT
Webinaron muticoreprocessors
PDF
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
PDF
Future Commodity Chip Called CELL for HPC
PPTX
Introduction to DPDK
The Berkeley View on the Parallel Computing Landscape
Nikravesh big datafeb2013bt
Exascale Capabl
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
Toward an Open and Unified Model for Heterogeneous and Accelerated Multicore ...
Industrial trends in heterogeneous and esoteric compute
ParaForming - Patterns and Refactoring for Parallel Programming
Lllsjjsjsjjshshjshjsjjsjjsjjzjsjjzjjzjjzj
Valladolid final-septiembre-2010
Intel new processors
The Cell Processor
2021Arch_2_Ch1_1.pptx Fundamentals of Quantitative Design and Analysis
Accelerating EDA workloads on Azure – Best Practice and benchmark on Intel EM...
Programmable Exascale Supercomputer
Real time machine learning proposers day v3
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
Webinaron muticoreprocessors
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Future Commodity Chip Called CELL for HPC
Introduction to DPDK

More from Sri Prasanna (20)

PDF
Qr codes para tech radar
PDF
Qr codes para tech radar 2
DOC
DOC
PDF
PDF
PDF
PDF
PDF
About stacks
PDF
About Stacks
PDF
About Stacks
PDF
About Stacks
PDF
About Stacks
PDF
About Stacks
PDF
About Stacks
PDF
About Stacks
PPT
Network and distributed systems
PPT
Introduction & Parellelization on large scale clusters
PPT
Mapreduce: Theory and implementation
PPT
Other distributed systems
Qr codes para tech radar
Qr codes para tech radar 2
About stacks
About Stacks
About Stacks
About Stacks
About Stacks
About Stacks
About Stacks
About Stacks
Network and distributed systems
Introduction & Parellelization on large scale clusters
Mapreduce: Theory and implementation
Other distributed systems

Recently uploaded (20)

PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PDF
Weekly quiz Compilation Jan -July 25.pdf
PDF
Microbial disease of the cardiovascular and lymphatic systems
PPTX
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PDF
GENETICS IN BIOLOGY IN SECONDARY LEVEL FORM 3
PPTX
Orientation - ARALprogram of Deped to the Parents.pptx
PPTX
Tissue processing ( HISTOPATHOLOGICAL TECHNIQUE
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PDF
Anesthesia in Laparoscopic Surgery in India
PDF
01-Introduction-to-Information-Management.pdf
PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PPTX
Cell Structure & Organelles in detailed.
PDF
Chinmaya Tiranga quiz Grand Finale.pdf
PDF
Computing-Curriculum for Schools in Ghana
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
Weekly quiz Compilation Jan -July 25.pdf
Microbial disease of the cardiovascular and lymphatic systems
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
Final Presentation General Medicine 03-08-2024.pptx
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
GENETICS IN BIOLOGY IN SECONDARY LEVEL FORM 3
Orientation - ARALprogram of Deped to the Parents.pptx
Tissue processing ( HISTOPATHOLOGICAL TECHNIQUE
O5-L3 Freight Transport Ops (International) V1.pdf
Module 4: Burden of Disease Tutorial Slides S2 2025
Anesthesia in Laparoscopic Surgery in India
01-Introduction-to-Information-Management.pdf
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
Cell Structure & Organelles in detailed.
Chinmaya Tiranga quiz Grand Finale.pdf
Computing-Curriculum for Schools in Ghana

Parallelism Processor Design

  • 1. License Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5 License.
  • 2. UC Berkeley has partnered with Intel and Microsoft to build the world’s #1 research lab to “accelerate developments in parallel computing and advance the powerful benefits of multi-core processing to mainstream consumer and business computers.” inst.eecs.berkeley.edu/~cs61c UCB CS61C : Machine Structures Lecture 40 – Parallelism in Processor Design 2008-05-05 Lecturer SOE Dan Garcia parlab.eecs.berkeley.edu How parallel is your processor?
  • 3. A Thread stands for “thread of execution”, is a single stream of instructions A program can split , or fork itself into separate threads, which can (in theory) execute simultaneously. It has its own registers, PC, etc. Threads from the same process operate in the same virtual address space switching threads faster than switching processes! An easy way to describe/think about parallelism A single CPU can execute many threads by Time Division Multipexing Background: Threads CPU Time Thread 0 Thread 1 Thread 2
  • 4. Background: Multithreading Multithreading is running multiple threads through the same hardware Could we do Time Division Multipexing better in hardware? Sure, if we had the HW to support it!
  • 5. Put multiple CPU’s on the same die Why is this better than multiple dies? Smaller, Cheaper Closer , so lower inter-processor latency Can share a L2 Cache (complicated) Less power Cost of multicore: Complexity Slower single-thread execution Background: Multicore
  • 6. Cell Processor (heart of the PS3) 9 Cores (1PPE, 8SPE) at 3.2GHz Power Processing Element (PPE) Supervises all activities, allocates work Is multithreaded (2 threads) Synergystic Processing Element (SPE) Where work gets done Very Superscalar No Cache, only “ Local Store ” aka “Scratchpad RAM” During testing, one “locked out” I.e., it didn’t work; shut down
  • 7. The majority of PS3’s processing power comes from the Cell processor Berkeley profs believe multicore is the future of computing Current multicore techniques can scale well to many (32+) cores Peer Instruction ABC 0: FFF 1: FF T 2: F T F 3: F TT 4: T FF 5: T F T 6: TT F 7: TTT
  • 8. Peer Instruction Answer All PS3 is 2.18TFLOPS, Cell is only 204GFLOPS (GPU can do a lot…) FALSE Not multicore, manycore! FALSE Share memory and caches huge barrier. That’s why Cell has Local Store! FALSE ABC 0: FFF 1: FF T 2: F T F 3: F TT 4: T FF 5: T F T 6: TT F 7: TTT The majority of PS3’s processing power comes from the Cell processor Berkeley profs believe multicore is the future of computing Current multicore techniques can scale well to many (32+) cores
  • 9. Conventional Wisdom (CW) in Computer Architecture Old CW: Power is free, but transistors expensive New CW: Power wall Power expensive, transistors “free” Can put more transistors on a chip than have power to turn on Old CW: Multiplies slow, but loads fast New CW: Memory wall Loads slow, multiplies fast 200 clocks to DRAM, but even FP multiplies only 4 clocks Old CW: More ILP via compiler / architecture innovation Branch prediction, speculation, Out-of-order execution, VLIW, … New CW: ILP wall Diminishing returns on more ILP Old CW: 2X CPU Performance every 18 months New CW: Power Wall+Memory Wall+ILP Wall = Brick Wall
  • 10. Uniprocessor Performance (SPECint) VAX : 25%/year 1978 to 1986 RISC + x86: 52%/year 1986 to 2002 RISC + x86: ??%/year 2002 to present From Hennessy and Patterson, Computer Architecture: A Quantitative Approach , 4th edition, Sept. 15, 2006  Sea change in chip design: multiple “cores” or processors per chip 3X
  • 11. Sea Change in Chip Design Intel 4004 (1971) 4-bit processor, 2312 transistors, 0.4 MHz, 10 micron PMOS, 11 mm 2 chip RISC II (1983) 32-bit, 5 stage pipeline, 40,760 transistors, 3 MHz, 3 micron NMOS, 60 mm 2 chip 125 mm 2 chip, 0.065 micron CMOS = 2312 RISC II + FPU + Icache + Dcache RISC II shrinks to  0.02 mm 2 at 65 nm Caches via DRAM or 1 transistor SRAM or 3D chip stacking Proximity Communication via capacitive coupling at > 1 TB/s ? (Ivan Sutherland @ Sun / Berkeley) Processor is the new transistor!
  • 12. Parallelism again? What’s different this time? “ This shift toward increasing parallelism is not a triumphant stride forward based on breakthroughs in novel software and architectures for parallelism; instead, this plunge into parallelism is actually a retreat from even greater challenges that thwart efficient silicon implementation of traditional uniprocessor architectures .” – Berkeley View, December 2006 HW/SW Industry bet its future that breakthroughs will appear before it’s too late view.eecs.berkeley.edu
  • 13. Need a New Approach Berkeley researchers from many backgrounds met between February 2005 and December 2006 to discuss parallelism Circuit design, computer architecture, massively parallel computing, computer-aided design, embedded hardware and software, programming languages, compilers, scientific programming, and numerical analysis Krste Asanovic, Ras Bodik, Jim Demmel, Edward Lee, John Kubiatowicz, George Necula, Kurt Keutzer, Dave Patterson, Koshik Sen, John Shalf, Kathy Yelick + others Tried to learn from successes in embedded and high performance computing (HPC) Led to 7 Questions to frame parallel research
  • 14. 7 Questions for Parallelism Applications: 1. What are the apps? 2. What are kernels of apps? Architecture & Hardware: 3. What are HW building blocks? 4. How to connect them? Programming Model & Systems Software: 5. How to describe apps & kernels? 6. How to program the HW? Evaluation: 7. How to measure success? (Inspired by a view of the Golden Gate Bridge from Berkeley)
  • 15. Hardware Tower: What are problems? Power limits leading edge chip designs Intel Tejas Pentium 4 cancelled due to power issues Yield on leading edge processes dropping dramatically IBM quotes yields of 10 – 20% on 8-processor Cell Design/validation leading edge chip is becoming unmanageable Verification teams > design teams on leading edge processors
  • 16. HW Solution: Small is Beautiful Expect modestly pipelined (5- to 9-stage) CPUs, FPUs, vector, Single Inst Multiple Data (SIMD) Processing Elements (PEs) Small cores not much slower than large cores Parallel is energy efficient path to performance: POWER ≈ VOLTAGE 2 Lower threshold and supply voltages lowers energy per op Redundant processors can improve chip yield Cisco Metro 188 CPUs + 4 spares; Cell in PS3 Small, regular processing elements easier to verify
  • 17. Number of Cores/Socket We need revolution, not evolution Software or architecture alone can’t fix parallel programming problem, need innovations in both “ Multicore” 2X cores per generation: 2, 4, 8, … “ Manycore” 100s is highest performance per unit area, and per Watt, then 2X per generation: 64, 128, 256, 512, 1024 … Multicore architectures & Programming Models good for 2 to 32 cores won’t evolve to Manycore systems of 1000’s of processors  Desperately need HW/SW models that work for Manycore or will run out of steam (as ILP ran out of steam at 4 instructions)
  • 18. Measuring Success: What are the problems?  Only companies can build HW; it takes years Software people don’t start working hard until hardware arrives 3 months after HW arrives, SW people list everything that must be fixed, then we all wait 4 years for next iteration of HW/SW How get 1000 CPU systems in hands of researchers to innovate in timely fashion on in algorithms, compilers, languages, OS, architectures, … ? Can avoid waiting years between HW/SW iterations?
  • 19. Build Academic Manycore from FPGAs As  16 CPUs will fit in Field Programmable Gate Array (FPGA), 1000-CPU system from  64 FPGAs? 8 32-bit simple “soft core” RISC at 100MHz in 2004 (Virtex-II) FPGA generations every 1.5 yrs;  2X CPUs,  1.2X clock rate HW research community does logic design (“gate shareware”) to create out-of-the-box, Manycore E.g., 1000 processor, standard ISA binary-compatible, 64-bit, cache-coherent supercomputer @  150 MHz/CPU in 2007 RAMPants: 10 faculty at Berkeley, CMU, MIT, Stanford, Texas, and Washington “ Research Accelerator for Multiple Processors” as a vehicle to attract many to parallel challenge
  • 20. And in Conclusion… Everything is changing Old conventional wisdom is out We desperately need new approach to HW and SW based on parallelism since industry has bet its future that parallelism works Need to create a “watering hole” to bring everyone together to quickly find that solution architects, language designers, application experts, numerical analysts, algorithm designers, programmers, …
  • 21. Bonus slides These are extra slides that used to be included in lecture notes, but have been moved to this, the “bonus” area to serve as a supplement. The slides will appear in the order they would have in the normal presentation Bonus
  • 22. Why is Manycore Good for Research ? SMP Cluster Simulate RAMP Scalability (1k CPUs) C A A A Cost (1k CPUs) F ($40M) C ($2-3M) A+ ($0M) A ($0.1-0.2M) Cost of ownership A D A A Power/Space (kilowatts, racks) D (120 kw, 12 racks) D (120 kw, 12 racks) A+ (.1 kw, 0.1 racks) A (1.5 kw, 0.3 racks) Community D A A A Observability D C A+ A+ Reproducibility B D A+ A+ Reconfigurability D C A+ A+ Credibility A+ A+ F B+/A- Perform. (clock) A (2 GHz) A (3 GHz) F (0 GHz) C (0.1 GHz) GPA C B- B A-
  • 23. Multiprocessing Watering Hole Killer app:  All CS Research, Advanced Development RAMP attracts many communities to shared artifact  Cross-disciplinary interactions RAMP as next Standard Research/AD Platform? (e.g., VAX/BSD Unix in 1980s) Parallel file system Flight Data Recorder Transactional Memory Fault insertion to check dependability Data center in a box Internet in a box Dataflow language/computer Security enhancements Router design Compile to FPGA Parallel languages RAMP 128-bit Floating Point Libraries
  • 24. Reasons for Optimism towards Parallel Revolution this time End of sequential microprocessor/faster clock rates No looming sequential juggernaut to kill parallel revolution SW & HW industries fully committed to parallelism End of La-Z-Boy Programming Era Moore’s Law continues, so soon can put 1000s of simple cores on an economical chip Communication between cores within a chip at low latency (20X) and high bandwidth (100X) Processor-to-Processor fast even if Memory slow All cores equal distance to shared main memory Less data distribution challenges Open Source Software movement means that SW stack can evolve more quickly than in past RAMP as vehicle to ramp up parallel research

Editor's Notes

  • #16: Autotuners is fine. I had a slightly longer list I was working on before seeing Krste's mail. PHiPAC: Dense linear algebra (Krste, Jim, Jeff Bilmes and others at UCB) PHiPAC = Portable High Performance Ansi C FFTW: Fastest Fourier Transforms in the West (from Matteo Frigo and Steve Johnson at MIT; Matteo is now at IBM) Atlas: Dense linear algebra now the "standard" for many BLAS implementations; used in Matlab, for example. (Jack Dongarra, Clint Whaley et al) Sparsity: Sparse linear algebra (Eun-Jin Im and Kathy at UCB) Spiral: DSP algorithms including FFTs and other transforms (Markus Pueschel, José M. F. Moura et al) OSKI: Sparse linear algebra (Rich Vuduc, Jim and Kathy, From the Bebop project at UCB) In addition there are groups at Rice, USC, UIUC, Cornell, UT Austin, UCB (Titanium), LLNL and others working on compilers that include an auto-tuning (Search-based) optimization phase. Both the Bebop group and the Atlas group have done work on automatic tuning of collective communication routines for supercomputers/clusters, but this is ongoing. I'll send a slide with an autotuning example later. Kathy
  • #17: Autotuners is fine. I had a slightly longer list I was working on before seeing Krste's mail. PHiPAC: Dense linear algebra (Krste, Jim, Jeff Bilmes and others at UCB) PHiPAC = Portable High Performance Ansi C FFTW: Fastest Fourier Transforms in the West (from Matteo Frigo and Steve Johnson at MIT; Matteo is now at IBM) Atlas: Dense linear algebra now the "standard" for many BLAS implementations; used in Matlab, for example. (Jack Dongarra, Clint Whaley et al) Sparsity: Sparse linear algebra (Eun-Jin Im and Kathy at UCB) Spiral: DSP algorithms including FFTs and other transforms (Markus Pueschel, José M. F. Moura et al) OSKI: Sparse linear algebra (Rich Vuduc, Jim and Kathy, From the Bebop project at UCB) In addition there are groups at Rice, USC, UIUC, Cornell, UT Austin, UCB (Titanium), LLNL and others working on compilers that include an auto-tuning (Search-based) optimization phase. Both the Bebop group and the Atlas group have done work on automatic tuning of collective communication routines for supercomputers/clusters, but this is ongoing. I'll send a slide with an autotuning example later. Kathy