SlideShare a Scribd company logo
Compiler Optimization Techniques
CP 7031
Dr.K.Thirunadana Sikamani
Principal Sources of Optimization
Elimination of unnecessary instructions in object code ,
or the replacement of one sequence of instructions by a
faster sequence of instructions that does the same thing
is usually called “code improvement” or “code
optimization”
 Redundancy
 Semantic preserving transformations
 Global Common Subexpressions
 Copy Propagation
 Dead Code Elimination
 Code Motion
8/25/2014 Compiler OptimizationTechniques - unit II 2
The Speed of a program run on a processor with
Instruction Level Parallelism depends on
1. The potential parallelism in the program.
2. The available parallelism on the processor.
3. Our ability to extract parallelism from the original
sequential program.
4. Our ability to find the best parallel schedule given
scheduling constraints.
8/25/2014 3Compiler OptimizationTechniques - unit II
Processor
Architecture
8/25/2014 4Compiler OptimizationTechniques - unit II
1. Instruction Pipelines and Branch
delays
2. Pipelined Execution
3. Multiple Instruction Issues –
VLIW ( Very Long Instruction
Word)
8/25/2014 5Compiler OptimizationTechniques - unit II
Code
Scheduling
Constraints
8/25/2014 6Compiler OptimizationTechniques - unit II
1. Control-dependence
constraints
2. Data-dependence Constraints
3. Resource Constraints
8/25/2014 7Compiler OptimizationTechniques - unit II
Control dependence constraints
All the operations
executed in original
program must be
executed in the
optimized one
8/25/2014 8Compiler OptimizationTechniques - unit II
Data Dependence Constraints
The operations in the
optimized program must
produce the same results
as the corresponding ones
in the original program
8/25/2014 9Compiler OptimizationTechniques - unit II
Resource Constraints
The schedule must not
oversubscribe the
resources on the
machine
8/25/2014 10Compiler OptimizationTechniques - unit II
Data Dependence
True dependence - Read after Write
Antidependence - Write after Read
Output dependence - Write after Write
8/25/2014 11Compiler OptimizationTechniques - unit II
Classify dependence for the following
statements
 1. a =b
 2.c =d
 3.b =c
 4. d =a
 5. c= d
 6. a = b
8/25/2014 Compiler OptimizationTechniques - unit II 12
1 and 4
3 and 5
1 and 6
Check the
dependences for the
following
Give the register level m/c code to provide maxm
parallelism also give the solution for minimal
usage of register
expression ((u+v) + (w+x)) + (y+z)
LD r1,u
LD r2,v
ADD r1,r1,r2
LD r2,w
LD r3,x
ADD r2,r2,r3
ADD r1,r1,r2
LD r2,y
LDr3,z
ADD r2,r2,r3
ADD r1,r1,r2
8/25/2014 Compiler OptimizationTechniques - unit II 13
Clock
1
LD
r1,u
LD
r2,v
LD
r3,w
LD
r4,x
LD
r5,y
LD
r6,z
Clock
2
ADD
r1,r1,r2
ADD
r3,r3,r
4
ADD
r5,r5,r
6
Clock
3
ADD
r1,r1,r3
clock
4
ADD
r1,r1,r5
Implementation of parallelism in 4 clocks
Finding dependences among
memory Access
1. Array data dependence analysis
for ( i = 0; i < n; i++)
A[2*i] = A[2* i+1]
2. Pointer alias analysis
Two pointers aliased if they refer to the same object
3. Inter procedural analysis
It is to determine if same variable is passed as two or more
different arguments in passing parameters by reference language
8/25/2014 14Compiler OptimizationTechniques - unit II
Tradeoff between Register usage
and Parallelism
e.g., machine independent intermediate representation code
LD t1 , a
ST b , t1
LD t2 , c
ST d , t2
the code above is to copy the values of a and c to b
and d . If all memory locations are distinct the copies can
be proceed in parallel . The other case if t1 and t2 are
assigned to use the same register to minimize the register
usage.
8/25/2014 15Compiler OptimizationTechniques - unit II
Tradeoff between Register U sage
and Parallelism
The syntax tree for the (a+b) + c + ( d+ e)
a b
+
+
+
+
C
d e
Machine code
LD r1 , a
LDr2 , b
ADD r1,r1,r2
LD r2 , c
ADD r1,r1,r2
LD r2, d
LD r3, e
ADD r2,r2,r3
ADD r1,r1,r2
Parallel evaluation of the expression
r1 =a
r6=r1+r2
r8=r6+r3
R9=r8+r7
r2=b
r7=r4+r5
r3=c r4=d r5=e
8/25/2014 16Compiler OptimizationTechniques - unit II
Phase Ordering between register
allocation and Code Scheduling
 If registers are allocated before scheduling , the
resulting code tends to have many storage
dependences that limit code scheduling.
 On the other way around , the schedule created may
require so many registers that register spilling
Spilling – storing the contents of a register in a memory
location, so the register can be used for some other
purpose.
Based on the characteristics of the program.
e.g., numeric , non numeric, etc.,
8/25/2014 17Compiler OptimizationTechniques - unit II
Control Dependence
 If ( c ) s1; else s2; /* s1 and s2 are control dependent on
c */
 While ( c ) s; /* s is dependent on c */
 if ( a > t )
b = a * a;
d = a + c; / * No dependence * /
8/25/2014 18Compiler OptimizationTechniques - unit II
Speculative Execution Support
 Prefetching - Bringing data from memory to cache
before it is used.
 Poison Bits – Speculative load of data from memory to
register file. Each register is augmented with poison
bit. The poison bit is set when an illegal memory is
accessed to raise exception at later usage.
8/25/2014 19Compiler OptimizationTechniques - unit II
Predicated Execution
 Predicated instructions were invented to reduce the
number of branches in a program.
 A predicated instruction is like a normal instruction but
has an extra predicate operand to guard its execution.
 E.g., CMOVZ R2, R3, R1 has the semantics of moving
contents of R3 to R2 if R1 is zero
if ( a == 0 ) b = c + d; can be implemented as
ADD R3 , R4 ,R5 /* a ,b,c ,d are allotedR1, R2 , R4,R5 */
CMOVZ R2, R3, R1
8/25/2014 20Compiler OptimizationTechniques - unit II
Basic Machine Model
Many machines can be represented as
M = < R , T >
T – Set of operation types T, such as loads, stores and
arithmetic operations.
R is a vector – R = [ r1,r2,…..] are hardware resources.
r1 - number of units availabel of the ith kind of resources.
Resources – memory access units, ALUs, floating point
functional units.
8/25/2014 21Compiler OptimizationTechniques - unit II
Basic Machine Model
 Each operation has a set of input operands , a set of
output operands and resource requirement
 RTt– Resource –Reservation table
 RTt[i,j]- is the number of units of jth resource
used by an operation type t, i clocks after it is
issued.
8/25/2014 22Compiler OptimizationTechniques - unit II
Basic-Block
Scheduling
8/25/2014 23Compiler OptimizationTechniques - unit II
Data-Dependence Graphs
Graph G = ( N , E)
N --
E ---
A set of nodes representing the operations in
the machine instructions.
A set of directed edges representing the data
dependence constraints among operations
1. Each operation n in N has a resource reservation table RTn , whose
value is simply the resource – reservation table associated with
operation type of n
2. Each edge e in E is labeled with delay de indicating that the destination
node must be issued no earlier than de clocks after the source node is
issued.
8/25/2014 24Compiler OptimizationTechniques - unit II
Data- dependence Graph
LD R2, 0(R1)
ST 4(R1), R2
ADD R3,R3,R2
ADD R3, R3, R4
Ld R3, 8 (R1)
ST 0(R7), R7
ST 12(R1), R3
i1
i2
i3
i4
i5
i6
i7
2
2
1
1
1
1
1
1
2
1.Load operation
takes 2 clock cycles
2. R1 is a stack
pointer having
offset from 0 t0 12
8/25/2014 25Compiler OptimizationTechniques - unit II
List Scheduling of Basic Blocks
 This involves visiting each node of the
data-de pendence graph in “prioritized topological
order”
 Machine-resource vector R = [r1,r2,r3,..]
ri --- the number of units available of the ith kind of
resource
G = ( N,E) data dependence graph
RTn ---- Resource -reservation table
Edge e = n1 n2 with de indicating n2
would be executed de delays after n1.
8/25/2014 Compiler OptimizationTechniques - unit II 26
List Scheduling Algorithm
RT = An empty reservation table
for ( each n in N in prioritized topological order){
s = max e=p ->n in E (S(p) + de);
/* find the earliest time this instruction this instruction could begin given when its predecessors started */
while ( there exists i such that RT[s+i] + RTn [i] > R)
s = s+ 1;
/* delay the instruction further until the needed resources are available */
S(n) = s;
for (all i)
RT[s + i] = RT [ s+i ] + RTn [i]
}
8/25/2014 Compiler OptimizationTechniques - unit II 27
Prioritized topological Order
Possible prioritized orderings:
1) Critical path - the longest path through the data-dependence graph.
Height of the node – the length of the longest path in the graph
originating from the node.
2) The length of the schedule is constrained by the resource available.
Critical resource - the one with the largest ratio of uses to the
number of units of that resource available.
Operations using more critical resources may be given higher priority.
3) Source ordering – the operation that shows up earlier in the source
program should be scheduled first.
8/25/2014 Compiler OptimizationTechniques - unit II 28
Result of applying List Scheduling
(for example in slide 22)
ALU Memory
LD R3 , 8(R1) /* using height as the priority
function */
LD R2, 0(R1)
ADD R3, R3,R4 /* 2 delay */
ADD R3,R3,R2 ST 4(R1) , R2
St 12(R1), R3
St 0(R1),R7
8/25/2014 Compiler OptimizationTechniques - unit II 29
Global Code Scheduling
 Strategies that consider more than one Basic Block at a
time are referred to as Global Scheduling.
 Conditions: ( must abide control and data
dependencies)
1. All instructions in the original program are executed
in the optimized one and
2. While the optimized program may execute extra
instructions speculatively ,these instructions must
not have any unwanted side effects.
8/25/2014 Compiler OptimizationTechniques - unit II 30
Basic Block
A basic Block is constituted by set of instructions in
which the control enters the block through the first
instruction and leaves the block via the last instruction
without any deterrence or jump / branch in between
them. ( the flow will be linear)
8/25/2014 Compiler OptimizationTechniques - unit II 31
Primitive code motion
Source Program
8/25/2014 Compiler OptimizationTechniques - unit II 32
if ( a == 0) goto L
e = d + d
c = b
L:
Locally Scheduled Machine code
8/25/2014 Compiler OptimizationTechniques - unit II 33
LD R6 , 0(R1)
nop
BEQZ R6 , L
LD R7 ,0(R2)
nop
ST 0(R3),R7
LD R8 , 0(R4)
nop
ADD R8,R8,R8
ST 0(R5), R8
B1
B2
B3
L:
Globally Scheduled machine code
8/25/2014 Compiler OptimizationTechniques - unit II 34
LD R6 , 0(R1)
LD R8 , 0(R4)
LD R7 , 0(R2)
ADD R8,R8,R8
BEQZ R6 , L
ST 0(R5), R8
ST 0(R5) , R8
ST 0(R3) , R7
B1
B3’
B3
Upward Code motion
It moves as operation from block src up a control-flow
path to block dst.
such move does not violate any data dependences and it
makes the path through dst and src run faster
Case 1: If src does not postdominate dst
In this case there exists a path that passes through dst
that does not reach src
This code motion is illegal unless tehoperation moved
has no unwanted side effects
8/25/2014 Compiler OptimizationTechniques - unit II 35
Contd…
Case 2: If dst does not dominate src
In this case there exists a path that reaches src without first
going through dst.
We need to move copies of the moved operation along such
paths
Constraints:
1.The operands of the operation must hold the same values
as in the original.
2.The result does not overwrite a value that is still needed ,
and
3. It itself is not subsequently overwritten before reaching
src.
8/25/2014 Compiler OptimizationTechniques - unit II 36
Downward Code Motion
It is moving an operation from block src down a control
flow path to block dst
Case 1: src does not dominate dst – There exists a path to
dst that does not passes through src.
Case 2: dst does not postdominate src - There exists a
path through src does not pass through dst
8/25/2014 Compiler OptimizationTechniques - unit II 37
E.g.,
 If ( x == 0) a = b;
Else a =c;
d= a;
8/25/2014 Compiler OptimizationTechniques - unit II 38
(x==0)
LD R1,x
Nop
BEQZ R1, L
(a = c)
LD R3,c
Nop
ST a,R3
( a= b)
LD R2,b
Nop
ST a, R2
(d =a)
LD R4, a
Nop
ST d, R4
B1
B2B3
B4
x---0(R5)
b-----0(R6)
c --------0(R7)
a -------- 0(R8)
d -------- 0(R9)
L:
E.g.,
 If ( x == 0) a = b;
Else a =c;
d= a;
8/25/2014 Compiler OptimizationTechniques - unit II 39
LD R1,0(R5), LD R3 , 0(R7)
LD R2 , 0(R6)
ST 0(R8),R3
BEQZ R1, L /* CMOVZ 0(R8) ,R2,R1 */
ST 0(R8), R2
LD R4, 0(R8)
Nop
ST 0(R9), R4
B1
B2
B4x---0(R5)
b-----0(R6)
c --------0(R7)
a -------- 0(R8)
d -------- 0(R9)
L:
Updating data dependences
 Code motions can change data dependence relations
between operations. Thus data dependences just be
updated after each code motions
8/25/2014 Compiler OptimizationTechniques - unit II 40
X = 1 X = 2
If one assignment is moved up
the other can not.
X is not live before code motion
Global Scheduling Algorithms
 Region Based Scheduling
Two easiest form of code motion
1. Moving operations up to control equivalent basic
blocks
2. Moving operations speculatively up one branch to a
dominating predecessor.
Assignment : Region Based Scheduling Algorithm
8/25/2014 Compiler OptimizationTechniques - unit II 41
Loop Unrolling
unrolling creates more instructions in the loop body permitting
global scheduling algorithms to find more parallelism
for (i = 0; i < N; i ++)
{
S(i);
}
Can be unrolled
for ( i = 0; i+4 < N; i+=4) {
S(i);
S(i+1);
S(i+2);
S(i+3);
}
repeat
S;
until C;
Can be unrolled as
repeat {
S;
if(C) break;
S;
if (C) break;
S;
} until C ;
8/25/2014 Compiler OptimizationTechniques - unit II 42
Neighborhood Compaction
 Examine each pair of basic blocks that are executed
one after the other , and check if any operation can be
moved up or down between them to improve the
execution time to those blocks.
 If such a pair is found we check if the instruction to be
moved needs to be duplicated along other paths.
8/25/2014 Compiler OptimizationTechniques - unit II 43
Advanced Code Motion Techniques
 Adding new basic blocks along the control flow edges
originating from blocks with more than one predecessor.
Moving instructions from basic blocks, so that the block
can be eliminated completely.
 The code to be executed in each basic block is scheduled
once and for all as each block is visited, because
algorithms only move operations up to dominating block.
 Implementing downward code motion is harder in an
algorithm that visits basic blocks in topological order , We
move all operations that
i) can be moved and
ii) can not be executed in their native block
8/25/2014 Compiler OptimizationTechniques - unit II 44
Interaction with dynamic
Schedulers
 It can create new schedules according to the run time
conditions.
 High latency instructions are issued early.
 Data pre fetch instructions will help the dynamic
scheduler to make them available advance.
 Data dependent operations are put in correct order to
ensure program correctness. For best performance the
compiler should assign long delays to dependences
that are likely to occur and short ones to those that are
not likely.
 Branch misprediction must be avoided
8/25/2014 Compiler OptimizationTechniques - unit II 45
Software
Pipelining
8/25/2014 Compiler OptimizationTechniques - unit II 46
Software Pipelining
 Numerical applications often have loops whose
iterations are completely independent of one another.
 These loops with many iterations have enough
parallelism to saturate all the resources in a processor.
It is up to the scheduler to take full advantage available
parallelism.
 Software Pipelining schedules an entire loop at a
time to take full advantage of the parallelism across
iterations.
8/25/2014 Compiler OptimizationTechniques - unit II 47
Machine Model
 The machine can issue in a single clock : one load, one
store, one arithmetic operation and one branch
operation.
 The machine has a loop back operation
BL R, L
which decrements register R and , unless the result is
0, branches to location L.

8/25/2014 Compiler OptimizationTechniques - unit II 48
Machine Model
 Memory operations have an auto increment
addressing mode , denoted by ++ after the register.
The register is automatically incremented to point to
the next consecutive address after each access.
 The arithmetic operations are fully pipelined ; they can
be initiated every clock but their results are not
available until 2 clock later. All other instructions have
a single- clock latency.
8/25/2014 Compiler OptimizationTechniques - unit II 49
Typical do-all loop
for ( i = 0; i< n; i++)
D[i] = A[i] * B[i] + c;
8/25/2014 Compiler OptimizationTechniques - unit II 50
//R1,R2,R3 = & A, &B, &D
// R4 = c
// R10 = n-1
LD R5 , 0(R1 ++)
LD R6 , 0(R2 ++)
MUL R7 , R5, R6
Nop
ADD R8 , R7, R4
Nop
ST 0(R3 ++) , R8 BL R10 , L
L:
Locally scheduled code
Five unrolled iterations of e.g.,
for (i = 0; i < n; i ++) D[i] = A[i] * B[i] + c ;
8/25/2014 Compiler OptimizationTechniques - unit II 51
Clock j = 1 J =2 J = 3 J =4 J = 5
1 LD
2 LD
3 MUL LD
4 LD
5 MUL LD
6 ADD LD
7 MUL LD
8 ST ADD LD
9 MUL LD
10 ST ADD LD
11 MUL
12 ST ADD
13
14 ST ADD
15
16 ST
Clock j = 1 J =2 J = 3 J =4
1 LD
2 LD
3 MUL LD
4 LD
5 MUL LD
6 ADD LD
7 L: MUL LD
8 ST ADD LD BL (L)
9 MUL
10 ST ADD
11
12 ST ADD
13
14 ST
8/25/2014 Compiler OptimizationTechniques - unit II 52
Software pipelined Code
 A new iteration can be started on the pipeline every 2
clocks
 When first iteration proceeds to stage three , the
second iteration starts to execute.
 By clock 7 the pipeline is fully filled with first four
iterations.
 In the steady state four consecutive iterations are
executing at the same time.
 The sequence of instructions 1 through 6 is called
prolog.
 7 and 8 are steady state.
 lines 9 through 14 is called epilog.
8/25/2014 Compiler OptimizationTechniques - unit II 53

More Related Content

PPT
PPTX
Asymptotic Notations
PDF
Introduction to OpenMP
PPT
Code Optimization
PDF
Iterations and Recursions
PPTX
COMPILER DESIGN
PPT
Operating Systems Process Scheduling Algorithms
Asymptotic Notations
Introduction to OpenMP
Code Optimization
Iterations and Recursions
COMPILER DESIGN
Operating Systems Process Scheduling Algorithms

What's hot (20)

PPTX
Election algorithms
PPTX
Error Detection & Recovery
PPTX
Dynamic programming
PDF
Parallel sorting Algorithms
PPTX
Debugging (Part 2)
PPT
Asymptotic Notation and Complexity
PPTX
Principal Sources of Optimization in compiler design
PPTX
Recursive Descent Parsing
PDF
Operating System Notes.pdf
PPTX
compiler design
PDF
PPTX
cpu scheduling
PPTX
Specification-of-tokens
PPT
Path testing, data flow testing
PPTX
pipelining
PPTX
Compiler design syntax analysis
PDF
Constructive Cost Model - II (COCOMO-II)
PPT
Interprocess communication (IPC) IN O.S
PPT
Unit 3-Greedy Method
PPT
Chapter 7 - Deadlocks
Election algorithms
Error Detection & Recovery
Dynamic programming
Parallel sorting Algorithms
Debugging (Part 2)
Asymptotic Notation and Complexity
Principal Sources of Optimization in compiler design
Recursive Descent Parsing
Operating System Notes.pdf
compiler design
cpu scheduling
Specification-of-tokens
Path testing, data flow testing
pipelining
Compiler design syntax analysis
Constructive Cost Model - II (COCOMO-II)
Interprocess communication (IPC) IN O.S
Unit 3-Greedy Method
Chapter 7 - Deadlocks
Ad

Viewers also liked (20)

PPTX
INSTRUCTION LEVEL PARALLALISM
PDF
Instruction Level Parallelism (ILP) Limitations
PDF
Pipelining and ILP (Instruction Level Parallelism)
PPT
Instruction Level Parallelism and Superscalar Processors
PDF
Chapter 3 instruction level parallelism and its exploitation
PPTX
Instruction level parallelism
PPT
1.prallelism
PPTX
Parallel language &amp; compilers
PDF
Aca2 10 11
PPTX
Introduction to code optimization by dipankar
PPT
Task and Data Parallelism
PDF
Pipelining
PDF
Parallel programming model, language and compiler in ACA.
PPTX
Task and Data Parallelism: Real-World Examples
PPT
Advanced computer architecture lesson 5 and 6
PDF
Concurrency basics
PPTX
Loop parallelization & pipelining
PPTX
Multithreading: Exploiting Thread-Level Parallelism to Improve Uniprocessor ...
PDF
Computer architecture kai hwang
PPTX
Symmetric multiprocessing
INSTRUCTION LEVEL PARALLALISM
Instruction Level Parallelism (ILP) Limitations
Pipelining and ILP (Instruction Level Parallelism)
Instruction Level Parallelism and Superscalar Processors
Chapter 3 instruction level parallelism and its exploitation
Instruction level parallelism
1.prallelism
Parallel language &amp; compilers
Aca2 10 11
Introduction to code optimization by dipankar
Task and Data Parallelism
Pipelining
Parallel programming model, language and compiler in ACA.
Task and Data Parallelism: Real-World Examples
Advanced computer architecture lesson 5 and 6
Concurrency basics
Loop parallelization & pipelining
Multithreading: Exploiting Thread-Level Parallelism to Improve Uniprocessor ...
Computer architecture kai hwang
Symmetric multiprocessing
Ad

Similar to Instruction Level Parallelism Compiler optimization Techniques Anna University,K.Thirunadana Sikamani (20)

PDF
High Speed and Area Efficient Matrix Multiplication using Radix-4 Booth Multi...
PDF
Design & Simulation of RISC Processor using Hyper Pipelining Technique
PDF
TSR_CLASS CD-UNIT 4.pdf ewqhqhqhewhwiqhe
PPTX
Introduction to computer architecture .pptx
PPT
Design and implementation of five stage pipelined RISC-V processor using Ver...
PPTX
PDF
Design and development of a 5-stage Pipelined RISC processor based on MIPS
PPTX
ARM instruction set
PDF
High Speed Area Efficient 8-point FFT using Vedic Multiplier
PPTX
Instruction Level Parallelism – Compiler Techniques
PPT
Lect05 Prog Model
PDF
Iaetsd finger print recognition by cordic algorithm and pipelined fft
PPT
High Performance Computer Architecture
PDF
An Area Efficient Mixed Decimation MDF Architecture for Radix 22 Parallel FFT
PDF
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
PDF
Handout#10
DOC
Yakaiah_Resume_9Yrs
PPT
basic-processing-unit computer organ.ppt
PPT
Computer Organization for third semester Vtu SyllabusModule 4.ppt
PDF
IRJET-Error Detection and Correction using Turbo Codes
High Speed and Area Efficient Matrix Multiplication using Radix-4 Booth Multi...
Design & Simulation of RISC Processor using Hyper Pipelining Technique
TSR_CLASS CD-UNIT 4.pdf ewqhqhqhewhwiqhe
Introduction to computer architecture .pptx
Design and implementation of five stage pipelined RISC-V processor using Ver...
Design and development of a 5-stage Pipelined RISC processor based on MIPS
ARM instruction set
High Speed Area Efficient 8-point FFT using Vedic Multiplier
Instruction Level Parallelism – Compiler Techniques
Lect05 Prog Model
Iaetsd finger print recognition by cordic algorithm and pipelined fft
High Performance Computer Architecture
An Area Efficient Mixed Decimation MDF Architecture for Radix 22 Parallel FFT
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
Handout#10
Yakaiah_Resume_9Yrs
basic-processing-unit computer organ.ppt
Computer Organization for third semester Vtu SyllabusModule 4.ppt
IRJET-Error Detection and Correction using Turbo Codes

Recently uploaded (20)

PPTX
Current and future trends in Computer Vision.pptx
PPTX
Fundamentals of Mechanical Engineering.pptx
PPTX
Internet of Things (IOT) - A guide to understanding
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PDF
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PDF
Well-logging-methods_new................
PPTX
Artificial Intelligence
PDF
A SYSTEMATIC REVIEW OF APPLICATIONS IN FRAUD DETECTION
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PDF
III.4.1.2_The_Space_Environment.p pdffdf
PPTX
additive manufacturing of ss316l using mig welding
PDF
PREDICTION OF DIABETES FROM ELECTRONIC HEALTH RECORDS
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PPTX
Fundamentals of safety and accident prevention -final (1).pptx
PPTX
Geodesy 1.pptx...............................................
PDF
737-MAX_SRG.pdf student reference guides
PDF
BIO-INSPIRED HORMONAL MODULATION AND ADAPTIVE ORCHESTRATION IN S-AI-GPT
PPT
Total quality management ppt for engineering students
Current and future trends in Computer Vision.pptx
Fundamentals of Mechanical Engineering.pptx
Internet of Things (IOT) - A guide to understanding
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
Well-logging-methods_new................
Artificial Intelligence
A SYSTEMATIC REVIEW OF APPLICATIONS IN FRAUD DETECTION
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
III.4.1.2_The_Space_Environment.p pdffdf
additive manufacturing of ss316l using mig welding
PREDICTION OF DIABETES FROM ELECTRONIC HEALTH RECORDS
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
Fundamentals of safety and accident prevention -final (1).pptx
Geodesy 1.pptx...............................................
737-MAX_SRG.pdf student reference guides
BIO-INSPIRED HORMONAL MODULATION AND ADAPTIVE ORCHESTRATION IN S-AI-GPT
Total quality management ppt for engineering students

Instruction Level Parallelism Compiler optimization Techniques Anna University,K.Thirunadana Sikamani

  • 1. Compiler Optimization Techniques CP 7031 Dr.K.Thirunadana Sikamani
  • 2. Principal Sources of Optimization Elimination of unnecessary instructions in object code , or the replacement of one sequence of instructions by a faster sequence of instructions that does the same thing is usually called “code improvement” or “code optimization”  Redundancy  Semantic preserving transformations  Global Common Subexpressions  Copy Propagation  Dead Code Elimination  Code Motion 8/25/2014 Compiler OptimizationTechniques - unit II 2
  • 3. The Speed of a program run on a processor with Instruction Level Parallelism depends on 1. The potential parallelism in the program. 2. The available parallelism on the processor. 3. Our ability to extract parallelism from the original sequential program. 4. Our ability to find the best parallel schedule given scheduling constraints. 8/25/2014 3Compiler OptimizationTechniques - unit II
  • 5. 1. Instruction Pipelines and Branch delays 2. Pipelined Execution 3. Multiple Instruction Issues – VLIW ( Very Long Instruction Word) 8/25/2014 5Compiler OptimizationTechniques - unit II
  • 7. 1. Control-dependence constraints 2. Data-dependence Constraints 3. Resource Constraints 8/25/2014 7Compiler OptimizationTechniques - unit II
  • 8. Control dependence constraints All the operations executed in original program must be executed in the optimized one 8/25/2014 8Compiler OptimizationTechniques - unit II
  • 9. Data Dependence Constraints The operations in the optimized program must produce the same results as the corresponding ones in the original program 8/25/2014 9Compiler OptimizationTechniques - unit II
  • 10. Resource Constraints The schedule must not oversubscribe the resources on the machine 8/25/2014 10Compiler OptimizationTechniques - unit II
  • 11. Data Dependence True dependence - Read after Write Antidependence - Write after Read Output dependence - Write after Write 8/25/2014 11Compiler OptimizationTechniques - unit II
  • 12. Classify dependence for the following statements  1. a =b  2.c =d  3.b =c  4. d =a  5. c= d  6. a = b 8/25/2014 Compiler OptimizationTechniques - unit II 12 1 and 4 3 and 5 1 and 6 Check the dependences for the following
  • 13. Give the register level m/c code to provide maxm parallelism also give the solution for minimal usage of register expression ((u+v) + (w+x)) + (y+z) LD r1,u LD r2,v ADD r1,r1,r2 LD r2,w LD r3,x ADD r2,r2,r3 ADD r1,r1,r2 LD r2,y LDr3,z ADD r2,r2,r3 ADD r1,r1,r2 8/25/2014 Compiler OptimizationTechniques - unit II 13 Clock 1 LD r1,u LD r2,v LD r3,w LD r4,x LD r5,y LD r6,z Clock 2 ADD r1,r1,r2 ADD r3,r3,r 4 ADD r5,r5,r 6 Clock 3 ADD r1,r1,r3 clock 4 ADD r1,r1,r5 Implementation of parallelism in 4 clocks
  • 14. Finding dependences among memory Access 1. Array data dependence analysis for ( i = 0; i < n; i++) A[2*i] = A[2* i+1] 2. Pointer alias analysis Two pointers aliased if they refer to the same object 3. Inter procedural analysis It is to determine if same variable is passed as two or more different arguments in passing parameters by reference language 8/25/2014 14Compiler OptimizationTechniques - unit II
  • 15. Tradeoff between Register usage and Parallelism e.g., machine independent intermediate representation code LD t1 , a ST b , t1 LD t2 , c ST d , t2 the code above is to copy the values of a and c to b and d . If all memory locations are distinct the copies can be proceed in parallel . The other case if t1 and t2 are assigned to use the same register to minimize the register usage. 8/25/2014 15Compiler OptimizationTechniques - unit II
  • 16. Tradeoff between Register U sage and Parallelism The syntax tree for the (a+b) + c + ( d+ e) a b + + + + C d e Machine code LD r1 , a LDr2 , b ADD r1,r1,r2 LD r2 , c ADD r1,r1,r2 LD r2, d LD r3, e ADD r2,r2,r3 ADD r1,r1,r2 Parallel evaluation of the expression r1 =a r6=r1+r2 r8=r6+r3 R9=r8+r7 r2=b r7=r4+r5 r3=c r4=d r5=e 8/25/2014 16Compiler OptimizationTechniques - unit II
  • 17. Phase Ordering between register allocation and Code Scheduling  If registers are allocated before scheduling , the resulting code tends to have many storage dependences that limit code scheduling.  On the other way around , the schedule created may require so many registers that register spilling Spilling – storing the contents of a register in a memory location, so the register can be used for some other purpose. Based on the characteristics of the program. e.g., numeric , non numeric, etc., 8/25/2014 17Compiler OptimizationTechniques - unit II
  • 18. Control Dependence  If ( c ) s1; else s2; /* s1 and s2 are control dependent on c */  While ( c ) s; /* s is dependent on c */  if ( a > t ) b = a * a; d = a + c; / * No dependence * / 8/25/2014 18Compiler OptimizationTechniques - unit II
  • 19. Speculative Execution Support  Prefetching - Bringing data from memory to cache before it is used.  Poison Bits – Speculative load of data from memory to register file. Each register is augmented with poison bit. The poison bit is set when an illegal memory is accessed to raise exception at later usage. 8/25/2014 19Compiler OptimizationTechniques - unit II
  • 20. Predicated Execution  Predicated instructions were invented to reduce the number of branches in a program.  A predicated instruction is like a normal instruction but has an extra predicate operand to guard its execution.  E.g., CMOVZ R2, R3, R1 has the semantics of moving contents of R3 to R2 if R1 is zero if ( a == 0 ) b = c + d; can be implemented as ADD R3 , R4 ,R5 /* a ,b,c ,d are allotedR1, R2 , R4,R5 */ CMOVZ R2, R3, R1 8/25/2014 20Compiler OptimizationTechniques - unit II
  • 21. Basic Machine Model Many machines can be represented as M = < R , T > T – Set of operation types T, such as loads, stores and arithmetic operations. R is a vector – R = [ r1,r2,…..] are hardware resources. r1 - number of units availabel of the ith kind of resources. Resources – memory access units, ALUs, floating point functional units. 8/25/2014 21Compiler OptimizationTechniques - unit II
  • 22. Basic Machine Model  Each operation has a set of input operands , a set of output operands and resource requirement  RTt– Resource –Reservation table  RTt[i,j]- is the number of units of jth resource used by an operation type t, i clocks after it is issued. 8/25/2014 22Compiler OptimizationTechniques - unit II
  • 24. Data-Dependence Graphs Graph G = ( N , E) N -- E --- A set of nodes representing the operations in the machine instructions. A set of directed edges representing the data dependence constraints among operations 1. Each operation n in N has a resource reservation table RTn , whose value is simply the resource – reservation table associated with operation type of n 2. Each edge e in E is labeled with delay de indicating that the destination node must be issued no earlier than de clocks after the source node is issued. 8/25/2014 24Compiler OptimizationTechniques - unit II
  • 25. Data- dependence Graph LD R2, 0(R1) ST 4(R1), R2 ADD R3,R3,R2 ADD R3, R3, R4 Ld R3, 8 (R1) ST 0(R7), R7 ST 12(R1), R3 i1 i2 i3 i4 i5 i6 i7 2 2 1 1 1 1 1 1 2 1.Load operation takes 2 clock cycles 2. R1 is a stack pointer having offset from 0 t0 12 8/25/2014 25Compiler OptimizationTechniques - unit II
  • 26. List Scheduling of Basic Blocks  This involves visiting each node of the data-de pendence graph in “prioritized topological order”  Machine-resource vector R = [r1,r2,r3,..] ri --- the number of units available of the ith kind of resource G = ( N,E) data dependence graph RTn ---- Resource -reservation table Edge e = n1 n2 with de indicating n2 would be executed de delays after n1. 8/25/2014 Compiler OptimizationTechniques - unit II 26
  • 27. List Scheduling Algorithm RT = An empty reservation table for ( each n in N in prioritized topological order){ s = max e=p ->n in E (S(p) + de); /* find the earliest time this instruction this instruction could begin given when its predecessors started */ while ( there exists i such that RT[s+i] + RTn [i] > R) s = s+ 1; /* delay the instruction further until the needed resources are available */ S(n) = s; for (all i) RT[s + i] = RT [ s+i ] + RTn [i] } 8/25/2014 Compiler OptimizationTechniques - unit II 27
  • 28. Prioritized topological Order Possible prioritized orderings: 1) Critical path - the longest path through the data-dependence graph. Height of the node – the length of the longest path in the graph originating from the node. 2) The length of the schedule is constrained by the resource available. Critical resource - the one with the largest ratio of uses to the number of units of that resource available. Operations using more critical resources may be given higher priority. 3) Source ordering – the operation that shows up earlier in the source program should be scheduled first. 8/25/2014 Compiler OptimizationTechniques - unit II 28
  • 29. Result of applying List Scheduling (for example in slide 22) ALU Memory LD R3 , 8(R1) /* using height as the priority function */ LD R2, 0(R1) ADD R3, R3,R4 /* 2 delay */ ADD R3,R3,R2 ST 4(R1) , R2 St 12(R1), R3 St 0(R1),R7 8/25/2014 Compiler OptimizationTechniques - unit II 29
  • 30. Global Code Scheduling  Strategies that consider more than one Basic Block at a time are referred to as Global Scheduling.  Conditions: ( must abide control and data dependencies) 1. All instructions in the original program are executed in the optimized one and 2. While the optimized program may execute extra instructions speculatively ,these instructions must not have any unwanted side effects. 8/25/2014 Compiler OptimizationTechniques - unit II 30
  • 31. Basic Block A basic Block is constituted by set of instructions in which the control enters the block through the first instruction and leaves the block via the last instruction without any deterrence or jump / branch in between them. ( the flow will be linear) 8/25/2014 Compiler OptimizationTechniques - unit II 31
  • 32. Primitive code motion Source Program 8/25/2014 Compiler OptimizationTechniques - unit II 32 if ( a == 0) goto L e = d + d c = b L:
  • 33. Locally Scheduled Machine code 8/25/2014 Compiler OptimizationTechniques - unit II 33 LD R6 , 0(R1) nop BEQZ R6 , L LD R7 ,0(R2) nop ST 0(R3),R7 LD R8 , 0(R4) nop ADD R8,R8,R8 ST 0(R5), R8 B1 B2 B3 L:
  • 34. Globally Scheduled machine code 8/25/2014 Compiler OptimizationTechniques - unit II 34 LD R6 , 0(R1) LD R8 , 0(R4) LD R7 , 0(R2) ADD R8,R8,R8 BEQZ R6 , L ST 0(R5), R8 ST 0(R5) , R8 ST 0(R3) , R7 B1 B3’ B3
  • 35. Upward Code motion It moves as operation from block src up a control-flow path to block dst. such move does not violate any data dependences and it makes the path through dst and src run faster Case 1: If src does not postdominate dst In this case there exists a path that passes through dst that does not reach src This code motion is illegal unless tehoperation moved has no unwanted side effects 8/25/2014 Compiler OptimizationTechniques - unit II 35
  • 36. Contd… Case 2: If dst does not dominate src In this case there exists a path that reaches src without first going through dst. We need to move copies of the moved operation along such paths Constraints: 1.The operands of the operation must hold the same values as in the original. 2.The result does not overwrite a value that is still needed , and 3. It itself is not subsequently overwritten before reaching src. 8/25/2014 Compiler OptimizationTechniques - unit II 36
  • 37. Downward Code Motion It is moving an operation from block src down a control flow path to block dst Case 1: src does not dominate dst – There exists a path to dst that does not passes through src. Case 2: dst does not postdominate src - There exists a path through src does not pass through dst 8/25/2014 Compiler OptimizationTechniques - unit II 37
  • 38. E.g.,  If ( x == 0) a = b; Else a =c; d= a; 8/25/2014 Compiler OptimizationTechniques - unit II 38 (x==0) LD R1,x Nop BEQZ R1, L (a = c) LD R3,c Nop ST a,R3 ( a= b) LD R2,b Nop ST a, R2 (d =a) LD R4, a Nop ST d, R4 B1 B2B3 B4 x---0(R5) b-----0(R6) c --------0(R7) a -------- 0(R8) d -------- 0(R9) L:
  • 39. E.g.,  If ( x == 0) a = b; Else a =c; d= a; 8/25/2014 Compiler OptimizationTechniques - unit II 39 LD R1,0(R5), LD R3 , 0(R7) LD R2 , 0(R6) ST 0(R8),R3 BEQZ R1, L /* CMOVZ 0(R8) ,R2,R1 */ ST 0(R8), R2 LD R4, 0(R8) Nop ST 0(R9), R4 B1 B2 B4x---0(R5) b-----0(R6) c --------0(R7) a -------- 0(R8) d -------- 0(R9) L:
  • 40. Updating data dependences  Code motions can change data dependence relations between operations. Thus data dependences just be updated after each code motions 8/25/2014 Compiler OptimizationTechniques - unit II 40 X = 1 X = 2 If one assignment is moved up the other can not. X is not live before code motion
  • 41. Global Scheduling Algorithms  Region Based Scheduling Two easiest form of code motion 1. Moving operations up to control equivalent basic blocks 2. Moving operations speculatively up one branch to a dominating predecessor. Assignment : Region Based Scheduling Algorithm 8/25/2014 Compiler OptimizationTechniques - unit II 41
  • 42. Loop Unrolling unrolling creates more instructions in the loop body permitting global scheduling algorithms to find more parallelism for (i = 0; i < N; i ++) { S(i); } Can be unrolled for ( i = 0; i+4 < N; i+=4) { S(i); S(i+1); S(i+2); S(i+3); } repeat S; until C; Can be unrolled as repeat { S; if(C) break; S; if (C) break; S; } until C ; 8/25/2014 Compiler OptimizationTechniques - unit II 42
  • 43. Neighborhood Compaction  Examine each pair of basic blocks that are executed one after the other , and check if any operation can be moved up or down between them to improve the execution time to those blocks.  If such a pair is found we check if the instruction to be moved needs to be duplicated along other paths. 8/25/2014 Compiler OptimizationTechniques - unit II 43
  • 44. Advanced Code Motion Techniques  Adding new basic blocks along the control flow edges originating from blocks with more than one predecessor. Moving instructions from basic blocks, so that the block can be eliminated completely.  The code to be executed in each basic block is scheduled once and for all as each block is visited, because algorithms only move operations up to dominating block.  Implementing downward code motion is harder in an algorithm that visits basic blocks in topological order , We move all operations that i) can be moved and ii) can not be executed in their native block 8/25/2014 Compiler OptimizationTechniques - unit II 44
  • 45. Interaction with dynamic Schedulers  It can create new schedules according to the run time conditions.  High latency instructions are issued early.  Data pre fetch instructions will help the dynamic scheduler to make them available advance.  Data dependent operations are put in correct order to ensure program correctness. For best performance the compiler should assign long delays to dependences that are likely to occur and short ones to those that are not likely.  Branch misprediction must be avoided 8/25/2014 Compiler OptimizationTechniques - unit II 45
  • 47. Software Pipelining  Numerical applications often have loops whose iterations are completely independent of one another.  These loops with many iterations have enough parallelism to saturate all the resources in a processor. It is up to the scheduler to take full advantage available parallelism.  Software Pipelining schedules an entire loop at a time to take full advantage of the parallelism across iterations. 8/25/2014 Compiler OptimizationTechniques - unit II 47
  • 48. Machine Model  The machine can issue in a single clock : one load, one store, one arithmetic operation and one branch operation.  The machine has a loop back operation BL R, L which decrements register R and , unless the result is 0, branches to location L.  8/25/2014 Compiler OptimizationTechniques - unit II 48
  • 49. Machine Model  Memory operations have an auto increment addressing mode , denoted by ++ after the register. The register is automatically incremented to point to the next consecutive address after each access.  The arithmetic operations are fully pipelined ; they can be initiated every clock but their results are not available until 2 clock later. All other instructions have a single- clock latency. 8/25/2014 Compiler OptimizationTechniques - unit II 49
  • 50. Typical do-all loop for ( i = 0; i< n; i++) D[i] = A[i] * B[i] + c; 8/25/2014 Compiler OptimizationTechniques - unit II 50 //R1,R2,R3 = & A, &B, &D // R4 = c // R10 = n-1 LD R5 , 0(R1 ++) LD R6 , 0(R2 ++) MUL R7 , R5, R6 Nop ADD R8 , R7, R4 Nop ST 0(R3 ++) , R8 BL R10 , L L: Locally scheduled code
  • 51. Five unrolled iterations of e.g., for (i = 0; i < n; i ++) D[i] = A[i] * B[i] + c ; 8/25/2014 Compiler OptimizationTechniques - unit II 51 Clock j = 1 J =2 J = 3 J =4 J = 5 1 LD 2 LD 3 MUL LD 4 LD 5 MUL LD 6 ADD LD 7 MUL LD 8 ST ADD LD 9 MUL LD 10 ST ADD LD 11 MUL 12 ST ADD 13 14 ST ADD 15 16 ST
  • 52. Clock j = 1 J =2 J = 3 J =4 1 LD 2 LD 3 MUL LD 4 LD 5 MUL LD 6 ADD LD 7 L: MUL LD 8 ST ADD LD BL (L) 9 MUL 10 ST ADD 11 12 ST ADD 13 14 ST 8/25/2014 Compiler OptimizationTechniques - unit II 52 Software pipelined Code
  • 53.  A new iteration can be started on the pipeline every 2 clocks  When first iteration proceeds to stage three , the second iteration starts to execute.  By clock 7 the pipeline is fully filled with first four iterations.  In the steady state four consecutive iterations are executing at the same time.  The sequence of instructions 1 through 6 is called prolog.  7 and 8 are steady state.  lines 9 through 14 is called epilog. 8/25/2014 Compiler OptimizationTechniques - unit II 53