Instruction Level Parallelism Compiler optimization Techniques Anna University,K.Thirunadana Sikamani

Compiler Optimization Techniques
CP 7031
Dr.K.Thirunadana Sikamani

Principal Sources of Optimization
Elimination of unnecessary instructions in object code ,
or the replacement of one sequence of instructions by a
faster sequence of instructions that does the same thing
is usually called “code improvement” or “code
optimization”
 Redundancy
 Semantic preserving transformations
 Global Common Subexpressions
 Copy Propagation
 Dead Code Elimination
 Code Motion
8/25/2014 Compiler OptimizationTechniques - unit II 2

The Speed of a program run on a processor with
Instruction Level Parallelism depends on
1. The potential parallelism in the program.
2. The available parallelism on the processor.
3. Our ability to extract parallelism from the original
sequential program.
4. Our ability to find the best parallel schedule given
scheduling constraints.
8/25/2014 3Compiler OptimizationTechniques - unit II

Processor
Architecture

1. Instruction Pipelines and Branch
delays
2. Pipelined Execution
3. Multiple Instruction Issues –
VLIW ( Very Long Instruction
Word)

Code
Scheduling
Constraints

1. Control-dependence
constraints
2. Data-dependence Constraints
3. Resource Constraints

Control dependence constraints
All the operations
executed in original
program must be
executed in the
optimized one

Data Dependence Constraints
The operations in the
optimized program must
produce the same results
as the corresponding ones
in the original program

Resource Constraints
The schedule must not
oversubscribe the
resources on the
machine

Data Dependence
True dependence - Read after Write
Antidependence - Write after Read
Output dependence - Write after Write

Classify dependence for the following
statements
 1. a =b
 2.c =d
 3.b =c
 4. d =a
 5. c= d
 6. a = b
1 and 4
3 and 5
1 and 6
Check the
dependences for the
following

Give the register level m/c code to provide maxm
parallelism also give the solution for minimal
usage of register
expression ((u+v) + (w+x)) + (y+z)
LD r1,u
LD r2,v
ADD r1,r1,r2
LD r2,w
LD r3,x
ADD r2,r2,r3
ADD r1,r1,r2
LD r2,y
LDr3,z
ADD r2,r2,r3
ADD r1,r1,r2
Clock
1
LD
r1,u
LD
r2,v
LD
r3,w
LD
r4,x
LD
r5,y
LD
r6,z
Clock
2
ADD
r1,r1,r2
ADD
r3,r3,r
4
ADD
r5,r5,r
6
Clock
3
ADD
r1,r1,r3
clock
4
ADD
r1,r1,r5
Implementation of parallelism in 4 clocks

Finding dependences among
memory Access
1. Array data dependence analysis
for ( i = 0; i < n; i++)
A[2*i] = A[2* i+1]
2. Pointer alias analysis
Two pointers aliased if they refer to the same object
3. Inter procedural analysis
It is to determine if same variable is passed as two or more
different arguments in passing parameters by reference language

Tradeoff between Register usage
and Parallelism
e.g., machine independent intermediate representation code
LD t1 , a
ST b , t1
LD t2 , c
ST d , t2
the code above is to copy the values of a and c to b
and d . If all memory locations are distinct the copies can
be proceed in parallel . The other case if t1 and t2 are
assigned to use the same register to minimize the register
usage.

Tradeoff between Register U sage
and Parallelism
The syntax tree for the (a+b) + c + ( d+ e)
a b
+
+
+
+
C
d e
Machine code
LD r1 , a
LDr2 , b
ADD r1,r1,r2
LD r2 , c
ADD r1,r1,r2
LD r2, d
LD r3, e
ADD r2,r2,r3
ADD r1,r1,r2
Parallel evaluation of the expression
r1 =a
r6=r1+r2
r8=r6+r3
R9=r8+r7
r2=b
r7=r4+r5
r3=c r4=d r5=e

Phase Ordering between register
allocation and Code Scheduling
 If registers are allocated before scheduling , the
resulting code tends to have many storage
dependences that limit code scheduling.
 On the other way around , the schedule created may
require so many registers that register spilling
Spilling – storing the contents of a register in a memory
location, so the register can be used for some other
purpose.
Based on the characteristics of the program.
e.g., numeric , non numeric, etc.,

Control Dependence
 If ( c ) s1; else s2; /* s1 and s2 are control dependent on
c */
 While ( c ) s; /* s is dependent on c */
 if ( a > t )
b = a * a;
d = a + c; / * No dependence * /

Speculative Execution Support
 Prefetching - Bringing data from memory to cache
before it is used.
 Poison Bits – Speculative load of data from memory to
register file. Each register is augmented with poison
bit. The poison bit is set when an illegal memory is
accessed to raise exception at later usage.

Predicated Execution
 Predicated instructions were invented to reduce the
number of branches in a program.
 A predicated instruction is like a normal instruction but
has an extra predicate operand to guard its execution.
 E.g., CMOVZ R2, R3, R1 has the semantics of moving
contents of R3 to R2 if R1 is zero
if ( a == 0 ) b = c + d; can be implemented as
ADD R3 , R4 ,R5 /* a ,b,c ,d are allotedR1, R2 , R4,R5 */
CMOVZ R2, R3, R1

Basic Machine Model
Many machines can be represented as
M = < R , T >
T – Set of operation types T, such as loads, stores and
arithmetic operations.
R is a vector – R = [ r1,r2,…..] are hardware resources.
r1 - number of units availabel of the ith kind of resources.
Resources – memory access units, ALUs, floating point
functional units.

Basic Machine Model
 Each operation has a set of input operands , a set of
output operands and resource requirement
 RTt– Resource –Reservation table
 RTt[i,j]- is the number of units of jth resource
used by an operation type t, i clocks after it is
issued.

Basic-Block
Scheduling

Data-Dependence Graphs
Graph G = ( N , E)
N --
E ---
A set of nodes representing the operations in
the machine instructions.
A set of directed edges representing the data
dependence constraints among operations
1. Each operation n in N has a resource reservation table RTn , whose
value is simply the resource – reservation table associated with
operation type of n
2. Each edge e in E is labeled with delay de indicating that the destination
node must be issued no earlier than de clocks after the source node is
issued.

Data- dependence Graph
LD R2, 0(R1)
ST 4(R1), R2
ADD R3,R3,R2
ADD R3, R3, R4
Ld R3, 8 (R1)
ST 0(R7), R7
ST 12(R1), R3
i1
i2
i3
i4
i5
i6
i7
2
2
1
1
1
1
1
1
2
1.Load operation
takes 2 clock cycles
2. R1 is a stack
pointer having
offset from 0 t0 12

List Scheduling of Basic Blocks
 This involves visiting each node of the
data-de pendence graph in “prioritized topological
order”
 Machine-resource vector R = [r1,r2,r3,..]
ri --- the number of units available of the ith kind of
resource
G = ( N,E) data dependence graph
RTn ---- Resource -reservation table
Edge e = n1 n2 with de indicating n2
would be executed de delays after n1.

List Scheduling Algorithm
RT = An empty reservation table
for ( each n in N in prioritized topological order){
s = max e=p ->n in E (S(p) + de);
/* find the earliest time this instruction this instruction could begin given when its predecessors started */
while ( there exists i such that RT[s+i] + RTn [i] > R)
s = s+ 1;
/* delay the instruction further until the needed resources are available */
S(n) = s;
for (all i)
RT[s + i] = RT [ s+i ] + RTn [i]
}

Prioritized topological Order
Possible prioritized orderings:
1) Critical path - the longest path through the data-dependence graph.
Height of the node – the length of the longest path in the graph
originating from the node.
2) The length of the schedule is constrained by the resource available.
Critical resource - the one with the largest ratio of uses to the
number of units of that resource available.
Operations using more critical resources may be given higher priority.
3) Source ordering – the operation that shows up earlier in the source
program should be scheduled first.

Result of applying List Scheduling
(for example in slide 22)
ALU Memory
LD R3 , 8(R1) /* using height as the priority
function */
LD R2, 0(R1)
ADD R3, R3,R4 /* 2 delay */
ADD R3,R3,R2 ST 4(R1) , R2
St 12(R1), R3
St 0(R1),R7

Global Code Scheduling
 Strategies that consider more than one Basic Block at a
time are referred to as Global Scheduling.
 Conditions: ( must abide control and data
dependencies)
1. All instructions in the original program are executed
in the optimized one and
2. While the optimized program may execute extra
instructions speculatively ,these instructions must
not have any unwanted side effects.

Basic Block
A basic Block is constituted by set of instructions in
which the control enters the block through the first
instruction and leaves the block via the last instruction
without any deterrence or jump / branch in between
them. ( the flow will be linear)

Primitive code motion
Source Program
if ( a == 0) goto L
e = d + d
c = b
L:

Locally Scheduled Machine code
LD R6 , 0(R1)
nop
BEQZ R6 , L
LD R7 ,0(R2)
nop
ST 0(R3),R7
LD R8 , 0(R4)
nop
ADD R8,R8,R8
ST 0(R5), R8
B1
B2
B3
L:

Globally Scheduled machine code
LD R6 , 0(R1)
LD R8 , 0(R4)
LD R7 , 0(R2)
ADD R8,R8,R8
BEQZ R6 , L
ST 0(R5), R8
ST 0(R5) , R8
ST 0(R3) , R7
B1
B3’
B3

Upward Code motion
It moves as operation from block src up a control-flow
path to block dst.
such move does not violate any data dependences and it
makes the path through dst and src run faster
Case 1: If src does not postdominate dst
In this case there exists a path that passes through dst
that does not reach src
This code motion is illegal unless tehoperation moved
has no unwanted side effects

Contd…
Case 2: If dst does not dominate src
In this case there exists a path that reaches src without first
going through dst.
We need to move copies of the moved operation along such
paths
Constraints:
1.The operands of the operation must hold the same values
as in the original.
2.The result does not overwrite a value that is still needed ,
and
3. It itself is not subsequently overwritten before reaching
src.

Downward Code Motion
It is moving an operation from block src down a control
flow path to block dst
Case 1: src does not dominate dst – There exists a path to
dst that does not passes through src.
Case 2: dst does not postdominate src - There exists a
path through src does not pass through dst

E.g.,
 If ( x == 0) a = b;
Else a =c;
d= a;
(x==0)
LD R1,x
Nop
BEQZ R1, L
(a = c)
LD R3,c
Nop
ST a,R3
( a= b)
LD R2,b
Nop
ST a, R2
(d =a)
LD R4, a
Nop
ST d, R4
B1
B2B3
B4
x---0(R5)
b-----0(R6)
c --------0(R7)
a -------- 0(R8)
d -------- 0(R9)
L:

E.g.,
 If ( x == 0) a = b;
Else a =c;
d= a;
LD R1,0(R5), LD R3 , 0(R7)
LD R2 , 0(R6)
ST 0(R8),R3
BEQZ R1, L /* CMOVZ 0(R8) ,R2,R1 */
ST 0(R8), R2
LD R4, 0(R8)
Nop
ST 0(R9), R4
B1
B2
B4x---0(R5)
b-----0(R6)
c --------0(R7)
a -------- 0(R8)
d -------- 0(R9)
L:

Updating data dependences
 Code motions can change data dependence relations
between operations. Thus data dependences just be
updated after each code motions
X = 1 X = 2
If one assignment is moved up
the other can not.
X is not live before code motion

Global Scheduling Algorithms
 Region Based Scheduling
Two easiest form of code motion
1. Moving operations up to control equivalent basic
blocks
2. Moving operations speculatively up one branch to a
dominating predecessor.
Assignment : Region Based Scheduling Algorithm

Loop Unrolling
unrolling creates more instructions in the loop body permitting
global scheduling algorithms to find more parallelism
for (i = 0; i < N; i ++)
{
S(i);
}
Can be unrolled
for ( i = 0; i+4 < N; i+=4) {
S(i);
S(i+1);
S(i+2);
S(i+3);
}
repeat
S;
until C;
Can be unrolled as
repeat {
S;
if(C) break;
S;
if (C) break;
S;
} until C ;

Neighborhood Compaction
 Examine each pair of basic blocks that are executed
one after the other , and check if any operation can be
moved up or down between them to improve the
execution time to those blocks.
 If such a pair is found we check if the instruction to be
moved needs to be duplicated along other paths.

Advanced Code Motion Techniques
 Adding new basic blocks along the control flow edges
originating from blocks with more than one predecessor.
Moving instructions from basic blocks, so that the block
can be eliminated completely.
 The code to be executed in each basic block is scheduled
once and for all as each block is visited, because
algorithms only move operations up to dominating block.
 Implementing downward code motion is harder in an
algorithm that visits basic blocks in topological order , We
move all operations that
i) can be moved and
ii) can not be executed in their native block

Interaction with dynamic
Schedulers
 It can create new schedules according to the run time
conditions.
 High latency instructions are issued early.
 Data pre fetch instructions will help the dynamic
scheduler to make them available advance.
 Data dependent operations are put in correct order to
ensure program correctness. For best performance the
compiler should assign long delays to dependences
that are likely to occur and short ones to those that are
not likely.
 Branch misprediction must be avoided

Software
Pipelining

Software Pipelining
 Numerical applications often have loops whose
iterations are completely independent of one another.
 These loops with many iterations have enough
parallelism to saturate all the resources in a processor.
It is up to the scheduler to take full advantage available
parallelism.
 Software Pipelining schedules an entire loop at a
time to take full advantage of the parallelism across
iterations.

Machine Model
 The machine can issue in a single clock : one load, one
store, one arithmetic operation and one branch
operation.
 The machine has a loop back operation
BL R, L
which decrements register R and , unless the result is
0, branches to location L.


Machine Model
 Memory operations have an auto increment
addressing mode , denoted by ++ after the register.
The register is automatically incremented to point to
the next consecutive address after each access.
 The arithmetic operations are fully pipelined ; they can
be initiated every clock but their results are not
available until 2 clock later. All other instructions have
a single- clock latency.

Typical do-all loop
for ( i = 0; i< n; i++)
D[i] = A[i] * B[i] + c;
//R1,R2,R3 = & A, &B, &D
// R4 = c
// R10 = n-1
LD R5 , 0(R1 ++)
LD R6 , 0(R2 ++)
MUL R7 , R5, R6
Nop
ADD R8 , R7, R4
Nop
ST 0(R3 ++) , R8 BL R10 , L
L:
Locally scheduled code

Five unrolled iterations of e.g.,
for (i = 0; i < n; i ++) D[i] = A[i] * B[i] + c ;
Clock j = 1 J =2 J = 3 J =4 J = 5
1 LD
2 LD
3 MUL LD
4 LD
5 MUL LD
6 ADD LD
7 MUL LD
8 ST ADD LD
9 MUL LD
10 ST ADD LD
11 MUL
12 ST ADD
13
14 ST ADD
15
16 ST

Clock j = 1 J =2 J = 3 J =4
1 LD
2 LD
3 MUL LD
4 LD
5 MUL LD
6 ADD LD
7 L: MUL LD
8 ST ADD LD BL (L)
9 MUL
10 ST ADD
11
12 ST ADD
13
14 ST
Software pipelined Code

 A new iteration can be started on the pipeline every 2
clocks
 When first iteration proceeds to stage three , the
second iteration starts to execute.
 By clock 7 the pipeline is fully filled with first four
iterations.
 In the steady state four consecutive iterations are
executing at the same time.
 The sequence of instructions 1 through 6 is called
prolog.
 7 and 8 are steady state.
 lines 9 through 14 is called epilog.

Instruction Level Parallelism Compiler optimization Techniques Anna University,K.Thirunadana Sikamani

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Instruction Level Parallelism Compiler optimization Techniques Anna University,K.Thirunadana Sikamani (20)

Recently uploaded (20)

Instruction Level Parallelism Compiler optimization Techniques Anna University,K.Thirunadana Sikamani