Tuning and Performance / Benchmarks on Multi-Core Processors |
Tuning and Performance of Application Programs using Compiler optimization techniques,
Code restructuring techniques
on Multi-Core Processors is challenging. Understanding Programming Programming Paradigms
(MPI, OpenMP, Pthreads), effective use of right
Compiler Optimization flags and obtaining correct results for given application
is important. Enhance performance and scalability on multiple core processors for given
application with respect to increase in problem size require serious efforts. Several Optimization
techniques are discussed below.
|
Example 1.1
|
Write a efficient sequential program and demonstrate the performance of a loop using
with/without Loop Unrolling optimization technique. Use Compiler flags and demonstrate
performance.
|
Example 1.2
|
Write a efficient Sequential program for implementation of given loop with/without Memory
reference optimization. Use Compiler flags and demonstrate
performance.
|
Example 1.3
|
Write a efficient Sequential Program for implementation of given Loop to avoid IF
and GOTO in the given loop for better memory access to reduce to execution time.
Use Compiler flags and demonstrate performance.
|
Example 1.4
|
Write a efficient Sequential program for efficient implementation of given loop
with/without "LOOP INTERCHANGE" to ease
the memory access. Use Compiler flags and demonstrate performance.
|
Example 1.5
|
Write a Sequential Program for efficient implementation of given Loop with/without removing
neighbor data dependency. Use Compiler flags and demonstrate performance.
|
Example 1.6
|
Write a sequential program to increase the performance using unrolling,
interchanging or blocking to ease the memory access the loop in subroutine tryparam
Description for implementation of a given loop in subroutine with Loop Unrolling,
Loop Interchanging to optimize Memory References and avoid TLB misses in the same loop.
Use Compiler flags and demonstrate performance.
|
Example 1.7
|
Write a Sequential program for efficient implementation of given loop with/without
"LOOP INTERCHANGE" to avoid TLB misses. Use Compiler flags and demonstrate performance.
|
Example 1.8
|
Write a Sequential program for efficient implementation of given loop with/without applying
"LOOP INTERCHANGE" to move the computations for better memory access. Use Compiler flags and demonstrate performance.
|
|
|
Description of Single Core Programs - Optimization Techniques |
Example 1.1 :
Write a efficient sequential program and demonstrate the performance of a loop
using with/without Loop Unrolling optimization technique. Use Compiler flags and
demonstrate performance.
( Download source code :
single-core-prg-loop-unroll.c
)
|
- Objective
Write a program to demonstrate the execution time for the following
loop with/without Loop Unrolling.
- Description
for (i=0; i<n; i++)
a[i]= a[i] <b[i] ? b[i] :c[i];
- Input
Size of Vectors
- Output
Time taken in Microseconds by the loop with/without unrolling.
|
|
|
Example 1.2 :
Write a efficient Sequential program for implementation of given loop with/without
Memory reference optimizations. Use Compiler flags and demonstrate performance.
( Download source code :
single-core-prg-memory-manage.f
)
|
- Objective
Write a efficient Sequential program for implementation of given loop with/without
Memory reference optimizations. Use Compiler flags and demonstrate performance.
- Description
Identify the bottlenecks for performance in Memory References in the following fragment of the code and write
a program to demonstrate the execution time for the given fragment
with/without better memory accesses.
do 10, i = 1, 10000000
do 20, j= i+1, 20000000
A(i) = A(i)*B(j) + A(j)*B(i)
B(i) = A(i)*B(i) + A(j)*B(j)
20 continue
10 continue
In the inner loop, A(i) and B(i)
are calculated at every iteration but only the values A(i) and B(i)
in the inner loop calculated at j=20000000 are reflected in the
next iteration. To improve the efficiency of the loop, we replace A(i)
and B(i) in the loop with temporary variables tempA and
tempB, thus avoiding unnecessary repeated memory
references in the inner loop. The modified loop will have more efficiency
when compared with the original loop.
- Input
None
- Output
The time taken by the two different implementations of the same loop.
|
|
|
Example 1.3 :
Write a efficient Sequential Program for implementation of given Loop to avoid IF and
GOTO in
in the given loop for better memory access to reduce to execution time. Use Compiler flags
and demonstrate performance.
( Download source code :
single-core-prg-if-goto.f
)
|
- Objective
Write a efficient Sequential Program for implementation of given Loop to avoid
IF and GOTO in the given loop for better memory access to reduce to execution time.
Use Compiler flags and demonstrate performance.
- Description
Write a program to avoid "IF" and
"GOTO" from the following loop for better performance.
i=0
10 i = i+1
if(i
.gt. 100000) goto 30
a(i)
= a(i) + b(i) * c(i)
go
to 10
30 continue
- Input
None
- Output
The time taken by the two different implementations and
Message showing whether the modified construct is performing the same operations as
original construct.
|
|
|
Example 1.4 :
Write a efficient Sequential program for efficient implementation of given loop with/without "LOOP
INTERCHANGE" to ease the memory access. Use Compiler flags and demonstrate performance.
( Download source code :
single-core-prg-loop-interchange.f
)
|
- Objective
Write a efficient Sequential program for efficient implementation of given loop
with/without "LOOP INTERCHANGE" to ease the memory access. Use Compiler flags and
demonstrate performance.
- Description
Write a program to demonstrate the execution time for the following
loop with/without applying "LOOP INTERCHANGE" to ease the memory
access for the . Explain the reasons behind to improve the
memory access for the arrays A and B.
do 10, i = 1, n
do 20, j = 1, n
A(j,i) = B(i,j) + A(i,j)
20 continue
10 continue
The array elements are stored in a Column Major
order in Fortran. Loop Interchange can be applied to this loop in
order to make the access of the two arrays in column wise fashion and
increase the performance. But, just interchanging the loops doesn't
work here as the data being modified in one iteration in next
iterations. However, the interchange can be made by making the outer
loop (the loop of j which is presently the inner loop) roll on a
negative stride which removes data dependency between one iteration
and other.
- Input
None
- Output
The time taken by the two different implementations
of the same loop without/with Loop Interchange.
|
|
|
Example 1.5 :
Write a efficient Sequential program for efficient implementation of given loop with/without
removing neighbor data dependency. Use Compiler flags and demonstrate performance.
( Download source code :
single-core-prg-neighbour-data-dependency.c
)
|
- Objective
Write a Sequential Program for efficient implementation of given Loop with/without
removing neighbor data dependency. Use Compiler flags and demonstrate performance
- Description
Demonstrate the execution time for the following fragment of code
with/without removing neighbor data dependency.
jwrap = array_size-1;
for(i=0; i <array_size; i++)
{
b[i] = (a[i] + a[jwrap] *0.5;
jwrap = i;
}
The given example contains an induction variable
"jwrap". Removal of an induction variable(i.e.
variable which is assigned a value in each iteration of a loop) if it
is possible to remove it , results in better performance. In the given
example, the variable "jwrap" is being assigned a
value "i" which is being used in next iteration
resulting in Neighbor Data Dependency. The induction variable
can be replaced with the value it is being assigned (a[jwrap] can be
written as a[i+1]) inside the loop which results in removing Dependency
of one iteration on another thus giving scope for parallelism and
resulting in better performance.
- Input
Size of the vectors
- Output
Time taken in microseconds by the loop with/without Neighbor
Data Dependency.
|
|
|
Example 1.6 :
Write a sequential program to increase the performance using unrolling, interchanging
or blocking to ease
the memory access the loop in subroutine tryparam.
Description for implementation of a given loop in subroutine with Loop Unrolling,
Loop Interchanging to optimize Memory References and avoid TLB misses in the same loop.
Use Compiler flags and demonstrate performance
( Download source code :
single-core-prg-loop-unroll-interchange-block.f
single-core-prg-tryparamorig.f
single-core-prg-tryparammod.f
)
|
- Objective
Write a sequential program to increase the performance using unrolling, interchanging or blocking to ease
the memory access the loop in subroutine tryparam.
Description for implementation of a given loop in subroutine with
Loop Unrolling, Loop Interchanging to optimize Memory References and avoid
TLB misses in the same loop.
- Description
Try un-rolling, interchanging or blocking the loop in
subroutine tryparam to increase the performance. What method or combination
of methods work best ? (NOTE : Compile the main routine and tryparam
separately. Use compiler's default optimization level.)
program main
implicit none
integer m,n,i,j
parameter (n=512, m=640, ntimes = 500)
double precision q(n,m), r(n,m)
do 10 i = 1, m
do 10 j = 1, n
q(j,i) = 1.0d0
r(j,i) = 1.0d0
10 continue
do 20 i = 1, ntimes
20
call tryparam (q,r,n,m)
subroutine tryparam
implicit none
integer m,n,i,j
double precision a(n,m), b(n,m)
do 10 i = 1, n
 
do 20 j = 1, n
a(j,i) = a(j,i) * b(i,j)
20 continue
10 continue
return
end
The loop in subroutine tryparam involves
two vectors. One is referenced with unit stride, the other with a
stride of N. We can interchange the loops, but one way or another we
will have N-strided array references on either A or B, either of which
is undesirable. The solution is to block references so that the
program grabs a few elements of A, and then a few of B, and then a few
of A, and so on- in neighborhoods. This is achieved by combining
inner and outer unrolling.
The loop in the subroutine can be further optimized to avoid TLB misses. If the value of N
is reasonably large (say 256) such that arrays A and B are each 64KB * 8
bytes = 1/2 MB - larger than can be handled by the TLBs and caches of
most workstations. So, the loop is divided into two loops to divide and
conquer the a large memory address space to be handled by TLB without
this division into two little pieces which avoid TLB misses and result
in better performance.
- Input
Size of the vectors
- Output
Time taken in microseconds by the loop with/without Neighbor
Data Dependency.
|
|
|
Example 1.7 :
Write a Sequential program for efficient implementation of given loop with/without
"LOOP INTERCHANGE" to avoid TLB misses. Use Compiler flags and demonstrate performance.
( Download source code :
single-core-prg-tlb-miss.f
)
|
- Objective
Write a Sequential program for efficient implementation of given loop with/without "LOOP
INTERCHANGE" to avoid TLB misses. Use Compiler flags and demonstrate performance.
- Description
To demonstrate the execution time for the following loop
with/without re-organizing it to avoid TLB misses.
real x(1001000)
common x,y
do 10 i = 0, 1000
do 20 j=1, 1000000,10000
y = x(j+i)
20 continue
10 continue
TLB (Translation Lookaside Buffer) can store 64
addresses of 4K sized pages totaling up-to 256K of memory management
on most of the machines. Once, a data element is accessed from out of
limits of the presently loaded memory pages, it results in TLB miss
and also in cache misses. The given loop accesses the elements of the
array X such that each memory reference is 1000 elements farther from
previous reference which results in frequent TLB misses. So, the loop
is modified such that the memory references are made for the array X
in a successive manner.
- Input
Size of the vectors
- Output
The time taken by the two different implementations of
the same loop without/with Loop Optimizations to avoid TLB misses.
|
|
|
Example 1.8 :
Write a Sequential program for efficient implementation of given loop
with/without applying "LOOP INTERCHANGE" to move the computations for better
memory access. Use Compiler flags and demonstrate performance.
( Download source code :
single-core-prg-loop-interchange-center.f
)
|
- Objective
Write a Sequential program for efficient implementation of given loop with/without applying
"LOOP INTERCHANGE" to move the computations for better memory access. Use Compiler flags
and demonstrate performance.
- Description
To demonstrate the execution time for the following
fragment of code with/without applying "LOOP interchange"
to move the computations to the center in the following loop for better
performance.
parameter(idim = 1000, jdim = 1000, kdim = 4)
...
do 10 i = 1, idim
do 20 j = 1, jdim
do 30 k = 1,kdim
d(i,j,k) = d(i,j,k) + v(i,j,k) * dt
30 continue
20 continue
10 continue
A model expressed as it is in an algorithm
often works on one point in space at a time, which tends
to give insignificant inner loops. We employ "Loop Interchange",
pulling more computations into the center and then apply "Loop
Unrolling" to the inner loop to extract more performance.
- Input
Size of the vectors
- Output
The time taken by the two different implementations
of the same loop without/with Loop Interchange (making it possible to
apply Loop Unrolling to the inner loop)
|
|
|
|
| |
|