Tuning and Performance on Multi-Core Processors
|
Tuning and Performance of Application Programs using Compiler optimization techniques,
Code restructuring techniques
on Multi-Core Processors is challenging. Understanding Programming Programming Paradigms
(MPI, OpenMP, Pthreads), effective use of right
Compiler Optimization flags and obtaining correct results for given application
is important. Enhance performance and scalability on multiple core processors for given
application with respect to increase in problem size require serious efforts. Several Optimization
techniques are discussed below.
|
Classical Optimization Techniques |
(a). Copy Propagation
|
Consider the following two assignment statements:
X = Y
Z = 1. + X
The second statement can't execute unless the first statement is executed but it
need not be in that way if we rewrite them in the following way:
X = Y
Z = 1. + Y
This is called propagating a copy of Y. This makes the compiler remove all such
references statements like X = Y if X is not used in any other following statements and also provide parallelism for the two instructions resulting in better performance.
|
(b). Constant Folding
|
The method used by compiler to trace the expressions where variables can be replaced
by constants to increase the performance is called Constant Folding. Consider
the following piece of program.
PROGRAM MAIN
INTEGER I,K
PARAMETER (I = 100)
K = 200
J = I+K
END
Because I and K are constant individually, J is a constant. The compiler reduces
constant expressions like I+K into constants with the technique "Constant Folding".
The compiler does through data-flow analysis, if it finds the paths of various
expression paths pertaining to a variable in the data-glow graphs leading back
to the same value, which is a constant, it replaces all the references to that
variable with that constant. This allows the compiler to calculate the value
of some expressions at compile-time itself if the expressions contain only
constants.
|
(c). Dead Code Removal
|
Dead code refers to that part of code, which is never reached while the code is being
executed.Dead code comes in following types:
-
Instructions those are unreachable.
-
Instructions that produce results, which are never used.
-
These parts are introduced into the code by the programmer and most commonly due to some optimizations
used by the compiler.
In the case of local variables, the compiler gets idea about such parts by analyzing a
variable's definitions and uses, the compiler can see whether any other part of the
routine references it. It can't tell the fate of variables, which are external
or common, so those computations will always be kept (as long as they are reachable).
Basing on these factors, the compiler can either remove such statements or
can give warning messages on compilation.
|
(d). Strength Reduction
|
Operations or expressions have various time costs associated with them. It is possible
to replace a more expensive calculation with a cheaper one, which is called strength
reduction, which results in better performance than without the reduction. For example,
the exponentiation operation has the overheads the calling a subroutine and then
involves computations like logarithm, multiplication, etc. For low powers,
the exponentiation can be replaced by much lesser costly operations like multiplication.
i.e. Y = X**2 can be replaced by Y = X*X. Also the multiplications can be replaced
by much cheaper additions to improve the performance.
One of the compiler-generated optimizations includes peephole optimization: when the
compiler crawls through the intermediate code looking for cheap and easy stocking
stuffer optimizations.
|
(e). Variable Renaming
|
This is the technique employed by a compiler or at source-code level by the programmer
to give parallelism to two or more calculations when it finds that the expressions have
the same variable but are independent to each other. Consider the following fragment:
x = y * z;
q = r + x +x;
x = a + b;
When the compiler recognizes that a variable is being recycled, and that its current and former uses
are independent, it can substitute a new variable to keep the calculations separate.
x0 = y * z;
q = r + x0 + x0;
x = a + b;
Variable Renaming, which provides clarity to the flow of program and also parallelism,
can take place both locally, at the basic block level, and globally, as a product of
data flow analysis.
|
(f). Common Sub-expression Elimination
|
Sub-expressions are pieces of expressions. For instance, A + B is a sub expression of C * (A + B).
If A + B appears in several places, like it does below, we call it a common sub expression:
D = C * (A + B)
E = (A + B)/2
Rather than calculate A and B twice, the compiler can generate a temporary variable and use it wherever
A + B is required.
temp = A + B
D = C * temp
E = temp/2
Information about expressions can be collected during data-flow analysis along with variable
use/definition information. By examining both together, the compiler can detect which sub-expressions
reach which basic blocks, and whether any of the variables within them have been registered.
This is not applicable if two sub-expressions are there in form of A + B + C and B + C + A. The same
is applied when more than one references are present to the same array element without any modifications
done to the element.
|
(g). Loop Invariant Code Motion
|
The expressions that do not change on entrance into a loop are called Loop Invariant Expressions. These
include the expressions that are unnecessarily being executed inside a loop although it is not
required for such expressions to be executed for more than one time outside the loop. From use/definition
information on the associated basic blocks of the loops, we can identify expressions containing
variables that are unchanging. The act of moving the repeated, unchanging calculations to the
outside is called Loop Invariant code motion.
|
(h). Induction Variable Simplification
|
The variables, which are repeatedly assigned values as a function of iteration number in each iteration of a
loop, are called Induction Variables. Elimination of such variables or simplification of expressions related to such
variables can help reduce some computations in the loop and result in better performance. Consider the following
loop with K as induction variable:
Do 10 I = 1,N
K = I*4 + M
10 CONTINUE
Induction variable simplification replaces calculations for variables for variables like K with simpler ones.
Given a starting point and the expression's first derivative, you can arrive at K's value for the nth iteration by
stepping through the n-1 intervening iterations:
K = M
DO 10 I = 1,N
K = K + 4
10 CONTINUE
Induction variable simplification is especially useful on processors that can automatically increment a
register each time it is used as a pointer for a memory reference. While stepping through a loop, the memory reference
and the address arithmetic can both be squeezed into a single instruction-a great savings.
|
(i). Register Variable detection
|
On many older CISC processors there were few general-purpose registers. Because many of the instruction
formats could operate directly on memory locations, you could choose to leave the less critical variables
out in main
memory and bring the more important ones into registers. On RISC designs, there are often many more registers
to choose
from, and everything has to be brought into a register anyway. This means that, for at least a little while,
all variables
will be register resident.
The variables, which are more frequently accesses, can be kept in registers to avoid frequent memory
loading and unloading to result in better performance. To determine which variables should be resident,
the compiler has to
examine how frequently each appears to be used, and whether the uses are independent of each other.
This information
comes from data flow analysis.
|
(j). Arithmetic Optimizations
|
Arithmetic Optimizations: Many of the optimizations performed on programs compiled with best compilers
can be seen as arithmetic optimizations because computer architecture is frequently used for numerically intensive
applications.
Strength Reduction: Sometimes it's possible to replace a more expensive calculation with a cheaper one.
Strength reduction reduces the computation costs of an operation while providing mathematically identical results.
Expensive operations, taking perhaps hundreds of machine cycles, can be replaced by a cheaper alternative, paying a lower cost :
exponentiations and divisions can be translated into multiplications, multiplications into sums. However, most compilers
automatically perform these optimizations.
Data Type Conversions: The type and precision of a data value determine the amount of storage required to
contain the value and the way in which the value can be operated on and has significant effect on performance. Statements
that contain runtime type conversions suffer a performance penalty each time the statement is executed.
Common Sub-expression Elimination: If a sub-expression appears at several places, it's called a common
sub-expression. Detect repeated patterns in the code and replace all but one with a temporary variable. This eliminates
the need for computing the same value repeatedly.
|
(k). Inlining
|
(k) Inlining: Function inlining replaces a call to a function or a subroutine with the body of the function
or subroutine. This can speed up execution by eliminating parameter passing and function/subroutine
call and return
overhead. It also allows the compiler to optimize the function with the rest of the code. Using function
inlining
indiscriminately can result in much larger code size and no increase in execution speed.
|
|
| |
|