• Mode-1 Multi-Core • Memory Allocators • OpenMP • Intel TBB • Pthreads • Java - Threads • Charm++ Prog. • Message Passing (MPI) • MPI - OpenMP • MPI - Intel TBB • MPI - Pthreads • Compiler Opt. Features • Threads-Perf. Math. Lib. • Threads-Prof. & Tools • Threads-I/O Perf. • PGAS : UPC / CAF / GA • Power-Perf. • Home




Tuning and Performance on Multi-Core Processors

Tuning and Performance of Application Programs using Compiler optimization techniques, Code restructuring techniques on Multi-Core Processors is challenging. Understanding Programming Programming Paradigms (MPI, OpenMP, Pthreads), effective use of right Compiler Optimization flags and obtaining correct results for given application is important. Enhance performance and scalability on multiple core processors for given application with respect to increase in problem size require serious efforts. Several Optimization techniques are discussed below.


Classical Optimization Techniques
(a). Copy Propagation

    Consider the following two assignment statements:

    X = Y
    Z = 1. + X

    The second statement can't execute unless the first statement is executed but it need not be in that way if we rewrite them in the following way:

    X = Y
    Z = 1. + Y

    This is called propagating a copy of Y. This makes the compiler remove all such references statements like X = Y if X is not used in any other following statements and also provide parallelism for the two instructions resulting in better performance.


(b). Constant Folding

    The method used by compiler to trace the expressions where variables can be replaced by constants to increase the performance is called Constant Folding. Consider the following piece of program.

    PROGRAM MAIN
    INTEGER I,K
    PARAMETER (I = 100)
    K = 200
    J = I+K
    END

    Because I and K are constant individually, J is a constant. The compiler reduces constant expressions like I+K into constants with the technique "Constant Folding". The compiler does through data-flow analysis, if it finds the paths of various expression paths pertaining to a variable in the data-glow graphs leading back to the same value, which is a constant, it replaces all the references to that variable with that constant. This allows the compiler to calculate the value of some expressions at compile-time itself if the expressions contain only constants.


(c). Dead Code Removal

    Dead code refers to that part of code, which is never reached while the code is being executed.Dead code comes in following types:

  • Instructions those are unreachable.
  • Instructions that produce results, which are never used.

  • These parts are introduced into the code by the programmer and most commonly due to some optimizations used by the compiler.


  • In the case of local variables, the compiler gets idea about such parts by analyzing a variable's definitions and uses, the compiler can see whether any other part of the routine references it. It can't tell the fate of variables, which are external or common, so those computations will always be kept (as long as they are reachable). Basing on these factors, the compiler can either remove such statements or can give warning messages on compilation.


(d). Strength Reduction

    Operations or expressions have various time costs associated with them. It is possible to replace a more expensive calculation with a cheaper one, which is called strength reduction, which results in better performance than without the reduction. For example, the exponentiation operation has the overheads the calling a subroutine and then involves computations like logarithm, multiplication, etc. For low powers, the exponentiation can be replaced by much lesser costly operations like multiplication. i.e. Y = X**2 can be replaced by Y = X*X. Also the multiplications can be replaced by much cheaper additions to improve the performance.

    One of the compiler-generated optimizations includes peephole optimization: when the compiler crawls through the intermediate code looking for cheap and easy stocking stuffer optimizations.


(e). Variable Renaming

    This is the technique employed by a compiler or at source-code level by the programmer to give parallelism to two or more calculations when it finds that the expressions have the same variable but are independent to each other. Consider the following fragment:

    x = y * z;
    q = r + x +x;
    x = a + b;

    When the compiler recognizes that a variable is being recycled, and that its current and former uses are independent, it can substitute a new variable to keep the calculations separate.

    x0 = y * z;
    q = r + x0 + x0;
    x = a + b;

    Variable Renaming, which provides clarity to the flow of program and also parallelism, can take place both locally, at the basic block level, and globally, as a product of data flow analysis.


(f). Common Sub-expression Elimination

    Sub-expressions are pieces of expressions. For instance, A + B is a sub expression of C * (A + B). If A + B appears in several places, like it does below, we call it a common sub expression:

    D = C * (A + B) E = (A + B)/2

    Rather than calculate A and B twice, the compiler can generate a temporary variable and use it wherever A + B is required.

    temp = A + B D = C * temp E = temp/2

    Information about expressions can be collected during data-flow analysis along with variable use/definition information. By examining both together, the compiler can detect which sub-expressions reach which basic blocks, and whether any of the variables within them have been registered. This is not applicable if two sub-expressions are there in form of A + B + C and B + C + A. The same is applied when more than one references are present to the same array element without any modifications done to the element.


(g). Loop Invariant Code Motion

    The expressions that do not change on entrance into a loop are called Loop Invariant Expressions. These include the expressions that are unnecessarily being executed inside a loop although it is not required for such expressions to be executed for more than one time outside the loop. From use/definition information on the associated basic blocks of the loops, we can identify expressions containing variables that are unchanging. The act of moving the repeated, unchanging calculations to the outside is called Loop Invariant code motion.


(h). Induction Variable Simplification

    The variables, which are repeatedly assigned values as a function of iteration number in each iteration of a loop, are called Induction Variables. Elimination of such variables or simplification of expressions related to such variables can help reduce some computations in the loop and result in better performance. Consider the following loop with K as induction variable:

    Do 10 I = 1,N
        K = I*4 + M
    10 CONTINUE

    Induction variable simplification replaces calculations for variables for variables like K with simpler ones. Given a starting point and the expression's first derivative, you can arrive at K's value for the nth iteration by stepping through the n-1 intervening iterations:

    K = M
    DO 10 I = 1,N
        K = K + 4
    10 CONTINUE

    Induction variable simplification is especially useful on processors that can automatically increment a register each time it is used as a pointer for a memory reference. While stepping through a loop, the memory reference and the address arithmetic can both be squeezed into a single instruction-a great savings.


(i). Register Variable detection

    On many older CISC processors there were few general-purpose registers. Because many of the instruction formats could operate directly on memory locations, you could choose to leave the less critical variables out in main memory and bring the more important ones into registers. On RISC designs, there are often many more registers to choose from, and everything has to be brought into a register anyway. This means that, for at least a little while, all variables will be register resident.

    The variables, which are more frequently accesses, can be kept in registers to avoid frequent memory loading and unloading to result in better performance. To determine which variables should be resident, the compiler has to examine how frequently each appears to be used, and whether the uses are independent of each other. This information comes from data flow analysis.


(j). Arithmetic Optimizations

    Arithmetic Optimizations: Many of the optimizations performed on programs compiled with best compilers can be seen as arithmetic optimizations because computer architecture is frequently used for numerically intensive applications.

    Strength Reduction: Sometimes it's possible to replace a more expensive calculation with a cheaper one. Strength reduction reduces the computation costs of an operation while providing mathematically identical results. Expensive operations, taking perhaps hundreds of machine cycles, can be replaced by a cheaper alternative, paying a lower cost : exponentiations and divisions can be translated into multiplications, multiplications into sums. However, most compilers automatically perform these optimizations.

    Data Type Conversions: The type and precision of a data value determine the amount of storage required to contain the value and the way in which the value can be operated on and has significant effect on performance. Statements that contain runtime type conversions suffer a performance penalty each time the statement is executed.

    Common Sub-expression Elimination: If a sub-expression appears at several places, it's called a common sub-expression. Detect repeated patterns in the code and replace all but one with a temporary variable. This eliminates the need for computing the same value repeatedly.


(k). Inlining

    (k) Inlining: Function inlining replaces a call to a function or a subroutine with the body of the function or subroutine. This can speed up execution by eliminating parameter passing and function/subroutine call and return overhead. It also allows the compiler to optimize the function with the rest of the code. Using function inlining indiscriminately can result in much larger code size and no increase in execution speed.


Centre for Development of Advanced Computing