C-DAC,Pune : High-Perf. Comp. Frontier Technologies Exploration Group and CMSD, University of Hyderabad, Technology Workshop hyPACK (October 15-18), 2013

Overview Venue : CMSD, UoH Key-Note/Invited Talks Faculty / Speakers Proceedings Downloads Past Tech. Workshops Target Audience Benefits Organisers Accommodation Local Travel Sponsors Feedback Acknowledgements Contact Home

Topics of Interest Tech. Prog. Schedule Topic : Multi-Core Topic : ARM Proc. Topic : Coprocessors Topic : GPGPUs Topic : HPC Cluster Topic : App. Kernels. Topic : Lab. Session Key-Note / Invited Talks Home

Mode-1 Multi-Core Memory Allocators OpenMP Intel TBB Pthreads Java - Threads Charm++ Prog. Message Passing (MPI) MPI - OpenMP MPI - Intel TBB MPI - Pthreads Compilers - Opt. Features Threads-Perf. Math. Lib. Threads-Prof. & Tools Threads - I/O Perf. PGAS : UPC / CAF/ GA Power & Perf. Home

Mode-2 ARM Prog. Env Benchmarks Power & Perf. Home

Mode-3 Coprocessors Arch. Software Compiler & Vect. Prog. Env. Benchmarks Power & Perf. Home

Mode-4 GPGPUs NVIDIA - CUDA/OpenCL AMD APP - OpenCL GPGPUs - OpenCL GPGPUs : Power & Perf. Home

Mode-5 HPC Cluster HPC MPI Cluster GPU Cluster - NVIDIA GPU Cluster - AMD APP Cluster - Intel Coprocessors Cluster- Power & Perf. Home

Mode-6 App. Kernels PDE Solvers : FDM/FEM Image Processing - FFT Monte Carlo Methods String Srch. Seq. Analy. Video Process. Intr. Detcn. Sys App. Power & Perf. Home

Reg. Overview Pvt. Sector Pub. Sector Govt. Acad. Staff Students Reg. On-line Reg. Accommodation Contact Home

• Mode-1 Multi-Core • Memory Allocators • OpenMP • Intel TBB • Pthreads • Java - Threads • Charm++ Prog. • Message Passing (MPI) • MPI - OpenMP • MPI - Intel TBB • MPI - Pthreads • Compiler Opt. Features • Threads-Perf. Math. Lib. • Threads-Prof. & Tools • Threads-I/O Perf. • PGAS : UPC / CAF / GA • Power-Perf. • Home

Tuning and Performance on Multi-Core Processors

Tuning and Performance of Application Programs using Compiler optimization techniques, Code restructuring techniques on Multi-Core Processors is challenging. Understanding Programming Programming Paradigms (MPI, OpenMP, Pthreads), effective use of right Compiler Optimization flags and obtaining correct results for given application is important. Enhance performance and scalability on multiple core processors for given application with respect to increase in problem size require serious efforts. Several Optimization techniques are discussed below.

Classical Optimization Techniques

(a). Copy Propagation

Consider the following two assignment statements:

X = Y
Z = 1. + X

The second statement can't execute unless the first statement is executed but it need not be in that way if we rewrite them in the following way:

X = Y
Z = 1. + Y

This is called propagating a copy of Y. This makes the compiler remove all such references statements like X = Y if X is not used in any other following statements and also provide parallelism for the two instructions resulting in better performance.

(b). Constant Folding

The method used by compiler to trace the expressions where variables can be replaced by constants to increase the performance is called Constant Folding. Consider the following piece of program.

PROGRAM MAIN
INTEGER I,K
PARAMETER (I = 100)
K = 200
J = I+K
END

Because I and K are constant individually, J is a constant. The compiler reduces constant expressions like I+K into constants with the technique "Constant Folding". The compiler does through data-flow analysis, if it finds the paths of various expression paths pertaining to a variable in the data-glow graphs leading back to the same value, which is a constant, it replaces all the references to that variable with that constant. This allows the compiler to calculate the value of some expressions at compile-time itself if the expressions contain only constants.

(c). Dead Code Removal

Dead code refers to that part of code, which is never reached while the code is being executed.Dead code comes in following types:

Instructions those are unreachable.

Instructions that produce results, which are never used.

These parts are introduced into the code by the programmer and most commonly due to some optimizations used by the compiler.

In the case of local variables, the compiler gets idea about such parts by analyzing a variable's definitions and uses, the compiler can see whether any other part of the routine references it. It can't tell the fate of variables, which are external or common, so those computations will always be kept (as long as they are reachable). Basing on these factors, the compiler can either remove such statements or can give warning messages on compilation.

(d). Strength Reduction

Operations or expressions have various time costs associated with them. It is possible to replace a more expensive calculation with a cheaper one, which is called strength reduction, which results in better performance than without the reduction. For example, the exponentiation operation has the overheads the calling a subroutine and then involves computations like logarithm, multiplication, etc. For low powers, the exponentiation can be replaced by much lesser costly operations like multiplication. i.e. Y = X**2 can be replaced by Y = X*X. Also the multiplications can be replaced by much cheaper additions to improve the performance.

One of the compiler-generated optimizations includes peephole optimization: when the compiler crawls through the intermediate code looking for cheap and easy stocking stuffer optimizations.

(e). Variable Renaming

This is the technique employed by a compiler or at source-code level by the programmer to give parallelism to two or more calculations when it finds that the expressions have the same variable but are independent to each other. Consider the following fragment:

x = y * z;
q = r + x +x;
x = a + b;

When the compiler recognizes that a variable is being recycled, and that its current and former uses are independent, it can substitute a new variable to keep the calculations separate.

x0 = y * z;
q = r + x0 + x0;
x = a + b;

Variable Renaming, which provides clarity to the flow of program and also parallelism, can take place both locally, at the basic block level, and globally, as a product of data flow analysis.

(f). Common Sub-expression Elimination

Sub-expressions are pieces of expressions. For instance, A + B is a sub expression of C * (A + B). If A + B appears in several places, like it does below, we call it a common sub expression:

D = C * (A + B) E = (A + B)/2

Rather than calculate A and B twice, the compiler can generate a temporary variable and use it wherever A + B is required.

temp = A + B D = C * temp E = temp/2

Information about expressions can be collected during data-flow analysis along with variable use/definition information. By examining both together, the compiler can detect which sub-expressions reach which basic blocks, and whether any of the variables within them have been registered. This is not applicable if two sub-expressions are there in form of A + B + C and B + C + A. The same is applied when more than one references are present to the same array element without any modifications done to the element.

(g). Loop Invariant Code Motion

The expressions that do not change on entrance into a loop are called Loop Invariant Expressions. These include the expressions that are unnecessarily being executed inside a loop although it is not required for such expressions to be executed for more than one time outside the loop. From use/definition information on the associated basic blocks of the loops, we can identify expressions containing variables that are unchanging. The act of moving the repeated, unchanging calculations to the outside is called Loop Invariant code motion.

(h). Induction Variable Simplification

The variables, which are repeatedly assigned values as a function of iteration number in each iteration of a loop, are called Induction Variables. Elimination of such variables or simplification of expressions related to such variables can help reduce some computations in the loop and result in better performance. Consider the following loop with K as induction variable:

Do 10 I = 1,N
K = I*4 + M
10 CONTINUE

Induction variable simplification replaces calculations for variables for variables like K with simpler ones. Given a starting point and the expression's first derivative, you can arrive at K's value for the nth iteration by stepping through the n-1 intervening iterations:

K = M
DO 10 I = 1,N
K = K + 4
10 CONTINUE

Induction variable simplification is especially useful on processors that can automatically increment a register each time it is used as a pointer for a memory reference. While stepping through a loop, the memory reference and the address arithmetic can both be squeezed into a single instruction-a great savings.

(i). Register Variable detection

On many older CISC processors there were few general-purpose registers. Because many of the instruction formats could operate directly on memory locations, you could choose to leave the less critical variables out in main memory and bring the more important ones into registers. On RISC designs, there are often many more registers to choose from, and everything has to be brought into a register anyway. This means that, for at least a little while, all variables will be register resident.

The variables, which are more frequently accesses, can be kept in registers to avoid frequent memory loading and unloading to result in better performance. To determine which variables should be resident, the compiler has to examine how frequently each appears to be used, and whether the uses are independent of each other. This information comes from data flow analysis.

(j). Arithmetic Optimizations

Arithmetic Optimizations: Many of the optimizations performed on programs compiled with best compilers can be seen as arithmetic optimizations because computer architecture is frequently used for numerically intensive applications.

Strength Reduction: Sometimes it's possible to replace a more expensive calculation with a cheaper one. Strength reduction reduces the computation costs of an operation while providing mathematically identical results. Expensive operations, taking perhaps hundreds of machine cycles, can be replaced by a cheaper alternative, paying a lower cost : exponentiations and divisions can be translated into multiplications, multiplications into sums. However, most compilers automatically perform these optimizations.

Data Type Conversions: The type and precision of a data value determine the amount of storage required to contain the value and the way in which the value can be operated on and has significant effect on performance. Statements that contain runtime type conversions suffer a performance penalty each time the statement is executed.

Common Sub-expression Elimination: If a sub-expression appears at several places, it's called a common sub-expression. Detect repeated patterns in the code and replace all but one with a temporary variable. This eliminates the need for computing the same value repeatedly.

(k). Inlining

(k) Inlining: Function inlining replaces a call to a function or a subroutine with the body of the function or subroutine. This can speed up execution by eliminating parameter passing and function/subroutine call and return overhead. It also allows the compiler to optimize the function with the rest of the code. Using function inlining indiscriminately can result in much larger code size and no increase in execution speed.

Centre for Development of Advanced Computing