C-DAC,Pune : High-Perf. Comp. Frontier Technologies Exploration Group and CMSD, University of Hyderabad, Technology Workshop hyPACK (October 15-18), 2013

Overview Venue : CMSD, UoH Key-Note/Invited Talks Faculty / Speakers Proceedings Downloads Past Tech. Workshops Target Audience Benefits Organisers Accommodation Local Travel Sponsors Feedback Acknowledgements Contact Home

Topics of Interest Tech. Prog. Schedule Topic : Multi-Core Topic : ARM Proc. Topic : Coprocessors Topic : GPGPUs Topic : HPC Cluster Topic : App. Kernels. Topic : Lab. Session Key-Note / Invited Talks Home

Mode-1 Multi-Core Memory Allocators OpenMP Intel TBB Pthreads Java - Threads Charm++ Prog. Message Passing (MPI) MPI - OpenMP MPI - Intel TBB MPI - Pthreads Compilers - Opt. Features Threads-Perf. Math. Lib. Threads-Prof. & Tools Threads - I/O Perf. PGAS : UPC / CAF/ GA Power & Perf. Home

Mode-2 ARM Prog. Env Benchmarks Power & Perf. Home

Mode-3 Coprocessors Arch. Software Compiler & Vect. Prog. Env. Benchmarks Power & Perf. Home

Mode-4 GPGPUs NVIDIA - CUDA/OpenCL AMD APP - OpenCL GPGPUs - OpenCL GPGPUs : Power & Perf. Home

Mode-5 HPC Cluster HPC MPI Cluster GPU Cluster - NVIDIA GPU Cluster - AMD APP Cluster - Intel Coprocessors Cluster- Power & Perf. Home

Mode-6 App. Kernels PDE Solvers : FDM/FEM Image Processing - FFT Monte Carlo Methods String Srch. Seq. Analy. Video Process. Intr. Detcn. Sys App. Power & Perf. Home

Reg. Overview Pvt. Sector Pub. Sector Govt. Acad. Staff Students Reg. On-line Reg. Accommodation Contact Home

• Mode-1 Multi-Core • Memory Allocators • OpenMP • Intel TBB • Pthreads • Java - Threads • Charm++ Prog. • Message Passing (MPI) • MPI - OpenMP • MPI - Intel TBB • MPI - Pthreads • Compiler Opt. Features • Threads-Perf. Math. Lib. • Threads-Prof. & Tools • Threads-I/O Perf. • PGAS : UPC / CAF / GA • Power-Perf. • Home

Tuning & Performance /Benchmarks on Multi-Core Processors

Tuning and Performance of Application Programs using Compiler optimisation techniques, Codre restructuring techniques on Multi-Core Processors is challenging. Understanding Programming Programming Paradigms (MPI, OpenMP, Pthreads), effective use of right Compiler Optimisation flags and obtaining correct results for given application is important. Enhance performance and scalability on multiple core processors for given application with respect to increase in problem size require serious effrots. Several Optmisation techniques are discussed below.

Memory Reference Optimisation Techniques

Good performance is becoming more closely tied to good memory access patterns, and careful reuse of operands. It is necessary to have memory system that is fast enough to satisfy every reference immediately so that reasonable memory performance can be achieved. The efficiency of memory performance depends on the behaviour of the program - bad memory access patterns give bad performance for user. A single program, structured in two different ways could run at very different speed because of the way it plays to the memory system. Speed of the memory system is important for parallelism and getting good memory throughput is important for performance of parallel program. When the compiler can see how memory references interplay it can safely schedule then alongside other operations for improved performance.

(a). Memory Access Patterns

For an array with a single dimension, stepping through one element at a time will give best performance in the modern hierarchical memory machines. For multiply dimensioned arrays, access will be fastest if you iterate on the array subscript offering the smallest stride or step size. In FORTRAN programs this is the leftmost subscript; in C it is the rightmost. The FORTRAN loop below has unit stride, and therefore will run quickly:

DO 10 J =1, N
DO 10 I =1, N
A(I,J) = B(I,J)+ C(I,J) *D
10 CONTINUE

In contrast, the next loop will be slower because its stride is N (assume that is greater that 1). The larger the value of N, the more significant the performance difference will be:

DO 10 J =1, N
DO 10 I =1, N
A(J,I) = B(J,I)+ C(J,I) *D
10 CONTINUE

In C programs, the subscript appears in reverse ode. Here's a unit stride loop like the one above, but written in C:

for(i=0; i < n; I++);
for (j=0; j < n; j++)
a[i][j] = a[i][j] + c[i][j] *d;

Unit stride gives the best performance because it conserves cache entries. The worst-case patterns are those that jump through memory, especially a large amount of memory, and particularly those that do so without apparent reason. On large jobs, you not only pay for a penalty for cache misses, but for TLB (The Translation Lookaside Buffer is a cache of translations from virtual memory addresses to physical memory addressees) misses too. It would be nice to be able to restructure the jobs in so that they make better use of memory. In other words, programs have to be written in such a way that the memory access patterns must give better performance.

(b). Loop Interchange to Ease Memory Access Patterns

Loop interchange is a good technique for lessening the impact of strided memory references. We consider the following FORTRAN loop with non-unit stride and investigate N-strided memory references for unit strides.

DO 10 J =1, N
DO 10 I =1, N
A(J,I) = B(J,I)+C(J,I)*D
10 CONTINUE

Loop interchange is possible in the above loop because each iteration is independent of other. After interchange, A,B, and C are reference with the leftmost subscript varying most quickly given below :

DO 10 I =1, N
DO 10 J =1, N
A(J,I) = B(J,I)+ C(J,I) *D
10 CONTINUE

This can make an important difference in performance. We traded three N-strided memory references for unit strides: Often you may find some mix of variables with small and large strides, in which case interchanging the loops moves the damage around, but doesn't make it go away. This can be explained in the following examples

DO 10 I =1, N DO J = 1, M
DO 20 J =1, M DO I = 1, N
A(J,I) = B(I,J) A(J,I) = B(I,J)
20 CONTINUE 20 CONTINUE
10 CONTINUE 10 CONTINUE

It is clear from the above loops, whichever way you interchange, you will break to memory access pattern for either A or B.

(c).Blocking to Ease Memory Access Patterns

Blocking is another kind of memory reference optimization. As with loop interchange, the challenge is to retrieve as much data as possible with as few cache missed as possible. A simple program on two dimensional vector sum is considered and the loop is re-arranged to ease the memory access pattern.

DO 10 I =1, N
DO 20 J =1 , N
A(J,I) = A(J,I)+ B(I,J)
20 CONTINUE
10 CONTINUE

The above loop involves two vectors. One is referenced with unit stride, the other with a stride of N. We can interchange the loops, but one way or another we will still have N-strided array reference on either A or B, either of which is undesirable. The trick is to block reference so that you grab a few elements of A, and then few of B, and the a few of A and so on in neighborhoods. Combining inner and outer loop unrolling can do this and the re-arranged loop is given below.

DO 10 I = 1, N
DO 20 J =1, N
&nbasp; A(J,I) = A(J,I) + B(I,J)
&nbasp; A(J+1,I) = A(J+1,I) + B(I,J+1)
&nbasp; A(J,I+1) = A(J,I+1) + B(I+1,J)
&nbasp; A(J+1,I+1) = A(J+1,I+1) + B(I+1,J+1)
20 CONTINUE
10 CONTINUE

To make programming easier, the compiler provides the illusion that two-dimensional arrays A and B in the above loop are rectangular plot so memory. Actually, memory is sequential storage. In FORTRAN, a two dimensional array is constructed from memory by logically lining memory strips up against each other. Array storage starts at the upper left, proceeds down to the bottom and the starts over at the top of the next column. Stepping through the array with unit stride traces out the shape of backwards N, repeated over and over, moving to the right.

If we could somehow rearrange the loop so that it consumed the arrays in small rectangles, rather than strips, we could conserve some of the cache entries that are being discarded. This will improve the cache performance and lower the run time.

For large size of problems, more than cache entries are at stake. Since many clusters are virtual memory machines, meaning that memory references have to be translated through a Translation Look-aside Buffer (TLB). A typical workstation's TLB can accommodate entries to cover only several hundred KB of data. This means if the programs have large arrays, TLB misses are going to add to your run time too, in addition to cache misses.

We re-write the above loop yet again, this time blocking the references at two different levels: in 2 x 2 squares to save cache entries, and by cutting the original loop in two pats to save TLB entries.

DO 11 I =1,N,2
DO 21 J =1,N/2,2
A(J,I) = A(J,I) + B(I,J)
A(J+1,I) = A(J+1,I) + B(I,J+1)
A(J,I+1) = A(J,I+1) + B(I+1,J)
A(J+1,I+1) = A(J+1,I+1) + B(I+1,J+1)
21 CONTINUE
11 CONTINUE

DO 10 I =1,N,2
DO 20 J = N/2+1,N,2
A(J,I) = A(J,I) + B(I,J)
A(J+1,I) = A(J+1,I) + B(I,J+1)
A(J,I+1) = A(J,I+1) + B(I+1,J)
A(J+1,I+1) = A(J+1,I+1) + B(I+1,J+1)
20 CONTINUE
10 CONTINUE

The above splitting strategy of loops works well for reasonably large value of N, say 256. It is possible to take blocking even further for larger problems. Thus for many sequential programs, it is necessary to know the tricks to ease the memory access patterns so that the performance of optimized loops is better than the performance of original loop particularly large values of N.

Getting memory references right is the one of the most important challenges of application performance tuning. As explained earlier, as CPUs get faster - and memory systems fail to keep up - that it will become the most important challenge.

Centre for Development of Advanced Computing