C-DAC,Pune : High-Perf. Comp. Frontier Technologies Exploration Group and CMSD, University of Hyderabad, Technology Workshop hyPACK (October 15-18), 2013

Overview Venue : CMSD, UoH Key-Note/Invited Talks Faculty / Speakers Proceedings Downloads Past Tech. Workshops Target Audience Benefits Organisers Accommodation Local Travel Sponsors Feedback Acknowledgements Contact Home

Topics of Interest Tech. Prog. Schedule Topic : Multi-Core Topic : ARM Proc. Topic : Coprocessors Topic : GPGPUs Topic : HPC Cluster Topic : App. Kernels. Topic : Lab. Session Key-Note / Invited Talks Home

Mode-1 Multi-Core Memory Allocators OpenMP Intel TBB Pthreads Java - Threads Charm++ Prog. Message Passing (MPI) MPI - OpenMP MPI - Intel TBB MPI - Pthreads Compilers - Opt. Features Threads-Perf. Math. Lib. Threads-Prof. & Tools Threads - I/O Perf. PGAS : UPC / CAF/ GA Power & Perf. Home

Mode-2 ARM Prog. Env Benchmarks Power & Perf. Home

Mode-3 Coprocessors Arch. Software Compiler & Vect. Prog. Env. Benchmarks Power & Perf. Home

Mode-4 GPGPUs NVIDIA - CUDA/OpenCL AMD APP - OpenCL GPGPUs - OpenCL GPGPUs : Power & Perf. Home

Mode-5 HPC Cluster HPC MPI Cluster GPU Cluster - NVIDIA GPU Cluster - AMD APP Cluster - Intel Coprocessors Cluster- Power & Perf. Home

Mode-6 App. Kernels PDE Solvers : FDM/FEM Image Processing - FFT Monte Carlo Methods String Srch. Seq. Analy. Video Process. Intr. Detcn. Sys App. Power & Perf. Home

Reg. Overview Pvt. Sector Pub. Sector Govt. Acad. Staff Students Reg. On-line Reg. Accommodation Contact Home

• Mode-1 Multi-Core • Memory Allocators • OpenMP • Intel TBB • Pthreads • Java - Threads • Charm++ Prog. • Message Passing (MPI) • MPI - OpenMP • MPI - Intel TBB • MPI - Pthreads • Compiler Opt. Features • Threads-Perf. Math.Lib. • Threads-Prof. & Tools • Threads-I/O Perf. • PGAS : UPC / CAF / GA • Power-Perf. • Home

hyPACK-2013 : Tuning and Performance of Programs/Benchmarks Using Math Libraries

Tuning and Performance of Application Programs using Compiler optimisation techniques, Codre restructuring techniques and system tuned mathematical libraries on Multi-Core Processors will enhance performance. Performance and scalability of application on multi-core processors with respect to increase in problem size require serious effrots. System provided tuned mathematical libraries on Intel, IBM P690 are discussed below.

IBM ESSL Mathematical Libraries

(a). ESSL Libraries

The performance of computer depends how fast the system can move data between processors and memories.The mathematical libraries are tuned to architecture and one can use the best compiler falgs to get the best sustained performance. The compilers used for compiling Fortran and C programs are xlf and xlc provided on IBM AIX Systems.

Besides the standard libraries, the Sequential Programs use BLAS libraries and IBM AIX -ESSL libraries for demonstrating the performance of some of the matrix operations using the subroutines provided by these libraries. The BLAS (Basic Linear Algebra Subprograms) are high quality "building block" routines for performing basic vector and matrix operations. Level 1 BLAS does vector-vector operations, Level 2 BLAS does matrix-vector operations, and Level 3 BLAS does matrix-matrix operations. Because the BLAS is efficient, portable, and widely available, it is commonly used in the development of high quality linear algebra software like LINPACK and LAPACK. They are available at www.netlib.org/blas/. Information about BLAS can be found at www.netlib.org/blas/faq.html. The ESSL libraries are the libraries providing the various subroutines for matrix-vector operations tuned to the IBM POWER5/Power6 machine a rchitecture (shared-memory processor architecture). The operations include solution of linear system of equations, dot product of vectors, matrix-matrix multiplication. These are highly optimized keeping in mind the memory and cache hierarchy of POWER4 architecture resulting in high performance for Linear Algebra problems with large problem sizes.

For information on ESSL libraries , one can go through "Engineering and Scientific Subroutine Library for AIX Version 3 Release 3: Guide and Reference" at

IBM ESSL Library and I http://www-1.ibm.com/servers/eserver/pseries/library/sp_books/essl.html

The subroutines from BLAS and ESSL libraries used in this module are:

ddot subroutine: This is the subroutine from BLAS level 1 libraries which calculates the dot product of two double precision vectors given by X and Y. The starting letter d refers to double precision operation. The return value is a double precision value.

Calling sequence in Fortran	dot = ddot(N, DX, INCX, DY, INCY)
Calling Sequence in C	double ddot(int N, double DX, int INCX, double DY, int INCY)
Arguments :
N	Number of elements in the vector; Default=0.
DX	Input double-precision vector X; the size of array X must be at least max(1,N*\|INCX\|).
INCX	Specifies the storage spacing between successive elements of the vector X. A value of one indicates that the elements of the vector are consecutive in memory.
DY	Input double-precision vector Y; the size of array Y must be at least max(1,N*\|INCY\|).
INCY	Specifies the storage spacing between successive elements of the vector Y. A value of one indicates that the elements of the vector are consecutive in memory.

dgesv subroutine: This subroutine solves a linear system AX = B for a square general matrix A and general matrices B and X. The starting letter d refers to double precision operation.This is the present in LAPACK subroutines in the IBM ESSL libraries.

Calling sequence in Fortran	call dgesv (N, NRHS, DA, LDA, IPIVOT, DB, LDB, INFO)
Calling Sequence in C	void dgesv (int N, int NRHS, double DA, int LDA, int IPIVOT, double DB, int LDB, int INFO)
Arguments
N	Order of Matrix A; Default=0
NRHS	Number of right-hand sides, equal to the number of columns of the matrix B. Default=0.
DA	On entry, the N*N matrix A.
LDA	Leading dimension of the array A as specified in a dimension or type statement. Default : LDA= max(1, N).
IPIVOT	On exit, pivot indices as computed by DGETRF routine.
DB	On entry, the NNRHS right-hand side matrix B. On exit, the NNRHS solution matrix X.
LDB	LDB Leading dimension of the array B as specified in a dimension or type statement. LDB . max(1, N).

Below information is about successful completion of mathematical routine.

On exit:
INFO = 0: Subroutine completed normally:
INFO < 0 The ith argument, where i = | INFO |, had an illegal value.
INFO > 0 U(i,i), where i = INFO, is exactly zero and U is therefore singular. The LU factorization has been completed, but the solution could not be computed

(b). Compilation & Execution

Compilation, Linking and Execution of Sequential Programs on PARAM Padma (IBM AIX -Power 5)
IBM AIX cluster runs AIX OS 5.1 L. It has the following Programming tools:

Compilers Available:

XL C Compiler
XL Fortran Compiler
GNU C Compiler

Libraries Available:

ESSL - BLAS Level 1,2,3, LAPACK, LINPACK

ESSLSMP - Threaded versions of ESSL libraries

PESSL - Parallel version of the ESSL libraries for MPI BLACS

Using BLAS Libraries:

Using BLAS Downloadable from NetLib.org

Using BLAS/LAPACK/LINPACK Libraries:

Using IBM ESSL/ESSL-SMP Libraries

How to compile and link:

For more control over the process of compiling and linking programs for Sequential Programs, you should use a 'Makefile'. You may also use some commands in Makefile particularly for programs contained in a large number of files. The user has to specify the names of the program and appropriate paths to link some of the libraries required for the programs in the Makefile.

To compile a C/Fortran program linking with/without BLAS or ESSL or ESSL-SMP libraries, the file Makefile has to be edited as per the guidelines given in the Makefile. A routine from ESSL library can be used by linking the program with -lessl option and multi-threaded version of routine can be used by linking with ESSL-SMP library which is achieved by keeping -lesslsmp instead of -lessl.

Appropriate lines consisting of "F77=","FFLAGS=","LINKFLAGS="," COBJECTS="," FOBJECTS=","BLASLIBS=" have to be uncommented based on the guidelines given in the Makefile. One of the lines consisting of "COBJECTS=" has to be uncommented for compilation of a C program and one of the lines consisting of "FOBJECTS=" has to be uncommented for compilation of a Fortran program.

After editing the Makefile, one can type on command-line

make runc

for compilation of a C program and

make runf

for compilation of a Fortran program.

This creates an executable runc or runf for C and Fortran programs respectively. For the Hands-On Session on IBM AIX cluster, the application user can use the Makefile.

How to execute:

After the creation of an executable runc or runf, execution of the program can be done by issuing a command

./runc or ./runf

However, if the program is linked with ESSL-SMP library routines, the program will execute using multiple threads. The Makefile and the procedure used in the Hands-on session for linking with ESSL-SMP routines is intended to create a multi-threaded environment using OpenMP threads. After editing the Makefile using the guidelines in the Makefile and after compilation using ESSL-SMP libraries, runc or runf are created. The number of threads is set using the environment variable OMP_NUM_THREADS prior to execution

export OMP_NUM_THREADS = <number of threads >

For example, to execute runc or runf using 4 threads, the number of threads have to be set prior to execution using

export OMP_NUM_THREADS = 4

After setting the number of threads, the executable runc or runf can be executed.

Centre for Development of Advanced Computing