• Mode-1 Multi-Core • Memory Allocators • OpenMP • Intel TBB • Pthreads • Java - Threads • Charm++ Prog. • Message Passing (MPI) • MPI - OpenMP • MPI - Intel TBB • MPI - Pthreads • Compiler Opt. Features • Threads-Perf. Math.Lib. • Threads-Prof. & Tools • Threads-I/O Perf. • PGAS : UPC / CAF / GA • Power-Perf. • Home




hyPACK-2013 : Tuning and Performance of Programs/Benchmarks Using Math Libraries


Tuning and Performance of Application Programs using Compiler optimisation techniques, Codre restructuring techniques and system tuned mathematical libraries on Multi-Core Processors will enhance performance. Performance and scalability of application on multi-core processors with respect to increase in problem size require serious effrots. System provided tuned mathematical libraries on Intel, IBM P690 are discussed below.


IBM ESSL Mathematical Libraries

(a). ESSL Libraries

The performance of computer depends how fast the system can move data between processors and memories.The mathematical libraries are tuned to architecture and one can use the best compiler falgs to get the best sustained performance. The compilers used for compiling Fortran and C programs are xlf and xlc provided on IBM AIX Systems.

Besides the standard libraries, the Sequential Programs use BLAS libraries and IBM AIX -ESSL libraries for demonstrating the performance of some of the matrix operations using the subroutines provided by these libraries. The BLAS (Basic Linear Algebra Subprograms) are high quality "building block" routines for performing basic vector and matrix operations. Level 1 BLAS does vector-vector operations, Level 2 BLAS does matrix-vector operations, and Level 3 BLAS does matrix-matrix operations. Because the BLAS is efficient, portable, and widely available, it is commonly used in the development of high quality linear algebra software like LINPACK and LAPACK. They are available at www.netlib.org/blas/. Information about BLAS can be found at www.netlib.org/blas/faq.html. The ESSL libraries are the libraries providing the various subroutines for matrix-vector operations tuned to the IBM POWER5/Power6 machine a rchitecture (shared-memory processor architecture). The operations include solution of linear system of equations, dot product of vectors, matrix-matrix multiplication. These are highly optimized keeping in mind the memory and cache hierarchy of POWER4 architecture resulting in high performance for Linear Algebra problems with large problem sizes.

For information on ESSL libraries , one can go through "Engineering and Scientific Subroutine Library for AIX Version 3 Release 3: Guide and Reference" at

IBM ESSL Library and I http://www-1.ibm.com/servers/eserver/pseries/library/sp_books/essl.html

The subroutines from BLAS and ESSL libraries used in this module are:

ddot subroutine: This is the subroutine from BLAS level 1 libraries which calculates the dot product of two double precision vectors given by X and Y. The starting letter d refers to double precision operation. The return value is a double precision value.

Calling sequence in Fortran dot = ddot(N, DX, INCX, DY, INCY)
Calling Sequence in C double ddot(int N, double *DX, int INCX, double *DY, int INCY)
Arguments :
N Number of elements in the vector; Default=0. 
DX Input double-precision vector X; the size of array X must be at least max(1,N*|INCX|).
INCX Specifies the storage spacing between successive elements of the vector X. A value of one indicates that the elements of the vector are consecutive in memory.
DY Input double-precision vector Y; the size of array Y must be at least max(1,N*|INCY|). 
INCY Specifies the storage spacing between successive elements of the vector Y. A value of one indicates that the elements of the vector are consecutive in memory.


dgesv subroutine: This subroutine solves a linear system AX = B for a square general matrix A and general matrices B and X. The starting letter d refers to double precision operation.This is the present in LAPACK subroutines in the  IBM ESSL libraries.



Calling sequence in Fortran call dgesv (N, NRHS, DA, LDA, IPIVOT, DB, LDB, INFO)
Calling Sequence in C void dgesv (int N, int NRHS, double *DA, int LDA, int *IPIVOT, double *DB, int LDB, int *INFO)
Arguments
N Order of Matrix A; Default=0
NRHS Number of right-hand sides, equal to the number of columns of the matrix B. Default=0. 
DA On entry, the N*N matrix A.
LDA Leading dimension of the array A as specified in a dimension or type statement. Default : LDA= max(1, N). 
IPIVOT On exit, pivot indices as computed by DGETRF routine.
DB On entry, the N*NRHS right-hand side matrix B. On exit, the N*NRHS solution matrix X.
LDB LDB Leading dimension of the array B as specified in a dimension or type statement. LDB . max(1, N).

Below information is about successful completion of mathematical routine.

On exit:
INFO = 0: Subroutine completed normally:
INFO < 0 The ith argument, where i = | INFO |, had an illegal value.
INFO > 0 U(i,i), where i = INFO, is exactly zero and U is therefore singular. The LU factorization has been completed, but the solution could not be computed
(b). Compilation & Execution

Compilation, Linking and Execution of Sequential Programs on PARAM Padma (IBM AIX -Power 5)
IBM AIX cluster runs AIX OS 5.1 L. It has the following Programming tools:

Compilers Available:

  • XL C Compiler
  • XL Fortran Compiler
  • GNU C Compiler 


  • Libraries Available:

  • ESSL - BLAS Level 1,2,3, LAPACK, LINPACK
  • ESSLSMP - Threaded versions of ESSL libraries
  • PESSL - Parallel version of the ESSL libraries for MPI BLACS 


  • Using BLAS Libraries: 

  • Using BLAS Downloadable from NetLib.org


  • Using BLAS/LAPACK/LINPACK Libraries:
  • Using IBM ESSL/ESSL-SMP Libraries
  • How to compile and link:

    For more control over the process of compiling and linking programs for Sequential Programs, you should use a 'Makefile'. You may also use some commands in Makefile particularly for programs contained in a large number of files.  The user has to specify the names of the program and appropriate paths to link some of the libraries required for the programs in the Makefile.

    To compile a C/Fortran program linking with/without BLAS or ESSL or ESSL-SMP libraries, the file Makefile  has to be edited as per the guidelines given in the Makefile. A routine from ESSL library can be used by linking the program with -lessl option and multi-threaded version of routine can be used by linking with ESSL-SMP library which is achieved by keeping -lesslsmp instead of -lessl. 

    Appropriate lines consisting of "F77=","FFLAGS=","LINKFLAGS="," COBJECTS="," FOBJECTS=","BLASLIBS=" have to be uncommented based on the guidelines given in the Makefile. One of the lines consisting of "COBJECTS=" has to be uncommented for compilation of a C program and one of the lines consisting of "FOBJECTS=" has to be uncommented for compilation of a Fortran program.

    After editing the Makefile, one can type on command-line 

                    make runc

    for compilation of a C program and

                     make runf

    for compilation of a Fortran program.

    This creates an executable runc or runf for C and Fortran programs respectively. For the Hands-On Session on IBM AIX cluster, the application user can use the Makefile.

    How to execute: 

    After the creation of an executable runc or runf, execution of the program can be done by issuing a command

                    ./runc or ./runf

    However, if the program is linked with ESSL-SMP library routines, the program will execute using multiple threads. The Makefile and the procedure used in the Hands-on session for linking with ESSL-SMP routines is intended to create a multi-threaded environment using OpenMP threads. After editing the Makefile using the guidelines in the Makefile and after compilation using ESSL-SMP libraries, runc or runf are created. The number of threads is set using the environment variable OMP_NUM_THREADS prior to execution

                    export OMP_NUM_THREADS = <number of threads >

    For example, to execute runc or runf using 4 threads, the number of threads have to be set prior to execution using

                    export OMP_NUM_THREADS = 4

    After setting the number of threads, the executable runc or runf can be executed.

    Centre for Development of Advanced Computing