C-DAC,Pune : High-Perf. Comp. Frontier Technologies Exploration Group and CMSD, University of Hyderabad, Technology Workshop hyPACK (October 15-18), 2013

Overview Venue : CMSD, UoH Key-Note/Invited Talks Faculty / Speakers Proceedings Downloads Past Tech. Workshops Target Audience Benefits Organisers Accommodation Local Travel Sponsors Feedback Acknowledgements Contact Home

Topics of Interest Tech. Prog. Schedule Topic : Multi-Core Topic : ARM Proc. Topic : Coprocessors Topic : GPGPUs Topic : HPC Cluster Topic : App. Kernels. Topic : Lab. Session Key-Note / Invited Talks Home

Mode-1 Multi-Core Memory Allocators OpenMP Intel TBB Pthreads Java - Threads Charm++ Prog. Message Passing (MPI) MPI - OpenMP MPI - Intel TBB MPI - Pthreads Compilers - Opt. Features Threads-Perf. Math. Lib. Threads-Prof. & Tools Threads - I/O Perf. PGAS : UPC / CAF/ GA Power & Perf. Home

Mode-2 ARM Prog. Env Benchmarks Power & Perf. Home

Mode-3 Coprocessors Arch. Software Compiler & Vect. Prog. Env. Benchmarks Power & Perf. Home

Mode-4 GPGPUs NVIDIA - CUDA/OpenCL AMD APP - OpenCL GPGPUs - OpenCL GPGPUs : Power & Perf. Home

Mode-5 HPC Cluster HPC MPI Cluster GPU Cluster - NVIDIA GPU Cluster - AMD APP Cluster - Intel Coprocessors Cluster- Power & Perf. Home

Mode-6 App. Kernels PDE Solvers : FDM/FEM Image Processing - FFT Monte Carlo Methods String Srch. Seq. Analy. Video Process. Intr. Detcn. Sys App. Power & Perf. Home

Reg. Overview Pvt. Sector Pub. Sector Govt. Acad. Staff Students Reg. On-line Reg. Accommodation Contact Home

• Mode-1 Multi-Core • Memory Allocators • OpenMP • Intel TBB • Pthreads • Java - Threads • Charm++ Prog. • Message Passing (MPI) • MPI - OpenMP • MPI - Intel TBB • MPI - Pthreads • Compiler Opt. Features • Threads-Perf. Math.Lib. • Threads-Prof. & Tools • Threads-I/O Perf. • PGAS : UPC / CAF / GA • Power-Perf. • Home

hyPACK-2013 : Multi-Core Processors : Performance Using Math Kernel Libraries

Example programs using different APIs. Compilation and execution of Pthread programs, programs numerical and non-numerical computations are discussed using different thread APIs to understand Performance issues on mutli-core processors.

Example 1.1	Write a Sequential program for efficient implementation to evaluate the polynomial by Horner's rule.
Example 1.2	Write a Sequential Program for Matrix-Vector Multiplication in which the matrix entries are accessed in Row-wise fashion. Use the best Compiler flags and demonstrate the performance.
Example 1.3	Write a Sequential Program for Matrix-Vector Multiplication in which the matrix entries should be accessed in column-wise fashion. Use the best Compiler flags and demonstrate the performance.
Example 1.4	Write a efficient Sequential program for efficient implementation of matrix-matrix multiplication, implementing dot product in inner loop to get better performance. Use the best Compiler flags and demonstrate their Performance.
Example 1.5	Write a efficient Sequential Program for implementation of matrix-matrix multiplication, implementing daxpy in inner loop to get better performance. Use the best Compiler flags and demonstrate the performance.
Example 1.6	Write a Sequential program for matrix-matrix multiplication program, implementing dot product in inner loop, using BLAS-I, II, III libraries. Use the best Compiler flags and demonstrate the performance.
Example 1.7	Write a Sequential Program for efficient implementation of matrix-matrix multiplication program, implementing " dot product" in inner loop using system provided Mathematical libraries to extract performance. Use the best Compiler flags and demonstrate the performance.
Example 1.8	Write a sequential program to solve the matrix system of linear equations by Direct Method (Gauss Method) in which A is symmetric positive definite matrix. Use Compiler optimizations and demonstrate the performance.
Example 1.9	Write a sequential program to solve the matrix system of linear equations by Direct Method (Gauss Method) in which A is symmetric positive definite matrix using system provided Mathematical libraries. Use Compiler optimizations and demonstrate the performance.
Example 1.10	Write a sequential program to solve the matrix system of linear equations by Iterative Method (Jacobi Method) in which A is symmetric positive definite matrix using system provided Mathematical libraries. Use Compiler optimizations and demonstrate the performance.

Description of Programs with/without using Math Kernel Library

Example 1.1 : Write a sequential program and estimate computational time for evalaution of function expressed in terms of polynomial of degree 'p' by using direct method and Horner's rule. (Download source code : mathlib-core-horner-rule.f )

Objective

Write a sequential program and estimate computational time for evaluation of function expressed in terms of polynomial of degree 'p' by using direct method and Horner's rule.

Description

This program reads a polynomial by reading its degree and then prepares the polynomial with some coefficients and reads the value of variable in the polynomial at which the value of the polynomial is calculated by using normal computation of a polynomial and then by using Horner's Rule and prints the time taken in each of the cases. Horner's rule states that a polynomial

A(x) = a0 + a1*(x power 0) + a2*(x power 2) + a3*(x power 3) + ...

may be written as

A(x) = a0 + x(a1 + x(a2 + x(a3 + ...))).

A polynomial may be evaluated at a point x', that is A(x') computed, in O(n) time using Horner's rule. That is, repeated multiplications and additions, rather than the naive methods of raising x to powers, multiplying by the coefficient, and accumulating which results in O(n raised to power degree) time for the computation.

Input

Degree of the polynomial

Output

Time taken in seconds for computation of polynomial using normal method and by Horner's Rule.

Example 1.2 : Write a Sequential Program for Matrix-Vector Multiplication in which matrix entries are accessed in Row-wise fashion. Use the best Compiler flags and demonstrate the performance. (Download source code : mathlib-core-mat-vect-mult-rowwise.f )

Objective

Write a Sequential Program for Matrix-Vector Multiplication in which the matrix entries are accessed in Row-wise fashion. Use the best Compiler flags and demonstrate the performance.

Description

The elements of the Matrix are accessed in a Row wise fashion and the time taken and the performance for the Matrix-Vector multiplication is calculated. In FORTRAN, the arrays are stored in memory in Column Major order. So, as the matrix size increases, accessing the Matrix in Row wise fashion results in frequent cache misses as the array element referred is to be loaded into cache if it is not present in cache. In C, the arrays are stored in memory in Row wise order. So, the arrays should be accesses in Row wise order to reduce cache overheads.

Input

Number of Rows and Columns of the Matrix and the number of Rows in the vector.

Output

The time taken in seconds for the multiplication in Row wise fashion and performance in MFLOPS.

Example 1.3 : Write a Sequential Program for Matrix-Vector Multiplication in which matrix entries are accessed in column-wise fashion. Use the best Compiler flags and demonstrate the performance. (Download source code : mathlib-core-mat-vect-mult-columnwise.f )

Objective

Write a Sequential Program for Matrix-Vector Multiplication in which the matrix entries are accessed in column-wise fashion. Use the best Compiler flags and demonstrate the performance.

Description

The elements of the Matrix are accessed in a Row wise fashion and the time taken and the performance for the Matrix-Vector multiplication is calculated. In FORTRAN, the arrays are stored in memory in Column Major order. So, as the matrix size increases, accessing the Matrix in Row wise fashion results in frequent cache misses as the array element referred is to be loaded into cache if it is not present in cache. In C, the arrays are stored in memory in Row wise order. So, the arrays should be accesses in Row wise order to reduce cache overheads.Accessing the Matrix in Column wise fashion results in better performance as the elements accessed are already present in the cache in fortran code.

Input

Number of Rows and Columns of the Matrix and the number of Rows in the vector.

Output

The time taken in seconds for the multiplication in column-wise fashion and performance in MFLOPS.

Example 1.4 : Write a efficient Sequential program for efficient implementation of matrix into matrix multiplication, implementing dotproduct in inner loop to get better performance. Use the best Compiler flags and demonstrate their Performance.
(Download source code : mathlib-core-mat-mat-mult-dotproduct.f )

Objective

Write a efficient Sequential program for efficient implementation of matrix into matrix multiplication, implementing dotproduct in inner loop to get better performance. Use the best Compiler flags and demonstrate the Performance.

Description

The aim is to compute two real square matrices with dot product inner loop and use compiler optimizations to extract the performance. Assume that the arrays dimension is of 2 the power i where i = 4, 8. This is a simple Matrix-Matrix Multiplication with dot-product inner-loop. The elements of the Matrix are accessed either in Row wise fashion or column-wise fashion. In FORTRAN, the arrays are stored in memory in Column Major order where as C-language the matrix arrays are stored in memory row major order . We try to achieve the maximum performance from the program using compiler optimizations. For compiler optimizations, refer Vendour supplied tuning & Performance Guide. The time taken for computation of Matrix into Matrix Multiplication is reported.

Input

Number of Rows and Columns of the two real square matrices

Output

The time taken in seconds for computation of Matrix into Matrix Multiplication and performance in MFLOPS.

Example 1.5 : Write a efficient Sequential program for efficient implementation of matrix into matrix multiplication, implementing daxpy in inner loop to get better performance. Use the best Compiler flags and demonstrate ther Performance
(Download source code : mathlib-core-mat-mat-mult-daxpy.f )

Objective

Write a efficient Sequential program for efficient implementation of matrix into matrix multiplication, implementing daxpy in inner loop to get better performance. in inner loop to get better performance. Use the best Compiler flags and demonstrate the Performance.

Description

The aim is to compute two real square matrices with dot product inner loop and use compiler optimizations to extract the performance. Assume that the arrays dimension is of 2 the power i where i = 4, 8. This is a simple Matrix-Matrix Multiplication with dot-product inner-loop implementing daxpy in inner loop. The elements of the Matrix are accessed either in Row wise fashion or column-wise fashion. In FORTRAN, the arrays are stored in memory in Column Major order where as C-language the matrix arrays are stored in memory row major order . We try to achieve the maximum performance from the program using compiler optimizations. For compiler optimizations, refer Vendour supplied tuning & Performance Guide. The time taken for computation of Matrix into Matrix Multiplication is reported. The daxpy library call can be used.

Input

Number of Rows and Columns of the two real square matrices

Output

The time taken in seconds for computation of Matrix into Matrix Multiplication and performance in MFLOPS.

Example 1.6 : Write a efficient Sequential program for efficient implementation of matrix into matrix multiplication, implementing dot product in inner loop, using BLAS-I, II, III libraries. Use the best Compiler flags and demonstrate their Performance.
(Download source code : mathlib-core-mat-mat-mult-dotproduct-blas.f )

Objective

Write a efficient Sequential program for efficient implementation of matrix into matrix multiplication, implementing implementing BLAS-I, II, III in inner loop to get better performance. in inner loop to get better performance.Use the best Compiler flags and demonstrate the Performance.

Description

The aim is to compute two real square matrices with dot product inner loop and use compiler optimizations to extract the performance. Assume that the arrays dimension is of 2 the power i where i = 4, 8. This is a simple Matrix-Matrix Multiplication with dot-product inner-loop implementing BLAS-I, II, III in inner loop.

The BLAS library ddot can be obtained from www.netlib.org

The elements of the Matrix are accessed either in Row wise fashion or column-wise fashion. In FORTRAN, the arrays are stored in memory in Column Major order where as C-language the matrix arrays are stored in memory row major order . We try to achieve the maximum performance from the program using compiler optimizations. For compiler optimizations, refer Vendor supplied tuning & Performance Guide. The time taken for computation of Matrix into Matrix Multiplication is reported. The BLAS-I,II, III libraries can be used.

Input

Number of Rows and Columns of the two real square matrices

Output

The time taken in seconds for computation of Matrix into Matrix Multiplication using BLAS-I, II, III library routines and performance in MFLOPS.

Example 1.7 : Write a efficient Sequential program for efficient implementation of matrix into matrix multiplication, implementing dot product in inner loop, using BLAS-I, II, III libraries. Use the best Compiler flags and demonstrate their Performance.
(Download source code : mathlib-core-mat-mat-mult-dotproduct-intel-mkl.f )

Objective

Write a efficeint Sequential program for efficient implementation of matrix into matrix multiplication, implementing BLAS-I, II, III in and using system Provided Mathematical libraries to get better performance. in inner loop to get better performance. Use the best Compiler flags and demonstrate the Performance.

Description

The aim is to compute two real square matrices with dot product inner loop and use compiler optimizations to extract the performance. Assume that the arrays dimension is of 2 the power i where i = 4, 8. This is a simple Matrix-Matrix Multiplication with dot-product inner-loop implementing system tuned BLAS-I, II, III mathematical libraries in inner loop. The elements of the Matrix are accessed either in Row wise fashion or column-wise fashion. In FORTRAN, the arrays are stored in memory in Column Major order where as C-language the matrix arrays are stored in memory row major order . We try to achieve the maximum performance from the program using compiler optimizations. For compiler optimizations, refer Vendor supplied tuning & Performance Guide. The time taken for computation of Matrix into Matrix Multiplication is reported. The system tuned BLAS-I,II, III libraries can be used.

Input

Number of Rows and Columns of the two real square matrices

Output

The time taken in seconds for computation of Matrix into Matrix Multiplication using BLAS-I, II, III library routines and performance in MFLOPS.

Example 1.8 : Write a efficient Sequential program for efficient implementation of solution of system of linear equations Ax= b where A is symmetric positive definite Matrix and b is real vector. Obtain better performance using the best Compiler flags
(Download source code : mathlib-core-linear-system-gauss-solver.f )

Objective

Write a efficient Sequential program for efficient implementation of solution of matrix system of linear equations Ax = b where A is symmetric positive definite Matrix and b is real vector. Obtain better performance using the best Compiler flags.

Description

Given a linear system of equations of form AX = b where A is a real square positive definite symmetric matrix of order n and b is a real vector of order n . The program finds the inverse of the real squre matrix A using Gauss Jordan method and the inverse is multiplied with vector b to get the solutions for matrix X . i.e. the operation can be represented as X = inverse(A) * B. The time taken for this algorithm to be implemented and performance is printed in seconds. and MFLOPS respectively. For compiler optimizations, refer Vendor supplied Tuning and Performance Guide.

Input

size of real square matrix and the real vector

Output

The time taken in seconds for computation of B> AX=b and performance in MFLOPS.

Example 1.9 : Write a efficient Sequential program for efficient implementation of solution of matrix system of linear equations Ax= b where A is symmetric positive definite Matrix and b is real vector using system provided mathematical libraries. Obtain better performance using the best Compiler flags (Download source code : mathlib-core-linear-system-gauss-intel-mkl.f )

Objective

Write a efficeint Sequential program for efficient implementation of solution of matrix system of linear equations Ax = b where A is symmetric positive definite Matrix and b is real vector using system provided mathematical libraries. Obtain better performance using the best Compiler flags.

Description

Given a linear system of equations of form AX = b where A is a real square positive definite symmetric matrix of order n and b is a real vector of order n . The program finds the inverse of the real squre matrix A using Gauss Jordan method and the inverse is multiplied with vector b to get the solutions for matrix X . i.e. the operation can be represented as X = inverse(A) * B.

The System Provided mathematical libraries for solution of AX = b or Inverse of the A can be used in numerical computations. The time taken for this algorithm to be implemented and performance is printed in seconds. and MFLOPS respectively. For compiler optimizations, refer Vendor supplied Tuning and Performance Guide.

Input

size of real square matrix and the real vector

Output

The time taken in seconds for computation of B> AX=b and performance in MFLOPS.

Example 1.10 : Write a efficient Sequential program for efficient implementation of solution of matrix system of linear equations Ax= b where A is symmetric positive definite Matrix and b is real vector by Iterative Method using Jacobi Method and Use System Provided mathematical libraries. Obtain better performance using the best Compiler flags (Assignment)

Objective

Write a efficient Sequential program for efficient implementation of solution of matrix system of linear equations Ax = b where A is symmetric positive definite Matrix and b is real vector by Jacobi Method and Use system provided mathematical libraries. Obtain better performance using the best Compiler flags.

Description

Given a linear system of equations of form AX = b where A is a real square positive definite symmetric matrix of order n and b is a real vector of order n . With Initialsolution vector {x₀} vector, the program computes the next solution vector {x₁} by Jacobi Method iteratively. The iterative method is stopped once the convergence criteria is satisfied, resulting the final solution vector {x}.

The System Provided mathematical libraries for solution of AX = b by Iterative method can be used in numerical computations. The time taken for this algorithm to be implemented and performance is printed in seconds. and MFLOPS respectively. For compiler optimizations, refer Vendor supplied Tuning and Performance Guide.

Input

size of real square matrix and the real vector

Output

The time taken in seconds for computation of AX=b and performance in MFLOPS.

Centre for Development of Advanced Computing