• Mode-1 Multi-Core • Memory Allocators • OpenMP • Intel TBB • Pthreads • Java - Threads • Charm++ Prog. • Message Passing (MPI) • MPI - OpenMP • MPI - Intel TBB • MPI - Pthreads • Compiler Opt. Features • Threads-Perf. Math.Lib. • Threads-Prof. & Tools • Threads-I/O Perf. • PGAS : UPC / CAF / GA • Power-Perf. • Home




hyPACK-2013 : Multi-Core Processors : Performance Using Math Kernel Libraries

Example programs using different APIs. Compilation and execution of Pthread programs, programs numerical and non-numerical computations are discussed using different thread APIs to understand Performance issues on mutli-core processors.

Example 1.1

Write a Sequential program for efficient implementation to evaluate the polynomial by Horner's rule.

Example 1.2

Write a Sequential Program for Matrix-Vector Multiplication in which the matrix entries are accessed in Row-wise fashion. Use the best Compiler flags and demonstrate the performance.

Example 1.3

Write a Sequential Program for Matrix-Vector Multiplication in which the matrix entries should be accessed in column-wise fashion. Use the best Compiler flags and demonstrate the performance.

Example 1.4

Write a efficient Sequential program for efficient implementation of matrix-matrix multiplication, implementing dot product in inner loop to get better performance. Use the best Compiler flags and demonstrate their Performance.

Example 1.5

Write a efficient Sequential Program for implementation of matrix-matrix multiplication, implementing daxpy in inner loop to get better performance. Use the best Compiler flags and demonstrate the performance.

Example 1.6

Write a Sequential program for matrix-matrix multiplication program, implementing dot product in inner loop, using BLAS-I, II, III libraries. Use the best Compiler flags and demonstrate the performance.

Example 1.7

Write a Sequential Program for efficient implementation of matrix-matrix multiplication program, implementing " dot product" in inner loop using system provided Mathematical libraries to extract performance. Use the best Compiler flags and demonstrate the performance.

Example 1.8

Write a sequential program to solve the matrix system of linear equations by Direct Method (Gauss Method) in which A is symmetric positive definite matrix. Use Compiler optimizations and demonstrate the performance.

Example 1.9

Write a sequential program to solve the matrix system of linear equations by Direct Method (Gauss Method) in which A is symmetric positive definite matrix using system provided Mathematical libraries. Use Compiler optimizations and demonstrate the performance.

Example 1.10

Write a sequential program to solve the matrix system of linear equations by Iterative Method (Jacobi Method) in which A is symmetric positive definite matrix using system provided Mathematical libraries. Use Compiler optimizations and demonstrate the performance.

Description of Programs with/without using Math Kernel Library

Example 1.1 : Write a sequential program and estimate computational time for evalaution of function expressed in terms of polynomial of degree 'p' by using direct method and Horner's rule. (Download source code : mathlib-core-horner-rule.f )

  • Objective
  • Write a sequential program and estimate computational time for evaluation of function expressed in terms of polynomial of degree 'p' by using direct method and Horner's rule.

  • Description
  • This program reads a polynomial by reading its degree and then prepares the polynomial with some coefficients and reads the value of variable in the polynomial at which the value of the polynomial is calculated by using normal computation of a polynomial and then by using Horner's Rule and prints the time taken in each of the cases. Horner's rule states that a polynomial

    A(x) = a0 + a1*(x power 0) + a2*(x power 2) + a3*(x power 3) + ...

    may be written as

    A(x) = a0 + x(a1 + x(a2 + x(a3 + ...))).

    A polynomial may be evaluated at a point x', that is A(x') computed, in O(n) time using Horner's rule. That is, repeated multiplications and additions, rather than the naive methods of raising x to powers, multiplying by the coefficient, and accumulating which results in O(n raised to power degree) time for the computation.

  • Input
  • Degree of the polynomial

  • Output
  • Time taken in seconds for computation of polynomial using normal method and by Horner's Rule.

    Example 1.2 : Write a Sequential Program for Matrix-Vector Multiplication in which matrix entries are accessed in Row-wise fashion. Use the best Compiler flags and demonstrate the performance. (Download source code : mathlib-core-mat-vect-mult-rowwise.f )

  • Objective
  • Write a Sequential Program for Matrix-Vector Multiplication in which the matrix entries are accessed in Row-wise fashion. Use the best Compiler flags and demonstrate the performance.

  • Description
  • The elements of the Matrix  are accessed in a Row wise fashion and the time taken and the performance for the Matrix-Vector multiplication is calculated. In FORTRAN, the arrays are stored in memory in Column Major order. So, as the matrix size increases, accessing the Matrix in Row wise fashion results in frequent cache misses as the array element referred is to be loaded into cache if it is not present in cache. In C, the arrays are stored in memory in Row wise order. So, the arrays should be accesses in Row wise order to reduce cache overheads.

  • Input
  • Number of Rows and Columns of the Matrix and the number of Rows in the vector.

  • Output
  • The time taken in seconds for the multiplication in Row wise fashion and performance in MFLOPS.

    Example 1.3 : Write a Sequential Program for Matrix-Vector Multiplication in which matrix entries are accessed in column-wise fashion. Use the best Compiler flags and demonstrate the performance. (Download source code : mathlib-core-mat-vect-mult-columnwise.f )

  • Objective
  • Write a Sequential Program for Matrix-Vector Multiplication in which the matrix entries are accessed in column-wise fashion. Use the best Compiler flags and demonstrate the performance.

  • Description
  • The elements of the Matrix  are accessed in a Row wise fashion and the time taken and the performance for the Matrix-Vector multiplication is calculated. In FORTRAN, the arrays are stored in memory in Column Major order. So, as the matrix size increases, accessing the Matrix in Row wise fashion results in frequent cache misses as the array element referred is to be loaded into cache if it is not present in cache. In C, the arrays are stored in memory in Row wise order. So, the arrays should be accesses in Row wise order to reduce cache overheads.Accessing the Matrix in Column wise fashion results in better performance as the elements accessed are already present in the cache in fortran code.

  • Input
  • Number of Rows and Columns of the Matrix and the number of Rows in the vector.

  • Output
  • The time taken in seconds for the multiplication in column-wise fashion and performance in MFLOPS.

    Example 1.4 : Write a efficient Sequential program for efficient implementation of matrix into matrix multiplication, implementing dotproduct in inner loop to get better performance. Use the best Compiler flags and demonstrate their Performance.
    (Download source code : mathlib-core-mat-mat-mult-dotproduct.f )

  • Objective
  • Write a efficient Sequential program for efficient implementation of matrix into matrix multiplication, implementing dotproduct in inner loop to get better performance. Use the best Compiler flags and demonstrate the Performance.

  • Description
  • The aim is to compute two real square matrices with dot product inner loop and use compiler optimizations to extract the performance. Assume that the arrays dimension is of 2 the power i where i = 4, 8. This is a simple Matrix-Matrix Multiplication with dot-product inner-loop. The elements of the Matrix  are accessed either in Row wise fashion or column-wise fashion. In FORTRAN, the arrays are stored in memory in Column Major order where as C-language the matrix arrays are stored in memory row major order . We try to achieve the maximum performance from the program using compiler optimizations. For compiler optimizations, refer Vendour supplied tuning & Performance Guide. The time taken for computation of Matrix into Matrix Multiplication is reported.

  • Input
  • Number of Rows and Columns of the two real square matrices

  • Output
  • The time taken in seconds for computation of Matrix into Matrix Multiplication and performance in MFLOPS.

    Example 1.5 : Write a efficient Sequential program for efficient implementation of matrix into matrix multiplication, implementing daxpy in inner loop to get better performance. Use the best Compiler flags and demonstrate ther Performance
    (Download source code : mathlib-core-mat-mat-mult-daxpy.f )

  • Objective
  • Write a efficient Sequential program for efficient implementation of matrix into matrix multiplication, implementing daxpy in inner loop to get better performance. in inner loop to get better performance. Use the best Compiler flags and demonstrate the Performance.

  • Description
  • The aim is to compute two real square matrices with dot product inner loop and use compiler optimizations to extract the performance. Assume that the arrays dimension is of 2 the power i where i = 4, 8. This is a simple Matrix-Matrix Multiplication with dot-product inner-loop implementing daxpy in inner loop. The elements of the Matrix  are accessed either in Row wise fashion or column-wise fashion. In FORTRAN, the arrays are stored in memory in Column Major order where as C-language the matrix arrays are stored in memory row major order . We try to achieve the maximum performance from the program using compiler optimizations. For compiler optimizations, refer Vendour supplied tuning & Performance Guide. The time taken for computation of Matrix into Matrix Multiplication is reported. The daxpy library call can be used.

  • Input
  • Number of Rows and Columns of the two real square matrices

  • Output
  • The time taken in seconds for computation of Matrix into Matrix Multiplication and performance in MFLOPS.

    Example 1.6 : Write a efficient Sequential program for efficient implementation of matrix into matrix multiplication, implementing dot product in inner loop, using BLAS-I, II, III libraries. Use the best Compiler flags and demonstrate their Performance.
    (Download source code : mathlib-core-mat-mat-mult-dotproduct-blas.f )

  • Objective
  • Write a efficient Sequential program for efficient implementation of matrix into matrix multiplication, implementing implementing BLAS-I, II, III in inner loop to get better performance. in inner loop to get better performance.Use the best Compiler flags and demonstrate the Performance.

  • Description
  • The aim is to compute two real square matrices with dot product inner loop and use compiler optimizations to extract the performance. Assume that the arrays dimension is of 2 the power i where i = 4, 8. This is a simple Matrix-Matrix Multiplication with dot-product inner-loop implementing BLAS-I, II, III in inner loop.

    The BLAS library ddot can be obtained from www.netlib.org

    The elements of the Matrix  are accessed either in Row wise fashion or column-wise fashion. In FORTRAN, the arrays are stored in memory in Column Major order where as C-language the matrix arrays are stored in memory row major order . We try to achieve the maximum performance from the program using compiler optimizations. For compiler optimizations, refer Vendor supplied tuning & Performance Guide. The time taken for computation of Matrix into Matrix Multiplication is reported. The BLAS-I,II, III libraries can be used.

  • Input
  • Number of Rows and Columns of the two real square matrices

  • Output
  • The time taken in seconds for computation of Matrix into Matrix Multiplication using BLAS-I, II, III library routines and performance in MFLOPS.

    Example 1.7 : Write a efficient Sequential program for efficient implementation of matrix into matrix multiplication, implementing dot product in inner loop, using BLAS-I, II, III libraries. Use the best Compiler flags and demonstrate their Performance.
    (Download source code : mathlib-core-mat-mat-mult-dotproduct-intel-mkl.f )

  • Objective
  • Write a efficeint Sequential program for efficient implementation of matrix into matrix multiplication, implementing BLAS-I, II, III in and using system Provided Mathematical libraries to get better performance. in inner loop to get better performance. Use the best Compiler flags and demonstrate the Performance.

  • Description
  • The aim is to compute two real square matrices with dot product inner loop and use compiler optimizations to extract the performance. Assume that the arrays dimension is of 2 the power i where i = 4, 8. This is a simple Matrix-Matrix Multiplication with dot-product inner-loop implementing system tuned BLAS-I, II, III mathematical libraries in inner loop. The elements of the Matrix  are accessed either in Row wise fashion or column-wise fashion. In FORTRAN, the arrays are stored in memory in Column Major order where as C-language the matrix arrays are stored in memory row major order . We try to achieve the maximum performance from the program using compiler optimizations. For compiler optimizations, refer Vendor supplied tuning & Performance Guide. The time taken for computation of Matrix into Matrix Multiplication is reported. The system tuned BLAS-I,II, III libraries can be used.

  • Input
  • Number of Rows and Columns of the two real square matrices

  • Output
  • The time taken in seconds for computation of Matrix into Matrix Multiplication using BLAS-I, II, III library routines and performance in MFLOPS.

    Example 1.8 : Write a efficient Sequential program for efficient implementation of solution of system of linear equations Ax= b where A is symmetric positive definite Matrix and b is real vector. Obtain better performance using the best Compiler flags
    (Download source code : mathlib-core-linear-system-gauss-solver.f )

  • Objective
  • Write a efficient Sequential program for efficient implementation of solution of matrix system of linear equations Ax = b where A is symmetric positive definite Matrix and b is real vector. Obtain better performance using the best Compiler flags.

  • Description
  • Given a linear system of equations of form AX = b where A is a real square positive definite symmetric matrix of order n and b is a real vector of order n . The program finds the inverse of the real squre matrix A using Gauss Jordan method and the inverse is multiplied with vector b to get the solutions for matrix X . i.e. the operation can be represented as X = inverse(A) * B. The time taken for this algorithm to be implemented and performance is printed in seconds. and MFLOPS respectively. For compiler optimizations, refer Vendor supplied Tuning and Performance Guide.

  • Input
  • size of real square matrix and the real vector

  • Output
  • The time taken in seconds for computation of B> AX=b and performance in MFLOPS.

    Example 1.9 : Write a efficient Sequential program for efficient implementation of solution of matrix system of linear equations Ax= b where A is symmetric positive definite Matrix and b is real vector using system provided mathematical libraries. Obtain better performance using the best Compiler flags (Download source code : mathlib-core-linear-system-gauss-intel-mkl.f )

  • Objective
  • Write a efficeint Sequential program for efficient implementation of solution of matrix system of linear equations Ax = b where A is symmetric positive definite Matrix and b is real vector using system provided mathematical libraries. Obtain better performance using the best Compiler flags.

  • Description
  • Given a linear system of equations of form AX = b where A is a real square positive definite symmetric matrix of order n and b is a real vector of order n . The program finds the inverse of the real squre matrix A using Gauss Jordan method and the inverse is multiplied with vector b to get the solutions for matrix X . i.e. the operation can be represented as X = inverse(A) * B.

    The System Provided mathematical libraries for solution of AX = b or Inverse of the A can be used in numerical computations. The time taken for this algorithm to be implemented and performance is printed in seconds. and MFLOPS respectively. For compiler optimizations, refer Vendor supplied Tuning and Performance Guide.

  • Input
  • size of real square matrix and the real vector

  • Output
  • The time taken in seconds for computation of B> AX=b and performance in MFLOPS.

    Example 1.10 : Write a efficient Sequential program for efficient implementation of solution of matrix system of linear equations Ax= b where A is symmetric positive definite Matrix and b is real vector by Iterative Method using Jacobi Method and Use System Provided mathematical libraries. Obtain better performance using the best Compiler flags (Assignment)

  • Objective
  • Write a efficient Sequential program for efficient implementation of solution of matrix system of linear equations Ax = b where A is symmetric positive definite Matrix and b is real vector by Jacobi Method and Use system provided mathematical libraries. Obtain better performance using the best Compiler flags.

  • Description
  • Given a linear system of equations of form AX = b where A is a real square positive definite symmetric matrix of order n and b is a real vector of order n . With Initialsolution vector {x0} vector, the program computes the next solution vector {x1} by Jacobi Method iteratively. The iterative method is stopped once the convergence criteria is satisfied, resulting the final solution vector {x}.

    The System Provided mathematical libraries for solution of AX = b by Iterative method can be used in numerical computations. The time taken for this algorithm to be implemented and performance is printed in seconds. and MFLOPS respectively. For compiler optimizations, refer Vendor supplied Tuning and Performance Guide.

  • Input
  • size of real square matrix and the real vector

  • Output
  • The time taken in seconds for computation of AX=b and performance in MFLOPS.

    Centre for Development of Advanced Computing