hyPACK-2013 : Multi-Core Processors : Performance Using Math Kernel Libraries |
Example programs using different APIs. Compilation and execution
of Pthread programs, programs numerical and non-numerical computations
are discussed using
different thread APIs to understand Performance issues on mutli-core processors.
|
Example 1.1
|
Write a Sequential program for efficient implementation to evaluate the polynomial
by Horner's rule.
|
Example 1.2
|
Write a Sequential Program for Matrix-Vector Multiplication in which the
matrix entries are accessed in Row-wise fashion. Use the best Compiler flags and demonstrate the
performance.
|
Example 1.3
|
Write a Sequential Program for Matrix-Vector Multiplication in which the
matrix entries should be accessed in column-wise fashion. Use the best Compiler flags and demonstrate the
performance.
|
Example 1.4
|
Write a efficient Sequential program for efficient implementation of matrix-matrix multiplication,
implementing dot product in inner loop to get better performance. Use the best Compiler flags and demonstrate
their Performance.
|
Example 1.5
|
Write a efficient Sequential Program for implementation of
matrix-matrix multiplication, implementing daxpy in inner loop to get better performance.
Use the best Compiler flags and demonstrate the performance.
|
Example 1.6
|
Write a Sequential program for matrix-matrix multiplication program, implementing dot product
in inner loop, using BLAS-I, II, III libraries. Use the best Compiler flags and demonstrate the performance.
|
Example 1.7
|
Write a Sequential Program for efficient implementation of matrix-matrix multiplication program,
implementing " dot product" in inner loop using system provided Mathematical libraries to extract performance.
Use the best Compiler flags and demonstrate the performance.
|
Example 1.8
|
Write a sequential program to solve the matrix system of linear equations by Direct Method (Gauss Method) in which A is symmetric positive
definite matrix. Use Compiler optimizations and demonstrate the performance.
|
Example 1.9
|
Write a sequential program to solve the matrix system of linear equations by Direct Method (Gauss Method) in which A is symmetric positive
definite matrix using system provided Mathematical libraries. Use Compiler optimizations and demonstrate the
performance.
|
Example 1.10
|
Write a sequential program to solve the matrix system of linear equations by Iterative Method (Jacobi Method) in which A is symmetric positive
definite matrix using system provided Mathematical libraries. Use Compiler optimizations and demonstrate the
performance.
|
Description of Programs with/without using Math Kernel Library
|
Example 1.1 :
Write a sequential program and estimate computational time for evalaution of function
expressed in terms of polynomial of degree 'p' by using direct method and Horner's rule.
(Download source code :
mathlib-core-horner-rule.f )
|
Objective
Write a sequential program and estimate computational time for evaluation of function
expressed in terms of polynomial of degree 'p' by using direct method and Horner's rule.
Description
This program reads a polynomial by reading its degree and then
prepares the polynomial with some coefficients and reads the value of variable in the
polynomial at which
the value of the polynomial is calculated by using normal computation of a
polynomial and then by using Horner's Rule and prints the time taken in each
of the cases.
Horner's rule states that a polynomial
A(x) = a0 + a1*(x power 0) +
a2*(x power 2) + a3*(x power 3) + ...
may be written as
A(x) = a0 + x(a1 + x(a2 + x(a3 + ...))).
A polynomial may be evaluated at a point x', that is A(x') computed, in
O(n) time using Horner's rule. That is, repeated multiplications and additions,
rather than the naive methods of raising x to powers, multiplying by the
coefficient, and accumulating which results in O(n raised to power degree)
time for the computation.
Input
Degree of the polynomial
Output
Time taken in seconds for computation of polynomial
using normal method and by Horner's Rule.
|
Example 1.2 :
Write a Sequential Program for Matrix-Vector Multiplication in which matrix entries are
accessed in Row-wise fashion. Use the best Compiler flags and demonstrate the performance.
(Download source code :
mathlib-core-mat-vect-mult-rowwise.f )
|
Objective
Write a Sequential Program for Matrix-Vector Multiplication in which the
matrix entries are accessed in Row-wise fashion. Use the best Compiler flags and
demonstrate the performance.
Description
The elements of the
Matrix are accessed in a Row wise fashion and the time taken and
the performance for the Matrix-Vector multiplication is calculated. In
FORTRAN, the arrays are stored in
memory in Column Major order. So, as the matrix size
increases, accessing the Matrix in Row wise fashion results in frequent
cache misses as the array element referred is to be loaded into cache
if it is not present in cache. In C, the arrays are stored in memory
in Row wise order. So, the arrays should be accesses in Row
wise order to reduce cache overheads.
Input
Number of Rows and Columns of the Matrix and the
number of Rows in the vector.
Output
The time taken in seconds for
the multiplication in Row wise fashion and performance in MFLOPS.
|
Example 1.3 :
Write a Sequential Program for Matrix-Vector Multiplication in which matrix entries are
accessed in column-wise fashion. Use the best Compiler flags and demonstrate the performance.
(Download source code :
mathlib-core-mat-vect-mult-columnwise.f )
|
Objective
Write a Sequential Program for Matrix-Vector Multiplication in which the
matrix entries are accessed in column-wise fashion. Use the best Compiler flags and
demonstrate the performance.
Description
The elements of the
Matrix are accessed in a Row wise fashion and the time taken and
the performance for the Matrix-Vector multiplication is calculated. In
FORTRAN, the arrays are stored in
memory in Column Major order. So, as the matrix size
increases, accessing the Matrix in Row wise fashion results in frequent
cache misses as the array element referred is to be loaded into cache
if it is not present in cache. In C, the arrays are stored in memory
in Row wise order. So, the arrays should be accesses in Row
wise order to reduce cache overheads.Accessing the Matrix in Column wise
fashion results in better performance as the elements accessed are already
present in the cache in fortran code.
Input
Number of Rows and Columns of the Matrix and the
number of Rows in the vector.
Output
The time taken in seconds for
the multiplication in column-wise fashion and performance in MFLOPS.
|
Example 1.4 :
Write a efficient Sequential program for efficient implementation of matrix into matrix multiplication,
implementing dotproduct in inner loop to get better performance. Use the best Compiler flags and
demonstrate their Performance.
(Download source code :
mathlib-core-mat-mat-mult-dotproduct.f )
|
Objective
Write a efficient Sequential program for efficient implementation of
matrix into matrix multiplication, implementing dotproduct in inner loop to get better performance.
Use the best Compiler flags and demonstrate the
Performance.
Description
The aim is to compute two real square matrices with dot product inner loop and use compiler
optimizations to extract the performance. Assume that the arrays dimension is of 2 the
power i where i = 4, 8. This is a simple Matrix-Matrix Multiplication with dot-product
inner-loop. The elements of the
Matrix are accessed either in Row wise fashion or column-wise fashion.
In FORTRAN, the arrays are stored in memory in Column Major order where as
C-language the matrix arrays are stored in memory row major order .
We try to achieve the maximum performance from the program using compiler optimizations.
For compiler optimizations, refer Vendour supplied tuning & Performance Guide. The time
taken for computation of Matrix into Matrix Multiplication is reported.
Input
Number of Rows and Columns of the two real square matrices
Output
The time taken in seconds for computation of Matrix into Matrix Multiplication
and performance in MFLOPS.
|
Example 1.5 :
Write a efficient Sequential program for efficient implementation of matrix into matrix multiplication,
implementing daxpy in inner loop to get better performance. Use the best Compiler flags and
demonstrate ther Performance
(Download source code :
mathlib-core-mat-mat-mult-daxpy.f )
|
Objective
Write a efficient Sequential program for efficient implementation of
matrix into matrix multiplication, implementing daxpy in inner loop to get better performance.
in inner loop to get better performance.
Use the best Compiler flags and demonstrate the
Performance.
Description
The aim is to compute two real square matrices with dot product inner loop and use compiler
optimizations to extract the performance. Assume that the arrays dimension is of 2 the
power i where i = 4, 8. This is a simple Matrix-Matrix Multiplication with dot-product
inner-loop implementing daxpy in inner loop. The elements of the
Matrix are accessed either in Row wise fashion or column-wise fashion.
In FORTRAN, the arrays are stored in memory in Column Major order where as
C-language the matrix arrays are stored in memory row major order .
We try to achieve the maximum performance from the program using compiler optimizations.
For compiler optimizations, refer Vendour supplied tuning & Performance Guide. The time
taken for computation of Matrix into Matrix Multiplication is reported. The daxpy
library call can be used.
Input
Number of Rows and Columns of the two real square matrices
Output
The time taken in seconds for computation of Matrix into Matrix Multiplication
and performance in MFLOPS.
|
Example 1.6 :
Write a efficient Sequential program for efficient implementation of matrix into matrix multiplication,
implementing dot product in inner loop, using BLAS-I, II, III
libraries. Use the best Compiler flags
and demonstrate their Performance.
(Download source code :
mathlib-core-mat-mat-mult-dotproduct-blas.f )
|
Objective
Write a efficient Sequential program for efficient implementation of
matrix into matrix multiplication, implementing implementing BLAS-I, II, III in inner loop to get
better performance. in inner loop to get better performance.Use the best Compiler flags and demonstrate the
Performance.
Description
The aim is to compute two real square matrices with dot product inner loop and use compiler
optimizations to extract the performance. Assume that the arrays dimension is of 2 the
power i where i = 4, 8. This is a simple Matrix-Matrix Multiplication with dot-product
inner-loop implementing BLAS-I, II, III in inner loop.
The BLAS library ddot can be obtained from
www.netlib.org
The elements of the Matrix are accessed either in Row wise fashion or column-wise fashion.
In FORTRAN, the arrays are stored in memory in Column Major order where as
C-language the matrix arrays are stored in memory row major order .
We try to achieve the maximum performance from the program using compiler optimizations.
For compiler optimizations, refer Vendor supplied tuning & Performance Guide. The time
taken for computation of Matrix into Matrix Multiplication is reported. The BLAS-I,II, III
libraries can be used.
Input
Number of Rows and Columns of the two real square matrices
Output
The time taken in seconds for computation of Matrix into Matrix Multiplication
using BLAS-I, II, III library routines
and performance in MFLOPS.
|
Example 1.7 :
Write a efficient Sequential program for efficient implementation of matrix into matrix multiplication,
implementing dot product in inner loop, using BLAS-I, II, III
libraries. Use the best Compiler flags
and demonstrate their Performance.
(Download source code :
mathlib-core-mat-mat-mult-dotproduct-intel-mkl.f )
|
Objective
Write a efficeint Sequential program for efficient implementation of
matrix into matrix multiplication, implementing BLAS-I, II, III in
and using system Provided Mathematical
libraries to get
better performance.
in inner loop to get better performance.
Use the best Compiler flags and demonstrate the
Performance.
Description
The aim is to compute two real square matrices with dot product inner loop and use compiler
optimizations to extract the performance. Assume that the arrays dimension is of 2 the
power i where i = 4, 8. This is a simple Matrix-Matrix Multiplication with dot-product
inner-loop implementing system tuned BLAS-I, II, III mathematical libraries
in inner loop.
The elements of the
Matrix are accessed either in Row wise fashion or column-wise fashion.
In FORTRAN, the arrays are stored in memory in Column Major order where as
C-language the matrix arrays are stored in memory row major order .
We try to achieve the maximum performance from the program using compiler optimizations.
For compiler optimizations, refer Vendor supplied tuning & Performance Guide. The time
taken for computation of Matrix into Matrix Multiplication is reported.
The system tuned BLAS-I,II, III libraries can be used.
Input
Number of Rows and Columns of the two real square matrices
Output
The time taken in seconds for computation of Matrix into Matrix Multiplication
using BLAS-I, II, III library routines
and performance in MFLOPS.
|
Example 1.8 :
Write a efficient Sequential program for efficient implementation of solution of
system
of linear equations Ax= b where A is symmetric positive definite
Matrix and b is real vector.
Obtain better performance using the best Compiler flags
(Download source code :
mathlib-core-linear-system-gauss-solver.f )
|
Objective
Write a efficient Sequential program for efficient implementation of solution
of matrix system of linear equations Ax = b where A is symmetric positive
definite Matrix and b is real vector.
Obtain better performance using the best Compiler flags.
Description
Given a linear system of equations of form AX = b where A is a real square positive definite
symmetric matrix of order n and b is a real vector of order n . The program finds the
inverse of the real squre matrix A using Gauss Jordan method and the inverse is multiplied with
vector b to get the
solutions for matrix X . i.e. the operation can be represented as X = inverse(A) * B.
The time taken for this algorithm to be implemented and performance is printed in seconds.
and MFLOPS respectively. For compiler optimizations, refer Vendor supplied Tuning and Performance
Guide.
Input
size of real square matrix and the real vector
Output
The time taken in seconds for computation of B> AX=b and performance in MFLOPS.
|
Example 1.9 :
Write a efficient Sequential program for efficient implementation of solution of matrix
system of linear equations Ax= b where A is symmetric positive definite
Matrix and b is real vector using system provided mathematical libraries.
Obtain better performance using the best Compiler
flags
(Download source code :
mathlib-core-linear-system-gauss-intel-mkl.f )
|
Objective
Write a efficeint Sequential program for efficient implementation of solution
of matrix system of linear equations Ax = b where A is symmetric positive
definite Matrix and b is real vector using system provided mathematical libraries.
Obtain better performance using the best Compiler flags.
Description
Given a linear system of equations of form AX = b where A is a real square positive definite
symmetric matrix of order n and b is a real vector of order n . The program finds the
inverse of the real squre matrix A using Gauss Jordan method and the inverse is multiplied with
vector b to get the
solutions for matrix X . i.e. the operation can be represented as X = inverse(A) * B.
The System Provided mathematical libraries for solution of AX = b
or Inverse of the A can be used in numerical computations.
The time taken for this algorithm to be implemented and performance is printed in seconds.
and MFLOPS respectively.
For compiler optimizations, refer Vendor supplied Tuning and Performance
Guide.
Input
size of real square matrix and the real vector
Output
The time taken in seconds for computation of B> AX=b and performance in MFLOPS.
|
Example 1.10 :
Write a efficient Sequential program for efficient implementation of solution of matrix system
of linear equations Ax= b where A is symmetric positive definite
Matrix and b is real vector by
Iterative Method using Jacobi Method and Use System Provided mathematical libraries.
Obtain better
performance using the best Compiler flags (Assignment)
|
Objective
Write a efficient Sequential program for efficient implementation of solution
of matrix system of linear equations Ax = b where A is symmetric positive
definite Matrix and b is real vector by Jacobi Method and Use system provided
mathematical libraries.
Obtain better performance using the best Compiler flags.
Description
Given a linear system of equations of form AX = b where A is a real square positive definite
symmetric matrix of order n and b is a real vector of order n .
With Initialsolution vector {x0} vector, the program computes the
next solution vector {x1} by Jacobi Method iteratively. The iterative
method is stopped once the convergence criteria is satisfied, resulting the final solution
vector {x}.
The System Provided mathematical libraries for solution of AX = b by Iterative
method can be used in numerical computations.
The time taken for this algorithm to be implemented and performance is printed in seconds.
and MFLOPS respectively.
For compiler optimizations, refer Vendor supplied Tuning and Performance
Guide.
Input
size of real square matrix and the real vector
Output
The time taken in seconds for computation of AX=b and performance in MFLOPS.
|
|
|
| |
|