|  
  
     hyPACK-2013 :   Multi-Core Processors :  Performance  Using  Math Kernel Libraries       |    
 
  
 
  
    | 
   
   
 Example   programs using different APIs. Compilation and execution
   of Pthread programs, programs numerical and non-numerical computations
   are discussed using 
    different thread APIs to understand  Performance issues on mutli-core processors.
   
 |    
 
 
 
   
    
     |  
       
          
         
              Example 1.1     
            
         
       | 
   
      
         
          Write a Sequential program for efficient implementation to evaluate the polynomial
   by Horner's rule.  
       | 
      
 
    
     |  
       
          
         
              Example 1.2    
        
         
       | 
   
      
           
        Write a  Sequential Program for Matrix-Vector Multiplication in which the  
        matrix entries are accessed in Row-wise fashion. Use the best Compiler flags and demonstrate the
        performance.
       
      | 
    
   
    
     |  
       
           
         
              Example 1.3         
           
         
       | 
   
      
         
       Write a  Sequential Program for Matrix-Vector Multiplication in which the  
        matrix entries should be accessed in column-wise fashion. Use the best Compiler flags and demonstrate the
        performance.
       
      | 
      
 
    
     |  
       
            
         
              Example 1.4     
          
        
      | 
   
      
         
        Write a efficient  Sequential program for efficient implementation of matrix-matrix multiplication, 
         implementing dot product in inner loop to get better performance. Use the best Compiler flags and demonstrate 
         their Performance.
      
      | 
    
   
    
     | 
       
              
         
              Example 1.5     
          
       
      | 
   
      
          
          Write a efficient Sequential Program for implementation of 
         matrix-matrix multiplication, implementing  daxpy   in inner loop to get better performance.
        Use the best Compiler flags and demonstrate the performance.
         
      | 
      
 
    
     | 
       
           
         
              Example 1.6     
          
      
      | 
   
      
         
         Write a Sequential program for matrix-matrix multiplication program, implementing dot product 
         in inner loop, using   BLAS-I, II, III   libraries.   Use the best Compiler flags and demonstrate the performance.
        
      | 
    
   
    
     | 
       
         
         
              Example 1.7    
          
      
      | 
   
      
        
         Write a Sequential Program for efficient implementation of  matrix-matrix multiplication program, 
        implementing " dot product" in inner loop using system provided Mathematical libraries  to extract performance. 
 Use the best Compiler flags and demonstrate the performance.
       
      | 
      
 
    
     |  
       
        
        
              Example 1.8      
          
       
      | 
   
      
       
      Write a sequential  program  to solve the matrix system of linear equations by Direct Method (Gauss Method) in which A is symmetric positive 
       definite matrix. Use  Compiler optimizations and demonstrate the performance.
      
      | 
    
   
    
     | 
       
            
         
              Example 1.9    
          
        
      | 
   
      
       
        Write a sequential  program  to solve the matrix system of linear equations by Direct Method (Gauss Method) in which A is symmetric positive 
       definite matrix using system provided Mathematical libraries.  Use  Compiler optimizations and demonstrate the 
       performance.
           
      | 
      
     
    
     |  
       
           
         
             Example 1.10   
          
       
      | 
      
        
         Write a sequential  program  to solve the matrix system of linear equations by Iterative  Method (Jacobi Method) in which A is symmetric positive 
       definite matrix using system provided Mathematical libraries.  Use  Compiler optimizations and demonstrate the 
       performance.
         
      | 
     
 
  
  
 
|   
 
      Description of  Programs with/without using Math Kernel Library  
 
 |  
  
   
 
 
    
 
| 
       
        Example 1.1 :     
   Write a sequential program and estimate computational time for evalaution of function
   expressed in terms of polynomial of degree 'p' by using direct method and Horner's rule.
 
     (Download source code : 
    
           mathlib-core-horner-rule.f   ) 
  
    
     
    
 
  | 
   
 
 
   
  
  
   
    
  
 
 
 
    Objective
 
   
 Write a sequential program and estimate computational time for evaluation of function
    expressed in terms of polynomial of degree 'p' by using direct method and Horner's rule.   
   
    
     Description
    
    
   This program reads a polynomial by reading its degree and then
          prepares the polynomial with some coefficients and reads the value of variable in the 
        polynomial at which
          the value of the polynomial is calculated by using normal computation of a
          polynomial and then by using Horner's Rule and prints the time taken in each
          of the cases.
      
          Horner's rule states that a polynomial    
           A(x) = a0 + a1*(x power 0) + 
          a2*(x power 2) + a3*(x power 3) + ...    
  
          may be written as     
             
          A(x) = a0 + x(a1 + x(a2 + x(a3 + ...))).
          
             
          A polynomial may be evaluated at a point x', that is A(x') computed, in
          O(n) time using Horner's rule. That is, repeated multiplications and additions,
          rather than the naive methods of raising x to powers, multiplying by the
          coefficient, and accumulating which results in O(n raised to power degree)
          time for the computation.
  
    
 
 
  
  Input
            
   Degree of the polynomial    
 
  Output
    
    
    Time taken in seconds for computation of polynomial
          using normal method and by Horner's Rule.
    
    
 
 |  
    
       
 
 
    
 
| 
       
        Example 1.2 :     
   Write a Sequential Program for Matrix-Vector Multiplication in which matrix entries are
    
   accessed in Row-wise fashion. Use the best Compiler flags and demonstrate the performance.
 
     (Download source code : 
    
           mathlib-core-mat-vect-mult-rowwise.f   ) 
     
     
    
 
  | 
   
 
 
   
  
  
   
 
  
  
        
 
 
    Objective
  
    
 Write a  Sequential Program for Matrix-Vector Multiplication in which the  
         matrix entries are accessed in Row-wise fashion. Use the best Compiler flags and 
         demonstrate the performance.    
 
    
 
           
     Description
    
 
    
    The elements of the
          Matrix  are accessed in a Row wise fashion and the time taken and
          the performance for the Matrix-Vector multiplication is calculated. In
          FORTRAN, the arrays are stored in
          memory in  Column Major order. So, as the matrix size
          increases, accessing the Matrix in Row wise fashion results in frequent
          cache misses as the array element referred is to be loaded into cache
          if it is not present in cache. In C, the arrays are stored in memory
          in Row wise order. So, the arrays should be accesses in Row
          wise order to reduce cache overheads.
  
 
    
 
 
  
  Input
        
 
    
   Number of Rows and Columns of the Matrix and the
          number of Rows in the vector. 
   
    
 
  Output
    
    
    The time taken in seconds for 
        the multiplication in Row wise fashion and performance in MFLOPS.
   
    
 
 
 |  
     
       
 
 
    
 
| 
       
        Example 1.3 :     
   Write a Sequential Program for Matrix-Vector Multiplication in which matrix entries are 
    accessed in column-wise fashion. Use the best Compiler flags and demonstrate the performance.
 
     (Download source code : 
    
           mathlib-core-mat-vect-mult-columnwise.f   ) 
       
    
     
    
 
  | 
   
 
 
   
  
  
   
    
  
       
 
 
    Objective
    
     
     Write a  Sequential Program for Matrix-Vector Multiplication in which the  
         matrix entries are accessed in column-wise  fashion. Use the best Compiler flags and 
         demonstrate the performance.    
     
     
     
           
     Description
    
    
     
        The elements of the
          Matrix  are accessed in a Row wise fashion and the time taken and
          the performance for the Matrix-Vector multiplication is calculated. In
          FORTRAN, the arrays are stored in
          memory in  Column Major order. So, as the matrix size
          increases, accessing the Matrix in Row wise fashion results in frequent
          cache misses as the array element referred is to be loaded into cache
          if it is not present in cache. In C, the arrays are stored in memory
          in Row wise order. So, the arrays should be accesses in Row
          wise order to reduce cache overheads.Accessing the Matrix in Column wise
         fashion results in better performance as the elements accessed are already
         present in the cache in fortran code.
    
    
   
 
  
 
  Input
        
   
    
     Number of Rows and Columns of the Matrix and the
          number of Rows in the vector. 
    
    
    
  Output
    
    
    The time taken in seconds for 
        the multiplication in column-wise  fashion and performance in MFLOPS.
   
   
   
 |  
     
       
 
 
    
 
| 
       
        Example 1.4 :     
   Write a efficient Sequential program for efficient implementation of matrix into matrix multiplication, 
          implementing  dotproduct  in inner loop to get better performance. Use the best Compiler flags and 
          demonstrate their Performance.   
 
     (Download source code : 
    
           mathlib-core-mat-mat-mult-dotproduct.f   ) 
   
    
     
    
 
  | 
   
 
 
   
  
  
   
    
  
  
       
 
 
 
    Objective
   
    
     Write a efficient Sequential program for efficient implementation of 
   matrix into matrix multiplication, implementing dotproduct in inner loop to get better performance.
   Use the best Compiler flags and demonstrate the
   Performance.     
     
   
   
           
     Description
    
   
    
 The aim is to compute two real square  matrices with dot product inner loop and use compiler
 optimizations to extract the performance. Assume that the arrays dimension is of 2 the
 power i where i = 4, 8. This is a simple Matrix-Matrix Multiplication with dot-product
 inner-loop. The elements of the
         
 Matrix  are accessed either in Row wise fashion or column-wise fashion.
 In FORTRAN, the arrays are stored in  memory in  Column Major order where as
 C-language the matrix arrays are stored in memory  row major order .
 We try to achieve the maximum performance from the program using compiler optimizations.
 For compiler optimizations, refer Vendour supplied tuning & Performance Guide. The time 
 taken for computation of   Matrix into Matrix Multiplication is reported.
  
  
  
  
  
 
 Input
        
   
    
     Number of Rows and Columns of the two real square matrices
  
   
   
  Output
  
    
    The time taken in seconds for computation of Matrix into Matrix Multiplication 
    and performance in MFLOPS.
 
   
   
   
 |  
    
       
 
 
    
 
| 
       
        Example 1.5 :     
Write a efficient Sequential program for efficient implementation of matrix into matrix multiplication, 
          implementing  daxpy   in inner loop to get better performance. Use the best Compiler flags and 
          demonstrate ther Performance  
 
     (Download source code : 
    
           mathlib-core-mat-mat-mult-daxpy.f   )  
       
    
     
    
 
  | 
   
 
 
   
  
  
   
 
  
  
       
 
 
    Objective
  
    
    
  Write a efficient Sequential program for efficient implementation of 
   matrix into matrix multiplication, implementing   daxpy   in inner loop to get better performance.
   in inner loop to get better performance.
   Use the best Compiler flags and demonstrate the
   Performance.     
  
   
   
    
     Description
    
    
    
 The aim is to compute two real square  matrices with dot product inner loop and use compiler
 optimizations to extract the performance. Assume that the arrays dimension is of 2 the
 power i where i = 4, 8. This is a simple Matrix-Matrix Multiplication with dot-product
 inner-loop implementing  daxpy   in inner loop. The elements of the
         
 Matrix  are accessed either in Row wise fashion or column-wise fashion.
 In FORTRAN, the arrays are stored in  memory in  Column Major order where as
 C-language the matrix arrays are stored in memory  row major order .
 We try to achieve the maximum performance from the program using compiler optimizations.
 For compiler optimizations, refer Vendour supplied tuning & Performance Guide. The time 
 taken for computation of   Matrix into Matrix Multiplication is reported. The  daxpy 
 library call can be used.
  
  
   
   
 
  
 
 Input
        
  
    
   Number of Rows and Columns of the two real square matrices  
   
   
   
  Output
 
  
    
    The time taken in seconds for computation of Matrix into Matrix Multiplication 
    and performance in MFLOPS.
     
   
   
 |  
     
       
 
 
    
 
| 
       
        Example 1.6 :     
 Write a efficient Sequential program for efficient implementation of matrix into matrix multiplication, 
          implementing  dot  product in inner loop, using  BLAS-I, II, III  
     libraries. Use the best Compiler flags 
          and demonstrate their Performance.  
 
     (Download source code : 
    
           mathlib-core-mat-mat-mult-dotproduct-blas.f   ) 
   
    
     
    
 
  | 
   
 
 
   
  
  
   
 
  
  
       
 
 
  Objective
   
    
   Write a efficient Sequential program for efficient implementation of 
   matrix into matrix multiplication, implementing implementing  BLAS-I, II, III   in inner loop to get
   better performance. in inner loop to get better performance.Use the best Compiler flags and demonstrate the
   Performance.     
   
   
   
           
     Description
    
    
     
 The aim is to compute two real square  matrices with dot product inner loop and use compiler
 optimizations to extract the performance. Assume that the arrays dimension is of 2 the
 power i where i = 4, 8. This is a simple Matrix-Matrix Multiplication with dot-product
 inner-loop implementing  BLAS-I, II, III   in inner loop. 
     
 The BLAS library ddot can be obtained from 
  www.netlib.org      
The elements of the Matrix  are accessed either in Row wise fashion or column-wise fashion.
 In FORTRAN, the arrays are stored in  memory in  Column Major order where as
 C-language the matrix arrays are stored in memory  row major order .
 We try to achieve the maximum performance from the program using compiler optimizations.
 For compiler optimizations, refer Vendor  supplied tuning & Performance Guide. The time 
 taken for computation of   Matrix into Matrix Multiplication is reported. The  BLAS-I,II, III 
 libraries can be used.
  
  
   
   
 
  
 
 Input
        
  
    
   Number of Rows and Columns of the two real square matrices  
   
   
   
  Output
    
    
    The time taken in seconds for computation of Matrix into Matrix Multiplication 
     using  BLAS-I, II, III  library routines
    and performance in MFLOPS.
    
   
   
 |  
    
       
 
 
    
 
| 
       
        Example 1.7 :     
 Write a efficient Sequential program for efficient implementation of matrix into matrix multiplication, 
          implementing  dot  product in inner loop, using  BLAS-I, II, III  
     libraries. Use the best Compiler flags 
          and demonstrate their Performance.  
 
     (Download source code : 
    
           mathlib-core-mat-mat-mult-dotproduct-intel-mkl.f   ) 
 
     
    
 
  | 
   
 
   
  
  
   
    
  
     
 
 
    Objective
     
    
  Write a efficeint Sequential program for efficient implementation of 
   matrix into matrix multiplication, implementing   BLAS-I, II, III   in 
 and using system Provided Mathematical 
    libraries to get
   better performance.
   in inner loop to get better performance.
   Use the best Compiler flags and demonstrate the
   Performance.     
    
   
   
           
     Description
    
    
    
 The aim is to compute two real square  matrices with dot product inner loop and use compiler
 optimizations to extract the performance. Assume that the arrays dimension is of 2 the
 power i where i = 4, 8. This is a simple Matrix-Matrix Multiplication with dot-product
 inner-loop implementing  system tuned    BLAS-I, II, III   mathematical libraries 
in inner loop. 
 
The elements of the
         
 Matrix  are accessed either in Row wise fashion or column-wise fashion.
 In FORTRAN, the arrays are stored in  memory in  Column Major order where as
 C-language the matrix arrays are stored in memory  row major order .
 We try to achieve the maximum performance from the program using compiler optimizations.
 For compiler optimizations, refer Vendor supplied tuning & Performance Guide. The time 
 taken for computation of   Matrix into Matrix Multiplication is reported. 
 The system  tuned  BLAS-I,II, III  libraries can be used.
  
   
   
   
  
  
 
 Input
        
   
    
    Number of Rows and Columns of the two real square matrices  
       
   
   
  Output
    
    
    The time taken in seconds for computation of Matrix into Matrix Multiplication 
     using  BLAS-I, II, III  library routines
    and performance in MFLOPS.
       
   
   
 |  
    
       
 
 
    
 
| 
       
        Example 1.8 :     
 Write a efficient Sequential program for efficient implementation of solution of 
 system  
  of linear equations  Ax= b  where  A  is symmetric positive definite
   Matrix and  b  is real vector. 
  Obtain  better performance using the  best Compiler flags  
 
     (Download source code : 
    
           mathlib-core-linear-system-gauss-solver.f   ) 
 
    
     
    
 
  | 
   
 
 
   
  
  
   
    
  
  
       
 
 
    Objective
     
 
    
   Write a efficient Sequential program for efficient implementation of solution 
      of  matrix system of linear equations  Ax = b  where  A  is symmetric positive 
     definite Matrix and  b  is real vector. 
          Obtain better performance using the best Compiler flags. 
     
   
   
   
           
     Description
    
    
  
    
 Given a linear system of equations of form  AX = b   where  A  is a real square positive definite 
symmetric matrix of order  n   and b is a real vector of order  n . The program finds the
 inverse of the real squre matrix A using Gauss Jordan method and the inverse is multiplied with 
vector b to get the
 solutions for matrix  X . i.e. the operation can be represented as  X = inverse(A) * B.  
The time taken for this algorithm to be implemented and performance is printed in seconds. 
and MFLOPS respectively.  For compiler optimizations, refer Vendor  supplied Tuning and Performance
 Guide.
  
  
   
   
   
 
  
 
 Input
        
  
    
    size of real square matrix and the real vector 
     
   
   
  Output
   
    
    The time taken in seconds for computation of B> AX=b   and performance in MFLOPS.
      
   
   
 
 |  
    
       
 
 
    
 
| 
       
        Example 1.9 :     
 Write a efficient Sequential program for efficient implementation of solution of matrix 
 system   of linear equations  Ax= b  where  A  is symmetric positive definite
  Matrix and  b  is real vector  using    system provided mathematical libraries. 
  Obtain  better performance using the  best Compiler
  flags 
 
     (Download source code : 
    
           mathlib-core-linear-system-gauss-intel-mkl.f   )  
       
    
     
    
 
  | 
   
 
 
   
  
  
   
    
  
  
       
 
 
 
    Objective
 
 
    
   Write a efficeint Sequential program for efficient implementation of solution 
      of  matrix system of linear equations  Ax = b  where  A  is symmetric positive 
     definite Matrix and  b  is real vector using system provided mathematical libraries.
          Obtain better performance using the best Compiler flags. 
     
     
   
   
           
     Description
    
  
    
 Given a linear system of equations of form  AX = b   where  A  is a real square positive definite 
symmetric matrix of order  n   and b is a real vector of order  n . The program finds the
 inverse of the real squre matrix A using Gauss Jordan method and the inverse is multiplied with 
vector b to get the
 solutions for matrix  X . i.e. the operation can be represented as  X = inverse(A) * B.  
  
   
The System Provided mathematical libraries for solution of  AX = b  
or Inverse of the  A  can be  used in  numerical computations.
The time taken for this algorithm to be implemented and performance is printed in seconds. 
and MFLOPS respectively.
 For compiler optimizations, refer Vendor  supplied Tuning and Performance
 Guide.
  
     
   
   
 
  
 
 Input
        
  
    
  size of real square matrix and the real vector     
    
   
   
  Output
  
    
    The time taken in seconds for computation of B> AX=b   and performance in MFLOPS.
        
   
   
 |  
    
       
 
 
    
 
| 
       
        Example 1.10 :     
 Write a efficient Sequential program for efficient implementation of solution of matrix  system 
  of linear equations  Ax= b  where  A  is symmetric positive definite
  Matrix and  b  is real vector  by  
 Iterative Method using   Jacobi Method and  Use System Provided mathematical  libraries.
 Obtain better 
 performance using the  best Compiler flags   (Assignment) 
 
      
 
   | 
   
 
 
   
  
  
   
    
  
  
       
 
 
  
    Objective
 
     
   Write a efficient Sequential program for efficient implementation of solution 
      of  matrix system of linear equations  Ax = b  where  A  is symmetric positive 
     definite Matrix and  b  is real vector by Jacobi Method and Use system provided 
      mathematical libraries.
          Obtain better performance using the best Compiler flags. 
     
   
   
   
           
     Description
    
   
     
 Given a linear system of equations of form  AX = b   where  A  is a real square positive definite 
symmetric matrix of order  n   and b is a real vector of order  n . 
With Initialsolution vector  {x0}  vector, the program computes the 
next solution vector  {x1}  by Jacobi Method iteratively. The iterative 
method is stopped once the convergence criteria is satisfied, resulting the final solution 
vector  {x}.  
 
  
   
The System Provided mathematical libraries for solution of  AX = b   by Iterative 
method can be  used in  numerical computations.
The time taken for this algorithm to be implemented and performance is printed in seconds. 
and MFLOPS respectively.
 For compiler optimizations, refer Vendor supplied Tuning and Performance
 Guide.
  
   
   
   
  
 
 Input
        
  
    
    size of real square matrix and the real vector  
 
   
   
   
  Output
   
    
    The time taken in seconds for computation of  AX=b   and performance in MFLOPS.
  
   
   
  
 |  
     
       
 
 | 
 
 
  
         
 |  
  
 
          
 
 |    |    
      
 |