| 
     
  
 
|  
  
       Performance Visualization Tools : Programming on Multi-Core Processors      |    
  
    
  
 
|  
  
         Performance visualization tools such as Intel Thread Checker, 
Intel Vtune Performance analyzer, Intel thread Debugger, Sun Studio, Etnus 
totalview Debugger, Upshot, Jumpshot (Public domain) , and IBM Tools have been developed in 
order to help the programmer to understand the behavior of a parallel (MPI/Pthreaded) program 
and understand performance of application. 
   Example   programs using different APIs. Compilation and execution
   of Pthread programs, programs numerical and non-numerical computations
   are discussed using 
    different thread APIs and understand  Performance issues on mutli-core processors
 
 
 |    
 
 
 
   
    
      
       
         
              Example 1.1         
         
       | 
   
      
         
         Write a Multi-threaded program to print  Hello World .  Analyze the
   Performance using thread Visualization tool.   
        
      | 
      
 
    
      
       
         
              Example 1.2          
         
       | 
   
      
          
        Write a Multi-threaded (Pthreads) program to compute the value of PI  pie    function 
    f(x) = 4/(1+x2 ) between the limits 0 and 1 by Numerical Integration. Analyze 
     the Performance using thread  Visualization tool.   
        
      | 
    
    
      
       
         
              Example 1.3          
         
       | 
   
      
         
         Write a Multi-threaded (OpenMP) program to compute the value of PI  pie    function 
    f(x) = 4/(1+x2 ) between the limits 0 and 1 by Numerical Integration. 
        Use  OpenMP PARALLEL
        directive. Analyze the Performance using thread 
        Visualization tool. 
         
      | 
    
   
    
      
       
         
              Example 1.4            
         
       | 
   
      
        
       Write  a MPI  program to compute the value of PI  pie    function 
    f(x) = 4/(1+x2 ) between the limits 0 and 1 by Numerical Integration.
  you have to use MPI point-point communication library calls.
    Analyze the Performance using MPI Visualization tool. 
         
      | 
      
 
    
      
       
         
              Example 1.5          
         
       | 
   
      
        
          Write a MPI-Pthreads program to calculate Infinity norm of a matrix using block striped 
    partitioning with row wise data distribution using  p  processe and  t  threads.
   Analyze the Performance using thread Visualization tool. 
         
      | 
    
 
    
      
       
         
              Example 1.6         
         
       | 
   
      
         
         Write a  MPI program to compute Matrix and Matrix Multiplication using block checkerboard
           partitioning and MPI Cartesian topology.   Analyze the  Performance using MPI
    Visualization tool. (Assignment)
         
      | 
    
   
 
  
 
     
       (Source - References :  
    
    
 
  Books   
      
 
  Multi-threading        
 
  OpenMP     - 
[MCMPI-06],  
 [MCMPI-07], [MCMPI-09], [MCMPI-12], [MCMTh-15], [MCMTh-21], [MCBW-44],  
 [MCOMP-01],[MCOMP-02],[MCOMP-04], [MCOMP-12], [MCOMP-19],  
[MCOMP-25])
     
  
 | 
 
 
 
  
 
|  
         
    
 Description of Muli-threaded (PThreads/OpenMP); MPI & MPI-Pthread 
Programs      |    
   
 
 
 
    
     
      
       Example 1.1:         
      | 
    
         Write an Multi-threaded program to print  Hello World   
  and analyze the Performance using thread Visualization tool.  
 
 
  
(Download source code : 
 
   tools-pthreads-hello-world.c) 
  
    
  | 
  
  
 
 
-  Objective  
  
-  Description  
 
This is a very simple program to get the feel of threads and to get a view of how threads actually work. The implementation is as follows: The main thread creates two child threads. These threads print the words
  Hello  and  World!  individually. Though there is no actual parallelism involved, this is 
just to demonstrate the working of threads. It is to be however noted that depending on the system load and
 the implementation of Pthreads Standard, the message may not always be "Hello World!". It can be 
"World! Hello"
 depending on which thread is scheduled to execute first. This also demonstrates the use of "Pthread_join". 
								
 
 
- Input 
 
  
-  
 Output     
 | 
 
    
     |   | 
      
     
    
        
     
   
 | 
 
 
 
 
 
 
    
     
         
       Example 1.2:       | 
     
      Write an Multi-threaded program to compute the value of PI function by numerical integration of a 
function  f(x) =  4/(1+x2 ) between the limits 0 and 1.
 
 
  
(Download source code : 
 
  tools-pthread-num-int-pie.c) 
  
  
    
  | 
  
 
  
 
 
 
-  Objective  
 
 Write a Multi-threaded program to compute the value of PI function by numerical integration of a 
function  f(x) =  4/(1+x2 ) between the limits 0 and 1 using  p  processes and  t  threads  
 
 
-  Description  
 
										
This program computes the value of PI over a given interval using Numerical integration. 
All the threads determine the number of intervals to be calculated by it.  The master thread assigns an 
interval to each thread.  
Threaded APIs provide support for implementing critical sections and atmoic operations using 
 mutex-locks  (mutual exclusision locks). 
Each thread calculates its part of the interval and finally adds it up to the result variable.
 Mutex-Locks have two states : locked and unlocked. At any point of time, only one thread can 
lock a mutex lock.     
Each thread 
locks a mutex before doing the same to guarantee the atomicity of the operation.
The Mutex-lock is an atomic operation generally associated with a piece of code that manipulates 
shared data.To access the shared data, a thread must first try to acquire a mutex-lock. If the mutex-lock
is already locked, the process trying to acquire the lock is blocked.
   
- Input 
 
 Output    
	
 
    
     |   | 
      
     
    
        
     
   
 | 
 
 
 
 
 
     
      
           
             
            -  Objective
 
	    
            Write an OpenMP program to compute the value of PI by numerical 
              integration of a function f(x) = 4/(1+x*x ) between the limits 0 
              and 1 using OpenMP PARALLEL directive. 
             
            -  Description 
 
            
            There are several approaches to parallelizing a serial program. One
            approach is to partition the data among the threads.
            That is we partition the interval of integration [0,1] among
            the threads, and each thread estimates local integral over its own
            subinterval. The local calculations produced by the
            individual threads are combined to produce the final result.
            To perform this integration numerically, divide the interval from 
              0 to 1 into n subintervals and add up the areas of the rectangles 
              as shown in the Figure 1 (n = 5). Large values of n 
              give more accurate approximations of PI value.  
         
             
    
            
             Fig. 1 : Numerical Integration of PI 
              function 
              In this program 
              OpenMP  PARALLEL FOR directive, and CRITICAL section is
            used. The CRITICAL directive specifies a region of program that
            must be executed by only one thread at a time. If a thread is
            currently executing inside a CRITICAL region and another
            thread reaches that CRITICAL region and attempts to execute
            it, it will block until the first thread exits that CRITICAL
            region. 
             
            
            -  Input
 
           
            
            -  Output
 
            
      
      | 
      
    
     |   | 
      
     
    
        
     
   
 | 
 
 
     
       
  
 
 
 
    
     
     
         
       Example 1.4:         | 
      
      Write a MPI  program to compute the value of PI function by numerical integration of a 
function  f(x) =  4/(1+x2 ) between the limits 0 and 1 using MPI point-to-point
  communication library calls. Analyze the
   Performance using MPI Visualization tool. 
  
(Download source code : 
 
  tools-numint-pie-pt-to-pt.c) 
  
  
    
  |  
   
 
  
 
	
	
- Objective
 
 Write a MPI program to compute the value of  pie function by numerical integration of a 
function
 f(x) = 4/(1+x2) between the limits 0 and 1 using MPI point-to-point
  communication library calls.
 
 
	
- Description 
 
 In this example,  partition the interval of 
    integration [0,1] among the processes is done, and each process estimates local integral over
    its own subinterval. The local calculations produced by the individual  
    processes are combined to produce the final result. Each process sends its integral to process 0,
   which adds them and prints the result.  
To perform this integration numerically, divide the interval from 0 to 1 into n
   subintervals and add up the areas of the rectangles as shown in the Figure 2 (n
   = 5). Large values of n give more accurate approximations of pi
   . Use MPI point-to-point communication library calls.  
  
    
      Figure 2 Numerical integration of  pie  function 
 We assume that n is total number of subintervals, p is the number 
    of processes and p < n. One simple way to distribute the total 
    number of subintervals to each process is to divide  n by p. There 
    are two kinds of mappings that balance the load. One is a block mapping, 
    partitions the array elements into blocks of consecutive entries and assigns 
    the block to the processes. The other mapping is a cyclic mapping. It 
    assigns the first element to the first process, the second element to the 
    second, and so on. If n > p, we get back to the first process, 
    and repeat the assignment process for remaining elements. This process is 
    repeated until all the elements are assigned. We have used a cyclic mapping
    for partition of interval [0,1] onto  p processes.  
 
- Input 
 
Process with rank0 reads the input parameter n, the number of intervals on 
   command line.     
- Output 
 
 
	
 | 
 
 
 
    
     |   | 
      
     
    
        
     
   
 | 
 
 
 
 
    
     
       
       Example 1.5:        | 
    
       Write an MPI-Pthreads program to calculate Infinity norm of a matrix using block striped 
     partitioning with row wise data distribution. Analyze the
   Performance using thread Visualization tool. 
  
 
 
  
(Download source code : 
 
 tools-mpi-pthreads-infinity-norm.c ) 
  
  
    
  | 
   
  
 
 
 
-  Objective 
 
 Write an MPI-Pthreads program to calculate Infinity norm of a matrix using block striped 
    partitioning with row wise data distribution using  p  processe and  t  threads
 
 
-  Description 
 
										
  Infinity Norm of a Matrix: The Row-Wise infinity norm of a matrix is defined to be the maximum of sums of 
absolute  values of elements in a row, over all rows.
After the initial validity checks, each process reads the input matrix and determines the number of rows 
to be operated by it. Using its rank, each process determines the specific rows to be operated by it.
 After the distribution of rows, the main thread on each process creates the child threads as the number of rows 
it is to operate. Each thread operates on the specified row and calculates the sum of the absolute values of the elements 
and updates a common variable. After all the threads on a process complete their share of work, 
the common variable holds the maximum value of the rows assigned to it. Finally, the Root process 
determines the infinity norm using a Collective MPI call, MPI_Reduce and prints the value.
 
 
- Input  
 
 - Output 
 
 |  
    
     |   | 
      
     
    
        
     
   
 | 
 
 
 
 
 
    
     
      
       Example 1.6:    | 
     
       Write a MPI program to compute the matrix into matrix  multiplication using Checker-Board
      Partitoning of input Matrices.Analyze the
   Performance using thread Visualization tool. 
  
  | 
   
  
 
 
 
-  Objective 
 Write an MPI-Pthreads program to compute the Matrix into Matrix  multiplication using Checker-Board
    Partitoning of input Matrices. using  p  processe and  t  threads
 
 
 -  Description 
										
  In checkerboard partitioning, the matrix is divided into smaller square or rectangular blocks 
(submatrices) that are distributed among processes. A checkerboard partitioning splits both the rows and the 
columns of the matrix, so no process is assigned any complete row or column. 
Like striping partitioning, checkerboard partitioning can be block or cyclic. 
 
 
 - Input  
 
 The input file holding the two square matrices (Number of Rows, Columns of the matrix)   
 
 - Output 
 
 |  
     |   | 
      
     
    
        
     
   
 | 
 
 
 
   
      
 |  
  
 
          
 
 |    |    
      
 |