C-DAC,Pune : High-Perf. Comp. Frontier Technologies Exploration Group and CMSD, University of Hyderabad, Technology Workshop hyPACK (October 15-18), 2013

Overview Venue : CMSD, UoH Key-Note/Invited Talks Faculty / Speakers Proceedings Downloads Past Tech. Workshops Target Audience Benefits Organisers Accommodation Local Travel Sponsors Feedback Acknowledgements Contact Home

Topics of Interest Tech. Prog. Schedule Topic : Multi-Core Topic : ARM Proc. Topic : Coprocessors Topic : GPGPUs Topic : HPC Cluster Topic : App. Kernels. Topic : Lab. Session Key-Note / Invited Talks Home

Mode-1 Multi-Core Memory Allocators OpenMP Intel TBB Pthreads Java - Threads Charm++ Prog. Message Passing (MPI) MPI - OpenMP MPI - Intel TBB MPI - Pthreads Compilers - Opt. Features Threads-Perf. Math. Lib. Threads-Prof. & Tools Threads - I/O Perf. PGAS : UPC / CAF/ GA Power & Perf. Home

Mode-2 ARM Prog. Env Benchmarks Power & Perf. Home

Mode-3 Coprocessors Arch. Software Compiler & Vect. Prog. Env. Benchmarks Power & Perf. Home

Mode-4 GPGPUs NVIDIA - CUDA/OpenCL AMD APP - OpenCL GPGPUs - OpenCL GPGPUs : Power & Perf. Home

Mode-5 HPC Cluster HPC MPI Cluster GPU Cluster - NVIDIA GPU Cluster - AMD APP Cluster - Intel Coprocessors Cluster- Power & Perf. Home

Mode-6 App. Kernels PDE Solvers : FDM/FEM Image Processing - FFT Monte Carlo Methods String Srch. Seq. Analy. Video Process. Intr. Detcn. Sys App. Power & Perf. Home

Reg. Overview Pvt. Sector Pub. Sector Govt. Acad. Staff Students Reg. On-line Reg. Accommodation Contact Home

Mode-1 Multi-Core Memory Allocators OpenMP Intel TBB Pthreads Java - Threads Charm++ Prog. Message Passing (MPI) MPI - OpenMP MPI - Intel TBB MPI - Pthreads Compiler Opt. Features Threads-Perf. Math.Lib. Threads-Prof. & Tools Threads-I/O Perf. PGAS : UPC / CAF / GA Power-Perf. Home

Performance Visualization Tools : Programming on Multi-Core Processors

Performance visualization tools such as Intel Thread Checker, Intel Vtune Performance analyzer, Intel thread Debugger, Sun Studio, Etnus totalview Debugger, Upshot, Jumpshot (Public domain) , and IBM Tools have been developed in order to help the programmer to understand the behavior of a parallel (MPI/Pthreaded) program and understand performance of application. Example programs using different APIs. Compilation and execution of Pthread programs, programs numerical and non-numerical computations are discussed using different thread APIs and understand Performance issues on mutli-core processors

Example 1.1	Write a Multi-threaded program to print Hello World . Analyze the Performance using thread Visualization tool.
Example 1.2	Write a Multi-threaded (Pthreads) program to compute the value of PI pie function f(x) = 4/(1+x²) between the limits 0 and 1 by Numerical Integration. Analyze the Performance using thread Visualization tool.
Example 1.3	Write a Multi-threaded (OpenMP) program to compute the value of PI pie function f(x) = 4/(1+x²) between the limits 0 and 1 by Numerical Integration. Use OpenMP PARALLEL directive. Analyze the Performance using thread Visualization tool.
Example 1.4	Write a MPI program to compute the value of PI pie function f(x) = 4/(1+x²) between the limits 0 and 1 by Numerical Integration. you have to use MPI point-point communication library calls. Analyze the Performance using MPI Visualization tool.
Example 1.5	Write a MPI-Pthreads program to calculate Infinity norm of a matrix using block striped partitioning with row wise data distribution using p processe and t threads. Analyze the Performance using thread Visualization tool.
Example 1.6	Write a MPI program to compute Matrix and Matrix Multiplication using block checkerboard partitioning and MPI Cartesian topology. Analyze the Performance using MPI Visualization tool. (Assignment)

(Source - References : Books Multi-threading OpenMP - [MCMPI-06],
[MCMPI-07], [MCMPI-09], [MCMPI-12], [MCMTh-15], [MCMTh-21], [MCBW-44],
[MCOMP-01],[MCOMP-02],[MCOMP-04], [MCOMP-12], [MCOMP-19],
[MCOMP-25])

Description of Muli-threaded (PThreads/OpenMP); MPI & MPI-Pthread Programs

Example 1.1:

Write an Multi-threaded program to print Hello World and analyze the Performance using thread Visualization tool.
(Download source code : tools-pthreads-hello-world.c)

Objective

Write a Multi-threaded program to print Hello World using p processe and t threads

Description

This is a very simple program to get the feel of threads and to get a view of how threads actually work. The implementation is as follows: The main thread creates two child threads. These threads print the words Hello and World! individually. Though there is no actual parallelism involved, this is just to demonstrate the working of threads. It is to be however noted that depending on the system load and the implementation of Pthreads Standard, the message may not always be "Hello World!". It can be "World! Hello" depending on which thread is scheduled to execute first. This also demonstrates the use of "Pthread_join".

Input

None

Output

Hello World! or World! Hello

Example 1.2:

Write an Multi-threaded program to compute the value of PI function by numerical integration of a function f(x) = 4/(1+x² ) between the limits 0 and 1.
(Download source code : tools-pthread-num-int-pie.c)

Objective

Write a Multi-threaded program to compute the value of PI function by numerical integration of a function f(x) = 4/(1+x² ) between the limits 0 and 1 using p processes and t threads

Description

This program computes the value of PI over a given interval using Numerical integration. All the threads determine the number of intervals to be calculated by it. The master thread assigns an interval to each thread. Threaded APIs provide support for implementing critical sections and atmoic operations using mutex-locks (mutual exclusision locks). Each thread calculates its part of the interval and finally adds it up to the result variable. Mutex-Locks have two states : locked and unlocked. At any point of time, only one thread can lock a mutex lock.

Each thread locks a mutex before doing the same to guarantee the atomicity of the operation. The Mutex-lock is an atomic operation generally associated with a piece of code that manipulates shared data.To access the shared data, a thread must first try to acquire a mutex-lock. If the mutex-lock is already locked, the process trying to acquire the lock is blocked.

Input

Number of intervals.

Output

Calculated Value of PI.

Example 1.3:

Compute the value of PI function by Numerical Integration using OpenMP PARALLEL directive. Analyze the Performance using thread Visualization tool.
(Download source code : tools-omp-pi-calculation.c / tools-omp-pi-calculation.f )

Objective

Write an OpenMP program to compute the value of PI by numerical integration of a function f(x) = 4/(1+x*x ) between the limits 0 and 1 using OpenMP PARALLEL directive.

Description

There are several approaches to parallelizing a serial program. One approach is to partition the data among the threads. That is we partition the interval of integration [0,1] among the threads, and each thread estimates local integral over its own subinterval. The local calculations produced by the individual threads are combined to produce the final result. To perform this integration numerically, divide the interval from 0 to 1 into n subintervals and add up the areas of the rectangles as shown in the Figure 1 (n = 5). Large values of n give more accurate approximations of PI value.

Fig. 1 : Numerical Integration of PI function

In this program OpenMP PARALLEL FOR directive, and CRITICAL section is used. The CRITICAL directive specifies a region of program that must be executed by only one thread at a time. If a thread is currently executing inside a CRITICAL region and another thread reaches that CRITICAL region and attempts to execute it, it will block until the first thread exits that CRITICAL region.

Input

Number of threads and Number of intervals.

Output

Computed value of pie and time taken for the computation.

Example 1.4:

Write a MPI program to compute the value of PI function by numerical integration of a function f(x) = 4/(1+x² ) between the limits 0 and 1 using MPI point-to-point communication library calls. Analyze the Performance using MPI Visualization tool.
(Download source code : tools-numint-pie-pt-to-pt.c)

Objective

Write a MPI program to compute the value of pie function by numerical integration of a function f(x) = 4/(1+x²) between the limits 0 and 1 using MPI point-to-point communication library calls.

Description

In this example, partition the interval of integration [0,1] among the processes is done, and each process estimates local integral over its own subinterval. The local calculations produced by the individual processes are combined to produce the final result. Each process sends its integral to process 0, which adds them and prints the result.

To perform this integration numerically, divide the interval from 0 to 1 into n subintervals and add up the areas of the rectangles as shown in the Figure 2 (n = 5). Large values of n give more accurate approximations of pi . Use MPI point-to-point communication library calls.

Figure 2 Numerical integration of pie function

We assume that n is total number of subintervals, p is the number of processes and p < n. One simple way to distribute the total number of subintervals to each process is to divide n by p. There are two kinds of mappings that balance the load. One is a block mapping, partitions the array elements into blocks of consecutive entries and assigns the block to the processes. The other mapping is a cyclic mapping. It assigns the first element to the first process, the second element to the second, and so on. If n > p, we get back to the first process, and repeat the assignment process for remaining elements. This process is repeated until all the elements are assigned. We have used a cyclic mapping for partition of interval [0,1] onto p processes.

Input

Process with rank0 reads the input parameter n, the number of intervals on command line.

Output

Process with rank 0 prints the computed value of pi function.

Example 1.5:

Write an MPI-Pthreads program to calculate Infinity norm of a matrix using block striped partitioning with row wise data distribution. Analyze the Performance using thread Visualization tool.
(Download source code : tools-mpi-pthreads-infinity-norm.c )

Objective

Write an MPI-Pthreads program to calculate Infinity norm of a matrix using block striped partitioning with row wise data distribution using p processe and t threads

Description

Infinity Norm of a Matrix: The Row-Wise infinity norm of a matrix is defined to be the maximum of sums of absolute values of elements in a row, over all rows. After the initial validity checks, each process reads the input matrix and determines the number of rows to be operated by it. Using its rank, each process determines the specific rows to be operated by it. After the distribution of rows, the main thread on each process creates the child threads as the number of rows it is to operate. Each thread operates on the specified row and calculates the sum of the absolute values of the elements and updates a common variable. After all the threads on a process complete their share of work, the common variable holds the maximum value of the rows assigned to it. Finally, the Root process determines the infinity norm using a Collective MPI call, MPI_Reduce and prints the value.

Input

The input file holding the square matrix

Output

Infinity Norm of the Matrix Value

Example 1.6:

Write a MPI program to compute the matrix into matrix multiplication using Checker-Board Partitoning of input Matrices.Analyze the Performance using thread Visualization tool.

Objective

Write an MPI-Pthreads program to compute the Matrix into Matrix multiplication using Checker-Board Partitoning of input Matrices. using p processe and t threads

Description

In checkerboard partitioning, the matrix is divided into smaller square or rectangular blocks (submatrices) that are distributed among processes. A checkerboard partitioning splits both the rows and the columns of the matrix, so no process is assigned any complete row or column. Like striping partitioning, checkerboard partitioning can be block or cyclic.

Input

The input file holding the two square matrices (Number of Rows, Columns of the matrix)

Output

Output Array : Product of Matrix matrix Multiplication.

Centre for Development of Advanced Computing