C-DAC,Pune : High-Perf. Comp. Frontier Technologies Exploration Group and CMSD, University of Hyderabad, Technology Workshop hyPACK (October 15-18), 2013

Overview Venue : CMSD, UoH Key-Note/Invited Talks Faculty / Speakers Proceedings Downloads Past Tech. Workshops Target Audience Benefits Organisers Accommodation Local Travel Sponsors Feedback Acknowledgements Contact Home

Topics of Interest Tech. Prog. Schedule Topic : Multi-Core Topic : ARM Proc. Topic : Coprocessors Topic : GPGPUs Topic : HPC Cluster Topic : App. Kernels. Topic : Lab. Session Key-Note / Invited Talks Home

Mode-1 Multi-Core Memory Allocators OpenMP Intel TBB Pthreads Java - Threads Charm++ Prog. Message Passing (MPI) MPI - OpenMP MPI - Intel TBB MPI - Pthreads Compilers - Opt. Features Threads-Perf. Math. Lib. Threads-Prof. & Tools Threads - I/O Perf. PGAS : UPC / CAF/ GA Power & Perf. Home

Mode-2 ARM Prog. Env Benchmarks Power & Perf. Home

Mode-3 Coprocessors Arch. Software Compiler & Vect. Prog. Env. Benchmarks Power & Perf. Home

Mode-4 GPGPUs NVIDIA - CUDA/OpenCL AMD APP - OpenCL GPGPUs - OpenCL GPGPUs : Power & Perf. Home

Mode-5 HPC Cluster HPC MPI Cluster GPU Cluster - NVIDIA GPU Cluster - AMD APP Cluster - Intel Coprocessors Cluster- Power & Perf. Home

Mode-6 App. Kernels PDE Solvers : FDM/FEM Image Processing - FFT Monte Carlo Methods String Srch. Seq. Analy. Video Process. Intr. Detcn. Sys App. Power & Perf. Home

Reg. Overview Pvt. Sector Pub. Sector Govt. Acad. Staff Students Reg. On-line Reg. Accommodation Contact Home

hyPACK-2013 Mode 1 : Programming Using MPI 1.X Implementing Solution for Partial differential Equations - Finite difference & Finite Element Method computation algorithms

Module 8 : MPI programs on Dense Matrix Computations - Parallel Programs for implementation of solution of Partial differential Equations using finite Difference/Finite Element Method and execute on Message Passing Cluster or Multi Core Systems that support MPI library.

Example 8.1

Parallel Program for implementation of Soluiton of Poission Equation (PDE) by Finite Difference Method (One dimensional Decomposition ).

Example 8.2

Parallel Program for implementation of Soluiton of Poission Equation (PDE) by Finite Difference Method (Two dimensional decomposition).

Example 8.3

Parallel Program for implementation of Soluiton of Poission Equation (PDE) by Finite Element Method (Two dimensional decomposition).

(Source - References : Books Multi-threading - [MC-MPI-02], [MCMPI-06], [MCMPI-07], [MCMPI-09], [MC-MPI10], [MCMPI-11], [MCMTh-12],[MCBW-44], [MCOMP-01])

Description of Programs - for solution of PDE's using MPI Library Calls

Example 8.1:

Write MPI program to implement Soluiton of Poission Equation (Partal differential Equations) by Finite Difference Method using One DimensionalDecomposition of Mesh
( Download WinRAR ZIP archive:
Poisson (Two Dim) (WinRAR ZIP archive)

Objective :

Write a MPI program to solve the Poisson equation with Dirichlet boundary conditions in two space dimensions by finite difference method on structured rectangular type of grid. Use Jacobi iteration method to solve the discretized equations.

Description:
In the Poisson problem, the FD method is imposed a regular grid on the physical domain. It then approximate the derivative of unknown quantity u at a grid point by the ratio of the difference in u at two adjacent grid points to the distance between the grid point. In a simple situation, consider a square domain discretized by n x n grid points, as shown in the Figure 8.1(a). Assume that the grid points are numbered in a row-major order fashion from left to right and from top to bottom, as shown in the Figure 8.1(b). This ordering is called natural ordering . Given a total of n² points in the domain n x n grid, this numbering scheme labels the immediate neighbors of point i on the top, left, right, and bottom point as i - n, i -1, i +1 and i+n , respectively. Figure 8.1(b) represents partitioning of mesh using one dimensional partitioning

Figure 8.1(a). Finite difference grid: Five point stencil approximation    Figure 8.1(b). One dimensional decomposition for 2-D problem, with n=7

Formulation of Poisson Problem :
The Poisson problem is expressed by the equations

Lu = f(x, y) in the interior of domain [0,1] x [0,1]

Where L is Laplacian operator in two space dimensions.

      u(x,y) = g(x,y) on the boundary of the domain [0,1] x [0,1]

We define a square mesh (also called a grid) consisting of the points (x_i , y_i), given by

     x_i = i / n+1, i=0, ..., n+1,

     y_j = j / n+1, j=0, ...,n+1,

where there are n+2 points along each edge of the mesh. We will find an approximation to u(x , y) only at the points (x_i, y_j). Let u_{i, j} be the approximate solution to u at (x_i, y_j). and let h = 1/(n+1). Now, we can approximate at each of these points with the formula

   (u_i-_{1, j}+u_i,
j+₁+u_{i, j-}₁+u_i+_1,
j-4u_{i, j})/ h² = f_{i, j}.

Rewriting the above equation as
   u_{i, j} = 1/4(u_i-1,
j+u_{i, j+}₁+u_{i, j-}₁+u_i+_1,
j-h²f_{i, j}),

Jacobi iteration method is employed to obtain final solution starting with initial guess solution u_i,j^k for k=0 for all mesh points u _i,j and solve the following equation iteratively until the solution is converged.

   u_{i, j}^k+¹ = 1/4(u_i-_{1, j}^k+u_i,
j+₁^k+u_{i, j-}₁^k+u_i+_1,
j^k-h²f_{i, j}).

The resultant discretized banded matrix is shown in the Figure 22.

Parallelisation strategy :
To parallelize this algorithm, the physical domain is sliced into slabs, with the computations on each slab being handled by a different process. We have used block row-wise strip partitioning as shown in Figure 8.1(a). Each process gets its own partition as shown in the Figure 8.1(b). To perform the computation in each partition, we need to obtain the grid point information from adjacent processs to compute the values u_i,j^k for iteration k for all mesh points u_i,j near to the partition boundary. The elements of the array that are used to hold data from other processes are called ghost points.

Figure 8.2   A 16 x 16 block-tridiagonal matrix.

In simple situation, each process sends data to the process on top and then receives data from the process below it. The order is then reversed, and data is sent to the process below and received from the process above. This strategy is simple, it is not necessarily the best way to implement the exchange of ghost points. MPI topologies can be used to create the position of processes and determine the decomposition of the arrays. MPI allows the user to define a particular application, or virtual topology. An important virtual topology is the Cartesian topology. This simply a decomposition in the natural co-ordinate (e.g.. x,y,z) directions. In the implementation of this algorithm, row and column MPI communicators are created.

The iteration is terminated when the difference between successive computed solution at all grid points is strictly less than prescribed tolerance. The routine MPI_Allreduce is used to ensure that all processes compute the same value for the difference in all of the elements.
For higher order approximation, involving more than four neighbors for approximation of unknown variable or derivatives in the PDE, the communication at the interface of each subdomain may increase. In this case, you may require more interface information from the remote process.

More elaborate problems and algorithm often have the same communication structure that will use here to solve this problem. Thus, by studying how MPI can be used here, we are providing fundamentals on how communication patterns appear in more complex PDE problems. At the same time, we can demonstrate a wide variety of message-passing techniques and how MPI may be used to express them. We emphasize that while the Poisson problem is a useful example for describing the feature of MPI that can be used in solving PDE's and other problems that involve decomposition across many processes.

The parallel algorithm used in this model is not efficient and may give poor performance relative to more recent and sophisticated, freely available parallel solver for PDE's that use MPI. This example was chosen not because it illustrates the best way to parallelize this particular numerical computation (it doesn't), because it illustrates the basic MPI send and receive operations, MPI topologies in the context of fundamental type of parallel algorithm, applicable in many situations.

Input

User defines the total number of mesh points along each side of the square mesh (also called as grid) and number of iterations for the convergence of the method on command line. There are n+2 grid points on each side of the square mesh. Process with rank 0 performs all the input such as grid point neighbors, co-ordinate values of the nodes, and number of boundary nodes, Dirichlet boundary condition information on boundary nodes, initial guess solution on process with rank 0 is given.

The program automatically generates all necessary input data and the user has to specify total number of grid points. To make the program as simple as possible to follow, error conditions are not checked in MPI library calls. In general one should check these, and if the calls fails, the program needs to exist smoothly.

Various special MPI library calls can be used for the solution of poission equation by finite difference method. MPI provides to the programmer, good choice of decomposition depends on the details of the underlying hardware. MPI allows the user to define a particular application, or virtual topology. An important virtual topology is the cartesian topology which is simply a decompostion of a grid.

Advanced point-to-point communication library calls such as MPI_Send and MPI_Recv (blocking communication calls), MPI_Send and MPI_Recv ( ordered send and receive blocking communication calls), MPI_SENDRECV ( Combined send and receive), MPI_BSEND (Buffer send), MPI_ISEND, MPI_IRECV (Non-blocking communication calls) and MPI_Ssend (synchronus sends) can be used in the implementation of parallel programe on IBM AIX cluster. One dimensional and two dimensional decomposition of mesh is used. Jacobi iterative method is employed for the iterative method.

Implementation using MPI advanced Point-to-Point Communications library calls

One Dimensional Decomposition of Mesh :
We consider square grid in a two dimensional region and assume that number of grid points in x-direction and y-direction is same. The actual message passing in our program is carried out by the MPI functions MPI_Send and MPI_Recv. The first command sends a message to a designated process. The second receives a message from a process. We have used simple MPI point-to-point communication calls MPI_Send and MPI_Recv in various programs. We study more detail about the advanced point-to-point communication library calls below.

Consider an example in which process 0 wants to send a message to process 1, and there is some type of physical connection between 0 and 1. The message envelope contains at least the information such as the rank of the receiver, the rank of the sender, a tag and a communicator. A couple of natural questions might occur. What if a process one is not ready to receive it? What can process zero do? There are three possibilities.

1. Process zero can stop and wait until process one is ready to receive the message,

2.It can copy the message out of send buffer into some internal buffer. This buffer may be located on process zero, process one, or somewhere else.

3. Return from the MPI_Send call or it can fail.

As long as there is space available to hold a copy of the message, the message passing system should provide this service to the programmer rather than making the process to stop dead in its tracks until the matching receive is called. Also, there is no information specifying where the receiving processes should store the message. Thus, until process 1 calls MPI_Recv, the system doesn't know where the data that Process 0 is sending should be stored. When Process 1 calls MPI_Recv, the system software that executes receive can see which (if any) buffered message has an envelope that matches the parameters specified by MPI_Recv. If there isn't a message, it waits until one arrives. If the MPI implementation does copy the send buffer into internal storage, we say that it buffers the data.

Different buffering strategies provide different levels of convenience and performance. For large applications that are already using a large amount of memory, even requiring the message-passing system to use all available memory may not provide enough memory to make this code work. Sometimes, the code runs (it does not deadlock) but it does not execute in parallel.

All of these issues suggest that programmers should be aware of the pitfalls in assuming that the system will provide adequate buffering. If a system has no buffering, then Process 0 cannot send its data until it knows that Process 1 is ready to receive the message, and consequently memory is available for the data. Programming under the assumption that a system has buffering is very common (most systems automatically provide some buffering), although in MPI parlance, it is unsafe. This means that if the program is run on a system that does not provide buffering, the program will deadlock. If the system has no buffering, Process 0's first message cannot be received until Process 1 has signaled that it is ready to receive the first send, and Process 1 will hang while it waits for Process 0 to execute the second send.

We will describe ways in which the MPI programmer can ensure that the correct parallel execution of a program does not depend on the amount of buffering, if any, provided by the message-passing system.

(1) Ordered Send and Receive

We discuss issues regarding MPI functions, which fail if there is no system buffering. We explain the issues regarding unsafe and provide alternative solutions. First, why are the functions unsafe? Consider the case that we have two processes similar to what we have given in Table 3.1. Then where will be a single exchange of data between process 0 and process 1? If we synchronize the processes before the exchanges we'll have approximately the following sequence of events:

Time Process 0 Process 1

1
2 MPI_Send to 1
MPI_Recv from 1 MPI_Send to 0
MPI_Recv from 0

Table 3.1 Process 0 sends a message to Process 1

As explained earlier, if there's no buffering, the MPI_Send will never return. So process 0 will wait in MPI_Send until process 1 calls MPI_Recv. We've encountered this situation and it is called as deadlock. Two options are considered to make them safe either by re-organising the send and receive or let MPI make them safe through use of different MPI library calls.

In order to make them safe by reorganizing the sends and receives, we need to decide who will send first and who will receive first. One of the easiest ways to correct for a dependence on buffering is to order the sends and receives so that they are paired up. That is, the sends and receives are ordered so that if one process is sending to another, the destination will do receive that matches that send before doing a send of its own. Now, the ordered sequence of events is listed in the table 3.2

Time Process 0 Process 1

1
2 MPI_Send to 1
MPI_Recv from 0 MPI_Recv from 1
MPI_Send to 0

Table 3.2 Process 0 sends an message to Process 1

So there may be a delay in the completion of one communication, but the program will complete successfully.

(2) Combined Send and Receive
The approach of pairing the sends and receives is effective but can be difficult to implement when the arrangement of processes is complex (for example, with an irregular grid). An alternative is to use special MPI calls, not worrying about deadlock from a lack of buffering. A simpler solution is to let MPI take care of the problem. The function MPI_Sendrecv, as its name implies, performs both send and receive, and it organizes them so that even in systems with no buffering the calling program won't deadlock, at least not in the way that the MPI_Send / MPI_Recv implementation deadlocks. The syntax of MPI_Sendrecv is

      int MPI_Sendrecv(void* send_buf, int send_count, MPI_Datatype send_type,
          int destination, int send_tag, void* recv_buf, int recv_count, int recv_type,
          int source, int recv_tag,MPI_Comm comm, MPI_Status *status)

Notice that the parameter list is basically just a concatenation of the parameter lists for the MPI_Send and MPI_Recv. The only difference is that the communicator parameter is not repeated. The destination and the source parameters can be the same. The "send" in an MPI_Sendrecv can be matched by an ordinary MPI_Recv, and the "receive" can be matched by an ordinary MPI_Send. The basic difference between a call to this function and MPI_Send followed by MPI_Recv (or vice versa) is that MPI can try to arrange that no deadlock occurs since it knows that the sends and receives will be paired.

Recollect that MPI doesn't allow a single variable to be passed to two distinct parameters if one of them is an ouut parameter. Thus, we can't call MPI_Sendrecv with send buf - recv buf. Since it is very common in practice of paired send/receives to use the same buffer, MPI provides a variant that does allow us to use a single buffer: Note that this implies the existence of some system buffering is required.

int MPI_Sendrecv_replace(void* buffer, int count, MPI_Datatype Datatype,
    int destination, int send_tag, int recv_tag, MPI_Comm comm, MPI_Status *status)

(3) Buffered Send and Receive

Instead of requiring the programmer to determine a safe ordering of the send and receive operations, MPI allows the programmer to provide a buffer into which data can be placed until it is delivered (or at least left in the buffer). The change in the exchange routine is simple; one just replaces the MPI_Send calls with MPI_Bsend. If the programmer allocates some memory (buffer space) for temporary storage on the sending processor, we can perform a type of non-blocking send. The semantics of buffered send mode are ;

int MPI_Bsend_init(void* buf, int count, MPI_Datatype datatype, int dest, int tag,
    MP I_Comm comm, MPI_Request *request)

Builds a handle for a buffered send

int MPI_Bsend(void* buf, int count, MPI_Datatype datatype, int dest, int tag,
    MP I_Comm comm)

In addition to the change in the exchange routine, MPI requires that the programmer provide the storage into which the message may be placed with the routine MPI_Buffer_attach. This buffer should be large enough to hold all of the messages.

Attaches a user-specified buffering for sending
int MPI_Buffer_attach(void* buf, int size)

Removes an existing Buffer (for use in MPI_Bsend
int MPI_Buffer_detach(void** buf, int* size)

(4) Non-blocking Send and Receive
Another approach involves using communications operations that do not block but permit the program to continue to execute instead of waiting for communications to complete. However, the ability to overlap computation with communication is just one reason to consider the non-blocking operations in MPI; the ability to prevent one communication operation from preventing others from finishing is just as important. Non-blocking communication is explicitly designed to meet these needs.

A non-blocking SEND or RECEIVE does not wait for the message transfer to actually complete, but it returns control back to the calling process immediately. The process now is free to perform any computation that does not depend on the completion of the communication operation. Later in the program, the process can check whether or not a non-blocking SEND or RECEIVE has completed and if necessary wait until it completes. The syntax of non-blocking communications has been explained in the next section. A call to a non-blocking send or receive simply starts, or posts, the communication operation. It is then up to the user program to explicitly complete the communication at some later point in the program. Thus, any non-blocking operation requires a minimum of two function calls: a call to start the operation and a call to complete the operation.

The basic functions in MPI for starting non-blocking communications are MPI_Isend and MPI_Irecv. The "I" stands for "immediate," i.e., they return (more or less) immediately. Their syntax is very similar to the syntax of MPI_Send and MPI_Recv ;

      int MPI_Isend (void* buffer, int count, MPI_Datatype datatype int destination,
        int tag, MPI_Comm comm, MPI_Request* request)

and

     int MPI_Irecv (void* buffer, int count, MPI_Datatype datatype int source,
      int tag, MPI_Comm comm, MPI_Request* request)

The parameters that they have in common with MPI_Send and MPI_Recv have the same meaning. However, the semantics are different. Both calls only start the operation. For MPI_Isend this means this means that the system has been informed that it can start copying data out of the send buffer (either to a system buffer or to the destination). For MPI_Irecv, it means that the system has been informed that it can start copying data into the buffer. Neither send nor receive buffers should be modified until the operations are explicitly completed or cancelled.

The request parameters purpose is to identify the operation started by the nonblocking call. So it will contain information such as the source or destination, the tag, the communicator, and the buffer. When the nonblocking operation is completed, the request initialized by the call to MPI_Isend or MPI_Irecv is used to identify the operation to be completed.

There are a variety of functions that MPI uses to complete nonblocking operations. The simplest of these is MPI_Wait. It can be used to complete any nonblocking operation.

       int MPI_Wait(MPI_Request* request,MPI_Status* status)

The request parameter corresponds to the request parameter returned by MPI_Isend or MPI_Irecv. MPI_Wait blocks until the operation identified by request completes, if it was a send, either the message has been sent or buffered by the system, if it was a receive, the message has been copied into the receive buffer. When MPI_Wait returns, request is set to MPI_REQUEST_NULL. This means that there is no pending operation associated to MPI_Request. If the call to MPI_Wait is used to complete an operation started by MPI_Irecv, the information returned in the status parameter is the same as the information returned in the status by a call to MPI_Recv.

Finally, it should be noted that it is perfectly legal to match blocking operations with non-blocking operations. For example, a message sent with MPI_Isend can be received by a call to MPI_Recv.

(5). Communication Modes :
We discussed the idea of safety in MPI. A program is safe if it will produce correct results even if the system does not provide buffering. Most programmers of message-passing systems expect the system to provide some buffering and, as a consequence, they routinely write unsafe programs. If we are writing a program that must absolutely be portable to any system, we can guarantee the safety of our program in two ways : We can reorganize our communications so that the program will not deadlock if sends cannot complete until a matching receive is posted.

Other solution is to reorganize our own buffering. In either case, we are, effectively, changing the communication mode.
There are four communication modes in MPI : standard, buffered, synchronous, and ready. They correspond to four different types of send operations. There is only a standard mode for receive operations. In standard mode, it is up to the system to decide whether messages should be buffered.

Synchronous Mode
In synchronous mode a send won't complete until a matching receive has been posted and the matching receive has begun reception of the data. To ensure that a program does not depend on buffering, MPI provides the routine MPI_Ssend. In many special applications, it is possible to show that if the program runs successfully with no buffering, it will run with any amount of buffering. MPI provides a way to send a message in such a way that the send does not return until the destination begins to receive the message. MPI provides three synchronous modes send operations:

     int MPI_Ssend(void* buffer, int count, MPI_Datatype datatype, int destination,
         int tag, MPI_Comm comm)

     int MPI_Issend(void* buffer, int count, MPI_Datatype datatype, int destination,
      int tag, MPI_Comm comm, MPI_Request* request)

     int MPI_Ssend_init(void* buffer, int count, MPI_Datatype datatype, int destination,
         int tag, MPI_Comm comm, MPI_Request* request)

Their effect is much the same as the corresponding standard mode sends. However, MPI_Ssend and the waits corresponding to MPI_Issend and MPI_Ssend_init will not complete until the corresponding receives have started. Thus, synchronous mode sends require no system buffering, and we can assure that our program is safe if it runs correctly using only synchronous mode sends.

Ready Mode
On some systems it's possible to improve the performance of a message transmission if the system knows, before a send has been initiated, that the corresponding receive has already been posted. For such systems, MPI provides the ready mode. The parameter lists of the ready sends are identical to the parameter lists for the corresponding standard sends:

     int MPI_Rsend(void* buffer, int count, MPI_Datatype datatype, int destination
         int tag, MPI_Comm comm)

     int MPI_Irsend(void* buffer, int count, MPI_Datatype datatype, int destination,
         int tag, MPI_Comm comm, MPI_Request* request)

     int MPI_Rsend_init(void* buffer, int count, MPI_Datatype datatype, int destination,
         int tag, MPI_Comm comm, MPI_Request* request)

The only difference between the semantics of the various ready sends and the corresponding standard sends is that the ready sends are erroneous if the matching receive hasn't been posted, and the behavior of an erroneous program is unspecified.

We have discussed in detail about the use of various MPI point-point communications library calls in order to write safe MPI programs. The different MPI library calls used in each iteration of Jacobi method may give different performance on a given parallel computer.

We emphasize that while the Poisson problem is a useful example for describing the feature of MPI that can be used in solving PDE's and other problems that involve decomposition across many processes. The parallel algorithm used in this model is not efficient and may give poor performance relative to more recent and sophisticated, freely available parallel solver for PDE's that use MPI.

This example was chosen not because it illustrates the best way to parallels this particular numerical computation (it doesn't), because it illustrates the basic MPI send and receive operations, MPI topologies in the context of fundamental type of parallel algorithm, applicable in many situations.

Example 8.2:

Write MPI program to implement Soluiton of Poission Equation (Partal differential Equation) by Finite Difference Method using two dimensionalDecomposition of Mesh
( Download WinRAR ZIP archive:
Poisson (Two Dim) (WinRAR ZIP archive)

Objective :

Write MPI program to implement Soluiton of Poission Equation (Partal differential Equation) by Finite Difference Method using Two dimensionalDecomposition of Mesh

Two Dimensional Decomposition of Mesh
We use higher dimensional decompositions to partition the domain and find the solution to the Poisson problem by using Jacobi iteration. We let MPI to compute the decomposition of the domain using MPI_Cart_create.

In the earlier programs, the messages have consisted of contiguous areas in memory, so the basic datatypes such as MPI_INTEGER and MPI_DOUBLE_PRECISION, accompanied by a count, have been sufficient to describe the messages. We use MPI's derived datatypes, which allow us to specify noncontiguous direction.

In the earlier iteration scheme, we have used MPI_Send and MPI_Recv to exchange data between top and below processes. This strategy is simple but it may pose problems if the parallel system doesn't have adequate buffer. Consequently, it is not necessarily the best way to implement the exchange of ghost points. The simplest solution to correct for a dependence on buffering is to order the sends and receives so that they are paired up. That is, the sends and receives are ordered so that if one process is sending to another, the destination will do receive that matches that send before doing a send of its own.

The approach of pairing the sends and receives is effective, but can be difficult to implement when the arrangement of processes is complex. An alternative is to use the MPI_Sendrecv. This routine allows you to send and receive data without worrying about deadlock from a lack of buffering. Each process then sends to the process below it and receives from the process above it.
Output :
The program prints the progress of the iterations, final iteration count, computed solution and the difference of the solution on process with rank 0.

Example 8.3:

Write MPI program to implement Soluiton of Poission Equation (Partal differential Equation) by Finite Element Method uisng Graph Partitioning method

Objective :

You will develop MPI program to solve the Poisson equation with Dirichlet boundary conditions in two space dimensions by Finite Element (FE) method using triangular elements. You will use conjugate gradient iterative method to solve the assembled system of equations and graph partitioning algorithms for unstructured mesh decomposition.

Background

A significant number of the application problems in the various scientific fields involve the solution of coupled partial differential equation (PDE) systems in complex geometries and Finite Element (FE) method. In the FE discretization of simple two dimensional region, if the complete region is decomposed into required subdomains, each group of cells within a subdomain can be processed in parallel. An optimal decomposition is the one which minimizes the communication bandwidth of the problem. Optimal decomposition of the mesh can be obtained by graph partitioning algorithms. The graph partitioning algorithm is employed to create nearly load balanced subdomains and to minimize the interprocesser communications. In all FE computations, the computation of elementary matrices and solution of associated matrix systmes are compute intensive. These computations will be repeated for transient FE analysis. The performance of application depends how these computations can be done effectively on parallel machine. < Since many suchsystems are evolving with time, time forms a additional dimension for these computations. Even for a small number of grid points, a three dimensional coordinate system, and a resonable discritezed time step, most of the models involve trillions of operations.

Also, suppose that a parallel explicit or explicit-like algorithm is to be implemented on an MIMD computer. It could be, for example, an iterative solver for the Poissson equation, or a time integration explicit algorithm for the transient analysis. Typically, these computations involve matrix-vector products and inner products which can evaluated in parallel. This can give good performance and minimize stroage requirements to solve large size of partial differential equations.

On other hand, if implicit-like algorithm is used to solve the PDE, then one has to solve the matrix solution of system of equations A x = b in parallel. In this case, a higher level of parallelism is obtained by treating the interface nodes as a separate entity and numbering the unknowns so that all subdomains can be processed in parallel after the interface problem.

The major difficulty is in the solution of matrix system A x = b which is due to sparseness of the matrix A . Direct/Iterative methods can be used to solve the matrix system of equations. Both have advantages and disadvantages from parallel processing point of view. Direct methods often poses problems due to memory limitations for large matrix solution techniques on parallel computers. Iterative methods require less memory and one can able to solve large size of matrix solution techniques. However, the parallelisation of these algorithms for various types of matrices may pose additional difficulties.

The Poisson Problem :
The Poisson problem is a simple partial differential equation (PDE) that is at the core of many applications. In this technique, an approximate solution to a PDE governing the behaviour of a physical system is obtained.

D²u = f(x,y) in the interior of domain u(x,y) = g(x,y) on the boundary of domain

One of the widely adopted techniques for unstructured grid generation is the automatic triangulation, which involves algorithmic division of the physical region into nearly triangular elements with specific properties by mesh generation algorithms.

Figure 8.3(a). A typical Finite element mesh in two dimension. Figure 8.3(b). Finite element discretization of two dimensional mesh

The given two dimensional region isdiscretized into collection of preselected finite elements say triangular elements as shown in Figure 1(a-b). The triangular element with three nodal points is considered for finite element approximation. We use piece-wise linear degree basis functions for FE approximation and derive element equations for typical elements in the mesh. Galerkin method has ben employed to obtain weak (integrated) formulation of the poisson system of equations.

The elementary matrix and elementary vector is computed for each element then assembly of elementary matrices is performed and the boundary conditions have been imposed at approriate grid points on the boundary of domain resulting non-singular matrix systems of linear equations. The matrix system of equationsis then solved to obtain the solution of unknown variables in the entire domain.

Parallelisation strategy

Step 1 : We first generate the finite element mesh and construct the dual (Geometrical) graph of the finite element mesh. In the dual graph, each node represent triangular element and edge represent common triangle side between two triangles.

The graph partitioning algorithm is employed to create nearly load balanced sub domains and to minimize the interprocesser communications. The multilevel recursive bisection algorithm can be used for graph partition. The FE mesh decomposition problem can be modelled as a graph partitioning problem in which the vertices of a graph are divided into a specified number of subsets such that few edges join two vertices in different subsets. A partition of the graph into subgraphs leads to a decomposition of the data and/or task associated with a computational problem and the subgraphs can then be mapped to the processors of a multiprocessor.

Initially, unstructured triangular mesh, followed by graph partition algorithm has been performed on one of the processors, say 0, resulting into give n subdomains on p (=n) . The figures 8.4(a) - (b) represent finite element mesh in irregular mesh region and the mesh partitioning of the finite element mesh.

Figure 8.4(a). A typical Finite element mesh in two space dimensions Figure 8.4(b). Mesh partitioning into eight subdomains

Step 2 : After sending the necessary mesh data and boundary data of k^thsub domain to k^th procressor by MPI_broadcast, from the master processor 0, the elementary matrices will be computed in parallel.

Step 3 : A partial assembly of elementary matrices in the interior of each subdomain will be done on each processor and triangles near the interface are marked to do the computations later. This require interporcessor communication.

Step 4 : Each subdomain selects specified number equations of the assembled matrix to perform the matrix solution techniques. Each sub-domain will consider internal nodes and the choose the nodes on interface of the subdomain if certain criteria satisfies. The selection of criteria is very important in order to ensure load balance among various subdomains. A simple criteria such as identifying independenet set of nodes on the interface for each sub-domain can be implemented. This may cause load imbalance in the application. The point-point communication across the neighbouring processors take place. Repeted communication of neighbours should be avoided.

Now, every processor has set of required rows of the matrix. Also, a part of the right hand side of the vector is available on each processes.

Step 5 : Identify the boundary nodes on every processor and apply boundary conditions to obtain resultant nodal equations on every processor.

Step 6 : Knowing the initial guess value of the solution variable, employ parallel conjugate gradient (CG) method to solve the resultant matrixsystem of equations. In fact, every processes has a block matrix which corresponds to the corresponding sub-domain. The computation of matrix and vector in CG methods play a vital role for performance and it depends on matrix structure (i.e. dense or sparse ).

Step 7 : Process 0 prints the computed solution of unknown variables in the entire mesh.

Remarks

More elaborate problems and algorithm often have the same communication structure that we will use here to solve this problem. Thus, by studing how MPI can be used here, we are providing fundamentals on how communication patterns appear in more complex PDE problems. At the same time, we can demonstrate a wide variety of message-passing technique and how MPI may be used to express them. The parallel algorithm used in this model is not efficient and may give poor performance relative to more recent and sophisticated, freely available parallel solver for PDE's that use MPI. This example was chosen not because it illustrates the best way to parallelize this particular numerical computation (it doesn't), because it illustrates the basic MPI send and receive operations, MPI topologies in the context of fundamental type of parallel algorithm, applicable in many situations.

Input

Finite element mesh input such as mesh connectivity, co-ordinate values of the nodes, and number of Dirichlet boundary nodes, initial guess solution is given on master processor. Process 0 will read in the data from a data file. This data must strictly adhere to the following format. Line one, two and three consists of integers npoint, nelem and nbnode which represent the number of nodes, triangles, and number of boundary nodes. After a leaving a blank line, the next nelem lines contain triangle number, nodal connectivity of its three nodes in anti-clock wise. The next npoin lines contain node number, and its x, y coordinate values. The next nbnode lines contain a global node number (integer) and its boundary vlaue.

Output

The program should print both the progress of the iterations, final iteration count, and the computed solution with node numbers for final iteration on master processor with process id 0. Process 0 prints the computed solution of unknown variables in the entire mesh.