C-DAC,Pune : High-Perf. Comp. Frontier Technologies Exploration Group and CMSD, University of Hyderabad, Technology Workshop hyPACK (October 15-18), 2013

The parallel conjugate gradient algorithm involves the following type of computations and communications

Partitioning of a matrix : The matrix A is obtained by discretization of partial differential equations by finite element, or finite difference method. In such cases, the matrix is either sparse or banded. Consequently, the partition of the matrix onto p processes play a vital role for performance. For, simplicity , we assume that A is symmetric positive definite and is rowwise block-striped partitioned.

Scalar Multiplication of a vector and addition of vectors :
Each of these computations can be performed sequentially regardless of the preconditioner and the type of coefficient matrix. If all vectors are distributed identically among the processes, these steps require no communication in a parallel implementation.

Vector inner products :
In some situations, partial vectors are available on each processes. MPI Collective library calls are necessary to perform vector inner products If the parallel computer supports fast reduction operations, such as optimized MPI, then the communication time for the inner-product calculations can be made minimum.

Matrix-vector multiplication :
The computation and the communication cost of the matrix-vector multiplication; depends on the structure of the matrix A. The parallel implementation of the PCG algorithm for three cases one in which A is a block-tridiagonal matrix of the type, two in which it is banded unstructured sparse matrix, and three in which the matrix is sparse give different performance on parallel computers. Various parts of the algorithm in each of the three cases dominate in terms of communication overheads.

Solving the preconditioned system :
The PCG algorithm solves system of linear equations in each iteration The preconditioner C is chosen so that solving the system modified system is in expensive compared to solving the original system of equations Ax = b. Nevertheless, preconditioning increases the amount of computation in each iteration. For good preconditioners, however, the increase is compensated by a reduction in the number of iterations required to achieve acceptable convergence. The computation and the communication requirements of this step depends on the type of preconditioner used. preconditioning method such as diagonal preconditioning, in which the preconditioning matrix C has nonzero elements only along the principle diagonal does not involve any communication Also, Incomplete Cholesky (IC) preconditioning, in which C is based on incomplete Cholesky factorization of A and it may involve different computations and communications in parallel implementation.

The convergence of CG method iterations performed by checking the error criteria i.e. eulicidean norm of the residual vector should be less than prescribed tolerance. This convergence check involves gathering of real value from all processes, which may be very costly operation.

We consider parallel implementations of the PCG algorithm using diagonal preconditioner for dense coefficient matrix type. As we will see, if C is a diagonal preconditioner, then solving the modified system does not require any interprocessor communication. Hence, the communication time in a CG iteration with diagonal preconditioning is the same as that in an iteration of the unpreconditioned algorithm.

Thus the operations that involve any communication overheads are computation of inner products, matrix-vector multiplication and, in case of IC preconditioner solving the system.