• Mode-4 GPGPUs • NVIDIA - CUDA/OpenCL • AMD APP OpenCL • GPGPUs - OpenCL • GPGPUs : Power & Perf. • Home




hyPACK-2013 : Prog. on Heterogeneous comp. Platforms : OpencL

The Open Computing Language is a framework for writing programs that execute across heterogeneous platforms consisting of CPUs, GPUs, and other processors. OpenCL includes a language (based on C99) for writing kernels (functions that execute on OpenCL devices), plus APIs that are used to define and then control the heterogeneous platform. The OpenCL provides an opportunity for developers to effectively use the multiple heterogeneous compute resources on their CPUs, GPUs and other processors. Developers wanted to be able to access them together, not just off load the GPU for specific tasks and the OpenCL programming environment may address these issues. OpenCL supports wide range of applications, ranging from embedded and consume software to HPC solutions, through a low-level, high-performance, portable abstraction. The techncial content is developed using several technical reports, & Books and other web sites as given in the References.

List of Programs :               (Download Readme & Makefile README       Makefile)

Courtesy : References & Web-Pages : GPGPU & GPU Computing       Web-sites

The CLinfo program uses the clgetPlatformInfo() and clgetDeviceInfo() commands print the detailed information about the OpenCL supported platforms and devices in a system. Hardware details such as memory sizes and bus widts are available using these commands. A snippet of the output from the CLinfo program is obtained using the command $ ./CLInfo



Programs on Single GPU

Example 1.1 Write a simple OpenCL Program for multiplication of scalar value and a vector using global memory (Using the keyword __global )
Example 1.2 Write a simple OpenCL Program for multiplication of scalar value and a Matrix using global memory (Using the keyword __global )
Example 1.3 Write a OpenCL Program to compute vector vector addition using global memory
Example 1.4 Write a OpenCL Program to compute matrix matrix addition using global memory
Example 1.5 Write a OpencL program to find the total number of work-items in the x- and y-dimension of the NDRanges (Assume that OpenCL kernel is launched with a two-dimensional (2D) NDRange. Use API get_global_size(0), get_global_size(1))
Example 1.6 Write a OpenCL program to device a query that returns the constant memory size supported by the device
Example 1.7 Write a OpenCL program to get unique global index of each work item that calls the function get_global_id(0) of OpenCL API.
Example 1.8 Write OpenCL program to compute the value of pie
Example 1.9 Write a OpenCL Program to find prefix sum of an array using global memory
Example 1.10 Write a OpenCL Program to perform vector - vector multiplication using global memory
Example 1.11 Write a OpenCL Program to compute infinity norm of a real squre matrix using global memory
Example 1.12 Write a OpenCL Program to compute tranpose of a square matrix using global memory & local memory
Example 1.13 Write a OpenCL Program to compute matrix into vector multiplication using global & local memory
Example 1.14 Write a OpenCL Program to compute matrix into matrix multiplication using global & local memory
Example 1.15 Write a OpenCL program to compute solve Ax=b Matrix System of linear equations based on Jacobi method on Multi-GPUs using OpenCL. ( Assignment )
Example 1.16 Write OpenCL program to implement the solution of Matrix system of Linear Equations [A]{x}={b} by Conjugate Gradient (Iterative) Method). ( Assignment )
Example 1.17 Write a OpenCL program to compute sparse Matrix into vector multiplication ( Assignment )

Programs on Multipe GPU
Example 1.18
Write a CUDA Program to compute vector - vector multiplication on Multi-GPU







Brief description of OpenCL Programs for Numerical (Matrix) Computations

Example 1.1: Write OpenCL program to calculate multiplication of scalar value and vector
(Download source code - Double Precision- Real : ScalarVectGlobalMemDP_kernel.cl    and    ScalarVectGlobalMemDP.c)
  • Objective

    Write OpenCL program to calculate multiplication of scalar value and vector

  • Description

    We create a one-dimensional globalWorkSize array that is overlaid on vector. The input vector using Single Precision/Double Precision input data are generated on host-CPU and transfer the vector to device-GPU for scalar-vector multiplication. In global memory,a simple kernel based on the 1- dimension indexspace of workgroups is generated in which work-item is given a unique ID within its workgroup.

  • Each work-item performs multiplication of element of a vector with scalar using work item ID.
  • The choice of work-items in the code are given as
    size_t globalWorkSize[3] = [128, 1,1];
    & for generalisation, refer table 1.2


  • The final resultant vector is generated on device and transferred to Host.

    This code demonstrates the development of OpencL Kernel for simple computations. It is to be notes that the performance may increase bu using local memory to cache data that will be used multiple times by a work-item or multiple work-items in the same workgroup. It is possible to achiev this with an explicit assignment from a global memory pointer a a local memory pointer. Once a work-item completes its execution, none of its state information or local memory storage is persistent. Any results that need to be kept must be transferred to global memory.


    Important Steps (Table 1.1) :
    Steps Description
    1. Memory allocation on host and Input data Generation
    Do memory allocation on host-CPU and fill with the single or double prcesion data.

    2. Set opencl execution environment :
    Call the function setExeEnv which sets execution environment for opencl which performs the following :
    - Get Platform Information
    - Get Device Information
    - Create context for GPU-devices to be used
    - Create program object.
    - Build the program executable from the program source.
    The function performs

    (a). Discover & Initilaise the platforms;
    (b). Discoer & Initialie the devices;
    (c). Create a Context; and
    (d). Create program object build the program executable

    3. Create command queue using Using clCreateCommandQueue(*) and associate it with the device you want to execute on.

    4. Create device bufffer
    using clCreateBuffer() API that will contain the data from the host-buffer.
    5. Write host-CPU data to device buffers
    6. Kernel Launch :

    (a). Create kernel handle;
    (b). Set kernel arguments;
    (c). Configure work-item strcture ( Define global and local worksizes and launch kernel for execution on device-GPU); and
    (d). Enqueue the kernel for execution
    7. Read the outpur Buffer to the host (Copy result from Device-GPU to host-CPU :)
    Use clEnqueueReadBuffer() API.
    8. Check correctness of result on host-CPU
    Perform computation on host-CPU and compare CPU and GPU results.
    9. Release OpenCL resources (Free the memory)
    Free the memory of arrays allocated on host-CPU & device-GPU

    Table 1.1

    Important Steps (Table 1.2) : Kernels & OpenCL Execution Model :
    S.No. Description
    1. Work-item
    The unit of concurrent execution in OpenCL C is a work-item.
    (For example, in typical fragment of the code, i.e. for loop data computations of typical multi-theaded code, a map of a single iteration of the loop to a work-item can be done.)

    OpenCL runtime to generate as many work-items as elements in the input and output arrays and allow the runtime to map those work-items to the underlying hardware (CPU & GPU).

    (Conceptually, this is very similar to the parallelism inherent in a functional map operation or data parallel in for loop in a model such as OpenMP.)

    2. Identification of Work-item
    When a OpenCL device begins executing a kernel, it provides intrinsic functions that allow a work-item to identify itself. This can be achieved by calling get_global_id(0) allows the programmer to make use of position of the current work-item in the sample case to regain the loop counter.

    3. Execution in fine-grained work-items
    (N-dimensional range (NDRange)

    OpenCL describes execution in fine-grained workitems and can despatch vast number of work-items on architecture with hardware for fine-grained threading. Scalability can be achieved due to support of large number of work-items.
  • When a kernel is executed, the programmer specifies the number of work-items that should be created as an n-dimensional range (NDRange)
  • An NDRange is one-, two- or three-dimensional index space of work-items that will often map to the dimensions of either the input or the output data.
  • The dimensions of the NDRange ae specified and an N-element array of type size_t, where N represents the number of dimenisons used to describe work-items being created.

  • 4. Example
    Assume that 1024 elements are taken in each vector. The size can be specified as a one- , two- , or three- dimensional vector. The host code to specify an ND Range for 1024 elements is follows :

    size_t indexSpaceSize[3] = [1024, 1,1];

    Most importantly, dividing the work-items of an NDRange into smaller, equal sized workgroups as shown in Figure 1.
  • An Index space with N dimensions requires workgroups to be specified using N dimensions, thus, a threee -dimensional requires three-dimensional workgroups.

  • 5. Example :
    Perform Barrier Operations & synchronization

    work-tems within a workgroup can peform barrier operations to synchronize and they have access to a shared memory address space Because workgroups sizes are fixed, this communication does not have have a need to scale and hence does not affect scalability of a large concurrent dispatch.

    For example 1.3, i.e., Vector Vector Addition, the workgroup can be specified as

    size_t workGroupSize[3] = [64, 1, 1];

    If the total number of work-items per array is 1024, this results in creating 16 work-groups (1024 work-items / 64 per workgroups.

    Most importantly, OpenCL requires that the index space sizes are evenely divisible by the work-group sizes in each dimension.

    For hardware efficiency, the workgroup size is usually fixed to a favourable size, and we round up the index space size in each dimension to satisfy this divisibility requirement.

  • In the kernel code, user can specify that exta work-items in each dimension simply return immediately without outputting any data.
  • Many highly data paralle computations in which access of memory for arrays that peforms computation (example Vector-Vector Addition), the OpenCL allows the local workgroup size to be ignoed by the programmer and generated automatically by the implementation ; in this case; the dveloper will pass NULL instead.

  • Table 1.2
  • Input
  • Scalar value and size of the vector

  • Output
  • Execution time in seconds ,Gflops achieved



Example 1.2: Write OpenCL program to calculate multiplication of scalar value and Matrix
(Download source code - Double Precision- Real : ScalarMatrixGlobalMemDP_kernel.cl    and    ScalarMatrixGlobalMemDP.c)
  • Objective

    Write OpenCL program to calculate multiplication of scalar value and Matrixc

  • Description

    We create a one-dimensional globalWorkSize array that is overlaid on matrix. The input matrix using Single Precision/Double Precision input data are generated on Host-CPU and transfer the matrix to device-GPU for scalar-matrix multiplication. In global memory,each work-item is assigned a row from matrix and it will multiply that row with scalar value. Final resultant matrix is generated on device and transferred to Host.

    Refer Table 1.1 for OpenCL Important implementation Steps

  • Each work-item is assigned a row from matrix and it will multiply that row with scalar value.
  • The choice of work-items in the code are given as
    size_t globalWorkSize[3] = [128, 1,1];
    & for generalisation, refer table 1.2


  • Input
  • Scalar value and size of the matrix

  • Output
  • Execution time in seconds, Gflops achieved



Example 1.3: Write a OpenCL program to compute vector-vector addition
(Download source code - Double Precision- Real : VectVectAddGlobalMemDP_kernel.cl    and    VectVectAddGlobalMemDP.c)
  • Objective

    Write OpenCL program to compute addition of two vectors

  • Description

    We create a one-dimensional globalWorkSize array that is overlaid on vector. The input vectors using Single Precision/Double Precision input data are generated on Host and transfer the vectors to device for vector addition. In global memory,a simple kernel based on the N- dimension grid of work groups is generated in which work item is given a unique ID within its work group. Each work item performs addition of two vectors using work item ID and the final resultant vector is generated on device and transferred to Host.

    Refer Table 1.1 for OpenCL Important implementation Steps

  • Input
  • Two vectors of same size

  • Output
  • Execution time in seconds ,Gflops achieved


Example 1.4: Write a OpenCL program to compute Matrix-Matrix addition
(Download source code - Double Precision- Real : MatMatAddGlobalMemDP_kernel.cl    and    MatMatAddGlobalMemDP.c)
  • Objective

    Write OpenCL program to compute addition of two Matrices

  • Description

    We create a two-dimensional globalWorkSize array that is overlaid on matrices. The input matrices using Single Precision/Double Precision input data are generated on Host-CPU and transfer the matrices to devic-GPUe for matrix addition. In global memory,a simple kernel based on the 2- dimension indexspace of work groups is generated in which work item is given a unique ID within its work group. Each work item performs addition of two matrices using work item ID and the final resultant matrix is generated on device and transferred to Host.

    Refer Table 1.1 for OpenCL Important implementation Steps

  • Each work-item using its work item ID performs addition using one element from each matrix.
  • The choice of work-items in the code are given as
    size_t globalWorkSize[3] = [128, 128,1];
    & for generalisation, refer table 1.2


  • Input
  • Matrix Size

  • Output
  • Execution time in seconds ,Gflops achieved



Example 1.5: Write a OpencL program to find the total number of work-items in the x- and y-dimension of the NDRanges (Assume that OpenCL kernel is launched with a two-dimensional (2D) NDRange. Use API get_global_size(0), get_global_size(1))
(Assignment - To be discussed in Lab. Session)

Example 1.6: Write a OpenCL program to device a query that returns the constant memory size supported by the device
(Assignment - To be discussed in Lab. Session)

Example 1.7: Write a OpenCL program to get unique global index of each work item that calls the function get_global_id(0) of OpenCL API.
(Assignment - To be discussed in Lab. Session)

Example 1.8: Write a OpenCL program to calculate the value of PI using numerical integration method
(Download source code : PieCalculationGlobalMemDP_kernel.cl    and    PieCalculationGlobalMemDP.c
  • Objective

    Write OpenCL Program to calculate value of PI using numerical integration method.

  • Description

    This program computes the value of PI over a given interval using Numerical integration. The main work-item distributes the given interval uniformly over the number of work-item on GPU device. Each work-item calculates its part of the interval and finally work-item 0 will add up all the value calculated by individual work-item. Every work-item do its part of calculation and put the final result in a global array cell , which is allocated to this particular work-item only. After execution of each work-item is over , work-item 0 is assigned the job to gather all values and produce final result.

    Refer Table 1.1 for OpenCL Important implementation Steps

  • Each work-item calculates its part of the interval and finally work-item 0 will add up all the value calculated by individual work-item.
  • The choice of work-items in the code are given as
    size_t globalWorkSize[3] = [128, 1,1];
    & for generalisation, refer table 1.2


  • Input
  • Number intervals

  • Output
  • pie value, Execution time in seconds



Example 1.9: Write OpenCL program to find prefix sum computations.
(Download source code : PrefixSumGlobalMemDP_kernel.cl    and    PrefixSumGlobalMemDP.c
  • Objective

    Write OpenCL program to find prefix sum computations.

  • Description

    We create a one-dimensional globalWorkSize array that is overlaid on vector. This is simple program which computes prefix sum of an given array element. Prefix sum of an element of an array is defined by summation of all the element processed this element and the current element. In global memory, the input array using Single Precision input data is generated on Host and the prefix sum kernel based on work item ID in a work group and 1-dimensional index space of work groups. The output array is generated on the device and transfer back to Host.

    Refer Table 1.1 for OpenCL Important implementation Steps

  • Each work-item performs summation of current element with previous element of an array using its work-item Id.
  • The choice of work-items in the code are given as
    size_t globalWorkSize[3] = [128, 1,1];
    & for generalisation, refer table 1.2


  • Input
  • Length of Input Array

  • Output
  • Prefix Sum Of elements of Input Array



Example 1.10: Write OpencL program to compute vector-vector multiplication
(Download source code - Double Precision- Real : VectVectMultGlobalMemDP_kernel.cl    and    VectVectMultGlobalMemDP.c)
  • Objective

    Write OpenCL program to compute vector-vector multiplication

  • Description

    We create a one-dimensional globalWorkSize array that is overlaid on vector. The input vectors using Single Precision/Double Precision input data are generated on Host-CPU and transfer the vectors to device-GPU for vector multiplication. In global memory, a simple kernel based on the 1- dimension index space of work groups is generated in which work item is given a unique ID within its work group. Each work item performs partial multiplication of two vectors and the final resultant value is generated on device and transferred to Host-CPU.

    Refer Table 1.1 for OpenCL Important implementation Steps

  • Each work-item calculates its part of the multiplication and finally work-item 0 will add up all the value calculated by individual work-item.
  • The choice of work-items in the code are given as
    size_t globalWorkSize[3] = [128, 1,1];
    & for generalisation, refer table 1.2


  • Input
  • Size of Input vectors

  • Output
  • Execution time in seconds ,Gflops achieved


Example 1.11: Write OpenCL program to find infinity norm of a square matrix.
(Download source code - Double Precision- Real : MatInfinityNormGlobalMemDP_kernel.cl    and    MatInfinityNormGlobalMemDP.c)
  • Objective

    Write OpenCL Program to find Infinity Norm of a matrix

  • Description

    We create a one-dimensional globalWorkSize array that is overlaid on matrices. Infinity Norm of a Matrix: The row-Wise infinity norm of a matrix is defined to be the maximum of sums of absolute values of elements in a row, over all rows. The input matrix using Single Precision/Double Precision input data are generated on Host and transfer the matrix to device for calculating infinity norm of matrix. After the initial validity checks, each work-item is assigned a row using its id and it will calculate sum of its assigned row. After all calculations ,The work-item 0 will find out maximum sum among all sum of all element of every row.

    Refer Table 1.1 for OpenCL Important implementation Steps

  • Each work-item is assigned a row using its id and it will calculate sum of its assigned row. After all calculations ,The thread 0 will find out maximum sum among all sum of all element of every row.
  • The choice of work-items in the code are given as
    size_t globalWorkSize[3] = [128, 1,1];
    & for generalisation, refer table 1.2


  • Input
  • Size of the input Matrix

  • Output
  • Execution time in seconds ,Gflops achieved





Example 1.12: Write OpenCL program to compute transpose of a matrix using row-wise distribution.
(Download source code - Double Precision- Real - Global Memory : MatTransposeGlobalMemDP_kernel.cl    and    MatTransposeGlobalMemDP.c)

(Download source code - Double Precision- Real - Local Memory : MatTransposeLocalMemDP_kernel.cl    and    MatTransposeLocalMemDP.c)
  • Objective

    Write OpenCL program to compute transpose of a matrix using row-wise distribution.

  • Description

    We create a two-dimensional globalWorkSize array that is overlaid on matrices. The input matrix using Single Precision input data is generated on the Host-CPU and transfer to device-GPU. In global memory implementation,a simple kernel function is written on device-GPU which generates the output matrix on the device-GPU and transfer back to Host-CPU. In local memory implementation , the kernel computes the resulting matrix using multiple iterations. In each iteration, one workgroup is loads one tile of both matrices from global memory to local memory. After all iterations is done, the work group stores the transposed matrix tile into the global resultant matrix. The resultant matrix is transferred to Host-CPU.

    Refer Table 1.1 for OpenCL Important implementation Steps

  • The choice of work-items in the code are given as
    size_t globalWorkSize[3] = [128, 128,1];
    & for generalisation, refer table 1.2


  • Input
  • Size of matrix

  • Output
  • Execution time in seconds



Example 1.13: Write a OpenCL program for Matrix and Vector multiplication.
(Download source code - Double Precision- Real - Global Memory : MatVectMultGlobalMemDP_kernel.cl    and    MatVectMultGlobalMemDP.c)

(Download source code - Double Precision- Real - Local Memory : MatVectMultLocalMemDP_kernel.cl    and    MatVectMultLocalMemDP.c)
  • Objective

    Write a OpenCL Program to perform Matrix into Vector multiplication.

  • Description

    We create a one-dimensional globalWorkSize array that is overlaid on matrice and vector. Input matrix and the vector using Single Precision/Double Precision input data are generated on the Host-CPU. A simple kernel based on the 1-dimension indexspace of work groups is generated in which work item is given a unique ID within its work group. In global memory each work item reads one row of the matrix and performs computation with column of the vector to obtain resultant vector on device-GPU. While in local memory, the resultant vector is calculated in multiple iterations. In each iteration , one workgroup handles some rows of the matrix and each work item from work group reads the row of matrix and perform computation with vector and calculate the partial dot product..This partial dot product is stored into local memory. The first work item with in workgroup computes the final result by adding up all the partial dot products. The resultant vector is transferred back to Host-CPU.

    Refer Table 1.1 for OpenCL Important implementation Steps

  • Each work-item reads the row of matrix and perform computation with vector element and calculate the partial dot product. Work-item 0 computes the final result by adding up all the partial dot products.
  • The choice of work-items in the code are given as
    size_t globalWorkSize[3] = [128, 1,1];
    & for generalisation, refer table 1.2


  • Input
  • Matrix Size

  • Output
  • Execution time in seconds ,Gflops achieved



Example 1.14: Write a OpenCL program for Matrix into Matrix multiplication.
(Download source code - Double Precision- Real - Global Memory : MatMatMultGlobalMemDP_kernel.cl    and    MatMAtMultGlobalMemDP.c)

(Download source code - Double Precision- Real - Local Memory : MatMatMultLocalMemDP_kernel.cl    and    MatMatMultLocalMemDP.c)
  • Objective

    Write a OpenCL Program to perform Matrix Matrix multiplication.

  • Description

    We create a two-dimensional globalWorkSize array that is overlaid on matrices. In the global memory implementation, each work item reads one row from matrix and perform computation with one column of another matrix and compute the corresponding element of resultant matrix. While in local memory implementation, Matrices have been divided into work-groups of 16 x 16 sizes. One work-group loads one tile of both matrices from global memory to shared memory. Each work-item with in work-group calculate temporal resultant in the local memory. After all work-items within work-group completed their part of computation, the work-group stores the resultant tile into the global resultant matrix.

    Refer Table 1.1 for OpenCL Important implementation Steps

  • Each work-item reads one row from matrix and perform computation with one column of another matrix and compute the corresponding element of resultant matrix.
  • The choice of work-items in the code are given as
    size_t globalWorkSize[3] = [128, 128,1];
    & for generalisation, refer table 1.2


  • Input
  • Matrix Size

  • Output
  • Execution time in seconds ,Gflops achieved



Example 1.15:

Write a OpenCL Program for implement solution of matrix system of linear equations [A]{x} = {b} by Jacobi method.
  • Objective

    Write a OpenCL program, for solving system of linear equations [A]{x} = {b} using Jacobi method

  • Description

    The Jacobi iterative method is one of the simplest iterative techniques to solve system of linear equations. The ith equation of a system of linear equations [A]{x}={b} is




    If all the diagonal elements of A are nonzero (or are made nonzero by permuting the rows and columns of A), we can rewrite equation (1)




    The Jacobi method starts with an initial guess x0 for the solution vector x. This initial vector x0 is used in the right-hand side of equation (2) to arrive at the next approximation x1 to the solution vector. The vector x1 is then used in the right hand side of equation (2), and the process continues until a close enough approximation to the actual solution is found. A typical iteration step in the Jacobi method is




    We now express the iteration step of equation 3 in terms of residual rk. Equation (3) can be rewritten as <



    Each process computes n/p values of the vector x in each iteration. These values are gathered by all the processes and each process tests for convergence. If the values have been computed up-to a certain accuracy the iterations are stopped otherwise the processes use these values in the next iterations to compute a new set of values.

  • Input
  • Input Matrix A and Initial Solution Vector x

  • Output
  • solution of matrix system Ax = b



    Example 1.16:

    Write a OpenCL Program for implement solution of matrix system of linear equations [A]{x} = {b} by Conjugate Gradient method.
    • Objective

      OpenCLimplementation for Conjugate Gradient Method to solve the system of linear equations [A]{x}={b}. Assume that A is symmetric positive definite matrix.

    • Description

      Description of conjugate gradient method :

      The conjugate gradient (CG) method is an example of minimizing method. A real n x n matrix A is positive definite if xT A x > {0} for any n x 1 real, nonzero vector x. For a symmetric positive definite matrix A, the unique vector x that minimizes the quadratic functional

      f(x) = (1/2)xTAx-xTb 

      is the solution to the system Ax = b, here x and b are n x 1 vectors. It is not particularly relevant when n is very large, since the conjugating time for that number of iterations is usually prohibitive and the property does not hold in presence of rounding errors. The reason is that the gradient of functional f (x) is Ax - b, which is zero when f (x) is minimum. The gradient of a function is a n x 1 vector. We explain some important steps in the algorithm. An iteration of a minimization method is of the form

      xk+1 = xk  + taukdk     (1) 
       

      where tauk is a scalar step size and dkis the direction vector, dk is a descent direction for f at x.  We now consider the problem of determining tauk, given xk and dk, so that f(x) is minimized on the line x = xk + tauk dk, for tauk. The function f(xk+ tau dk) is quadratic in tau, and its minimization leads to the condition

      tau k =  gkTgk / dkTAdk ,    (2) 

      where gk=Axk - b is the gradient (residue)  vector after k iterations. The residual need not be computed explicitly in each iteration because it can be computed incrementally by using its value from the previous iteration. In the (k+1)th iteration, the residual gk+1 can be expressed as follows: 

           gk+1  = Axk+1 -        = A(xk+ tauk dk) - b 
           = Axk- b + tauk Ad 

           = gk + tauk Adk    
      (3)


      Thus, the only matrix-vector product computed in each iteration is Adk, which is already required to compute tauin the equation (2). If A is a symmetric positive definite matrix and d1, d2,..., dn are direction vectors that are conjugate with respect to A (that is, diT Adj=0 for all 0<n, j<=n, i!=j), then xk+1 in the Equation (1) converges to the solution of Ax = bin at most n iterations, assuming no rounding errors.

      In practice, however, the number of iterations that yields an acceptable approximation to the solution is much smaller than n.  It  also  makes the gradient at xk+1 orthogonal to search direction, i.e  dkT gk+1  = 0.  Now we suppose that the search directions are determined by an iteration of the form 

      dk+1 = -gk+1+ betak d     (4)

      where d0 = -g0  and beta0, beta1 , ...... remain to be determined. We find the new search direction in the  plane spanned by the gradient at the most recent point and previous search direction. The parameter betak+1is determined by following equation

      Betak+1 = gTk+1Adk/ dTkAdk     (5)
      and, one can derive orthogonality relations
       

      gTkg l= 0 (l != k);                 dTkAdl = 0 (l !=k)

      The derivation of the above equation (5)  and orthogonality relations is beyond the scope of this document. For details please refer [ ]. Using equation (3) and orthogonality relations, the equation (5)  can be further reduced to

      Betak+1 = gTk+1gk+1/ gTkgk<     (6)

      The above equations (1) to (6) lead to CG algorithm. The algorithm terminates when the square of the Euclidean vector norm of gradient (residual) falls below a predetermined tolerance value.  Although all of the versions of the conjugate gradient method obtained by combining the formulas for gk, Betak, and tauk in various ways are mathematically equivalent, their computer implementation is not. The following version is compared with respect to computational labor, storage requirements, and accuracy. The following sequence of steps are widely accepted.

      1. tau k         =  gkTgk / dkTAdk 
      2. xk+1        = xk  + tauk dk 
      3. gk+1        =  gk + tauk Ad 
      4. Betak+1 = gTk+1gk+1/ gTkgk 
      5. dk+1        = -gk+1 + Betak dk 

      where k = 0, 1, 2, .......... Initially we choose x0, calculate g0 = Ax0 - b , and put d0= -g0
      The computer implementation of this algorithm is explained as follows :

      void CongugateGradient(float x0 [ ], float b [ ], float d) 
          { 
             float    g, Delta0, Delta1, beta; 
             float    temp, tau; 
             int        iteration;
             iteration = 0;
             x   = x0         g = b; 
             g = A x - g; 
             Delta0 = gT * g;   
             if ( Delta0 <= EPSILON) 
                 return;  
             d = -g; 
             do { 
                iteration = iteration + 1; 
                temp = A * d;  
                tau = Delta0 / d * temp;  
                x = x + tau * d;  

                g = g + tau * temp;  
                Delta1 = g * g;
                if ( Delta1 <= EPSILON )  
                       break; 
                beta = Delta1 / Delta0;  
                Delta0 = Delta1;  
                d = -g + beta * d;  
            } while(Delta0 > EPSILON && Iteration < MAX_ITERATIONS);
             return;
       

      Regarding one-dimensional arrays of size n x 1 are required for temp, g, x, d. The storage requirement for matrix. A is depends upon the structure ( dense, band, sparse ) of the matrix.The two dimensional n x n array is the simplest structure to store matrix A. For large sparse matrix A this structure wastes a large amount of storage space, for such matrix A suitable storage scheme should be used.

    • The preconditioned conjugate gradient algorithm  
    • Let C be a positive definite matrix factored in the form C = E ET, and let the quadratic functional 

      f(x) = (1/2)xTAx - xTb + C
      We define second quadratic functional g(y) by the transformation y = ETx,

      g(x) = g(E-Ty) = (1/2)yTA*y - yTb * + C*         where A * = E-1AE-T, b* = E-1b, C* = C. 
      Here, A* is symmetric and positive definite. The similarity transformation

      E-TA*ET = E-TE-1A = C-1

      reveals that A* and A have same eigen values. If C can be found such that the condition number of the matrix A* is less than the condition number of the matrix A, then the rate of convergence of the preconditioned method is better than that of conjugate gradient method. We call C the preconditioning matrix, A* the preconditioned matrix, We assume that the matrix C = EET is positive definite, since E is nonsingular by assumption. If the coefficient matrix A has l distinct eigen values, the CG algorithm converges to the solution of the system Ax = b in at most l iterations (assuming no rounding errors). Therefore, if A has many distinct eigen values that vary widely in magnitude, the CG algorithm may require a large number of iterations to converge to an acceptable approximation to the solution.

      The speed of convergence of the CG algorithm can be increased by preconditioning A with the congruence transformation A* = E-1AE-T where E is a nonsingular matrix. E is chosen such that A* has fewer distinct eigen values than A. The CG algorithm is then used to solve A* y =b*, where x =(ET)-1y . The resulting algorithm is called the preconditioned conjugate gradient (PCG) algorithm. The step performed in each iteration of the preconditioned conjugate gradient algorithm are as follows

      1.   tau k         =  gkThk / dkTAdk 
      2.   xk+1       = xk + tauk dk 
      3.  gk+1       =  gk + tauk Ad 
      4.   hk+1       =  C-1gk+1 
      5. &nsbp; betak+1 = gTk+1hk+1/ gTkhk 
      6.   dk+1       = -hk+1 + betak+1dk

      where k = 0, 1, 2, .......... Initially we choose x0, calculate g0 = Ax0 - b, h0= C-1g0 and d0 = -h0. The multiplication by C-1 in step (4) is to be interpreted as solving a system of equations with coefficient matrix C. A source of preconditioning matrices is the class of stationary iterative methods for solving the system Ax* = b

    • Parallel implementations of the PCG algorithm
    • The parallel conjugate gradient algorithm involves the following  type of computations and communications

      Partitioning of a matrix : The matrix A is obtained by discretization of partial differential equations by finite element, or finite difference method. In such cases, the matrix is either sparse or banded. Consequently, the partition  of the matrix onto p processes play a vital role for performance. For, simplicity , we assume that A is symmetric positive definite and is row-wise block-striped partitioned.

      Scalar Multiplication of a vector and addition of vectors :
      Each of these computations can be performed sequentially regardless of the preconditioner and the type of coefficient matrix. If all vectors are distributed identically among the processes, these steps require no communication in a parallel implementation.

      Vector inner products :
      In some situations, partial vectors are available on each  processes.  MPI Collective library calls are necessary to perform vector inner products  If the parallel computer supports fast reduction operations, such as optimized MPI, then the communication time for the inner-product calculations can be made minimum.

      Matrix-vector multiplication : The computation and the communication cost of the matrix-vector multiplication; depends on the structure of the matrix A. The parallel implementation of the PCG algorithm for three cases one in which A is a block-tridiagonal matrix of the type, two in which it is banded unstructured sparse matrix, and three in which the matrix is sparse give different performance on parallel computers. Various parts of the algorithm in each of the three cases dominate in terms of communication overheads.

      Solving the preconditioned system :
      The PCG algorithm solves system of linear equations in each iteration The preconditioner C is chosen so that solving the system modified system is in expensive compared to solving the original system of equations Ax = b. Nevertheless, preconditioning increases the amount of computation in each iteration. For good preconditioners, however, the increase is compensated by a reduction in the number of iterations required to achieve acceptable convergence. The computation and the communication requirements of this step depends on the type of preconditioner used. preconditioning method such as diagonal preconditioning, in which the preconditioning matrix C has nonzero elements only along the principle diagonal does not involve any communication Also, Incomplete Cholesky (IC) preconditioning, in which C is based on incomplete Cholesky factorization of A and it may involve different computations and communications in parallel implementation.

      The convergence of CG method iterations performed by checking the error criteria i.e. euclidean norm of the residual vector should be less than prescribed tolerance. This convergence check involves gathering of real value from all processes, which may be very costly operation.

      We consider parallel implementations of the PCG algorithm using diagonal preconditioner for dense coefficient matrix type. As we will see, if C is a diagonal preconditioner, then solving the modified system does not require any interprocessor communication. Hence, the communication time in a CG iteration with diagonal preconditioning is the same as that in an iteration of the unpreconditioned algorithm.

      Thus the operations that involve any communication overheads are computation of inner products, matrix-vector multiplication and, in case of IC preconditioner solving the system.

    • Input
    • Row and Column in input Matrix

    • Output
    • prints the results of the solution x of linear system of matrix equations Ax = b




    Example 1.17: Write a OpenCL program to perform sparse matrix into vector multiplication
    (Assignment - To be discussed in Lab. Session)
    • Objective

      To write a CUDA program on sparse matrix multiplication of size n x n and vector of size n.

    • Efficient storage format for sparse matrix

      Dense matrices are stored in the computer memory by using two-dimensional arrays. For example, a matrix with n rows and m columns, is stored using a n x m array of real numbers. However, using the same two-dimensional array to store sparse matrices has two very important drawbacks. First, since most of the entries in the sparse matrix are zero, this storage scheme wastes a lot of memory. Second, computations involving sparse matrices often need to operate only on the non-zero entries of the matrix. Use of dense storage format makes it harder to locate these non-zero entries. For these reasons sparse matrices are stored using different data structures. The Compressed Row Storage format (CRS) is a widely used scheme for storing sparse matrices. In the CRS format, a sparse matrix A with n rows having k non-zero entries is stored using three arrays: two integer arrays rowptr and colind, and one array of real entries values. The array rowptr is of size n+1, and the other two arrays are each of size k. The array colind stores the column indices of the non-zero entries in A, and the array values stores the corresponding non-zero entries. In particular, the array colind stores the column-indices of the first row followed by the column-indices of the second row followed by the column-indices of the third row, and so on. The array rowptr is used to determine where the storage of the different rows starts and ends in the array colind and values. In particular, the column-indices of row i are stored starting at colind [rowptr[i]] and ending at (but not including) colind [rowptr[i+1] ]. Similarly, the values of the non-zero entries of row i are stored at values [rowptr[i] ] and ending at (but not including) values [rowptr[i+1] ]. Also note that the number of non-zero entries of row i is simply rowptr[i+1]-rowptr[i].

    • Serial sparse matrix vector multiplication

      The following function performs a sparse matrix-vector multiplication [y]={A} {b} where the sparse matrix A is of size n x m, the vector b is of size m and the vector y is of size n. Note that the number of columns of A (i.e., m ) is not explicitly specified as part of the input unless it is required.

      void SerialSparseMatVec(int n, int *rowptr, int *colind, double *values
      double *b, double  *y)   

            int i, j, count ;  
            count = 0;
            for(i=0; i<n; i++)
            {
              y[i] = 0.0;
              for (j=rowptr[i]; j<rowptr[i+1];   j++)
                y[i] += value [count] * b [colind[j]];
              count ++;
            }
      }

    • Description of parallel algorithm

      In the parallel implementation, each thread picks a row from the matrix and multiplies it with the vector. Thus computation of all threads is carried out in parallel.

    • Implementation

      There are two implementations, one using OpenCL kernels and the other using BLAS library.

      OpenCL implementation

      Step 1: The matrix size(no. of rows) and sparsity(percentage of non-zero) are provided by the user in the cmd line.

      Step 2: A sparse matrix and a vector of the given size are allocated and initialized. Also the row_ptr and col_idx vectors are created and assigned their appropriate based on the sparse matrix

      Step 3: The above vectors are also created and initialized on the device (GPU).

      Step 4: The sparse_matrix and vector are multiplied in the GPU to obtain the result.

    • Performance:

      The gettimeofday() function which is part of sys/time.h is used to measure the time taken for computation.

    • Input
    • The input to the problem is given as arguments in the command line. It should be given in the following format ; Suppose that the number of rows of the sparse matrix is n (only square matrices are considered) and the sparsity i.e. the percentage of number of zero's (given in the range 0 to 1) is m, then the program must be run as,

      ./program_name n m

      CPU generates the sparse matrix, the vector to be multiplied using random values and the row_ptr and col_idx vectors based on the sparse matrix.

    • Output
    • The CPU prints the time taken for the computation.





    Example 1.18: Write a OpenCL program to compute vector-vector addition on Multi-GPU.
    (Download source code - : clMultiGPU-VectVectAdd.c   )
    • Objective

      Write OpenCL program to compute addition of two vectors

    • Description

      We create a one-dimensional globalWorkSize array that is overlaid on vector. The input vectors using Single Precision/Double Precision input data are generated on Host and transfer the vectors to device for vector addition. Each GPU computes addition of its own elements of two vectors and accumalates partial sum. In global memory,a simple kernel based on the 1- dimension indexspace of work groups is generated in which work item is given a unique ID within its work group. Each work item performs addition of two vectors using work item ID and the final resultant vector is generated on device and transferred to Host.

      Refer Table 1.1 for OpenCL Important implementation Steps

    • Each work-item using its work item ID performs addition using one element from each vector.
    • The choice of work-items in the code are given as
      size_t globalWorkSize[3] = [128, 1,1];
      & for generalisation, refer table 1.2 The number of work-items of workgroup on each GPU are specified.


    • Input
    • Two vectors of same size

    • Output
    • Execution time in seconds, Gflops achieved


    References
    1. AMD Fusion
    2. APU
    3. All about AMD FUSION APUs (APU 101)
    4. AMD A6 3500 APU Llano
    5. AMD A6 3500 APU review
    6. AMD APP SDK with OpenCL 1.2 Support
    7. AMD-APP-SDKv2.7 (Linux) with OpenCL 1.2 Support
    8. AMD Accelerated Parallel Processing Math Libraries (APPML)
    9. AMD Accelerated Parallel Processing (AMD APP) Programming Guide OpenCL : May 2012
    10. MAGMA OpenCL
    11. AMD Accelerated Parallel Processing (APP) SDK (formerly ATI Stream) with AMD APP Math Libraries (APPML); AMD Core Math Library (ACML); AMD Core Math Library for Graphic Processors (ACML-GPU)
    12. Getting Started with OpenCL
    13. Aparapi - API & Java
    14. AMD Developer Central - OpenCL Zone
    15. AMD Developer Central - SDKs
    16. ATI GPU Services (AGS) Library
    17. AMD GPU - Global Memory for Accelerators (GMAC)
    18. AMD Developer Central - Programming in OpenCL
    19. AMD GPU Task Manager (TM)
    20. AMD APP Documentation
    21. AMD Developer OpenCL FORUM
    22. AMD Developer Central - Programming in OpenCL - Benchmarks performance
    23. OpenCL 1.2 (pdf file)
    24. OpenCLT Optimization Case Study Fast Fourier Transform - Part 1
    25. AMD GPU PerfStudio 2
    26. Open Source Zone - AMD CodeAnalyst Performance Analyzer for Linux
    27. AMD ATI Stream Computing OpenCL - Programming Guide
    28. AMD OpenCL Emulator-Debugger
    29. GPGPU : http://www.gpgpu.org and Stanford BrookGPU discussion forum http://www.gpgpu.org/forums/
    30. Apple : Snowleopard - OpenCL
    31. The OpenCL Speciifcation Version : v1.0 Khronos OpenCL Working Group
    32. Khronos V1.0 Introduction and Overview, June 2010
    33. The OpenCL 1.1 Quick Reference card.
    34. OpenCL 1.2 Specification Document Revision 15) Last Released November 15, 2011
    35. The OpenCL 1.2 Specification (Document Revision 15) Last Released November 15, 2011 Editor : Aaftab Munshi Khronos OpenCL Working Group
    36. OpenCL1.1 Reference Pages
    37. MATLAB
    38. OpenCL Toolbox v0.17 for MATLAB
    39. NAG
    40. AMD Compute Abstraction Layer (CAL) Intermediate Language (IL) Reference Manual. Published by AMD.
    41. C++ AMP (C++ Accelerated Massive Parallelism)
    42. C++ AMP for the OpenCL Programmer
    43. C++ AMP for the OpenCL Programmer
    44. MAGMA SC 2011 Handout
    45. AMD Accelerated Parallel Processing Math Libraries (APPML) MAGMA
    46. Heterogeneous Computing with OpenCL, Benedict R Gaster, Lee Howes, David R Kaeli, Perhadd Mistry Dana Schaa Elsevier, Moran Kaufmann Publishers, 2011
    47. Programming Massievely Parallel Processors - A Hands-on Approach, David B Kirk, Wen-mei W. Hwu nvidia corporation, 2010, Elsevier, Morgan Kaufmann Publishers, 2011
    48. OpenCL Progrmamin Guide, Aftab Munshi Benedict R Gaster, timothy F Mattson, James Fung, Dan Cinsburg, Addision Wesley, Pearson Education, 2012
    49. AMD gDEBugger
    50. The HSA (Heterogeneous System Architecture) Foundation

    Centre for Development of Advanced Computing