• Mode-5 HPC Cluster • Cluster : Multi-Core - MPI • Cluster : GPU - NVIDIA CUDA/OpenCL • Cluster : GPU - AMD OpenCL • Cluster : Coprocessors -Intel Xeon Phi • Cluster:Power & Perf. • Home




hyPACK-2013 : Prog. on HPC GPU Cluster : OpenMP & OpenCL Prog.

A prototype Hybrid Adaptive Cluster that can be made "adaptive" to the application it is running, assigning the most effective resources in real-time as per application demands, without requiring modifications to the application. The goal of this mixed environment is to provide total workflow optimization, which takes cares-off applications that do not parallelize well on scalar processors, can be optimized with the appropriate computation model. The system aim is to develop system software and integrate components of the State-of-the-Art-Technology such as Reconfigurable FPGA, Stream accelerators NVIDIA GPU computing, AMD Stream computing and IBM Cell Broadband Engine Processors, and Multi-Core Processors.

Example 1.1 Write a OpenMP-OpenCL program to compute vector vector multiplication on Multi-CPU using global memory features of OpenCL. (OpenCL Events is used to synchronize the CPU and GPU devices.)
Example 1.2
Write a OpenMP-OpenCL program to Compute the value of pie value by Numerical Integration using OpenMP directives and OpenCL ( Assignment )
Example 1.3
Write a OpenMP-OpenCL program to perform Matrix Vector multiplication using BLAS1 Library. ( Assignment )
Example 1.4
Write a OpenMP-OpenCL program to perform Matrix Matrix multiplication using BLAS1 library on device GPU
Example 1.5
Write a OpenMP-OpenCL program to perform Matrix Matrix multiplication on Multi-GPUs using global memory & local memory ( Assignment )


Description of OpenMP - OpenCL Programs

Example 1.1: Write a OpenMP OpenCL program to vector vector multiplication based on OpenMP & OpenCL directives
(Download source code :
Vect_Vect_Add_Openmp_Opencl.c)     Vect-Vect-Add.cl)
  • Objective

    Write a OpenMP-OpenCL program to compute vector vector multiplication using global memory model of OpenCL

  • Description

    This is an openMP implementation on host-CPU using p threads of OpenMP and computation on device-GPU. This is an implementation of vector-Vector multiplication using the block striped partitioning algorithm on host-CPU and device-GPU. Each OpenMP thread gets the block of entries of vector and each transfers the block of its vector elements to the device-GPU from host-CPU. On each device-GPU,the partial elements of the two vectors are multipled by calling the OpenCL kernel Vector-VectorMultiplication and writes the partial sclar value. The result scalar is transfered from device-GPU to host-CPU.

  • Output

  • OpenMP Master thread on host-cpu prints the resultant sclar value n.



Example 1.2: Write a OpenMP-OpenCL program to Compute the value of pie value by Numerical Integration using OpenMP directives and OpenCL. ( Assignment )
  • Objective

    Write a OpenMP-OpenCL program to Compute the value of pie value by Numerical Integration using OpenMP directives and OpenCL

  • Description

    This is an openMP implementation on host-CPU using p threads of OpenMP and computation on device-GPU using OpencL APIs One approach is to partition the data among the processes. That is we partition the interval of integration [0,1] among the OpenMP threads, and each thread estimates local integral over its number of subinterval on GPU using OpenCL APIs. The comptuations on GPU produced by the individual OpenMP thread are transformed back to host-CPU . These results are combined on host-CPU to produce the final result. On host-CPU, prints the result.

    To perform this integration numerically, divide the interval from 0 to 1 into n subintervals and add up the areas of the rectangles as shown in the Figure 1 (n = 5). Large values of n give more accurate approximations of pi . Use MPI point-to-point communication library calls.

    Figure 1 Numerical integration of pie function

    We assume that n is total number of subintervals, p is the number of processes and p < n. One simple way to distribute the total number of subintervals to each process is to dividen by p. There are two kinds of mappings that balance the load. One is a block mapping, partitions the array elements into blocks of consecutive entries and assigns the block to the processes. The other mapping is a cyclic mapping. It assigns the first element to the first process, the second element to the second, and so on. If n > p, we get back to the first process, and repeat the assignment process for remaining elements. This process is repeated until all the elements are assigned. We have used a cyclic mapping for partition of interval [0,1] onto p processes.

  • Input

  • OpenMP master thread on Host-CPU prints the computed value of pi function.

  • Output

  • OpenMP master thread on Host-CPU prints the resultant vector



Example 1.3: Write a OpenMP-OpenCL program to perform Matrix vector multiplication based on OpenMP & CUDA directives and use OpenCL BLASlibrary funcation calls to compute block vector into vector multiplication. ( Assignment )
  • Objective

    Write a OpenMP-OpenCL program to Matrix vector multiplication based on CUBLAS1 library funcation calls to compute vector vector multiplication

  • Description

    This is an openMP implementation on host-CPU using p threads of OpenMP and computation on device-GPU. This is an implementation of Matrix-Vector multiplication using the block striped partitioning algorithm on host-CPU and device-GPU. Each OpenMP thread gets the block of rows of the matrix and each transfers the block of its rows and vector to the device-GPU from host-CPU. Then each device-GPU multiplies the corresponding block matrix with the vector by calling the OpenCL kernel VectorVectorMultiplication (BLAS1 library call) and writes the partial product into the result vector. The result vector is transfered from device-GPU to host-CPU. on host-CPU to the output array.

  • Input

  • OpenMP Master thread on host-cpu reads the input matrix and the vector of size n .

  • Output

  • OpenMP master thread on Host-CPU prints the resultant vector


Example 1.4: Write a OpenMP-OpenCL program to perform Matrix Matrix multiplication based on OpenMP & CUDA directives and use BLAS1 library funcation calls to compute block matrix into matrix multiplication.
(Download source code based on Device Global Memory:
Mat_Mat_Mult_Openmp_Opencl_GlobalMemory.cl)


(Download source code based on Device Shared Memory:
Mat_Mat_Mult_Openmp_Opencl_SharedMemory.cl)

clOmpMatMatMultShared.c / matrixMultShared.cl
  • Objective

    Write a OpenMP OpenCL program to compute Matrix Matrix multiplication based on OpenMP & OpenCL directives and use BLAS1 library funcation calls to compute vector vector multiplication

  • Description

    This is an openMP implementation Matrix-Matrix multiplication on host-CPU using p threads of OpenMP and computation on device-GPU in which block striped partitioning of the both input matrices on host-CPU is performance and on device-GPU, the CUDA BLAS1 library call is used to compute vector-vector multiplication algorithm. Each OpenMP thread gets the block of rows of the input matrix A and a block of columns of input matrix B. Computations of block matrices are performed on device-GPU using tuned BLAS1 mathematical libraries on device-GPU. Then, one device-GPU multiplies the corresponding block matrices by calling the OpenCL kernel MatrixMatrixMultiplication using BLAS1 library and writes the partial output matrix i.e., C . The result block matrix C is transfered from device-GPU to host-CPU.

  • Input

  • OpenMP Master thread on host-cpu reads the both input matrices of square size n

  • Output

  • OpenMP master thread on Host-CPU prints the resultant output matrix


Example 1.5: Write a OpenMP-OpenCL program to perform Matrix Matrix multiplication using global and local memory.

  • Objective

    Write a OpenMP OpenCL program to compute Matrix Matrix multiplication based on OpenMP & OpenCL directives using global and local memory

  • Description

    This is an openMP implementation Matrix-Matrix multiplication on host-CPU using p threads of OpenMP and computation on device-GPU in which block striped partitioning of the both input matrices on host-CPU is performed and on device-GPU. Each OpenMP thread gets the block of rows of the input matrix A and a block of columns of input matrix B. Computations of block matrices are performed on device-GPU on device-GPU. Then, one device-GPU multiplies the corresponding block matrices by calling the OpenCL kernel performs MatrixMatrixMultiplication and writes the partial output matrix i.e., C . The result block matrix C is transfered from device-GPU to host-CPU.

  • Input

  • OpenMP Master thread on host-cpu reads the both input matrices of square size n

  • Output

  • OpenMP master thread on Host-CPU prints the resultant output matrix



Centre for Development of Advanced Computing