hyPACK-2013 : Prog. on HPC GPU Cluster : OpenMP & CUDA Prog.
|
A prototype Hybrid Adaptive Cluster that can be made "adaptive" to the application it is running, assigning
the most effective resources in real-time as per application demands, without requiring modifications to the
application. The goal of this mixed environment is to provide total workflow optimization, which takes
cares-off applications
that do not parallelize well on scalar processors, can be optimized with the appropriate
computation model. The system aim is
to develop system software and integrate components of the State-of-the-Art-Technology such as
Reconfigurable FPGA,
Stream accelerators NVIDIA GPU computing, AMD Stream computing and IBM Cell Broadband Engine
Processors, and Multi-Core Processors.
|
Example 1.1
|
Write a OpenMP-CUDA program to Compute the value of pie value by Numerical Integration
using OpenMP directives and CUDA
|
Example 1.2
|
Write a OpenMP program to Matrix vector multiplication based on OpenMP & CUDA directives
and use CUBLAS1 library funcation calls to compute vector vector multiplication.
( Assignment )
|
Example 1.3
|
Write a OpenMP program to perform Matrix vector multiplication based on OpenMP & CUDA directives
and use CUBLAS2 library funcation calls to compute block matrix into vector multiplication.
( Assignment )
|
Example 1.4
|
Write a OpenMP program to Matrix matrix multiplication based on OpenMP ↦ CUDA directives
and use CUBLAS3 library funcation calls to compute block matrix into block matrix multiplication.
( Assignment )
|
Example 1.5
|
Write a OpenMP program for Matrix - Matrix Multiplication based on single threaded OpenMP using Venour supplied
mathematical libraries on host-CPU and CUDA BLAS3 Library on device-GPU >
|
|
|
|
Description of OpenMP - CUDA Programs |
Example 1.1:
|
Write a OpenMP-CUDA program to Compute the value of pie value by Numerical Integration
using OpenMP directives and CUDA
|
-
Objective
Write a OpenMP-CUDA program to Compute the value of pie value by Numerical Integration
using OpenMP directives and CUDA
-
Description
This is an openMP implementation on host-CPU using p threads of
OpenMP and computation on device-GPU.
One approach is to partition the data among the processes. That is we partition the interval of
integration [0,1] among the OpenMP threads, and each thread estimates local integral over
its number of subinterval on GPU using CUDA APIs. The comptuations on GPU produced by the individual
OpenMP thread are transformed back to host-CPU . These results
are combined on host-CPU to produce the final result. On host-CPU,
prints the result.
To perform this integration numerically, divide the interval from 0 to 1 into n
subintervals and add up the areas of the rectangles as shown in the Figure 1 (n
= 5). Large values of n give more accurate approximations of pi
. Use MPI point-to-point communication library calls.
Figure 1 Numerical integration of pie function
We assume that n is total number of subintervals, p is the number
of processes and p < n. One simple way to distribute the total
number of subintervals to each process is to dividen by p. There
are two kinds of mappings that balance the load. One is a block mapping,
partitions the array elements into blocks of consecutive entries and assigns
the block to the processes. The other mapping is a cyclic mapping. It
assigns the first element to the first process, the second element to the
second, and so on. If n > p, we get back to the first process,
and repeat the assignment process for remaining elements. This process is
repeated until all the elements are assigned. We have used a cyclic mapping
for partition of interval [0,1] ontop processes.
-
CUDA API used:
To Allocate memory on device-GPU :
cudaMalloc(void** array, int size)
To Free memory allocated on device-GPU:
cudaFree(void* array)
To transfer from host-CPU to device-GPU:
cudaMemcpy((void*)device_array, (void*)host_array, size, cudaMemcpyHostToDevice)
To transfer from device-GPU to host-GPU:
cudaMemcpy((void*)host_array, (void*)device_array, size, cudaMemcpyDeviceToHost)
-
Input
OpenMP master thread on Host-CPU prints the computed value of pi function.
-
Output
OpenMP master thread on Host-CPU prints the resultant vector
|
Example 1.2:
|
Write a OpenMP program to Matrix vector multiplication based on
OpenMP & CUDA directives and use CUBLAS1 library funcation calls to compute
vector vector multiplication
|
-
Objective
Write a OpenMP-CUDA program to Matrix vector multiplication based on
OpenMP & CUDA directives and use CUBLAS1 library funcation calls to compute
vector vector multiplication
-
Description
This is an openMP implementation on host-CPU using p threads of
OpenMP and computation on device-GPU.
This is an implementation of Matrix-Vector multiplication
using the block striped partitioning algorithm on host-CPU
and device-GPU.
Each OpenMP thread gets the block of rows of the matrix
and each transfers the block of its rows and vector to the device-GPU from
host-CPU.
On each device-GPU,the elements of the row of a bock matrix with the
vector by calling the CUDA kernel Vector-VectorMultiplication i.e. CUBLAS1 Library call
and writes the partial product into the result vector. The result vector is transfered
from device-GPU to host-CPU.
on host-CPU to the output array.
For local block matrix of rows into vector on GPU, please refer
CUDA enabled NVIDIA GPU i.e.
gpu-comp-cublas-cuda-numerical.html
-
CUDA API used:
To Allocate memory on device-GPU :
cudaMalloc(void** array, int size)
To Free memory allocated on device-GPU:
cudaFree(void* array)
To transfer from host-CPU to device-GPU:
cudaMemcpy((void*)device_array, (void*)host_array, size, cudaMemcpyHostToDevice)
To transfer from device-GPU to host-GPU:
cudaMemcpy((void*)host_array, (void*)device_array, size, cudaMemcpyDeviceToHost)
-
Input
Master thread on host-cpu reads the input parameter n, the number of intervals on
command line.
-
Output
OpenMP Master thread on host-cpu prints the resultant vector of size n.
|
Example 1.3:
|
Write a OpenMP-CUDA program to perform Matrix vector multiplication based on OpenMP & CUDA
directives and use CUBLAS2 library funcation calls to compute
block matrix into vector multiplication.
|
-
Objective
Write a OpenMP program to Matrix vector multiplication based on
OpenMP & CUDA directives and use CUBLAS2 library funcation calls to compute
matrix vector multiplication
-
Description
This is an openMP implementation on host-CPU using p threads of
OpenMP and computation on device-GPU.
This is an implementation of Matrix-Vector multiplication
using the block striped partitioning algorithm on host-CPU
and device-GPU.
Each OpenMP thread gets the block of rows of the matrix
and each transfers the block of its rows and vector to the device-GPU from
host-CPU.
Then each device-GPU multiplies the corresponding block matrix with the
vector by calling the CUDA kernel MatrixVectorMultiplication CUBLAS2 library
and writes the partial product into the result vector. The result vector is transfered
from device-GPU to host-CPU.
on host-CPU to the output array.
For local block matrix of rows into vector on GPU, please refer
CUDA enabled NVIDIA GPU i.e.
gpu-comp-cublas-cuda-numerical.html
-
CUDA API used:
To Allocate memory on device-GPU :
cudaMalloc(void** array, int size)
To Free memory allocated on device-GPU:
cudaFree(void* array)
To transfer from host-CPU to device-GPU:
cudaMemcpy((void*)device_array, (void*)host_array, size, cudaMemcpyHostToDevice)
To transfer from device-GPU to host-GPU:
cudaMemcpy((void*)host_array, (void*)device_array, size, cudaMemcpyDeviceToHost)
-
Input
OpenMP Master thread on host-cpu reads the input matrix and the vector of size n .
-
Output
OpenMP master thread on Host-CPU prints the resultant vector
|
Example 1.4:
|
Write a OpenMP program to perform Matrix Matrix multiplication based on OpenMP & CUDA
directives and use CUBLAS3 library funcation calls to compute
block matrix into matrix multiplication.
|
-
Objective
Write a OpenMP-CUDA program to Matrix Matrix multiplication based on
OpenMP & CUDA directives and use CUBLAS3 library funcation calls to compute
matrix matrix multiplication
-
Description
This is an openMP implementation Matrix-Matrix multiplication
on host-CPU using p threads of
OpenMP and computation on device-GPU in which
block striped partitioning of the both input matrices on host-CPU
is performance and on device-GPU, the CUDA BLAS3 library call us used to
compute block matrix-matrix multiplication algorithm.
Each OpenMP thread gets the block of rows of the input matrix A and a block of
columns of input matrix B.
Each OpenMP threads transfers the respective block of matrices of
A and B from host-CPU to device-GPU.
Then each device-GPU multiplies the corresponding block matrices by calling the CUDA kernel
MatrixMatrixMultiplication CUBLAS3 library
and writes the partial output matrix i.e., C . The result block matrix
C is transfered
from device-GPU to host-CPU.
For block matrix matix multiplication on GPU, please refer
CUDA enabled NVIDIA GPU i.e.
gpu-comp-cublas-cuda-numerical.html
-
CUDA API used:
To Allocate memory on device-GPU :
cudaMalloc(void** array, int size)
To Free memory allocated on device-GPU:
cudaFree(void* array)
To transfer from host-CPU to device-GPU:
cudaMemcpy((void*)device_array, (void*)host_array, size, cudaMemcpyHostToDevice)
To transfer from device-GPU to host-GPU:
cudaMemcpy((void*)host_array, (void*)device_array, size, cudaMemcpyDeviceToHost)
-
Input
OpenMP Master thread on host-cpu reads the both input matrices of square size n
-
Output
OpenMP master thread on Host-CPU prints the resultant output matrix
|
Example 1.5:
|
Write a OpenMP-CUDA program to perform Matrix Matrix multiplication based on OpenMP & CUDA
directives and use CUBLAS3 library funcation calls to compute
block matrix into matrix multiplication.
|
-
Objective
Write a OpenMP program to Matrix Matrix multiplication based on
OpenMP & CUDA directives and use CUBLAS3 library funcation calls to compute
matrix matrix multiplication
-
Description
This is an openMP implementation Matrix-Matrix multiplication
on host-CPU using p threads of
OpenMP and computation on device-GPU in which
block striped partitioning of the both input matrices on host-CPU
is performance and on device-GPU, the CUDA BLAS3 library call us used to
compute block matrix-matrix multiplication algorithm.
Each OpenMP thread gets the block of rows of the input matrix A and a block of
columns of input matrix B. Computations of block matrices are performed
on host-CPU using tuned mathematical libraries.
The other OpenMP thread transfers the remaining block of matrices of
A and B from host-CPU to device-GPU.
Then, one device-GPU multiplies the corresponding block matrices by calling the CUDA kernel
MatrixMatrixMultiplication CUBLAS3 library
and writes the partial output matrix i.e., C . The result block matrix
C is transfered
from device-GPU to host-CPU.
For block matrix matix multiplication on GPU, please refer
CUDA enabled NVIDIA GPU i.e.
gpu-comp-cublas-cuda-numerical.html
-
CUDA API used:
To Allocate memory on device-GPU :
cudaMalloc(void** array, int size)
To Free memory allocated on device-GPU:
cudaFree(void* array)
To transfer from host-CPU to device-GPU:
cudaMemcpy((void*)device_array, (void*)host_array, size, cudaMemcpyHostToDevice)
To transfer from device-GPU to host-GPU:
cudaMemcpy((void*)host_array, (void*)device_array, size, cudaMemcpyDeviceToHost)
-
Input
OpenMP Master thread on host-cpu reads the both input matrices of square size n
-
Output
OpenMP master thread on Host-CPU prints the resultant output matrix
|
|
|
|