C-DAC,Pune : High-Perf. Comp. Frontier Technologies Exploration Group and CMSD, University of Hyderabad, Technology Workshop hyPACK (October 15-18), 2013

Overview Venue : CMSD, UoH Key-Note/Invited Talks Faculty / Speakers Proceedings Downloads Past Tech. Workshops Target Audience Benefits Organisers Accommodation Local Travel Sponsors Feedback Acknowledgements Contact Home

Topics of Interest Tech. Prog. Schedule Topic : Multi-Core Topic : ARM Proc. Topic : Coprocessors Topic : GPGPUs Topic : HPC Cluster Topic : App. Kernels. Topic : Lab. Session Key-Note / Invited Talks Home

Mode-1 Multi-Core Memory Allocators OpenMP Intel TBB Pthreads Java - Threads Charm++ Prog. Message Passing (MPI) MPI - OpenMP MPI - Intel TBB MPI - Pthreads Compilers - Opt. Features Threads-Perf. Math. Lib. Threads-Prof. & Tools Threads - I/O Perf. PGAS : UPC / CAF/ GA Power & Perf. Home

Mode-2 ARM Prog. Env Benchmarks Power & Perf. Home

Mode-3 Coprocessors Arch. Software Compiler & Vect. Prog. Env. Benchmarks Power & Perf. Home

Mode-4 GPGPUs NVIDIA - CUDA/OpenCL AMD APP - OpenCL GPGPUs - OpenCL GPGPUs : Power & Perf. Home

Mode-5 HPC Cluster HPC MPI Cluster GPU Cluster - NVIDIA GPU Cluster - AMD APP Cluster - Intel Coprocessors Cluster- Power & Perf. Home

Mode-6 App. Kernels PDE Solvers : FDM/FEM Image Processing - FFT Monte Carlo Methods String Srch. Seq. Analy. Video Process. Intr. Detcn. Sys App. Power & Perf. Home

Reg. Overview Pvt. Sector Pub. Sector Govt. Acad. Staff Students Reg. On-line Reg. Accommodation Contact Home

• Mode-5 HPC Cluster • Cluster : Multi-Core - MPI • Cluster : GPU - NVIDIA CUDA/OpenCL • Cluster : GPU - AMD OpenCL • Cluster : Coprocessors -Intel Xeon Phi • Cluster:Power & Perf. • Home

hyPACK-2013 : Prog. on HPC GPU Cluster : OpenMP & CUDA Prog.

A prototype Hybrid Adaptive Cluster that can be made "adaptive" to the application it is running, assigning the most effective resources in real-time as per application demands, without requiring modifications to the application. The goal of this mixed environment is to provide total workflow optimization, which takes cares-off applications that do not parallelize well on scalar processors, can be optimized with the appropriate computation model. The system aim is to develop system software and integrate components of the State-of-the-Art-Technology such as Reconfigurable FPGA, Stream accelerators NVIDIA GPU computing, AMD Stream computing and IBM Cell Broadband Engine Processors, and Multi-Core Processors.

Example 1.1	Write a OpenMP-CUDA program to Compute the value of pie value by Numerical Integration using OpenMP directives and CUDA
Example 1.2	Write a OpenMP program to Matrix vector multiplication based on OpenMP & CUDA directives and use CUBLAS1 library funcation calls to compute vector vector multiplication. ( Assignment )
Example 1.3	Write a OpenMP program to perform Matrix vector multiplication based on OpenMP & CUDA directives and use CUBLAS2 library funcation calls to compute block matrix into vector multiplication. ( Assignment )
Example 1.4	Write a OpenMP program to Matrix matrix multiplication based on OpenMP ↦ CUDA directives and use CUBLAS3 library funcation calls to compute block matrix into block matrix multiplication. ( Assignment )
Example 1.5	Write a OpenMP program for Matrix - Matrix Multiplication based on single threaded OpenMP using Venour supplied mathematical libraries on host-CPU and CUDA BLAS3 Library on device-GPU

Description of OpenMP - CUDA Programs

Example 1.1:

Write a OpenMP-CUDA program to Compute the value of pie value by Numerical Integration using OpenMP directives and CUDA

Objective

Write a OpenMP-CUDA program to Compute the value of pie value by Numerical Integration using OpenMP directives and CUDA

Description

This is an openMP implementation on host-CPU using p threads of OpenMP and computation on device-GPU. One approach is to partition the data among the processes. That is we partition the interval of integration [0,1] among the OpenMP threads, and each thread estimates local integral over its number of subinterval on GPU using CUDA APIs. The comptuations on GPU produced by the individual OpenMP thread are transformed back to host-CPU . These results are combined on host-CPU to produce the final result. On host-CPU, prints the result.

To perform this integration numerically, divide the interval from 0 to 1 into n subintervals and add up the areas of the rectangles as shown in the Figure 1 (n = 5). Large values of n give more accurate approximations of pi . Use MPI point-to-point communication library calls.

Figure 1 Numerical integration of pie function

We assume that n is total number of subintervals, p is the number of processes and p < n. One simple way to distribute the total number of subintervals to each process is to dividen by p. There are two kinds of mappings that balance the load. One is a block mapping, partitions the array elements into blocks of consecutive entries and assigns the block to the processes. The other mapping is a cyclic mapping. It assigns the first element to the first process, the second element to the second, and so on. If n > p, we get back to the first process, and repeat the assignment process for remaining elements. This process is repeated until all the elements are assigned. We have used a cyclic mapping for partition of interval [0,1] ontop processes.

CUDA API used:

To Allocate memory on device-GPU :
cudaMalloc(void** array, int size)

To Free memory allocated on device-GPU:
cudaFree(void* array)

To transfer from host-CPU to device-GPU:
cudaMemcpy((void*)device_array, (void*)host_array, size, cudaMemcpyHostToDevice)

To transfer from device-GPU to host-GPU:
cudaMemcpy((void*)host_array, (void*)device_array, size, cudaMemcpyDeviceToHost)

Input

OpenMP master thread on Host-CPU prints the computed value of pi function.

Output

OpenMP master thread on Host-CPU prints the resultant vector

Example 1.2:

Write a OpenMP program to Matrix vector multiplication based on OpenMP & CUDA directives and use CUBLAS1 library funcation calls to compute vector vector multiplication

Objective

Write a OpenMP-CUDA program to Matrix vector multiplication based on OpenMP & CUDA directives and use CUBLAS1 library funcation calls to compute vector vector multiplication

Description

This is an openMP implementation on host-CPU using p threads of OpenMP and computation on device-GPU. This is an implementation of Matrix-Vector multiplication using the block striped partitioning algorithm on host-CPU and device-GPU. Each OpenMP thread gets the block of rows of the matrix and each transfers the block of its rows and vector to the device-GPU from host-CPU. On each device-GPU,the elements of the row of a bock matrix with the vector by calling the CUDA kernel Vector-VectorMultiplication i.e. CUBLAS1 Library call and writes the partial product into the result vector. The result vector is transfered from device-GPU to host-CPU. on host-CPU to the output array.

For local block matrix of rows into vector on GPU, please refer CUDA enabled NVIDIA GPU i.e. gpu-comp-cublas-cuda-numerical.html

Input

Master thread on host-cpu reads the input parameter n, the number of intervals on command line.

Output

OpenMP Master thread on host-cpu prints the resultant vector of size n.

Example 1.3:

Write a OpenMP-CUDA program to perform Matrix vector multiplication based on OpenMP & CUDA directives and use CUBLAS2 library funcation calls to compute block matrix into vector multiplication.

Objective

Write a OpenMP program to Matrix vector multiplication based on OpenMP & CUDA directives and use CUBLAS2 library funcation calls to compute matrix vector multiplication

Description

This is an openMP implementation on host-CPU using p threads of OpenMP and computation on device-GPU. This is an implementation of Matrix-Vector multiplication using the block striped partitioning algorithm on host-CPU and device-GPU. Each OpenMP thread gets the block of rows of the matrix and each transfers the block of its rows and vector to the device-GPU from host-CPU. Then each device-GPU multiplies the corresponding block matrix with the vector by calling the CUDA kernel MatrixVectorMultiplication CUBLAS2 library and writes the partial product into the result vector. The result vector is transfered from device-GPU to host-CPU. on host-CPU to the output array.

For local block matrix of rows into vector on GPU, please refer CUDA enabled NVIDIA GPU i.e. gpu-comp-cublas-cuda-numerical.html

Input

OpenMP Master thread on host-cpu reads the input matrix and the vector of size n .

Output

OpenMP master thread on Host-CPU prints the resultant vector

Example 1.4:

Write a OpenMP program to perform Matrix Matrix multiplication based on OpenMP & CUDA directives and use CUBLAS3 library funcation calls to compute block matrix into matrix multiplication.

Objective

Write a OpenMP-CUDA program to Matrix Matrix multiplication based on OpenMP & CUDA directives and use CUBLAS3 library funcation calls to compute matrix matrix multiplication

Description

This is an openMP implementation Matrix-Matrix multiplication on host-CPU using p threads of OpenMP and computation on device-GPU in which block striped partitioning of the both input matrices on host-CPU is performance and on device-GPU, the CUDA BLAS3 library call us used to compute block matrix-matrix multiplication algorithm. Each OpenMP thread gets the block of rows of the input matrix A and a block of columns of input matrix B. Each OpenMP threads transfers the respective block of matrices of A and B from host-CPU to device-GPU. Then each device-GPU multiplies the corresponding block matrices by calling the CUDA kernel MatrixMatrixMultiplication CUBLAS3 library and writes the partial output matrix i.e., C . The result block matrix C is transfered from device-GPU to host-CPU.

For block matrix matix multiplication on GPU, please refer CUDA enabled NVIDIA GPU i.e. gpu-comp-cublas-cuda-numerical.html

Input

OpenMP Master thread on host-cpu reads the both input matrices of square size n

Output

OpenMP master thread on Host-CPU prints the resultant output matrix

Example 1.5:

Write a OpenMP-CUDA program to perform Matrix Matrix multiplication based on OpenMP & CUDA directives and use CUBLAS3 library funcation calls to compute block matrix into matrix multiplication.

Objective

Write a OpenMP program to Matrix Matrix multiplication based on OpenMP & CUDA directives and use CUBLAS3 library funcation calls to compute matrix matrix multiplication

Description

This is an openMP implementation Matrix-Matrix multiplication on host-CPU using p threads of OpenMP and computation on device-GPU in which block striped partitioning of the both input matrices on host-CPU is performance and on device-GPU, the CUDA BLAS3 library call us used to compute block matrix-matrix multiplication algorithm. Each OpenMP thread gets the block of rows of the input matrix A and a block of columns of input matrix B. Computations of block matrices are performed on host-CPU using tuned mathematical libraries. The other OpenMP thread transfers the remaining block of matrices of A and B from host-CPU to device-GPU. Then, one device-GPU multiplies the corresponding block matrices by calling the CUDA kernel MatrixMatrixMultiplication CUBLAS3 library and writes the partial output matrix i.e., C . The result block matrix C is transfered from device-GPU to host-CPU.

For block matrix matix multiplication on GPU, please refer CUDA enabled NVIDIA GPU i.e. gpu-comp-cublas-cuda-numerical.html

Input

OpenMP Master thread on host-cpu reads the both input matrices of square size n

Output

OpenMP master thread on Host-CPU prints the resultant output matrix

Centre for Development of Advanced Computing