C-DAC,Pune : High-Perf. Comp. Frontier Technologies Exploration Group and CMSD, University of Hyderabad, Technology Workshop hyPACK (October 15-18), 2013

Overview Venue : CMSD, UoH Key-Note/Invited Talks Faculty / Speakers Proceedings Downloads Past Tech. Workshops Target Audience Benefits Organisers Accommodation Local Travel Sponsors Feedback Acknowledgements Contact Home

Topics of Interest Tech. Prog. Schedule Topic : Multi-Core Topic : ARM Proc. Topic : Coprocessors Topic : GPGPUs Topic : HPC Cluster Topic : App. Kernels. Topic : Lab. Session Key-Note / Invited Talks Home

Mode-1 Multi-Core Memory Allocators OpenMP Intel TBB Pthreads Java - Threads Charm++ Prog. Message Passing (MPI) MPI - OpenMP MPI - Intel TBB MPI - Pthreads Compilers - Opt. Features Threads-Perf. Math. Lib. Threads-Prof. & Tools Threads - I/O Perf. PGAS : UPC / CAF/ GA Power & Perf. Home

Mode-2 ARM Prog. Env Benchmarks Power & Perf. Home

Mode-3 Coprocessors Arch. Software Compiler & Vect. Prog. Env. Benchmarks Power & Perf. Home

Mode-4 GPGPUs NVIDIA - CUDA/OpenCL AMD APP - OpenCL GPGPUs - OpenCL GPGPUs : Power & Perf. Home

Mode-5 HPC Cluster HPC MPI Cluster GPU Cluster - NVIDIA GPU Cluster - AMD APP Cluster - Intel Coprocessors Cluster- Power & Perf. Home

Mode-6 App. Kernels PDE Solvers : FDM/FEM Image Processing - FFT Monte Carlo Methods String Srch. Seq. Analy. Video Process. Intr. Detcn. Sys App. Power & Perf. Home

Reg. Overview Pvt. Sector Pub. Sector Govt. Acad. Staff Students Reg. On-line Reg. Accommodation Contact Home

• Mode-5 HPC Cluster • Cluster : Multi-Core - MPI • Cluster : GPU - NVIDIA CUDA/OpenCL • Cluster : GPU - AMD OpenCL • Cluster : Coprocessors -Intel Xeon Phi • Cluster:Power & Perf. • Home

hyPACK-2013 : Prog. on HPC GPU Cluster : Pthreads & CUDA Prog.

High Performance computing GPU Cluster (HPC GPU Cluster) is based on integration of different programming paradigms on host-CPU and device-GPU is available for laboratory sessions. Different programming paradigms on host CPU i.e. MPI, OpenMP, Pthreads and CUDA enabled NVIDIA GPUs & NVIDIA/AMD-APP OpenCL are integrated and several example programs for Dense/Sparse Matrix Computations, Solution of Partial Differential equations are included.

Example 1.1

To write a Pthreads - CUDA program to compute the matrix-vector multiplication using block striped partitioning for uniform data distribution.

Description of Pthreads - CUDA Programs

Example 1.1:

Write a Pthreads - CUDA program to compute the matrix-vector multiplication using block striped partitioning for uniform data distribution.
(Download source code : Mat_Vect_Mult_Pthreds_CUDA.cu)

Objective

Write a MPI - CUDA program, for computing the matrix -vector To write a Pthreads - Cuda program to compute the matrix-vector multiplication using block striped partitioning of a matrix for uniform data distribution.

Description

This is an implementation of Matrix-Vector multiplication using the block striped partitioning algorithm. Each thread multiplies the corresponding elements of the matrix with the vector by calling the CUDA kernel MatrixVectorMultiplication and writes the result into the result vector. A Mutex is used on the result vector to guarantee atomicity. The thread accesses the elements based on its id which is allocated by the main thread in the order of their creation. As the number of threads and the number of elements is known, the corresponding elements to be accessed can easily be computed.

Implementation of Matrix Vector Multiplication :

Step 1 : Three Arrays MatrixA , VectorB and ResultVector are declaed on . Two additional vectors are used on host-CPU for computation. The purpose is to to store the part of the matrix that is accessible by each thread and the other is to store part of the result vector computed by each thread.

Step 2 : Main thread initializes the arrays MatrixA , VectorB and ResultVector . Fill arrays MatrixA and VectorB with single precsion real values and initialize the array ResultVector.

Step 3 : ThreadPart which decides how many elements each thread is to access in the matrix is computed.

Step 4 : Threads creation is done.

Step 5 : Memory is allocated for MyMatrixA and MyResultVector by all the threads on host-CPU .

Step 6 : MyMatrixA is constructed from MatrixA depending on the value of thread Id host-CPU

Step 7 : Memor for Vectors are allocated on host-CPU . The values of the vectors in the host machine are copied to the vectors on the device machine.

Step 8 :Checking for the availability of device is done on host-CPU

Step 9 : Each thread computes the Matrix vector multiplication by calling the CUDA kernel MatrixVectorMultiplication and copies back the result to MyResultVector on host-CPU .

Step 10 : Each thread performs a mutex operation on the result vector and assigns the corresponding elements on host-CPU.

Step 11 : Main thread waits for all threads to exit and Main thread prints the resultant vector.

CUDA API used :

To Allocate memory on device-GPU :
cudaMalloc(void** array, int size)

To Free memory allocated on device-GPU:
cudaFree(void* array)

To transfer from host-CPU to device-GPU:
cudaMemcpy((void*)device_array, (void*)host_array, size, cudaMemcpyHostToDevice)

To transfer from device-GPU to host-GPU:
cudaMemcpy((void*)host_array, (void*)device_array, size, cudaMemcpyDeviceToHost)

Input

The input to the problem is given as arguments in the command line. It should be given in the following format. Suppose the dimension of the matrix is mxn, size of the vector is n and the number of threads is p.

./Mat_Vect_Mult_Pthreds_CUDA m n n p
Main thread generates the Matrix and the vector.

Output

Main thread prints the final vector and the resultant vector

Centre for Development of Advanced Computing