C-DAC,Pune : High-Perf. Comp. Frontier Technologies Exploration Group and CMSD, University of Hyderabad, Technology Workshop hyPACK (October 15-18), 2013

Overview Venue : CMSD, UoH Key-Note/Invited Talks Faculty / Speakers Proceedings Downloads Past Tech. Workshops Target Audience Benefits Organisers Accommodation Local Travel Sponsors Feedback Acknowledgements Contact Home

Topics of Interest Tech. Prog. Schedule Topic : Multi-Core Topic : ARM Proc. Topic : Coprocessors Topic : GPGPUs Topic : HPC Cluster Topic : App. Kernels. Topic : Lab. Session Key-Note / Invited Talks Home

Mode-1 Multi-Core Memory Allocators OpenMP Intel TBB Pthreads Java - Threads Charm++ Prog. Message Passing (MPI) MPI - OpenMP MPI - Intel TBB MPI - Pthreads Compilers - Opt. Features Threads-Perf. Math. Lib. Threads-Prof. & Tools Threads - I/O Perf. PGAS : UPC / CAF/ GA Power & Perf. Home

Mode-2 ARM Prog. Env Benchmarks Power & Perf. Home

Mode-3 Coprocessors Arch. Software Compiler & Vect. Prog. Env. Benchmarks Power & Perf. Home

Mode-4 GPGPUs NVIDIA - CUDA/OpenCL AMD APP - OpenCL GPGPUs - OpenCL GPGPUs : Power & Perf. Home

Mode-5 HPC Cluster HPC MPI Cluster GPU Cluster - NVIDIA GPU Cluster - AMD APP Cluster - Intel Coprocessors Cluster- Power & Perf. Home

Mode-6 App. Kernels PDE Solvers : FDM/FEM Image Processing - FFT Monte Carlo Methods String Srch. Seq. Analy. Video Process. Intr. Detcn. Sys App. Power & Perf. Home

Reg. Overview Pvt. Sector Pub. Sector Govt. Acad. Staff Students Reg. On-line Reg. Accommodation Contact Home

• Mode-5 HPC Cluster • Cluster : Multi-Core - MPI • Cluster : GPU - NVIDIA CUDA/OpenCL • Cluster : GPU - AMD OpenCL • Cluster : Coprocessors -Intel Xeon Phi • Cluster:Power & Perf. • Home

hyPACK-2013 : Prog. on HPC GPU Cluster : Pthreads & OpenCL Prog.

High Performance computing GPU Cluster (HPC GPU Cluster) is based on integration of different programming paradigms on host-CPU and device-GPU is available for laboratory sessions. A prototype HPC GPU Cluster supports various heterogeneous programming features and itcan be made "adaptive" to the application it is running, assigning the most effective resources in real-time as per application demands, without requiring modifications to the application that is written using different progrmaming is used. he Open Computing Language is a framework for writing programs that execute across heterogeneous platforms consisting of CPUs, GPUs, and other processors. Some of the following example programs use Pthread Programming environment on host-CPU & and OpenCL on single/multiple GPUs in HPC GPU Cluster.

Example 1.1	To write a Pthreads - OpenCL program to opencl-pthread program to perform Vector-Vector addtition that executes on multiple GPUs.
Example 1.2	To write a Pthreads - OpenCL program to opencl-pthread program to perform Matrix-Vector Multiplication that executes on multiple GPUs.

The HPC GPU Cluster aim is to develop system software and integrate components of the OpenCL on GPUs and demonstrare some of the programs in Mode-2 & Mode-3. Different programming paradigms on host CPU i.e. MPI, OpenMP, Pthreads and OpenCL enabled NVIDIA/AMD-APP are integrated and several example programs for Matrix Computations, and Solution of Partial Differential equations are discussed.

Description of Pthreads - OpenCL Programs

(Download Readme & Makefile README-Opencl_Pthreads Makefile-Opencl_Pthreads )

Example 1.1:

Write a Pthreads - OpenCL program to perform Vector- Vector addition on multiple GPUs
(Download source code : Vect_Vect_Add_Pthreads_OpenCL_mGPU.c)

Objective

Write a Pthreads OpenCL program for Vector Vector multiplication on multiple GPUs

Description

The input data is paritioned on p threads and each P each hread work is transfered from host-cPU to a device-GPU. Each thread performs vector into vector addition on each GPU by called OpencL Kernel and writes the partial sum to a result vector.

Implementation of Vector Vector Multiplication :

Step 1 : The memory for Arrays VectorA , VectorB and ResultVector are allocted on host-CPU. Two additional vectors are used on host-CPU to store the partial result of Vector Vector Multiplication that is performed by each thread.

Step 2 : Main thread initializes the arrays VectorA , VectorB and ResultVector . Fill the arrays VectorA and VectorB with real single precsion values and initialize the array ResultVector.

Step 3 : ThreadPart which decides how many elements each thread is to access in the VectorA & VectorB is computed.

Step 4 : Memory is allocated for MyResultVector by each thread on host-CPU .

Step 5 : Input VectorA and VectorB are filled with required number of single real precsion vlues for each thread Id host-CPU

Step 6 : Checking for the availability of device is done on host-CPU

Step 7 : Each thread computes the vector vector addition by calling the OpenCL kernel VectorVectorAddtition and copies back the result to MyResultVector on host-CPU .

Step 8 : Each thread performs a mutex operation on the result vector and assigns the corresponding elements on host-CPU.

Step 9 : Main thread waits for all threads to exit and Main thread prints the resultant vector.

Input

Master thread on host-cpu reads the input parameter n , the size of the vector and the number of threads is p. The input vectors generated using random numbers

Output

Main thread prints the final vector

Example 1.2:

Write a Pthreads - OpenCL program to perform Matrix- Vector Multiplication on multiple GPUs
(Download source code : Mat_Vect_Mult_Pthreads_OpenCL_mGPU.c )

Objective

Write a Pthreads OpenCL program for vector vector multiplication on multiple GPUs

Description

The input matrix is partioned on p threads and each thread on host-CPU is transfered the partitioned matrix and vector from host-cPU to a device-GPU. Each GPU performs parttioned matrix into vector mutliplication corresponds to thrread Id by calling OpencL Kernel and accumaltes partial result vector.

Implementation of Matrix Vector Multiplication :

Step 1 : Three Arrays MatrixA , VectorB and ResultVector are declaed on host-CPU. Two additional vectors are used on host-CPU for computation. The purpose is to to store the part of the matrix that is accessible by each thread and the other is to store part of the result vector computed by each thread.

Step 2 : Main thread initializes the arrays MatrixA, VectorB and ResultVector . Fill arrays MatrixA and VectorB with single precsion real values and initialize the array ResultVector.

Step 3 : ThreadPart which decides how many elements each thread is to access in the matrix is computed.

Step 4 : Memory is allocated for MyMatrixA and MyResultVector by all the threads on host-CPU .

Step 5 : MyMatrixA is constructed from MatrixA depending on the value of thread Id host-CPU

Step6 : Memor for Vectors are allocated on host-CPU . The values of the vectors in the host machine are copied to the vectors on the device machine.

Step 7 :Checking for the availability of device is done on host-CPU

Step 8 : Each thread computes the Matrix vector multiplication by calling the CUDA kernel MatrixVectorMultiplication and copies back the result to MyResultVector on host-CPU .

Step 9 : Each thread performs a mutex operation on the result vector and assigns the corresponding elements on host-CPU.

Step 10 : Main thread waits for all threads to exit and Main thread prints the resultant vector.

Input

Master thread on host-cpu reads the input parameter n , the size of the vector and the number of threads is p.

Output

Main thread prints the final vector

Centre for Development of Advanced Computing