CUDA - NVIDIA GPU Cluster Configuration
Topic of Programs (Assignment Problems)
List of Programs CUDA enabled NVIDIA GPUs
Module 1 :
|
GPU Cluster : OpenMP - CUDA - Matrix Computations
|
Module 2 :
|
GPU Cluster : Pthreads - CUDA Dense Matrix Computations
|
Module 3 :
|
MPI - CUDA - Dense Matrix Compuations - Application Kernels
|
Module 4 :
|
GPU Cluster Health Monitoring - Low level Benchmarks
|
|
|
The implementation and programming issues of integrated cluster of Multi-Core processors with GPU accelerators,
will be discussed. The HPC GPU Cluster supports Parallel Programming models, which include Shared memory
programming (POSIX Threads, OpenMP, Intel TBB), and MPI 2.0 standard on Multi Core Processors. The Linux programming
environment is provided on Cluster and the operating environment can be designed to run large complex application that
can make use of GPGPU / GPU computing accelerators attached to Multi-Core Processors in an efficient way.
|
Type 1 : Configuration of HPC GPU Cluster
|
Peak performance (in double precision) of HPC GPU Cluster with one node having Single CUDA enabled NVIDIA GPU is 615 Gflop/s
Host-CPU : Intel Xeon Quad Core;
Device GPU : NVIDIA Fermi Multi GPUs
|
Host-CPU (Xeon)
-
One Intel Xeon 64bit Quad Core (X5450 processor series (Harpertown Processor) with
two PCI-e 2.0 x16 Slots; RAM-16 GB; Clock Speed : 3.0 GHz; Cent OS 5.2;
GCC Version 4.1.2; Dual Socket Quad Core (6 Processors or cores)
-
Intel MKL version 10.2, CUBLAS version 3.2, Intel icc11.1
Peak Performance : CPU : 96 Gflops (1 Node - 8 Cores)
Device-CPU (NVIDIA)
-
One Tesla C2050 (Fermi) with 3 GB memory; Clock Speed 1.15 GHz, CUDA 3.2 Toolkit
-
Reported theoretical peak performance of the Fermi (C2050) is 515 Gflop/s in double precision (448 cores; 1.15 GHz; one instruction per cycle) and reported maximum achievable peak performance of DGEMM in Fermi up to 58% of that peak.
-
The theoretical peak of the GTX280 is 936 Gflops/s in single precision (240 cores X 1.30 GHz X 3 instructions
per cycle) and reported maximum achievable peak performance of DGEMM up to 40% of that peak.
|
List of Programs based on HPC GPU Cluster
-
Performance of Matrix Computations - NVIDIA-PGI Complier Directives OpenACC on GPUs & Comparison with CUDA enabled NVIDIA GPUs
-
Demonstrate codes using different memory types of CUDA enabled NVIDIA GPUs for matrix computations.
-
Example programs on Heterogeneous Programming - OpenCL based on CUDA enabled NVIDIA GPUs
-
Application & System Benchmarks related to HPC GPU Cluster based on CUDA/OpenCL NVIDIA
-
Example programs based on The OpenACC Application Program Interface (a collection of compiler directives and the details are implicit in the programming model and are managed by the OpenACC API-enabled compilers and runtimes) for matrix computations on NVIDIA GPUs.
-
Example programs based on CUDA APIs to completely overlap CPU and GPU execution and I/O in HPC GPU Cluster environment.
-
Performance of pageble /pinned (page-locked) host memory & CUDA shared memory usage on CUDA enabled GPUs for application kernels.
-
Develop test suites to launch multiple kernels on CUDA enabled NVIDIA single & multiple GPU devices.
-
Programming exercises for Numerical Computations based on CUDA/OpenCL enabled NVIDIA, for Sparse Matrices Computations
-
Special example programs using CUDA Tool Chain on Multi-Core Processors with NVIDIA - GPU Computing CUDA SDK (CULA Tools, CUBLAS, CUFFT, CUSPARSE), NVIDIA Parallel Nsight tool
-
Special example programs on matrix computations using Concurrent Asynchronous Execution APIs of CUDA 5.0 enabled NVIDIA GPUs (Single/Multiple devices).
-
Demonstrate LLVM-based CUDA complier and toolkit technologies for CUDA enabled
GPU Programming Model
-
Tuning & Performance using CUDA enabled NVIDIA GPU Libraries; Memory Optimization, Data-access optimization for matrix computations
Performance of Application Kernels PDE Solvers by FDM using NVIDIA-PGI Complier Directives OpenACC on GPUs; CUDA enabled NVIDIA GPUs
-
Solution of Partial Differential Equations (Poisson Equation in two dimensional & three dimensional regions) by finite element Method (FEM) using CUDA/OpenCL enabled NVIDIA GPUs & OpenCL on HPC GPU Cluster.
-
Matrix Computations : Matrix - Vector Multiplication, Matrix-Matrix Multiplication based on MPI and OpenCL/CUDA Implementation on HPC GPU Cluster
-
Application Kernels demonstration on HPC GPU Clusters (CUDA Prog & Intel TBB)
Performance of Matrix Computations using vendor supplied tuned mathematical libraries (CUBLAS, MAGMA on NVIDIA GPUs) on HPC GPU Cluster with GPU Accelerators)
-
Selective Numerical Computational kernels on Parallel Processing Systems with GPU Accelerator devices using MPI & CUDA & OpenCL enabled NVIDIA GPUs on HPC GPU Cluster
-
Special Class of Application Kernels, and Numerical Linear algebra on Multi-Core Processors using Mixed Mode of Programming ( TBB-CUDA, MPI-CUDA, Pthreads-CUDA) on HPC GPU Cluster.
-
Special Class of Application Kernels, and Numerical Linear algebra on Multi-Core Processors using Heterogeneous Programming ( OpenMP-OpenCL, MPI-OpenCL, Pthreads-OpenCL) on HPC GPU Cluster.
-
HPC GPU Cluster (MPI on host-CPU & GPU - OpenCL - Image Processing -Edge Detection algorithms using OpenACC
-
An Overview of Bio-Informatics : Sequence analysis (Smith Watermann Algorithms) on HPC GPU Cluster - CUDA enabled NVIDIA GPUs & Heterogeneous Programming environment - OpenCL
-
Heterogeneous Programming (MPI on host-CPU & OpenCL on GPU & openACC for String Search algorithms & Sequence Analysis Applications
-
Implementation of Image Processing applications (Edge Detection algorithms) on GPGPUs using CUDA/OpenCL enabled NVIDIA GPUs
-
Implementation of String Search Algorithms - CUDA/OpenCL enabled NVIDIA GPUs of HPC GPU Cluster
-
HPC GPU Cluster (MPI on host-CPU & GPU-OpenCL - Open source software Benchmarks - Solution of Matrix system Ax=b of Linear Equations (MAGMA on CUDA enabled GPUs & LINPACK solvers)
|
References
1.
|
NVIDIA Kepler Architecture
|
2.
|
NVIDIA CUDA toolkit 5.0 Preview Release April 2012
|
3.
|
NVIDIA Developer Zone
|
4.
|
RDMA for NVIDIA GPUDirect coming in CUDA 5.0 Preview Release, April 2012
|
5.
|
NVIDIA CUDA C Programmig Guide Version 4.2 dated 4/16/2012 (April 2012)
|
6.
|
Dynamic Parallelism in CUDA Tesla K20 Kepler GPUs - Prelease of NVIDIA CUDA 5.0
|
7.
|
NVIDIA Developer ZONE - CUDA Downloads CUDA TOOLKIT 4.2
|
8.
|
NVIDIA Developer ZONE - GPUDirect
|
9.
|
OpenACC - NVIDIA
|
10.
|
Nsight, Eclipse Edition Pre-release of CUDA 5.0, April 2012
|
11.
|
NVIDIA OpenCL Programming Guide for the CUDA Architecture version 4.0 Feb, 2011 (2/14,2011)
|
12.
|
Optmization : NVIDIA OpenCL Best Practices Guide Version 1.0 Feb 2011
|
13.
|
NVIDIA OpenCL JumpStart Guide - Technical Brief
|
14.
|
NVIDA CUDA C BEST PRACTICES GUIDE (Design Guide) V4.0, May 2011
|
15.
|
NVIDA CUDA C Programming Guide Version V4.0, May 2011 (5/6/2011)
|
16.
|
NVIDIA GPU Computing SDK
|
17.
|
Apple : Snowleopard - OpenCL
|
18.
|
The OpenCL Specification, Version 1.1, Published by Khronos OpenCL
Working Group, Aaftab Munshi (ed.), 2010.
|
19.
|
The OpenCL Speciifcation Version : v1.0 Khronos OpenCL Working Group
|
20.
|
Khronos V1.0 Introduction and Overview, June 2010
|
21.
|
The OpenCL 1.1 Quick Reference card.
|
22.
|
OpenCL 1.2 (pdf file)
|
23.
|
OpenCL 1.1 Specification (Revision 44) June 1, 2011
|
24.
|
OpenCL Reference Pages
|
25.
|
MATLAB
|
26.
|
NVIDIA - CUDA MATLAB Acceleration
|
27.
|
CUDA BY EXAMPLE - An Introduction to General Purpose GPU Programnming,
Jason Sanders, Edward Kandrot (Foreword by Jack Dongarra),
Addison Wessely 2011, nvidia
|
28.
|
Programming Massievely Parallel Processors - A Hands-on Approach,
David B Kirk, Wen-mei W. Hwu
nvidia corporation, 2010, Elsevier, Morgan Kaufmann Publishers, 2011
|
29.
|
OpenCL Toolbox for MATLAB
|
30.
|
NAG
|
31.
|
OpenCL Progrmamin Guide,
Aftab Munshi Benedict R Gaster, timothy F Mattson, James Fung,
Dan Cinsburg, Addision Wesley, Pearson Education, 2012
|
32.
|
The OpenCL 1.2 Specification Khronos OpenCL Working Group
|
33.
|
The OpenCL 1.2 Quick-reference-card ; Khronos OpenCL Working Group
|
|