hyPACK-2013 HPC GPU Cluster - Heterogeneous Programming

HPC GPU Cluster (AMD Opeteron Processors with AMD APP GPU Devices)

Two types of Hybrid Heterogeneous HPC GPU Cluster are used in laboratory sessions of workshop. The two clusters i.e., Intel Xeon Processor nodes as host-cpus with CUDA enabled NVIDIA GPUs as device accelerator GPUs and another cluster consists of AMD-Opteron processor nodes as host-cpu with AMD-ATI GPUs (AMDFire Stream & AMD-ATI FirePro) accelerator GPUs and AMD APUs. These clusters can address some of the heterogeneous computing workloads. The hybrid computing system aim is to develop system software and integrate components of the State-of-the-Art-Technology such as Stream accelerators NVIDIA GPU computing, AMD-ATI SDK.

AMD APP GPU Cluster Configuration Topic of Programs (Assignment Problems)

List of Programs :

Module 1 :	GPU Cluster : OpenMP - OpenCL - Matrix Computations
Module 2 :	GPU Cluster : Pthreads - OpenCL Dense Matrix Computations
Module 3 :	MPI - OpenCL - Dense Matrix Compuations - Application Kernels
Module 4 :	GPU Cluster Health Monitoring - Low level Benchmarks

References & Web-Pages : GPGPU & GPU Computing Web-sites

The implementation and programming issues of integrated cluster of Multi-Core processors with GPU accelerators, will be discussed. The HPC GPU Cluster supports Parallel Programming models, which include Shared memory programming (POSIX Threads, OpenMP, Intel TBB), and MPI 2.0 standard on Multi Core Processors. The Linux programming environment is provided on Cluster.

Type 1 : Configuration of HPC GPU Cluster

Peak performance (in double precision) of HPC GPU Cluster with one node having OpencL enabled AMD-ATI GPU is 4955 Gflop/s

Host-CPU : AMD Opteron X86 12 Core;
Device GPU : AMD Fire Stream 9350 & 9250; AMD FirePro V5900 & V7900

Host-CPU (AMD)

One AMD Opteron X86 24 Core Multi-Core Processor systems with two PCI-e 2.0 x16 Slots; RAM-48 GB; Clock Speed : 3.0 GHz; Cent OS 5.2; GCC Version 4.1.2; Dual Socket 12 Core (24 cores)

ACML version, OpenCL and BLAS Libraries; Peak Performance : CPU : 144 Gflops (1 Node - 12 Cores) and AMD-APP with OpenCL Prog. Env.

GPUs (AMD-ATI)

AMD Fire Stream 9250 GPU Accelerator :
Double Precision Floating Point : The FireStream 9250 supports double precision floating point operations in hardware;
High Performance per Watt : Up to 8 GFLOPS per watt of single precision performance potential

Optimized for computation The AMD FireStream product line provides the industry's first double-precision floating point capability on a GPU. The AMD FireStream 9250 is our second generation DP-FP product. With 1GB GDDR3 memory on board and single-precision performance of 1 TFLOPS.

AMD Fire Stream 9350 GPU Accelerator :
Technology Need : AMD FireStream Computing Solution
High DPFP performance : 528 GFLOPS double precision
High performance per Watt : 2.4 GFLOPS / Watt
Open standards : OpenCL, Direct Compute
Performance optimization tools : OpenCL SDK
PCIe 2.1 Host Interface : 8 GB/S Host-GPU bandwidth

The FireStream 9350 offers maximum GPU performance with 4GB of DDR5 memory in a 2-slot configuration. The FireStream 9350 offers maximum performance / slot with 2GB DDR5 memory in a 1-slot configuration.

AMD FirePro V5900 :
The AMD FirePro V5900 features 2GB of blazing-fast GDDR5 memory, 512 stream processors, and support for three simultaneous monitor outputs from a single AMD FirePro V5900 graphics card with AMD technology. The AMD FirePro V5900 supports OpenCL and it has parallel processing capabilities of 512 stream processors and PCI Express 2.1 compliant.

AMD FirePro V7900 :
The AMD FirePro V7900 features : 2GB of ultra-fast GDDR5 memory and 1280 stream processors. The AMD FirePro V7900 supports OpenCL and it has parallel processing capabilities of 1280 stream processors and PCI Express 2.1 compliant.

List of Programs based on HPC GPU Cluster

Demonstrate codes using different memory types of OpenCL Architectures on AMD APP GPU Cluster and AMD APUs
Incorporation of Error Checks on HPC GPU Cluster based on OpenCL for matric computation test suites
Example programs on Heterogeneous Programming - OpenCL based on CUDA enabled NVIDIA GPUs
Tuning & Performance using OpenCL enabled AMD-APP Libraries; Memory Optimization, Data-access optimization for matrix computations
Matrix Computations : Matrix - Vector Multiplication, Matrix-Matrix Multiplication based on MPI and OpenCL Implementation on HPC GPU Cluster with AMD-ATI GPUs
Application Kernels demonstration on HPC GPU Clusters (Heterogeneous Programming & MPI, Pthreads & Intel TBB)
Performance of Matrix Computations using vendor supplied tuned mathematical libraries (OpenCL based BLAS on AMD-ATI GPUs) on HPC GPU Cluster with GPU Accelerators)
Selective Numerical Computational kernels on Parallel Processing Systems with GPU Accelerator devices using MPI & OpenCL enabled AMD-ATI GPUs on HPC GPU Cluster
Numerical Linear algebra on Multi-Core Processors using Mixed Mode of Programming ( MPI-OpenCL, Pthreads-OpenCL) on HPC GPU Cluster.
Special Class of Application Kernels, and Numerical Linear algebra on Multi-Core Processors using Heterogeneous Programming ( OpenMP-OpenCL, MPI-OpenCL, Pthreads-OpenCL) on HPC GPU Cluster.
HPC-GPU Cluster (MPI on host-CPU & GPU - OpenCL - Solution of Partial differential Equations
HPC GPU Cluster (MPI on host-CPU & GPU - OpenCL - Image Processing -Edge Detection algorithms
Heterogeneous Programming (MPI on host-CPU & GPU - OpenCL - String Search algorithms & Sequence Analysis Applications
Develop test suites on HPC GPU Cluster based on MPI programming in Host-CPU to launch multiple kernels on GPU devices on each node of HPC GPU Cluster in an MPI- OpenCL programming environment
HPC GPU Cluster (MPI on host-CPU & GPU-OpenCL - Open source software Benchmarks - Solution of Matrix system Ax=b of Linear Equations (OpenCL based LINPACK solvers)
HPC GPU Cluster (MPI on host-CPU & GPU-OpenCL - Open source software Benchmarks - LINPACK (Solution of Matrix system Ax=b of Linear Equations)
Performance of MAGMA (Numerical Linear Algebra Kernels) on CUDA enabled GPUs & L HPC GPU Cluster (MPI on host-CPU & GPU - OpenCL - Image Processing -Edge Detection algorithms using OpenACC
Bio-Informatics: Sequence analysis (Smith Waterman Algorithms) on HPC GPU Cluster - OpenCL enabled NVIDIA GPUs
Solution of Partial Differential Equations (Poisson Equation in two dimensional & three dimensional regions) by finite element Method (FEM) using OpenCL AMD-APP on HPC GPU Cluster.
Image Processing -Face Detection and Image Inpainting algorithms on HPC GPU Cluster - AMD APP

References

1.	AMD Fusion
2.	APU
3.	All about AMD FUSION APUs (APU 101)
4.	AMD A6 3500 APU Llano
5.	AMD A6 3500 APU review
6.	AMD APP SDK with OpenCL 1.2 Support
7.	AMD-APP-SDKv2.7 (Linux) with OpenCL 1.2 Support
8.	AMD Accelerated Parallel Processing Math Libraries (APPML)
9.	AMD Accelerated Parallel Processing (AMD APP) Programming Guide OpenCL : May 2013
10.	MAGMA OpenCL
11.	AMD Accelerated Parallel Processing (APP) SDK (formerly ATI Stream) with AMD APP Math Libraries (APPML); AMD Core Math Library (ACML); AMD Core Math Library for Graphic Processors (ACML-GPU)
12.	Getting Started with OpenCL
13.	Aparapi - API & Java
14.	AMD Developer Central - OpenCL Zone
15.	AMD Developer Central - SDKs
16.	ATI GPU Services (AGS) Library
17.	AMD GPU - Global Memory for Accelerators (GMAC)
18.	AMD Developer Central - Programming in OpenCL
19.	AMD GPU Task Manager (TM)
20.	AMD APP Documentation
21.	AMD Developer OpenCL FORUM
22.	AMD Developer Central - Programming in OpenCL - Benchmarks performance
23.	OpenCL 1.2 (pdf file)
24.	OpenCLT Optimization Case Study Fast Fourier Transform - Part 1
25.	AMD GPU PerfStudio 2
26.	Open Source Zone - AMD CodeAnalyst Performance Analyzer for Linux
27.	AMD ATI Stream Computing OpenCL - Programming Guide
28.	AMD OpenCL Emulator-Debugger
29.	GPGPU : http://www.gpgpu.org and Stanford BrookGPU discussion forum http://www.gpgpu.org/forums/
30.	Apple : Snowleopard - OpenCL
31.	The OpenCL Speciifcation Version : v1.0 Khronos OpenCL Working Group
32.	Khronos V1.0 Introduction and Overview, June 2010
33.	The OpenCL 1.1 Quick Reference card.
34.	OpenCL 1.2 Specification Document Revision 15) Last Released November 15, 2011
35.	The OpenCL 1.2 Specification (Document Revision 15) Last Released November 15, 2011 Editor : Aaftab Munshi Khronos OpenCL Working Group
36.	OpenCL1.1 Reference Pages
37.	MATLAB
38.	OpenCL Toolbox v0.17 for MATLAB
39.	NAG
40.	AMD Compute Abstraction Layer (CAL) Intermediate Language (IL) Reference Manual. Published by AMD.
41.	C++ AMP (C++ Accelerated Massive Parallelism)
42.	C++ AMP for the OpenCL Programmer
43.	C++ AMP for the OpenCL Programmer
44.	MAGMA SC 2011 Handout
45.	AMD Accelerated Parallel Processing Math Libraries (APPML) MAGMA
46.	The OpenCL 1.2 Specification Khronos OpenCL Working Group
47.	The OpenCL 1.2 Quick-reference-card ; Khronos OpenCL Working Group
48.	Benedict R Gaster, Lee Howes, David R Kaeli, Perhadd Mistry Dana Schaa Heterogeneous Computing with OpenCL, Elsevier, Moran Kaufmann Publishers, 2011
49.	Programming Massievely Parallel Processors - A Hands-on Approach, David B Kirk, Wen-mei W. Hwu nvidia corporation, 2010, Elsevier, Morgan Kaufmann Publishers, 2011
50.	OpenCL Progrmamin Guide, Aftab Munshi Benedict R Gaster, timothy F Mattson, James Fung, Dan Cinsburg, Addision Wesley, Pearson Education, 2012
51.	AMD gDEBugger
52.	The HSA (Heterogeneous System Architecture) Foundation

Centre for Development of Advanced Computing