C-DAC,Pune : High-Perf. Comp. Frontier Technologies Exploration Group and CMSD, University of Hyderabad, Technology Workshop hyPACK (October 15-18), 2013

hyPACK-2013 HPC GPU Cluster - Heterogeneous Programming

HPC GPU Cluster (Intel Xeon Processors with CUDA enabled NVIDIA GPUs)

Two tyes of Hybrid Heterogneous HPC GPU Cluster is used in laboratory sessions of workshop. The two clusters i.e., Intel Xeon Processor nodes as host-cpus with CUDA enabled NVIDIA GPUs as device accelerator GPUs and another cluster consists of AMD-Opteron processor nodes as host-cpu with AMD-ATI GPUs (AMDFire Stream & AMD-ATI FirePro) accelerator GPUs. These clusters can address some of the heterogeneous computing workloads. These sytems can be made "adaptive" to the application it is running, assigning the most effective resources in real-time as per application demands, without requiring modifications to the application. The hybrid computing system aim is to develop system software and integrate components of the State-of-the-Art-Technology such as Stream accelerators NVIDIA GPU computing, AMD-ATI SDK.

The implementation and programming issues of integrated cluster of Multi-Core processors with GPU accelerators, will be discussed. The HPC GPU Cluster supports Parallel Programming models, which include Shared memory programming (POSIX Threads, OpenMP, Intel TBB), and MPI 2.0 standard on Multi Core Processors. The Linux programming environment is provided on Cluster and the operating environment can be designed to run large complex application that can make use of GPGPU / GPU computing accelerators attached to Multi-Core Processors in an efficient way. The Linux programming environment can be configured to match different workloads of cluster as per application demands and execute highly scalable customized applications.

Stand-alone Multi-core Processor systems with multiple GPUs are inter connected with appropriate high-speed network and their combined computational power can be applied to solve a variety of computationally intensive applications. System area networks move switched, low-latency, high-speed networks away from the backplanes and cabinets of massively parallel processors into the traditional territory of local area networks. The inter-node communication and inter-GPU communication in typical HPC GPU cluster takes place via host node. All inter-GPU communication takes place via host nodes. GPU and the controlling CPU thread on host-cpu communicate via memcopies, while CPU threads exchange data using the same methods as applications not accelerated with GPUs. Thus, best performance is achieved when one follows best practices for the CPU-GPU communication as well as CPU-CPU communication. Note that the two are independent and orthogonal.

Pinned Memory : Communication between CPU and GPU is most efficient when using pinned memory on the CPU. Pinned memory enables asynchronous memory copies (allowing for overlap with both CPU and GPU execution), as well as improves PCIe throughput on FSB systems. Please refer to the CUDA C Programming Guide for more details and examples of pinned memory usage.

Communication between Light-weight CPU Threads: Light-weight CPU threads exchange data most efficiently via shared memory. Note that in order for a pinned memory region to be viewed as pinned by CPU threads other than the one that allocated it, one must call cudaHostAlloc() with the cudaHostAllocPortable flag. A common communication pattern will be for one CPU thread to copy data from its GPU to a shared host memory region, after which another CPU thread will copy the data to its GPU. Users of NUMA systems will have to follow the same best practices as for communication between non-GPU accelerated CPU threads.

Communication between heavy-weight CPU Threads: Communication between heavy-weight processes takes place via message passing, for example MPI. Once data has been copied from GPU to CPU it is transferred to another process by calling one of the MPI functions. For example, one possible pattern when exchanging data between two GPUs is for a CPU thread to call a device-to-host cudaMemcpy() then MPI_Sendrecv(), then a host-to-device cudaMemcpy(). Note that performance of the MPI function is not dependent on the fact that data originated at or is destined for a GPU. Since MPI provides several variations for most of its communication functions, the choice of a function should be dictated by the best practices guide for the MPI implementation as well as the system and network.