C-DAC,Pune : High-Perf. Comp. Frontier Technologies Exploration Group and CMSD, University of Hyderabad, Technology Workshop hyPACK (October 15-18), 2013

hyPACK-2013 Topics of Interest & Technical Programme

Topics related to performance of applications on HPC Cluster with coprocessors and accelerators have been identified. The focus is on understanding practical aspects to solve application kernels using different programming paradigms are discussed. In this hyPACK-2013 workshop, power-consumption and performance issues of application kernels on Heterogeneous HPC Clusters with coprocessors and accelerators will be discussed. The approach adopted to heterogeneous programming for applications kernels and numerical linear algebra on hybrid computing systems ( HPC Cluster with acceleratrors and coprocessors ) is discussed in Mode-1, Mode-2, Mode-3, Mode-4, Mode-5 and Mode-5 modules of hyPACK-2013 ) are given below.

The hybrid programming model on HPC GPU Cluster has two phases of computation i.e., the host CPU and a Accelerator : device GPU . The phases that exhibit little or no data parallelism are implemented in host CPU and the phases that exhibit rich amount of data parallelism are implemented in the device code.

The data decomposition of application kernel or numerical linear algebra computations is performed based on MPI or Pthreads programming, keeping the number of cores on host multi-core processor system and the number of GPU devices available on the host multi-core processor system. The synchronization issues can be performed on host CPU as well as device GPU.

The data required for computation on device such as host to device, device to host and device to device is transferred using appropriate API calls in CUDA and OpenCL. Currently, the Xeon-Phi coprocessor is packaged as a separate PCIe device, external to the host processor. The current PCIe packaging complicates the offload programming model in which any thread can access any data in a shared memory system with some overheads. To achieve the high offload computational performance with external coprocessors requires that developers to do the following operations such as (1). Transfer the data across the PCIe bus to the coprocessor and keep it there, (2). Give the coprocessor enough work to do and (3) focus on data reuse within the coprocessor(s) to avoid memory bandwidth bottlenecks and moving data back and forth to the host processor.