• Topics of Interest • Tech. Prog. Schedule • Topic : Multi-Core • Topic : ARM Proc • Topic : Coprocessor • Topic : GPGPUs • Topic : HPC Cluster • Topic : App. Kernels • Lab. Overview • Key-Note/Invited Talks • Home




hyPACK-2013 Topics of Interest & Technical Programme

Topics related to performance of applications on HPC Cluster with coprocessors and accelerators have been identified. The focus is on understanding practical aspects to solve application kernels using different programming paradigms are discussed. In this hyPACK-2013 workshop, power-consumption and performance issues of application kernels on Heterogeneous HPC Clusters with coprocessors and accelerators will be discussed. The approach adopted to heterogeneous programming for applications kernels and numerical linear algebra on hybrid computing systems ( HPC Cluster with acceleratrors and coprocessors ) is discussed in Mode-1, Mode-2, Mode-3, Mode-4, Mode-5 and Mode-5 modules of hyPACK-2013 ) are given below.


  • Mode-1 : (Host-CPU : Multi-Cores) Tuning & Performance of programs on Multi-Core Processors & Distributed Shared Address Space (PGAS) memory Models (Host-CPU : Multi-Cores) Tuning & Performance of programs on Multi-Core Processors & Distributed Shared Address Space (PGAS) memory Models

  • Mode-2 : ARM microprocessor technology address the performance, power and cost requirements for almost all application markets. ARM development platform featuring NVIDIA Tegra processors are being used in HPC. ARM platforms with CUDA parallel programming toolkit, provides the foundation for developers to build out the ARM HPC application ecosystem. The CARMA DevKit features the NVIDIA Tegra 3 Quad-core ARM A9 CPU and the NVIDIA Quadro 1000M GPU with 96 CUDA cores. It offers HPC developers a simple way to create CUDA applications for GPU-accelerated systems with ARM processors. The topics such as Tuning and Performance Issues, Power Consumption for Application Kernels, Measurement of Power Consumption - using External Power-Off-Meter, and Programming on ARM processor multi-core processor systems will be discussed.

  • Mode-3 : Intel Xeon Phi Coprocessors Programming on Intel Xeon-Phi Coprocessors, Xeon-Phi Coprocessor usage model : MPI vesus Offload; Compiler Vectorization features, Approaches to Vectorization - Complier Directives, Programming Paradigms - OpenMP, Intel TBB, Intel Cilk Plus, Intel MKL Intel Xeon-Phi Coprocessor Architecture, Linux OS on Coprocessor, Coprocessor System software, Tuning Memory Allocation Performance - Huge Page Sizes, Profiling & Tuning Tools- PAPI & MPI tools and Tuning and Performance Issues- Power Consumption for Application Kernels will be discussed with hands-on sessions on Intel Xeon-Phi Coprocessors.

  • Mode-4 : (Accelerator : Device GPU - GPGPUs) Multi-Core Processors with GPGPUs & GPU Computing Accelerators CUDA enabled NVIDIA GPUs - Kepler , NVIDIA-PGI Compiler Directives - OpenACC APIs; Efficient use of different memory types, Libraries-CUBLAS, CUFFT, CUSPARSE, and CUDA-OpenACC APIs AMD-APP OpenCL; OpenCL on APUs, Use NVML Power Efficient APIs to estimate the performance ; An Overview of AMD Accelerated Parallel Processing (APP) Capabilities; AMD APUs - OpenCL

  • Mode-5 : HPC Cluster with Intel Xeon Phi Coprocessors and GPU Accelerators Efficient use of Intel Xeon Coprocessors and GPU Accelerators in Cluster; Open Source Software using GPUs - MAGMA, & Top-500 Benchmarks; Performance Issues on a HPC Cluster with coprocessors HPC GPU Cluster : Programs based on host-cpu and devices GPUs (CUDA/OpenCL); Health Monitoring of large GPU Cluster - MPI and CUDA on GPU Cluster

  • Mode-6 : Application Kernels Mixed Programming for Numerical /Non-Numerical Computations on multi-core processors with Intel Xeon-Phi coprocessors - and NVIDIA /AMD GPU accelerators and ARM processor systems; Application & System Benchmarks & Performance; Image Processing Applications - Bio-Informatics - String Search Algorithms & Sequence Analysis; Dense /Sparse Matrix Computations on HPC GPU Cluster; Solution of Partial Differential Eqs. (FDM & FEM); FFT libraries; Invited lectures on Information Sciences; Computational Physics


Challenges : HPC Cluster with coprocessors and accelerators uses MPI for data transfer across the network during execution. Besides network transfer, data transfer includes upstreaming data from coprocessor or accelerator to CPU and downstreaming data from CPU to GPU or Coprocessor for next computation. There are four main challenges of HPC Cluster with coprocessors and accelerators from application point of view.

  • Application development process,
  • job scheduling and
  • resource management and monitoring health.
  • Measurement of Power Concumption and Performance of Applicaiton Kernels
To address the challenges, the three principle components such as host nodes, Coprocessors or Accelerators, and interconnect should be known in detail.
  • PCI-Express allow multiple accelerators or coprocessors to be plugged into one host-multi-core system. Since Intel Xeon Phi Coprocessor or GPU accelerator performs substantial portion of computations of application kernels, there is a need to match important characteristics such as host-cpu memory, PCIe bus, and network interconnect performance with the accelerator or coprocessor performance in order to maintain well-balanced system.

  • In particular, high-end GPUs (NVIDIA Fermi or Kepler) require full-bandwidth PCIe Gen 2 X 16 slots, which is higher than X8 speeds when multiple GPUs are used. Interconnect networks such as InfiniBand QDR interconnect is highly desirable to match with the amount of memory on the GPUs in order to enable their full utilization, and one-to-one ratio of CPU Cores to GPUs may be desirable from the software development perspective. That is the development of MP based applications and performance considerations can be studied. The mixed programming model i.e., MPI, OpenMP, Pthreads on host-CPU and CUDA enabled NVIDIA GPU, OpenCL programming on device-GPU or FPGA programming on RC-FPGA devices is used for solving scientific and engineering applications. Also, the HPC cluster with Intel Xeon Phi Coprocessors can be used as a hybrid computing platform based on different programming paradigms such as OpenMP, MPI, Intel TBB, OpenCL and MPI to solve applicaitons. The Intel Xeon-Phi Coprocessor offload pragmas can be used to port several applications in Message Passing Environment. Understanding Intel's MIC architecture and programming models for the Intel Xeon Phi coprocessor may enable programmers to achieve good performance of their applications. The description of the hardware of the Intel Xeon Phi coprocessor through information about the basic programming models is discussed. Also, the information about porting programs up to tools and strategies how to analyze and improve the performance of applications is discussed.

The hybrid programming model on HPC GPU Cluster has two phases of computation i.e., the host CPU and a Accelerator : device GPU . The phases that exhibit little or no data parallelism are implemented in host CPU and the phases that exhibit rich amount of data parallelism are implemented in the device code.

The data decomposition of application kernel or numerical linear algebra computations is performed based on MPI or Pthreads programming, keeping the number of cores on host multi-core processor system and the number of GPU devices available on the host multi-core processor system. The synchronization issues can be performed on host CPU as well as device GPU.

The data required for computation on device such as host to device, device to host and device to device is transferred using appropriate API calls in CUDA and OpenCL. Currently, the Xeon-Phi coprocessor is packaged as a separate PCIe device, external to the host processor. The current PCIe packaging complicates the offload programming model in which any thread can access any data in a shared memory system with some overheads. To achieve the high offload computational performance with external coprocessors requires that developers to do the following operations such as (1). Transfer the data across the PCIe bus to the coprocessor and keep it there, (2). Give the coprocessor enough work to do and (3) focus on data reuse within the coprocessor(s) to avoid memory bandwidth bottlenecks and moving data back and forth to the host processor.

Centre for Development of Advanced Computing