| 
        
        
         | 
           
            
              
                
 | 
        
      
          
          |    hyPACK-2013   Topics of Interest & Technical Programme   
Topics related to  performance of applications on HPC Cluster with coprocessors and accelerators  have been 
identified. The focus is on understanding  practical aspects to solve application kernels using
different programming paradigms are discussed.
In this  hyPACK-2013  workshop, power-consumption and  performance issues of application kernels 
on Heterogeneous HPC Clusters with coprocessors and accelerators  will be discussed. 
 The approach adopted to heterogeneous programming   for applications 
kernels and numerical linear algebra on hybrid computing systems (
 HPC  Cluster with acceleratrors and coprocessors )  is discussed 
in  Mode-1, Mode-2, Mode-3, Mode-4, Mode-5 and Mode-5 modules of
  hyPACK-2013 )  are given below.
 |  
 
 
 |    
   
      
  
      Mode-1 : 
  
     
  (Host-CPU : Multi-Cores)    Tuning & Performance of programs
   on Multi-Core Processors &   Distributed Shared Address Space (PGAS) memory Models
 (Host-CPU : Multi-Cores)    Tuning & Performance of programs
   on Multi-Core Processors &   Distributed Shared Address Space (PGAS) memory Models
       
      
  
      Mode-2 :  
  ARM    microprocessor technology address the performance, power and cost requirements for 
 almost all application markets. ARM development platform featuring NVIDIA Tegra processors
are being used in HPC. ARM platforms with CUDA parallel programming 
toolkit, provides the foundation for developers to build out the ARM HPC application ecosystem.
The CARMA DevKit features the NVIDIA Tegra 3 Quad-core ARM A9 CPU and the NVIDIA Quadro 1000M GPU
 with 96 CUDA cores. It offers HPC developers a simple way to create CUDA applications for 
GPU-accelerated systems with ARM processors. The topics such as  
Tuning and Performance Issues,  Power Consumption for Application Kernels, Measurement of Power Consumption - using External Power-Off-Meter,  and Programming on ARM processor multi-core processor systems will be discussed.
    
    
       Mode-3 : 
   
        
   Intel Xeon Phi Coprocessors    
      Programming on Intel Xeon-Phi Coprocessors,  Xeon-Phi Coprocessor usage model : MPI vesus Offload; 
 Compiler Vectorization features,  Approaches to Vectorization - Complier Directives, Programming Paradigms - OpenMP, Intel TBB, Intel Cilk Plus, Intel MKL
Intel Xeon-Phi Coprocessor Architecture,  Linux OS on Coprocessor, Coprocessor System software, Tuning Memory Allocation Performance - Huge Page Sizes, Profiling & Tuning Tools- PAPI & MPI tools and 
Tuning and Performance Issues- Power Consumption for Application Kernels will be discussed with hands-on sessions on Intel Xeon-Phi Coprocessors.
   
       
    
       Mode-4 : 
   
        
     (Accelerator : Device GPU - GPGPUs)    
      Multi-Core Processors with GPGPUs & GPU Computing Accelerators
      CUDA enabled NVIDIA GPUs - Kepler , NVIDIA-PGI Compiler Directives - OpenACC APIs;
      Efficient use of different memory types, Libraries-CUBLAS, CUFFT, CUSPARSE, and  CUDA-OpenACC APIs
      AMD-APP OpenCL; OpenCL on APUs,  Use  NVML Power Efficient APIs to estimate the 
      performance ; An Overview of AMD Accelerated Parallel Processing (APP) Capabilities;  
     AMD APUs - OpenCL  
   
       
     
               
       Mode-5 : 
   
       HPC Cluster with Intel Xeon Phi Coprocessors and GPU Accelerators 
  Efficient use of Intel Xeon Coprocessors and GPU Accelerators in Cluster; 
  Open Source Software using GPUs - MAGMA, & Top-500 Benchmarks; 
  Performance Issues on a HPC Cluster with coprocessors   
  HPC GPU Cluster : Programs based on host-cpu and devices GPUs 
  (CUDA/OpenCL); Health Monitoring of large GPU Cluster - MPI and CUDA on GPU 
  Cluster
       
   
    
       Mode-6 : 
   
       Application Kernels  
     Mixed Programming for Numerical /Non-Numerical Computations on multi-core processors with 
   Intel  Xeon-Phi coprocessors - and NVIDIA /AMD GPU accelerators and ARM processor systems; 
   Application & System Benchmarks & Performance; Image Processing Applications - Bio-Informatics -
   String Search Algorithms & Sequence Analysis;  Dense /Sparse Matrix Computations on HPC GPU 
   Cluster; Solution of Partial Differential Eqs. (FDM & FEM); FFT    libraries; Invited lectures on Information Sciences; Computational Physics 
      |  
 
|  
    Challenges :  
HPC Cluster with coprocessors and accelerators uses MPI for data transfer across the network during execution.  
Besides network transfer, data transfer includes upstreaming data from coprocessor or accelerator to CPU and 
downstreaming data from CPU to GPU or Coprocessor for next computation. There are four main challenges
of HPC  Cluster with coprocessors and accelerators from application point of view. 
 
 
  Application development process, 
 
  job scheduling and 
   resource management and monitoring health. 
  Measurement of Power Concumption and Performance of Applicaiton Kernels
 
 
To address the challenges, the three principle components such as host nodes, Coprocessors or Accelerators, and interconnect should be known in detail.
 
PCI-Express  allow multiple accelerators  or coprocessors to be plugged into one host-multi-core system. 
  Since Intel Xeon Phi Coprocessor or GPU accelerator performs substantial portion of computations of 
application
kernels, there is a need to match important characteristics such as host-cpu memory, PCIe bus, 
and network interconnect performance with the accelerator or coprocessor performance in order to maintain well-balanced 
system.
 
In particular, high-end GPUs (NVIDIA Fermi or Kepler) require full-bandwidth PCIe Gen 2 X 16 slots, 
which is higher than X8 speeds when multiple GPUs are used. Interconnect networks such as 
InfiniBand QDR interconnect is highly desirable to match with the amount of memory on the GPUs 
in order to enable their full utilization, and one-to-one ratio of  CPU Cores to GPUs may be 
desirable from the software development perspective. 
That is the development of MP based applications  and performance considerations 
can be studied. The mixed programming model i.e., MPI, OpenMP, Pthreads on host-CPU and CUDA enabled
 NVIDIA GPU, OpenCL programming on device-GPU or FPGA programming on RC-FPGA devices is used for 
solving scientific and engineering applications. 
Also, the HPC cluster with Intel Xeon Phi Coprocessors can be used as a hybrid computing platform 
based on different programming paradigms such as OpenMP, MPI, Intel TBB, OpenCL and MPI to solve 
applicaitons. The Intel Xeon-Phi Coprocessor offload pragmas can be used to port several applications 
in Message Passing Environment.
Understanding Intel's MIC architecture and programming models for the
Intel Xeon Phi coprocessor may enable programmers to achieve good performance of their applications.
The description of the hardware of the Intel Xeon Phi coprocessor through information about the basic programming models is discussed. 
Also, the information about porting programs up to tools and strategies how to analyze and improve the performance of applications is discussed.
 
  |  
| 
The hybrid programming model on HPC GPU Cluster has two phases of computation i.e., 
the   host CPU    and a   Accelerator : device GPU  . The phases that exhibit little or no data parallelism are 
implemented in host CPU and the phases that exhibit rich amount of data parallelism are 
implemented in the device code.  
 The data decomposition of application kernel or  numerical linear algebra computations
is performed based on MPI or Pthreads programming, 
keeping the number of cores on  host  multi-core processor system and
 the number of   GPU devices  
 available on the host multi-core processor system. The synchronization issues can be performed 
on  host CPU as well as device GPU.
 
 The data required for computation on device such as    
host  to device,  device to host  and  device to device     is transferred using
 appropriate API calls in 
CUDA and OpenCL.
Currently, the Xeon-Phi coprocessor is packaged as a separate PCIe device, external to the host processor.
The current PCIe packaging complicates the offload programming model in which any thread can access any data
 in a shared memory system with some overheads. To achieve the high offload computational performance with external coprocessors requires that developers to do the following operations such as (1). Transfer the data across the PCIe bus to the coprocessor and keep it there, (2). Give the coprocessor enough work to do 
and (3) focus on data reuse within the coprocessor(s) to avoid memory bandwidth bottlenecks and moving 
data back and forth to the host processor.
 |  |  |  |