hyPACK-2013 Topics of Interest & Technical Programme
Topics related to performance of applications on HPC Cluster with coprocessors and accelerators have been
identified. The focus is on understanding practical aspects to solve application kernels using
different programming paradigms are discussed.
In this hyPACK-2013 workshop, power-consumption and performance issues of application kernels
on Heterogeneous HPC Clusters with coprocessors and accelerators will be discussed.
The approach adopted to heterogeneous programming for applications
kernels and numerical linear algebra on hybrid computing systems (
HPC Cluster with acceleratrors and coprocessors ) is discussed
in Mode-1, Mode-2, Mode-3, Mode-4, Mode-5 and Mode-5 modules of
hyPACK-2013 ) are given below.
|
-
Mode-1 :
(Host-CPU : Multi-Cores) Tuning & Performance of programs
on Multi-Core Processors & Distributed Shared Address Space (PGAS) memory Models
(Host-CPU : Multi-Cores) Tuning & Performance of programs
on Multi-Core Processors & Distributed Shared Address Space (PGAS) memory Models
-
Mode-2 :
ARM microprocessor technology address the performance, power and cost requirements for
almost all application markets. ARM development platform featuring NVIDIA Tegra processors
are being used in HPC. ARM platforms with CUDA parallel programming
toolkit, provides the foundation for developers to build out the ARM HPC application ecosystem.
The CARMA DevKit features the NVIDIA Tegra 3 Quad-core ARM A9 CPU and the NVIDIA Quadro 1000M GPU
with 96 CUDA cores. It offers HPC developers a simple way to create CUDA applications for
GPU-accelerated systems with ARM processors. The topics such as
Tuning and Performance Issues, Power Consumption for Application Kernels, Measurement of Power Consumption - using External Power-Off-Meter, and Programming on ARM processor multi-core processor systems will be discussed.
-
Mode-3 :
Intel Xeon Phi Coprocessors
Programming on Intel Xeon-Phi Coprocessors, Xeon-Phi Coprocessor usage model : MPI vesus Offload;
Compiler Vectorization features, Approaches to Vectorization - Complier Directives, Programming Paradigms - OpenMP, Intel TBB, Intel Cilk Plus, Intel MKL
Intel Xeon-Phi Coprocessor Architecture, Linux OS on Coprocessor, Coprocessor System software, Tuning Memory Allocation Performance - Huge Page Sizes, Profiling & Tuning Tools- PAPI & MPI tools and
Tuning and Performance Issues- Power Consumption for Application Kernels will be discussed with hands-on sessions on Intel Xeon-Phi Coprocessors.
-
Mode-4 :
(Accelerator : Device GPU - GPGPUs)
Multi-Core Processors with GPGPUs & GPU Computing Accelerators
CUDA enabled NVIDIA GPUs - Kepler , NVIDIA-PGI Compiler Directives - OpenACC APIs;
Efficient use of different memory types, Libraries-CUBLAS, CUFFT, CUSPARSE, and CUDA-OpenACC APIs
AMD-APP OpenCL; OpenCL on APUs, Use NVML Power Efficient APIs to estimate the
performance ; An Overview of AMD Accelerated Parallel Processing (APP) Capabilities;
AMD APUs - OpenCL
-
Mode-5 :
HPC Cluster with Intel Xeon Phi Coprocessors and GPU Accelerators
Efficient use of Intel Xeon Coprocessors and GPU Accelerators in Cluster;
Open Source Software using GPUs - MAGMA, & Top-500 Benchmarks;
Performance Issues on a HPC Cluster with coprocessors
HPC GPU Cluster : Programs based on host-cpu and devices GPUs
(CUDA/OpenCL); Health Monitoring of large GPU Cluster - MPI and CUDA on GPU
Cluster
-
Mode-6 :
Application Kernels
Mixed Programming for Numerical /Non-Numerical Computations on multi-core processors with
Intel Xeon-Phi coprocessors - and NVIDIA /AMD GPU accelerators and ARM processor systems;
Application & System Benchmarks & Performance; Image Processing Applications - Bio-Informatics -
String Search Algorithms & Sequence Analysis; Dense /Sparse Matrix Computations on HPC GPU
Cluster; Solution of Partial Differential Eqs. (FDM & FEM); FFT libraries; Invited lectures on Information Sciences; Computational Physics
|
Challenges :
HPC Cluster with coprocessors and accelerators uses MPI for data transfer across the network during execution.
Besides network transfer, data transfer includes upstreaming data from coprocessor or accelerator to CPU and
downstreaming data from CPU to GPU or Coprocessor for next computation. There are four main challenges
of HPC Cluster with coprocessors and accelerators from application point of view.
-
Application development process,
-
job scheduling and
-
resource management and monitoring health.
-
Measurement of Power Concumption and Performance of Applicaiton Kernels
To address the challenges, the three principle components such as host nodes, Coprocessors or Accelerators, and interconnect should be known in detail.
-
PCI-Express allow multiple accelerators or coprocessors to be plugged into one host-multi-core system.
Since Intel Xeon Phi Coprocessor or GPU accelerator performs substantial portion of computations of
application
kernels, there is a need to match important characteristics such as host-cpu memory, PCIe bus,
and network interconnect performance with the accelerator or coprocessor performance in order to maintain well-balanced
system.
-
In particular, high-end GPUs (NVIDIA Fermi or Kepler) require full-bandwidth PCIe Gen 2 X 16 slots,
which is higher than X8 speeds when multiple GPUs are used. Interconnect networks such as
InfiniBand QDR interconnect is highly desirable to match with the amount of memory on the GPUs
in order to enable their full utilization, and one-to-one ratio of CPU Cores to GPUs may be
desirable from the software development perspective.
That is the development of MP based applications and performance considerations
can be studied. The mixed programming model i.e., MPI, OpenMP, Pthreads on host-CPU and CUDA enabled
NVIDIA GPU, OpenCL programming on device-GPU or FPGA programming on RC-FPGA devices is used for
solving scientific and engineering applications.
Also, the HPC cluster with Intel Xeon Phi Coprocessors can be used as a hybrid computing platform
based on different programming paradigms such as OpenMP, MPI, Intel TBB, OpenCL and MPI to solve
applicaitons. The Intel Xeon-Phi Coprocessor offload pragmas can be used to port several applications
in Message Passing Environment.
Understanding Intel's MIC architecture and programming models for the
Intel Xeon Phi coprocessor may enable programmers to achieve good performance of their applications.
The description of the hardware of the Intel Xeon Phi coprocessor through information about the basic programming models is discussed.
Also, the information about porting programs up to tools and strategies how to analyze and improve the performance of applications is discussed.
|
The hybrid programming model on HPC GPU Cluster has two phases of computation i.e.,
the host CPU and a Accelerator : device GPU . The phases that exhibit little or no data parallelism are
implemented in host CPU and the phases that exhibit rich amount of data parallelism are
implemented in the device code.
The data decomposition of application kernel or numerical linear algebra computations
is performed based on MPI or Pthreads programming,
keeping the number of cores on host multi-core processor system and
the number of GPU devices
available on the host multi-core processor system. The synchronization issues can be performed
on host CPU as well as device GPU.
The data required for computation on device such as
host to device, device to host and device to device is transferred using
appropriate API calls in
CUDA and OpenCL.
Currently, the Xeon-Phi coprocessor is packaged as a separate PCIe device, external to the host processor.
The current PCIe packaging complicates the offload programming model in which any thread can access any data
in a shared memory system with some overheads. To achieve the high offload computational performance with external coprocessors requires that developers to do the following operations such as (1). Transfer the data across the PCIe bus to the coprocessor and keep it there, (2). Give the coprocessor enough work to do
and (3) focus on data reuse within the coprocessor(s) to avoid memory bandwidth bottlenecks and moving
data back and forth to the host processor.
|
|
|
|