• Topics of Interest • Tech. Prog. Schedule • Topic : Multi-Core • Topic : ARM Proc • Topic : Coprocessor • Topic : GPGPUs • Topic : HPC Cluster • Topic : App. Kernels • Lab. Overview • Key-Note/Invited Talks • Home




hyPACK-2013 Hands On Session


hyPACK-2013 Hands-on Sessions (HoS) will be conducted on HPC Cluster with coprocessors and accelerators. Programming on ARM Processor Cluster and ARM Processor system with CUDA enabled NVIDIA carma and DSP Multi-Core processor systems for Mode-1, Mode-2, Mode-3, Mode-4, & Mode-5 modules. The approach adopted to heterogeneous programming for applications kernels and numerical linear algebra on hybrid computing systems ( HPC GPU Cluster) is discussed in Mode-1, Mode-2, Mode-3 & Mode-4 modules of hypack-2013 ) are given below.

Mode-5 : HPC Cluster with Accelerators & Coprocessors


HPC Cluster with Coprocessors & Accelerators : Prog. Env.

Cluster with NVIDIA GPUs     Cluster with AMD GPUs     Cluster with Intel Xeon-Phi Corprocessors



Type 1 :HPC Cluster with NVIDIA GPUs

Host-CPU : Intel Xeon Quad Core
Device GPU : NVIDIA Fermi Multi GPUs
Prog. Env : CUDA/OpenCL - NVIDIA GPUs; CUDA SDK/APIs; PGI Accelerator; Compilers with OpenACC API Directives on GPUs

Peak performance (in double precision) of HPC GPU Cluster with one node having Single CUDA enabled NVIDIA GPU is 615 Gflop/s


Host-CPU (Xeon)
  • One Intel Xeon 64bit Quad Core (X5450 processor series (Harpertown Processor) with two PCI-e 2.0 x16 Slots; RAM-16 GB; Clock Speed : 3.0 GHz; Cent OS 5.2; GCC Version 4.1.2; Dual Socket Quad Core (6 Processors or cores)

  • Intel MKL version 10.2, CUBLAS version 3.2, Intel icc11.1 Peak Performance : CPU : 96 Gflops (1 Node - 8 Cores)

Device-CPU (NVIDIA)
  • One Tesla C2050 (Fermi) with 3 GB memory; Clock Speed 1.15 GHz, CUDA 3.2 Toolkit

  • Reported theoretical peak performance of the Fermi (C2050) is 515 Gflop/s in double precision (448 cores; 1.15 GHz; one instruction per cycle) and reported maximum achievable peak performance of DGEMM in Fermi up to 58% of that peak.

  • The theoretical peak of the GTX280 is 936 Gflops/s in single precision (240 cores X 1.30 GHz X 3 instructions per cycle) and reported maximum achievable peak performance of DGEMM up to 40% of that peak.




Type 2 : Configuration of HPC Cluster with AMD GPUs

HP Pavailion AMD A8-4500K (Trinity) APU with AMD Radeon 7640G chip
Host-CPU : AMD Opteron X86 12 Core     AMD APUs (APU 101)
Device GPU : AMD Fire Stream 9350 & 9250; AMD FirePro V5900 & V7900
Prog. Env : OpenCL - AMD APP GPUs; OpenCL SDK/APIs on APUs

Peak performance (double precision) of HPC GPU Cluster with one node having Single AMD Fire Stream 9305 is 415 Gflop/s


Host-CPU (AMD)
  • One AMD Opteron X86 24 Core Multi-Core Processor systems with two PCI-e 2.0 x16 Slots; RAM-48 GB; Clock Speed : 3.0 GHz; Cent OS 5.2; GCC Version 4.1.2; Dual Socket 12 Core (24 cores)

  • ACML version , OpenCL and BLAS Libraries; Peak Performance : CPU : 144 Gflops (1 Node - 12 Cores) and AMD-APP with OpenCL Prog. Env.

  • AMD Fire Stream 9250 GPU Accelerator :
    Double Precision Floating Point : The FireStream 9250 supports double precision floating point operations in hardware;
    High Performance per Watt : Up to 8 GFLOPS per watt of single precision performance potential

    Optimized for computation The AMD FireStream product line provides the industry's first double-precision floating point capability on a GPU. The AMD FireStream 9250 is our second generation DP-FP product. With 1GB GDDR3 memory on board and single-precision performance of 1 TFLOPS.

  • AMD Fire Stream 9350 GPU Accelerator :
    Technology Need : AMD FireStream Computing Solution
    High DPFP performance : 528 GFLOPS double precision
    High performance per Watt : 2.4 GFLOPS / Watt
    Open standards : OpenCL, Direct Compute
    Performance optimization tools : OpenCL SDK
    PCIe 2.1 Host Interface : 8 GB/S Host-GPU bandwidth

    The FireStream 9350 offers maximum GPU performance with 4GB of DDR5 memory in a 2-slot configuration. The FireStream 9350 offers maximum performance / slot with 2GB DDR5 memory in a 1-slot configuration.

  • AMD FirePro V5900 :
    The AMD FirePro V5900 features 2GB of blazing-fast GDDR5 memory, 512 stream processors, and support for three simultaneous monitor outputs from a single AMD FirePro V5900 graphics card with AMD Eyefinity technology. The AMD FirePro V5900 supports OpenCL and it has parallel processing capabilities of 512 stream processors and PCI Express 2.1 compliant.

  • AMD FirePro V7900 :
    The AMD FirePro V7900 features : 2GB of ultra-fast GDDR5 memory and 1280 stream processors. The AMD FirePro V7900 supports OpenCL and it has parallel processing capabilities of 1280 stream processors and PCI Express 2.1 compliant.

  • HP Pavailion AMD A8-4500K (Trinity) APU the Pavilion dv6-7010 features an AMD A8-4500M APU with four cores, a 1.9 GHz clock frequency and a 2.8 GHz Turbo boost. Graphics are provided by a Radeon 7640G chip. Further specifications include 6 GB of memory, a 750 GB hard disk, Gigabit LAN, 802.11/b/g/n WiFi and Bluetooth. The 15.6-inch screen has a resolution of 1366x768 pixels.



Type 3 : HPC Cluster with Intel Xeon Phi Co-processors (PARAM YUVA

Host-CPU : Intel Xeon Multi-Core Processor
Device Accelerator : Two Intel Xeon Phi Coprocessors
Prog. Env. : Intel Prog. Tools & CDAC KSHIPRA

Host-CPU : Intel Xeon Processor : (PARAM YUVA Compute Node) Two Quad Socket Eight Core Systems ( 16 CPU - Intel(R) Xeon(R) CPU E5-2670 @ 2.68GHz with sandy bridge Arch; RAM - 64 GB, cache - 20MB ; GCC 4.4.6; Infiniband, Interconnects having PARAMNet-II and InfiniBand. Each node has two Intel Xeon Phi Coprocessors. Peak Performance /Node : 2.35 TF; Memory : OS : Linux

Device Accelerator : Two Intel Xeon Phi Coprocessor : 60 Cores; -8 GB GDDR5 RAM; -32kB L1-cache per core; -512kB L2-cache per core

Prof. Env. : Intel Development Tools : Intel MPI, OpenMP, Clik Plus, Pubic domain MVAPICH2; MKL, NAG & CDAC KSHIPRA; and Varda Prog. Env - RCS-FPGA Prog.



  • System 1 : Intel Xeon Phi Co-processor : The pragma-based offload model and using Intel Xeon Phi as an SMP processor is one of the easiest approached to write a program similar to existing x86 systems. The Intel Xeon Phi KNC processor is a 61-core SMP chip where each core has a dedicated 512-bit wide SSE (Streaming SIMD Extensions) vector unit. All the cores are connected via a 512-bit bidirectional ring interconnect (Figure 1). Currently, the Phi coprocessor is packaged as a separate PCIe device, external to the host processor. Each Phi contains 16 GB of RAM that provides all the memory and file-system storage that every user process, the Linux operating system, and ancillary daemon processes will use. The theoretical maximum bandwidth of the Intel Xeon Phi memory system is 352 GB/s (5.5GTransfers/s * 16 channels * 4B/Transfer). Each Intel Xeon Phi core is based on a modified Pentium processor design that supports hyperthreading and some new x86 instructions created for the wide vector unit. The parallel threads issue instructions to the wide vector units quickly enough to keep the vector pipeline full. The current generation of coprocessor cores support up to four concurrent threads of execution via hyperthreading.

    The Coprocessor is integrated with Intel X86 Xeon Processor Sandybride System for laboratory session.

  • System 2 : PARMA YUVA-II - a hybrid computing platform is a message passing cluster and configuration of a compute node with co-processors are given below. Compute Node : Two Quad Socket Eight Core Systems ( 16 CPU - Intel(R) Xeon(R) CPU E5-2670 @ 2.68GHz with sandy bridge Arch; RAM - 64 GB, cache - 20MB ; GCC 4.4.6; Infiniband, Interconnects having PARAMNet-II and InfiniBand. Each node has two Intel Xeon Phi Coprocessors.

    Intel Xeon Phi Coprocessor : 60 Cores; -8 GB GDDR5 RAM; -32kB L1-cache per core; -512kB L2-cache per core



Centre for Development of Advanced Computing