Mode-5 HPC Cluster Cluster : Multi-Core - MPI Cluster : GPU - NVIDIA CUDA/OpenCL Cluster : GPU - AMD OpenCL Cluster : Coprocessors -Intel Xeon Phi Home




hyPACK-2013 HPC GPU Cluster with Xeon Phi Co-processors - Heterogeneous Programming


The Message Passing Programming paradigm is one of the widely used approaches for programming parallel computers. The Standard Message Passing Interface (MPI) library is commonly used for applications with numerous programming languages. In a Message Passing Cluster, MPI processes are launched across several cluster nodes with suitable interconnect. There are two key attributes that characterize the message-passing programming paradigm. The first is that it assumes a partitioned address space and the second is that is supports only explicit parallelism. The logical view of a machine supporting the message-passing paradigm consists of p processes, each with its own exclusive address space.

Hybrid Heterogeneous HPC Cluster is becoming popular to solve complex heterogeneous workloads in which Accelerators based on CPU , GPU , FPGA are being used nowadays. These systems can address some of the heterogeneous computing workloads. The goal of this mixed environment is to provide total workflow optimization, which takes cares-off applications that do not parallelize well on scalar processors, can be optimized with the appropriate computation model. Example programs using compiler pragmas, directives, function calls, and environment variables, Compilation and execution of programs on calculation of memory bandwidth.

Xeon Phi :
MPI Env.     How to run Intel MPI on Xeon Phi    
    Running natively on the Xeon Phi coprocessor (Coprocessor-only model)
    Running symmetrically on both Xeon host & Xeon Phi (Symmetric model)
    Running on Xeon host & Offload computations to Xeon Phi

Launch : Types of MPI-OpenMP Jobs     Compilation & Execution

Using MPI Accessing Memory Bandwidth on Xeon-Phi

Example 1 : Memory Bandwidth Calculation on Intel Xeon-Phi using MPI-OpenMP

Example 2 : Memory Bandwidth Calculation on Intel Xeon-host using MPI-OpenMP

Example 3 : Calculation of Memory Bandwidth using OpenMP framework on Xeon Host Multi-Core System

References


MPI Prog. Env

MPI (Message Passing Interface) is a standard specification for message passing libraries. MPI makes it relatively easy to write portable parallel programs. MPI does provide message-passing routines for exchanging all the information needed to allow a single MPI implementation to operate in a heterogeneous environment. The MPI-2 has new areas for message-passing model such as parallel I/O, remote memory operations, and dynamic process management. In addition, MPI-2 introduces a number of features designed to make all of MPI more robust and convenient to use, such as external interface specifications, C++ and fortran-90 bindings, support for threads, and mixed-language programming.

MPI 3.0 Standardization efforts and research work on hybrid programming (treating threads as MPI Processes, Dynamic thread levels) is going on. The current multi- and future many-core processors require extended MPI facilities for dealing with threads. The efforts on point-to-point and collective communications will be further tuned on multi-core and many-core processors. MPI support two types of commonly used MPI programming Paradigms such as SPMD (Single Program Multiple Data) and MPMD (Multiple Program Multiple Data).

On Intel Xeon-Phi coprocessors, Intel MPI library with Intel Compiler environment is provided and MPI implementattions provide compiler wrappers (for example mpicc , mpiicc , and mpif90 ) to simplify the process of binding MPI programs and a utilities ush as mpiexec.hydra & mpirun to launch MPI program.

More about How to run Intel MPI on Xeon Phi
Access to necessary libraries :

Make sure all MPI libraries are accessible from the Xeon Phi card. There are a couple of ways to do this:

  • Setup an NFS share between the Xeon host where the Intel MPI Library is installed, and the Xeon Phi corprossesor card.

  • Manually copy all Xeon Phi-specific MPI libraries to the card. More details on which libraries to copy and where are available here.


1. Running natively on the Xeon Phi coprocessor (Coprocessor-only model )

Using MPI on Intel Xeon-Phi Coprocessors : Intel Xeon-Phi coprocessor can be considered as a independent compute node and is capable of high performance networking via the coprocessor link (CCL). Intel Xeon-Phi coprocessor is IP addressable and have own memory domain, also seperate from the host processors. In this, MPI jobs are launched only on the coprocessors and it avoids the complex heterogenity of the symmetric case. The native model where all MPI ranks are run on the Intel Xeon Phi coprocessor card.

To run natively on the Xeon Phi Coprocessor, given below important step are required.

  • (a). Set up the environment,
  • (b). Compile for the Xeon Phi coprocessor card

    # for the Xeon Phi coprocessor
    [hypack-01] $ mpiicc -mmic -o matrix_multiply.MIC   mult.c

  • (c). Copy the Xeon Phi executable to the cards (This step
            is not required if your host and Xeon-Phi card are NFS-shared.),
  • (d) Launch the application using mpirun.
Following step is required for setting up the I_MPI_MIC environment variable.

export I_MPI_MIC = enable

and prepare the mpi_hosts file and in our examples write the following 8 host names indicating node and mic cards information.
ycn209-mic0
ycn209-mic0
ycn210-mic1
ycn210-mic1
ycn211-mic0
ycn211-mic1
ycn212-mic0
ycn212-mic1

To launch the application binary run, execute the following command

mpirun -f   mpi_hosts   -n   8   ./run



There are two ways to execute jobs on Xeon Phi : (1) is ssh to Phi and (2) Set approriate environment on the Phi.

mpiexec.hydra   \96f   hostfile     \96perhost   60 \96np   120 \
                matrix_multiply.mic



2. Running symmetrically on both the Xeon host and the Xeon Phi coprocessor

Symmetric : MPI program can be run on a mixture of hosts and coprocessors without source code modification. The program execution is symmetric - the complete program, the main function, the computational kernel, and MPI, runs on both the host and the coprocessor. The programming model is logically symmetric. The main idea is to utilize both the Xeon hosts on your cluster, and the Xeon Phi coprocessor cards attached to them. Most importantly, the host and coprocessor cores do not have equivalent performance profiles as well as the communication performance. The symmetric model where MPI ranks are run on both the Xeon host and the Xeon Phi coprocessor card.

To run ymmetrically on both the Xeon host and the Xeon Phi coprocessor, given below important step are required.

  • (a). Set up the environment,
  • (b). Compile for the Xeon Phi coprocessor card and for the Xeon host
            and generate two different sets of binaries.

    # for the Xeon Phi coprocessor
    [hypack-01] $ mpiicc -mmic -o matrix_multiply.MIC   mult.c

    # for the Xeon host
    [hypack-01] $ mpiicc -o matrix_multiply   mult.c

  • (c). Copy the Xeon Phi executables to the cards (This step
            is not required if your host and Xeon-Phi card are NFS-shared.),

    [hypack-01] $ scp ./matrix_multiply.MIC \
                            ycn209-mic0:~/matrix_multiply


  • (d) Launch the application using mpirun.
Following step is required for setting up the I_MPI_MIC environment variable.

export I_MPI_MIC = enable

and prepare the mpi_hosts file and in our examples write the following 8 host names indicating node and mic cards information.
ycn209
ycn209-mic0
ycn210
ycn210-mic0
ycn211
ycn211-mic0
ycn212
ycn212-mic1

To launch the application binary run on one node with one xeon card (2 MPI process), execute the following command

mpirun -f   mpi_hosts   -perhost   1   -n   2   ./matrix_multiply




More about How to run launch application using Offload

Offload: In a message passing cluster, the compute nodes are equipped with multiple coprocessors and each MPI process on each compute node can offload the computational kernel to the coprocessor. In this case, MPI calls cannot be made in offload region. The offload model where all MPI ranks are run on the main Xeon host, and the application utilizes Offload directives to run on the Intel Xeon Phi corpocessor card.

This is also called as MPI + Offload paradigm and all MPI communications occur between Xeon Hosts only. Offload used to accelerate MPI ranks. Programmers designates (OpenMP, pthreads, TBB, or CilkPlus) code sessions to run on Xeon Phi using offload directives. Most importantly, calling MPI functions within an offload region is NOT allowed and No direct file system access needed on Xeon Phi.

To Compile and Run in MPI Offload mode, first step is compile the code with offload directives, which is same as with MPI-OpenMP hybrid applications. The Intel13 Compiler is the default complier option for the offload build.

To compile the code with offload directives execute the following command

mpiifort -openmp   matrix_multiply.f   -o   matrix_multiply.offload

To request explciit no offloading :
mpiifort -openmp   -no-offload   matrix_multiply.f   -o   \
                matrix_multiply.offload


mpiicc -openmp   -no-offload   matrix_multiply.f   -o   \
                matrix_multiply.offload


mpiicc  -openmp   -align array64byte   -offload-option,mic,compiler, \
                "-O3 -vec-report3"   -o   matrix_multiply.offload


To launch application execute the following command

mpirun   -host   ycn209   -n   1   ./matrix_multiply.offload

mpiiexec.hydra   -f   -perhost   1   -np   2   ./matrix_multiply.offload


Resource Availability on Xeon Coprocessor : To coordinate coprocessor resource usage among the MPI ranks is an important from performance point of view. Three methods commonly useful are :

  • Only running one MPI rank per host, so there is no chance of multiple ranks offload to the same Phi coprocessor

  • Heterogeneous: Running multiple MPI ranks per host but arranging processors to allow only single rank offload to the same Phi.

  • Explicit Pinning: Setting the pinning on a per-process basis to allow control of where each thread is offloaded.


About Launching of Different MPI Jobs

Launch MPI Job Type-1 : The following statements describe launch of two MPI ranks on the host with each rank offloaing for OpenMP threads
export MIC_ENV_PREFIX = MIC
export MIC_OMP_NUM_THREADS = 4
mpirun -n 2 ./vectvect_add

Launch MPI Job Type-2 : The following statements describe launch of two MPI ranks on the host with each rank offloading its part of the computations to the coprocessor.
export MIC_ENV_PREFIX = MIC
export MIC_KMP_AFFINITY = 4
mpirun -n 2 ./vectvect_add

Using MPI natively on the coprocessor

Launch MPI Job Type-3 : mpicc  -mmic  -o  vectvect.mic  vectvect_add.c 

Launch MPI Job Type-4 : The following statements launches to ranks on the host (ycn216) and two ranks on the coporcessor (ycn215-mic0)

# export I_MPI_MIC = enable
mpirun  -host  yn215  -n  2  ./vectvect_add  : -host  ycn215  -mic0  n  2  ./vectvect_add 


Launch MPI Job Type-5 : The following statements launch a job using a hostfile containing a mixture of Intel Xeon and Intel Xeon Phi nodes

    #!/bin/sh
    EXE=$1
    shift
    ARCH= uname -m
    if [ $ARCH == 'x86_64' ] : then
    #host
    exec $(EXE) $ =
    elif [ $ARCH == 'klpm']; then
    #coprocessor
    exec $ [EXE].mic $=
    fi


Example. 1 : Calculation of Memory Bandwidth using OpenMP framework on Xeon-Phi

Objective       Input       Description       Output      


(Download source code :
memory-bdw-openmp-xeon-phi.c;   memory_bdw_execution_openmp_xeon_phi_script.sh;
Makefile_bdw_mpi_openmp_xeon_phi.NATIVE )

  • Objective
  • Extract performance in GB/s for Memory Bandwidth Calculation and analyze the performance on Intel Xeon-Phi coprocessor using OpenMP framework

  • Description
  • Input data starting from 1 Byte to few GB's from one array to another array is copied several times performin read/write memory access pattern in the main inner loop for bandwidth calculation. The memory transfer rate will be limited as per peak performance of bandwidth. For results, we use large data (few MB's) for calculation of bandwidth on Intel Xeon Phi Coprocessor in roder to achieve the maximum performance. In the present computations, it is necessary to ensure that we need read and write contiguous blocks of data, minimizing access requestes and thereby enabling very high throughput. The OpenMP framework is used to support the code with threads and cores.

  • Input
  • Input array with Size in Megabytes.

  • Output
  • Prints the Bandwidth (Gbytes), time taken, no. of threads, Gbytes per sec.


// Memory Bandwidth Calculation using OpenMP framework

//
// A simple example that measures copy memory bandwidth on
// on Intel Xeon Phi Co-processors Using OpenMP to scale
  #include
  #include
  #include
  #include
  #include
  //
  //dtime (Wall Clock time ....)
  //

  //Utility routine to return the current wall clock time
  //
double dtime()
{
  double tseconds = 0.0;
  struct timeval mytime;
  gettimeofday(&mytime,(struct timezone*)0);
  tseconds = (double)(mytime.tv_sec + mytime.tv_usec*1.0e-6);
  return( tseconds);
}

// Set to float or double
#define REAL double

#define BW_ARRAY_SIZE (1000*1000*64)
#define BW_ITERS 1000
// Number of memory operations each iteration
// 1 read + 1 write
#define OPS_PERITER 2
// define some arrays
// make sure they are 64 byte aligned - for
// fastest cache access

REAL fa[BW_ARRAY_SIZE] __attribute__((align(64)));
REAL fb[BW_ARRAY_SIZE] __attribute__((align(64)));
REAL fc[BW_ARRAY_SIZE] __attribute__((align(64)));

//
// --------------------------------------------------

/* Main Program to Compute Bandwidth */

//
int main(int argc, char*argv[] )
  {
  int i,j,k;
  int numthreads;
  double tstart, tstop, ttime;
  double gbytes = 0.0f;
  REAL a = 1.1f;
  //
  // Initialize the compute arrays
  //
  //double tstart, tstop, ttime;
  double gflops = 0.0f;
  //float a = 1.1f;
  //
  //Initialize the compute arrays
  //
  printf(" Initializing \r\n") ;
  #pragma omp parallel for
  for(i=0; i   {
    fa[i] = (REAL)i + 0.1f;
    fb[i] = (REAL)i + 0.2f;
    fc[i] = (REAL)i + 0.2f;
  }
  // print the # of threads to be used
  // just display from 1 thread - the "master"
  #pragma omp parallel
  #pragma omp master
  printf("Starting BW Test on %d threads \r\n ",omp_get_num_threads());

  tstart = dtime();
  // use omp to scale the test across
  // the threads requested. Need to set environment
  // variable OMP_NUM_THREADS and KMP_AFFINITY
  for (i=0; i   {
      //
      // copy the arrays to/from memory (2 bw ops)
      // use openmp to scale and get aggregate bw
      //
        #pragma omp parallel for
          for(k=0; k         {
          fa[k] = fb[k];
       }
        }
  tstop = dtime();
  // # of gigabytes we just moved
  gbytes = (double)( 1.0e-9 * OPS_PERITER * BW_ITERS *
  BW_ARRAY_SIZE*sizeof(REAL));
  // elapsed time
  ttime = tstop - tstart;
  //
  // Print the results
  //
  if( (ttime) > 0.0f)
  {
      printf("Gbytes = %10.31f,Secs = %10.31f," "Gbytes per sec - %10.31f \r\n ",
      gbytes, ttime, gbytes/ttime);
  }
  return( 0 );
}


Example. 2 : Calculation of Memory Bandwidth using MPI-OpenMP framework on Xeon-Phi
(Download source code :
memory-bdw-mpi-openmp-xeon-phi.c;
memory_bdw_execution_mpi_openmp_xeon_phi_script.sh
;
mpi_hosts_xeon_phi

Makefile_bdw_mpi_openmp_xeon_phi.NATIVE )


Example. 3 : Calculation of Memory Bandwidth using OpenMP framework on Xeon Host Multi-Core System
(Download source code :
memory-bdw-mpi-openmp-xeon-host.c; memory_bdw_execution_mpi_openmp_xeon_host_script.sh;
mpi_hosts_xeon_host

Makefile_bdw_mpi_openmp_xeon.HOST )



Centre for Development of Advanced Computing