hyPACK-2013 HPC GPU Cluster with Xeon Phi Co-processors - Heterogeneous Programming
The Message Passing Programming paradigm is one of the widely used approaches for
programming parallel computers. The Standard Message Passing Interface (MPI) library is commonly
used for applications with numerous programming languages.
In a Message Passing Cluster, MPI processes are launched across several cluster nodes with suitable interconnect.
There are two key attributes that characterize the message-passing programming paradigm. The first is
that it assumes a partitioned address space and the second is that is supports only explicit
parallelism. The logical view of a machine supporting the message-passing paradigm consists of
p processes, each with its own exclusive address space.
Hybrid Heterogeneous HPC Cluster is becoming popular to solve complex heterogeneous workloads
in which Accelerators based on CPU , GPU , FPGA are being used nowadays. These systems can
address some of the heterogeneous computing workloads.
The goal of this mixed environment is to provide total workflow optimization, which takes
cares-off applications that do not parallelize
well on scalar processors, can be optimized with the appropriate computation model.
Example programs using
compiler pragmas, directives, function calls, and environment variables, Compilation and execution
of programs on calculation of memory bandwidth.
|
MPI Prog. Env
MPI (Message Passing Interface) is a standard specification for message
passing libraries. MPI makes it relatively easy to write portable parallel programs. MPI does provide
message-passing routines for exchanging all the information needed to allow a single MPI implementation
to operate in a heterogeneous environment. The MPI-2 has new areas for message-passing model such as parallel I/O, remote memory operations, and
dynamic process management. In addition, MPI-2 introduces a number of features designed to make all
of MPI more robust and convenient to use, such as external interface specifications, C++ and fortran-90
bindings, support for threads, and mixed-language programming.
MPI 3.0 Standardization efforts and research work on hybrid programming (treating threads as MPI Processes,
Dynamic thread levels) is going on. The current multi- and future many-core processors require extended
MPI facilities for dealing with threads. The efforts on point-to-point and collective communications
will be further tuned on multi-core and many-core processors.
MPI support two types of commonly used MPI programming Paradigms such as SPMD (Single Program Multiple
Data) and MPMD (Multiple Program Multiple Data).
On Intel Xeon-Phi coprocessors, Intel MPI library with Intel Compiler environment is provided and
MPI implementattions provide compiler wrappers (for example
mpicc ,
mpiicc , and
mpif90 ) to simplify
the process of binding MPI programs and a utilities ush as
mpiexec.hydra &
mpirun
to launch MPI program.
|
|
More about How to run Intel MPI on Xeon Phi
|
Access to necessary libraries :
Make sure all MPI libraries are accessible from the Xeon Phi card. There are a couple of ways to
do this:
-
Setup an NFS share between the Xeon host where the Intel MPI Library is installed, and the Xeon Phi corprossesor card.
-
Manually copy all Xeon Phi-specific MPI libraries to the card. More details on which libraries to copy and where are available here.
|
|
|
|
1. Running natively on the Xeon Phi coprocessor (Coprocessor-only model )
|
Using MPI on Intel Xeon-Phi Coprocessors :
Intel Xeon-Phi coprocessor can be considered as a independent compute node and is capable of high
performance networking via the coprocessor link (CCL). Intel Xeon-Phi coprocessor is IP addressable
and have own memory domain, also seperate from the host processors.
In this, MPI jobs are launched only on the coprocessors and it
avoids the complex heterogenity of the symmetric case.
The native model where all MPI ranks are run on the Intel Xeon Phi coprocessor card.
To run natively on the Xeon Phi Coprocessor, given below important step are required.
-
(a). Set up the environment,
-
(b).
Compile for the Xeon Phi coprocessor card
# for the Xeon Phi coprocessor
[hypack-01] $
mpiicc -mmic -o matrix_multiply.MIC mult.c
-
(c). Copy the Xeon Phi executable to the cards (This step
is not required if your host and Xeon-Phi card are NFS-shared.),
-
(d) Launch the application using mpirun.
Following step is required for setting up the I_MPI_MIC environment variable.
export I_MPI_MIC = enable
and prepare the mpi_hosts file and in our examples
write the following 8 host names indicating node and mic cards information.
ycn209-mic0
ycn209-mic0
ycn210-mic1
ycn210-mic1
ycn211-mic0
ycn211-mic1
ycn212-mic0
ycn212-mic1
To launch the application binary run, execute the following command
mpirun -f mpi_hosts -n 8 ./run
There are two ways to execute jobs on Xeon Phi : (1) is ssh to Phi and (2) Set approriate environment
on the Phi.
mpiexec.hydra \96f hostfile \96perhost 60 \96np 120 \
matrix_multiply.mic
|
|
|
|
2. Running symmetrically on both the Xeon host and the Xeon Phi coprocessor
|
Symmetric :
MPI program can be run on a mixture of hosts and coprocessors without source code modification. The program execution is symmetric - the complete program, the main function,
the computational kernel,
and MPI, runs on both the host and the coprocessor. The programming model is logically
symmetric. The main idea is to utilize both the Xeon hosts on your cluster, and the Xeon Phi coprocessor
cards attached to them.
Most importantly, the host and coprocessor cores do not have equivalent performance
profiles
as well as the communication performance. The symmetric model where MPI ranks are run on both the Xeon host and the Xeon Phi coprocessor card.
To run ymmetrically on both the Xeon host and the Xeon Phi coprocessor,
given below important step are required.
-
(a). Set up the environment,
-
(b). Compile for the Xeon Phi coprocessor card and for the Xeon host
and generate two different sets of binaries.
# for the Xeon Phi coprocessor
[hypack-01] $
mpiicc -mmic -o matrix_multiply.MIC mult.c
# for the Xeon host
[hypack-01] $
mpiicc -o matrix_multiply mult.c
-
(c). Copy the Xeon Phi executables to the cards (This step
is not required if your host and Xeon-Phi card are NFS-shared.),
[hypack-01] $
scp ./matrix_multiply.MIC \
ycn209-mic0:~/matrix_multiply
-
(d) Launch the application using mpirun.
Following step is required for setting up the I_MPI_MIC environment variable.
export I_MPI_MIC = enable
and prepare the mpi_hosts file and in our examples
write the following 8 host names indicating node and mic cards information.
ycn209
ycn209-mic0
ycn210
ycn210-mic0
ycn211
ycn211-mic0
ycn212
ycn212-mic1
To launch the application binary run on one node with one xeon card
(2 MPI process),
execute the following command
mpirun -f mpi_hosts -perhost 1 -n 2 ./matrix_multiply
|
|
|
More about How to run launch application using Offload
|
Offload:
In a message passing cluster, the compute nodes are equipped with multiple
coprocessors and each MPI process on each compute node can offload the computational kernel
to the coprocessor.
In this case, MPI calls cannot be made in offload region.
The offload model where all MPI ranks are run on the main Xeon host, and the application utilizes
Offload directives to run on the Intel Xeon Phi corpocessor card.
This is also called as MPI + Offload paradigm and all MPI communications occur between
Xeon Hosts only.
Offload used to accelerate MPI ranks. Programmers designates (OpenMP, pthreads, TBB, or CilkPlus)
code sessions to run on Xeon Phi using offload directives.
Most importantly, calling MPI functions within an offload region is NOT allowed and No direct
file system access needed on Xeon Phi.
To Compile and Run in MPI Offload mode, first step is compile the code with offload directives, which
is same as with MPI-OpenMP hybrid applications. The Intel13 Compiler is the default complier
option for the offload build.
To compile the code with offload directives
execute the following command
mpiifort -openmp matrix_multiply.f -o matrix_multiply.offload
To request explciit no offloading :
mpiifort -openmp -no-offload matrix_multiply.f -o \
matrix_multiply.offload
mpiicc -openmp -no-offload matrix_multiply.f -o \
matrix_multiply.offload
mpiicc -openmp -align array64byte -offload-option,mic,compiler, \
"-O3 -vec-report3" -o matrix_multiply.offload
To launch application
execute the following command
mpirun -host ycn209 -n 1
./matrix_multiply.offload
mpiiexec.hydra -f -perhost 1 -np 2
./matrix_multiply.offload
Resource Availability on Xeon Coprocessor :
To coordinate coprocessor resource usage among the MPI
ranks is an important from performance point of view. Three methods
commonly useful are :
-
Only running one MPI rank per host, so there is no chance of
multiple ranks offload to the same Phi coprocessor
-
Heterogeneous: Running multiple MPI ranks per host but
arranging processors to allow only single rank offload to the same
Phi.
-
Explicit Pinning: Setting the pinning on a per-process basis to
allow control of where each thread is offloaded.
|
|
|
About Launching of Different MPI Jobs
Launch MPI Job Type-1 :
The following statements describe launch of two MPI ranks on the host with each rank offloaing
for OpenMP threads
export MIC_ENV_PREFIX = MIC
export MIC_OMP_NUM_THREADS = 4
mpirun -n 2 ./vectvect_add
Launch MPI Job Type-2 :
The following statements describe launch of two MPI ranks on the host with each rank
offloading its part of the computations to the coprocessor.
export MIC_ENV_PREFIX = MIC
export MIC_KMP_AFFINITY = 4
mpirun -n 2 ./vectvect_add
Using MPI natively on the coprocessor
Launch MPI Job Type-3 :
mpicc -mmic -o vectvect.mic vectvect_add.c
Launch MPI Job Type-4 :
The following statements launches to ranks on the host (ycn216) and two ranks on the coporcessor (ycn215-mic0)
# export I_MPI_MIC = enable
mpirun -host yn215 -n 2 ./vectvect_add : -host
ycn215 -mic0 n 2 ./vectvect_add
Launch MPI Job Type-5 :
The following statements launch a job using a hostfile containing a mixture of Intel Xeon
and Intel Xeon Phi nodes
#!/bin/sh
EXE=$1
shift
ARCH= uname -m
if [ $ARCH == 'x86_64' ] : then
#host
exec $(EXE) $ =
elif [ $ARCH == 'klpm']; then
#coprocessor
exec $ [EXE].mic $=
fi
|
|
|
|
Example. 1 :
Calculation of Memory Bandwidth using OpenMP framework on Xeon-Phi
Objective
Input
Description
Output
(Download source code :
memory-bdw-openmp-xeon-phi.c;
memory_bdw_execution_openmp_xeon_phi_script.sh;
Makefile_bdw_mpi_openmp_xeon_phi.NATIVE )
|
- Objective
Extract performance in GB/s for Memory Bandwidth
Calculation and analyze the performance on Intel Xeon-Phi coprocessor
using OpenMP framework
- Description
Input data starting from 1 Byte to few GB's from one array to another array is copied
several times performin read/write memory access pattern in the main inner loop for
bandwidth calculation. The memory transfer rate will be limited as per peak performance
of bandwidth. For results, we use large data (few MB's) for calculation of bandwidth
on Intel Xeon Phi Coprocessor in roder to achieve the maximum performance.
In the present computations, it is necessary to ensure that we need read and write
contiguous blocks of data, minimizing access requestes and thereby enabling very high
throughput. The OpenMP framework is used to support the code with threads and cores.
- Input
Input array with Size in Megabytes.
- Output
Prints the Bandwidth (Gbytes), time taken, no. of threads, Gbytes per sec.
|
|
|
// Memory Bandwidth Calculation using OpenMP framework
//
// A simple example that measures copy memory bandwidth on
// on Intel Xeon Phi Co-processors Using OpenMP to scale
#include
#include
#include
#include
#include
//
//dtime (Wall Clock time ....)
//
//Utility routine to return the current wall clock time
//
double dtime()
{
double tseconds = 0.0;
struct timeval mytime;
gettimeofday(&mytime,(struct timezone*)0);
tseconds = (double)(mytime.tv_sec + mytime.tv_usec*1.0e-6);
return( tseconds);
}
// Set to float or double
#define REAL double
#define BW_ARRAY_SIZE (1000*1000*64)
#define BW_ITERS 1000
// Number of memory operations each iteration
// 1 read + 1 write
#define OPS_PERITER 2
// define some arrays
// make sure they are 64 byte aligned - for
// fastest cache access
REAL fa[BW_ARRAY_SIZE] __attribute__((align(64)));
REAL fb[BW_ARRAY_SIZE] __attribute__((align(64)));
REAL fc[BW_ARRAY_SIZE] __attribute__((align(64)));
//
// --------------------------------------------------
/* Main Program to Compute Bandwidth */
//
int main(int argc, char*argv[] )
{
int i,j,k;
int numthreads;
double tstart, tstop, ttime;
double gbytes = 0.0f;
REAL a = 1.1f;
//
// Initialize the compute arrays
//
//double tstart, tstop, ttime;
double gflops = 0.0f;
//float a = 1.1f;
//
//Initialize the compute arrays
//
printf(" Initializing \r\n") ;
#pragma omp parallel for
for(i=0; i
{
fa[i] = (REAL)i + 0.1f;
fb[i] = (REAL)i + 0.2f;
fc[i] = (REAL)i + 0.2f;
}
// print the # of threads to be used
// just display from 1 thread - the "master"
#pragma omp parallel
#pragma omp master
printf("Starting BW Test on %d threads \r\n ",omp_get_num_threads());
tstart = dtime();
// use omp to scale the test across
// the threads requested. Need to set environment
// variable OMP_NUM_THREADS and KMP_AFFINITY
for (i=0; i
{
//
// copy the arrays to/from memory (2 bw ops)
// use openmp to scale and get aggregate bw
//
#pragma omp parallel for
for(k=0; k
{
fa[k] = fb[k];
}
}
tstop = dtime();
// # of gigabytes we just moved
gbytes = (double)( 1.0e-9 * OPS_PERITER * BW_ITERS *
BW_ARRAY_SIZE*sizeof(REAL));
// elapsed time
ttime = tstop - tstart;
//
// Print the results
//
if( (ttime) > 0.0f)
{
printf("Gbytes = %10.31f,Secs = %10.31f," "Gbytes per sec - %10.31f \r\n ",
gbytes, ttime, gbytes/ttime);
}
return( 0 );
}
|
|
|
|
|
Example. 2 :
Calculation of Memory Bandwidth using MPI-OpenMP framework on Xeon-Phi
(Download source code :
memory-bdw-mpi-openmp-xeon-phi.c;
memory_bdw_execution_mpi_openmp_xeon_phi_script.sh;
mpi_hosts_xeon_phi
Makefile_bdw_mpi_openmp_xeon_phi.NATIVE )
|
|
Example. 3 :
Calculation of Memory Bandwidth using OpenMP framework on Xeon Host Multi-Core System
(Download source code :
memory-bdw-mpi-openmp-xeon-host.c;
memory_bdw_execution_mpi_openmp_xeon_host_script.sh;
mpi_hosts_xeon_host
Makefile_bdw_mpi_openmp_xeon.HOST )
|
|
|
|
| |
|
| |