Prog.on Multi-Core Processors with Intel Xeon Phi Coprocessors
|
MPI Prog. Env
MPI (Message Passing Interface) is a standard specification for message
passing libraries. MPI makes it relatively easy to write portable parallel programs. MPI does provide
message-passing routines for exchanging all the information needed to allow a single MPI implementation
to operate in a heterogeneous environment. The MPI-2 has new areas for message-passing model such as parallel I/O, remote memory operations, and
dynamic process management. In addition, MPI-2 introduces a number of features designed to make all
of MPI more robust and convenient to use, such as external interface specifications, C++ and fortran-90
bindings, support for threads, and mixed-language programming.
MPI 3.0 Standardization efforts and research work on hybrid programming (treating threads as MPI Processes,
Dynamic thread levels) is going on. The current multi- and future many-core processors require extended
MPI facilities for dealing with threads. The efforts on point-to-point and collective communications
will be further tuned on multi-core and many-core processors.
MPI support two types of commonly used MPI programming Paradigms such as SPMD (Single Program Multiple
Data) and MPMD (Multiple Program Multiple Data).
On Intel Xeon-Phi coprocessors, Intel MPI library with Intel Compiler environment is provided and
MPI implementattions provide compiler wrappers (for example
mpicc ,
mpiicc , and
mpif90 ) to simplify
the process of binding MPI programs and a utilities ush as
mpiexec.hydra &
mpirun
to launch MPI program.
|
|
More about How to run Intel MPI on Xeon Phi
|
Access to necessary libraries :
Make sure all MPI libraries are accessible from the Xeon Phi card. There are a couple of ways to
do this:
-
Setup an NFS share between the Xeon host where the Intel MPI Library is installed, and the Xeon Phi corprossesor card.
-
Manually copy all Xeon Phi-specific MPI libraries to the card. More details on which libraries to copy and where are available here.
|
|
|
|
1. Running natively on the Xeon Phi coprocessor (Coprocessor-only model )
|
Using MPI on Intel Xeon-Phi Coprocessors :
Intel Xeon-Phi coprocessor can be considered as a independent compute node and is capable of high
performance networking via the coprocessor link (CCL). Intel Xeon-Phi coprocessor is IP addressable
and have own memory domain, also seperate from the host processors.
In this, MPI jobs are launched only on the coprocessors and it
avoids the complex heterogenity of the symmetric case.
The native model where all MPI ranks are run on the Intel Xeon Phi coprocessor card.
To run natively on the Xeon Phi Coprocessor, given below important step are required.
-
(a). Set up the environment,
-
(b).
Compile for the Xeon Phi coprocessor card
# for the Xeon Phi coprocessor
[hypack-01] $
mpiicc -mmic -o matrix_multiply.MIC mult.c
-
(c). Copy the Xeon Phi executable to the cards (This step
is not required if your host and Xeon-Phi card are NFS-shared.),
-
(d) Launch the application using mpirun.
Following step is required for setting up the I_MPI_MIC environment variable.
export I_MPI_MIC = enable
and prepare the mpi_hosts file and in our examples
write the following 8 host names indicating node and mic cards information.
ycn209-mic0
ycn209-mic0
ycn210-mic1
ycn210-mic1
ycn211-mic0
ycn211-mic1
ycn212-mic0
ycn212-mic1
To launch the application binary run, execute the following command
mpirun -f mpi_hosts -n 8 ./run
There are two ways to execute jobs on Xeon Phi : (1) is ssh to Phi and (2) Set approriate environment
on the Phi.
mpiexec.hydra -f hostfile -perhost 60 -np 120 \
matrix_multiply.mic
|
|
|
|
2. Running symmetrically on both the Xeon host and the Xeon Phi coprocessor
|
Symmetric :
MPI program can be run on a mixture of hosts and coprocessors without source code modification. The program execution is symmetric - the complete program, the main function,
the computational kernel,
and MPI, runs on both the host and the coprocessor. The programming model is logically
symmetric. The main idea is to utilize both the Xeon hosts on your cluster, and the Xeon Phi coprocessor
cards attached to them.
Most importantly, the host and coprocessor cores do not have equivalent performance
profiles
as well as the communication performance. The symmetric model where MPI ranks are run on both the Xeon host and the Xeon Phi coprocessor card.
To run ymmetrically on both the Xeon host and the Xeon Phi coprocessor,
given below important step are required.
-
(a). Set up the environment,
-
(b). Compile for the Xeon Phi coprocessor card and for the Xeon host
and generate two different sets of binaries.
# for the Xeon Phi coprocessor
[hypack-01] $
mpiicc -mmic -o matrix_multiply.MIC mult.c
# for the Xeon host
[hypack-01] $
mpiicc -o matrix_multiply mult.c
-
(c). Copy the Xeon Phi executables to the cards (This step
is not required if your host and Xeon-Phi card are NFS-shared.),
[hypack-01] $
scp ./matrix_multiply.MIC \
ycn209-mic0:~/matrix_multiply
-
(d) Launch the application using mpirun.
Following step is required for setting up the I_MPI_MIC environment variable.
export I_MPI_MIC = enable
and prepare the mpi_hosts file and in our examples
write the following 8 host names indicating node and mic cards information.
ycn209
ycn209-mic0
ycn210
ycn210-mic0
ycn211
ycn211-mic0
ycn212
ycn212-mic1
To launch the application binary run on one node with one xeon card
(2 MPI process),
execute the following command
mpirun -f mpi_hosts -perhost 1 -n 2 ./matrix_multiply
|
|
|
More about How to run launch application using Offload
|
Offload:
In a message passing cluster, the compute nodes are equipped with multiple
coprocessors and each MPI process on each compute node can offload the computational kernel
to the coprocessor.
In this case, MPI calls cannot be made in offload region.
The offload model where all MPI ranks are run on the main Xeon host, and the application utilizes
Offload directives to run on the Intel Xeon Phi corpocessor card.
This is also called as MPI + Offload paradigm and all MPI communications occur between
Xeon Hosts only.
Offload used to accelerate MPI ranks. Programmers designates (OpenMP, pthreads, TBB, or CilkPlus)
code sessions to run on Xeon Phi using offload directives.
Most importantly, calling MPI functions within an offload region is NOT allowed and No direct
file system access needed on Xeon Phi.
To Compile and Run in MPI Offload mode, first step is compile the code with offload directives, which
is same as with MPI-OpenMP hybrid applications. The Intel13 Compiler is the default complier
option for the offload build.
To compile the code with offload directives
execute the following command
mpiifort -openmp matrix_multiply.f -o matrix_multiply.offload
To request explciit no offloading :
mpiifort -openmp -no-offload matrix_multiply.f -o \
matrix_multiply.offload
mpiicc -openmp -no-offload matrix_multiply.f -o \
matrix_multiply.offload
mpiicc -openmp -align array64byte -offload-option,mic,compiler, \
"-O3 -vec-report3" -o matrix_multiply.offload
To launch application
execute the following command
mpirun -host ycn209 -n 1
./matrix_multiply.offload
mpiiexec.hydra -f -perhost 1 -np 2
./matrix_multiply.offload
Resource Availability on Xeon Coprocessor :
To coordinate coprocessor resource usage among the MPI
ranks is an important from performance point of view. Three methods
commonly useful are :
-
Only running one MPI rank per host, so there is no chance of
multiple ranks offload to the same Phi coprocessor
-
Heterogeneous: Running multiple MPI ranks per host but
arranging processors to allow only single rank offload to the same
Phi.
-
Explicit Pinning: Setting the pinning on a per-process basis to
allow control of where each thread is offloaded.
|
|
|
About Launching of Different MPI Jobs
Launch MPI Job Type-1 :
The following statements describe launch of two MPI ranks on the host with each rank offloaing
for OpenMP threads
export MIC_ENV_PREFIX = MIC
export MIC_OMP_NUM_THREADS = 4
mpirun -n 2 ./vectvect_add
Launch MPI Job Type-2 :
The following statements describe launch of two MPI ranks on the host with each rank
offloading its part of the computations to the coprocessor.
export MIC_ENV_PREFIX = MIC
export MIC_KMP_AFFINITY = 4
mpirun -n 2 ./vectvect_add
Using MPI natively on the coprocessor
Launch MPI Job Type-3 :
mpicc -mmic -o vectvect.mic vectvect_add.c
Launch MPI Job Type-4 :
The following statements launches to ranks on the host (ycn216) and two ranks on the coporcessor (ycn215-mic0)
# export I_MPI_MIC = enable
mpirun -host yn215 -n 2 ./vectvect_add : -host
ycn215 -mic0 n 2 ./vectvect_add
Launch MPI Job Type-5 :
The following statements launch a job using a hostfile containing a mixture of Intel Xeon
and Intel Xeon Phi nodes
#!/bin/sh
EXE=$1
shift
ARCH= uname -m
if [ $ARCH == 'x86_64' ] : then
#host
exec $(EXE) $ =
elif [ $ARCH == 'klpm']; then
#coprocessor
exec $ [EXE].mic $=
fi
|
|
|
|
Example. 1 :
Matrix Matrix Multiplication using MPI -OpenMP framework on Xeon-Phi (Offload)
(Download source code :
matrix-matrix-multiply-mpi-openmp-offload.c;
env_var_setup_mpi_openmp_offload.sh ;
mpi_hosts_xeon_phi_offload
Makefile_mat_mat_mpi_openmp,Offload )
Objective
Input
Description
Output
|
- Objective
Extract performance in G/flops for Matrix Matrix Multiply
and analyze the performance on Intel Xeon system with Xeon Phi Coprocessor
based on MPI-OpenMP.
- Description
Two input matrices are filled with real data and matrix-matrix Multiply
is performed using Xeon-Phi Offload pragma with MPI - OpenMP framework.
This example demonstrates the use of MPI on Xeon Host and offload computation of
matrix-matrix multiply using OpenMP on Xeon-Phi Programming features
to obtain the maximum achievable
performance.
In implementation, every thread will work its own row array section of input Matrix_A and multiply with
each column of Matrix_B, resulting to give an output array of resulting matrix Matrix_C. It is
assumed that the size of the two input square matrices is divisible by number of threads.
The optimisation features on Intel Xeon-Phi are not explored.
- Input
Number of threads , Size of the Matrices.
- Output
Prints the time taken for computation of Output matrix and G/flops and the number of threads.>
|
|
|
Compilation and Execution :
The compilation and execution of a program to run in Offline mode is
is shown below.
# Compile to run Offload on the Xeon Phi :
make -f Makefile.OFFLOAD
(Download source code : OFFLOAD):
Makefile_mat_mat_mpi_openmp.Offload )
# Execution on the Xeon Phi :
mpirun -machinefile mpi_hosts_xeon_phi -np 8 ./run 16000 236 4
|
|
|
|
Example. 2 :
Matrix Matrix Multiplication using MPI -OpenMP framework on Xeon-Phi (NATIVE)
(Download source code :
matrix-matrix-multiply-mpi-openmp-native.c;
env_var_setup_mpi_openmp_native.sh ;
mpi_hosts_xeon_phi_native
Makefile_mat_mat_mpi_openmp,native )
|
|
|
|
Example.3 :
Matrix Matrix Multiplication using MPI -OpenMP framework (PREFIX Option)
(Download source code :
mat_mat_multiply_mpi_openmp_prefix.zip)
Using Xeon Phi prefixes and extensions for Intel MPI jobs in NFS shared environment
|
|
To run natively on the Xeon Phi Coprocessor with Prefix option, below few things are required,
Saving all Intel Xeon Phi coprocessor executables in a sub-directory
-
1. Create all executables which will be run on the Xeon Hosts
-
2. Creat a special directory where all Xeon Phi executables will liave and pouplate it :
# Creat s special directory
[hypack-01] $
mkdir MIC
mpirun -f mpi_hosts -perhost 1 -n 2 ./test.exe
-
3. Launch applicaiton by setting the
I_MPI_MIC_PREFIX environment variable.
This will automatically adds a prefix (e.g.,
./MIC to the executable when the
mpirun script tuns the MPI job on the Xeon Phi coporcessor cards.
[hypack-01] $
export I_MPI_MIC = enable
[hypack-01] $
export I_MPI_MIC_PREFIX = ./MIC/
mpirun -f mpi_hosts -perhost -perhost 1 -n 2 ./test.exe
|
|
|
|
Example.4 :
Matrix Matrix Multiplication using MPI -OpenMP framework (POSTFIX Option)
(Download source code :
mat_mat_multiply_mpi_openmp_postfix.zip)
Using a Xeon Phi-specifc extension (POSTFIX) for your Xeon Phi-specific executables
To run natively on the Xeon Phi Coprocessor with POSTFIX option, below few things are required,
-
1. Create all executables which will be run on the Xeon Hosts
-
2. No need of creating a special directory. Attach an extension to the Xeon Phi
coprocessor executables during creation
# Creat s special directory
[hypack-01] $
mkdir MIC
mpirun -mmic test.c -o ./test.exe.mic
-
3. Launch applicaiton by setting the
I_MPI_MIC_POSTFIX environment variable.
This will automatically attach the sepficed postfix (e.g.,
.MIC to the executable when the
mpirun script tuns the MPI job on the Xeon Phi coporcessor cards.
[hypack-01] $
export I_MPI_MIC = enable
[hypack-01] $
export I_MPI_MIC_POSTFIX = .mic
mpirun -f mpi_hosts -perhost -perhost 1 -n 2 ./test.exe
|
|
|
|
| | | | |