• Mode-3 Coprocessors • Arch. Software • Compiler & Vect. • Prog. Env. • Benchmarks • Power-Perf. • Home




Prog. on Intel Xeon Phi Coprocessors (OpenMP 4.0)

Example programs using compiler pragmas, directives, function calls, and environment variables, Compilation and execution of OpenMP 4.0 programs, programs numerical and non-numerical computations are discussed.

OpenMP 4.X on Intel Xeon Phi             Document : OpenMP Version 4.0 July 2013 (pdf)

Compilation :
    (Sequential ): Vectorize & Not to Vectorize    

OpenMP Compilation :     OpenMP

OpenMP Execution :     Set Up Run time Prog. Env.     Execution     Script    

Offload Information : Compiler Offload Pragma & Report     Compiler Offload Clauses

Tuning & Performance : KMP Thread Affinity     Memory Alignment     KMP-Script

Summary

Matrix Computation Codes

Example 1 : Reduction Operation for one dimensional integer array

Example 2 : Reduction Operation for two dimensional integer array

Example 3 : Vector-Vector Multiply

Example 4. Matrix - Matrix Addition

Example 5. Matrix - Matrix Multiply

Example 6 : Performance of Vector-Vector Multiply

Example 7. Infini Norm of A Square Matrix

Example 8. Reduction Operation for one dimensional integer array using openmp distribute clause

References :     Xeon Phi Coprocessor


The key specifications of Intel Xeon Coprocessor

Specification Feature
Clock Frequency 1.091 GHz
No. of Cores 61
Memory Size / Type 8 GB/GDDR5
Memory Speed 5.5 GT/sec
Peak DP/SP 1.065/2.130 teraFLOP/s
Peak Memory Bandwdith 352 GB/s
An Overview of OpenMP 4.0

One of the non-shared memory model is supported On Intel Xeon host and Intel Xeon Phi Coprocessor, using offload pragmas in the application. The compiler handled the application data at runtime and sends data buffer over to coprocessors by logically ordering the data parameters as provided by the application. The other non-shared memory model is virtual shared memory model in which system level runtime support library is used to maintain the coherency between the host and coprocessor shared memory address space. The non-shared memory model is supported by OpenMP 4.0. Intel Compiler supports non-shared programming model and Intel OpenMP 4.0 Programming sticks to OpenMP 4.0 syntax. Intel compiler had proprietary language extensions that were implemented to support non-shared programming model.

In the non-shared programming model, the main program starts on the host and the data and computation can be sent to Intel Xeon Phi through Offload pragmas when the data exchange between the host and the coprocessor are bit-wise copyable. The bitwise copyable includes scalars, arrays and structures without indirections.

At runtime, the compiler takes care-off copying data back and forth between the host and the coprocessor around the offload block indicated by the pragmas in view of Coprocessor memory space is separate from the host memory space.

The data selected for offload may be implicitly copied if used inside the offload code block provided the variables are in the lexical scope of the code block performing offload and variable listed explicitly as part of the OpenMP 4.0 pragmas.

In OpenMP 4.0, the same code can run on host processor with or without a coprocessor, the compiler runtime determines whether the coprocessor is present on the system or not. So if the coprocessor is not available or inactive during the offload call, the code fragment inside the offload code block may execute on the host as well.

To understand the new language extensions of OpenMP 4.0, for non-shared memory programming, please refer to OpenMP 4.0 Technical Report (TR) i.e. www.openmp.org/mp-documents/TR1_167.pdf . In order to understand the language extensions defined in Open4.0 to deal with coprocessor, we need to understand some terminologies associated with the specification to explain the coprocessor interaction and these are explined below. Terminoloies & Notations

  • Device: The coprocessor with its own memory. A device could have one or more coprocessors or host. A host device is the device executing the main thread.A target device executes the offloaded code segment
  • Offload:  The process of sending computation from host to target. 
  • Data Environment:  The variables associated with a given execution environment. 
  • Device data environment :  A data environment associated with a target data or target construct. 
  • Mapping Variable : Mapping of a variable in a data environment to a variable in a device data environment. The original and corresponding variable may share storage.
  • Mappable type:  A valid ‘data type’ for a mapped variable. 

Offload Function and Data Declaration Constructs

Target Data Declarations:  Declares device data environment for the scope of the target code block. This allows creation of versions of specified function or data that can be used inside a target region executing on the coprocessor.
 
C/C++  
#pragma omp declare target  new-line  

        [function-definition-or-declaration] 

#pragma omp end  target  new-line  



Function Offload and Execution Constructs:

Target Data Declarations:  In order to execute on a coprocessor, a “target” device must be available on the host system running the code. For non-shared memory model, the host application executes the initial program which spawns the master execution thread called “initial device thread”. The OpenMP pragma “target” provides the capability to offload computations to coprocessor(s).

  • A target region begins as a single thread of execution and executes sequentially, as if enclosed in an implicit task region, called the initial device task region.

  • When a target construct is encountered, the target region is executed by the implicit device task

  • The task that encounters the target construct waits at the end of the construct until execution of the region completes. If a coprocessor does not exist, or is not supported by the implementation, or cannot execute the target construct then the target region is executed by the host device.

  • The data environment is created at the time the construct is encountered if needed. Whether a construct creates a data environment is defined in the description of the construct

C/C++  
#pragma omp  target   [clause[[,] clause],...] new­line  

        parallel­loop­construct | parallel­sections­construct  

Where Clauses are :

    device(scalar-integer-expression) 
The integer expression must be a positive number to differentiate various coprocessor  available on a host. If no device is specified, the default device is determined by internal  control variable (ICV) named num-var.

    devicemap(alloc | to | from | tofrom: list), Scratch(list): 
These are data motion clauses that allow copying and mapping of variables or common  block to/from host scope to target device scope

    if(scalar-expr): 
If the scalar-expression evaluates to false, the new device environment is not created.

    Num_threads( list )   
 

The “ target ” directive creates a device data environment and executes the code block on the target device. The target region binds to the enclosing parallel or task region. It provides a superset of “ target data ” constructs described later and describes data as well as the code block to be executed on the target device. The master task waits for the coprocessor to complete the target region at the end of the constructs. When an “ if ” clause is present and if the logical expression inside the if clause evaluates to false, the target region is not executed by the device. When a num_threads clause is present, the nthreads-var in the coprocessor environment is assigned the value list.

  • If a target, target update, or target data construct appears within a target region ten the construct is ignored.
  • At most one device clause may appear on the directive. The device expression must evaluate to a positive integer value.
  • At most one “ if ” clause can appear on the directive.
  • The result of an omp_set_default_device, omp_get_default_device, or omp_get_num_devices routine called within a target region is unspecified. The effect of an access to a threadprivate variable in a target region is unspecified
  • A variable referenced in a target construct that is not declared in the construct is implicitly treated as if it had appeared in a map clause with a map-type of tofrom.
  • A variable referenced in a target region but not the target construct that is not declared in the target region, must appear in a declare target directive.
  • C/C++ Specific: A throw executed inside a target region must cause execution toresume within the same target region, and the same thread that threw the exception must catch it.
ICV Name Routines
device-num-var omp_set_device_num(),
omp_get_device_num();

The extension also provides routines to set and get runtime environment settings (referred to as internal control  variables ICV in the OpenMP spec). These are:


Target Data Construct Constructs

Target Data Declarations : Creates the device data environment for the scope of the target code block. If there are no “ device ” clause, the default device is determined by device-num-var.

Syntax
  C/C++ 

#pragma omp  target data   [clause[[,] clause],...] new­line  

        structured-block | parallel­sections­construct  

Where Clauses are :

    device(scalar-integer-expression) 

The integer expression must be a positive number to differentiate various coprocessor  available on a host. If no device is specified, the default device is determined by internal  control variable (ICV) named num-var.

    devicemap(alloc | to | from | tofrom: list), Scratch(list): 
These are data motion clauses that allow copying and mapping of variables or common  block to/from host scope to target device scope

    if(scalar-expr): 
If the scalar-expression evaluates to false, the new device environment is not created.

    Strucrtured-block :    No branching in and out of the block of code is allowed.
 

Target Update Constructs

Target Update Declarations : This construct synchronizes the list items in the device data  environment consistent with their corresponding original list items. device-num-var.

Syntax
  C/C++ 

#pragma omp  target update   [clause[[,] clause],...] new­line  

 

Where Clauses are :

    device(scalar-integer-expression) 

The integer expression must be a positive number to differentiate various coprocessor  available on a host. If no device is specified, the default device is determined by internal  control variable (ICV) named num-var.

    devicemap(alloc | to | from | tofrom: list), Scratch(list): 
These are data motion clauses that allow copying and mapping of variables or common  block to/from host scope to target device scope

    if(scalar-expr): 
If the scalar-expression evaluates to false, the new device enviornment is not created.
    Strucrtured-block :    No branching in and out of the block of code is allowed.

Using command line arguments (Vectorization & No Vectorization )

The compilation and execution of a program for an Intel Many Integrated Core (MIC) architecture coprocessor (-mmic) also known as Intel Xeon Phi Coprocessor are given below.

Compilation :

Compilation of Sequential Programs : Compiler to Vectorize / NoVectorize

Using command line arguments (Vectorization & No Vectorization )

The compilation and execution of a program for an Intel Many Integrated Core (MIC) architecture coprocessor (-mmic) also known as Intel Xeon Phi Coprocessor are given below.

Compilation :

To compile the program : Using Intel C Compiler with Vectorization

# icc   -mmic  -vec  -report=3   -O3  <program name> -O  <Name of executable>  

For example to compile a simple seq-matrix-matrix-multiply.c program user can type on the command line

# icc   -mmic  -vec  -report=3   -O3  <seq-matrix-matrix-multiply.c > -O  <seq-matrix-matrix-multiply>  

To compile the program : Using Intel C Compiler without Vectorization

User can ask the complier not to vectorze the code with   -no -vec  option and execute the code. It is possible to get less performance.

# icc   -mmic  -no-vec  -vec -report=3   -O3  <seq-matrix-matrix-multiply.c > -O  <seq-matrix-matrix-multiply>

To compile the program using Makefile Utility, Using Intel C Compiler with Vectorization

make

Note: If the Makefile has some extension like Makefile_C then user is required to type

make -f Makefile_C (instead of simply typing make)

make -f make -f Makefile.OFFLOAD (Compile using OFFLOAD mode)

make -f make -f Makefile.NATIVE (Compile using NATIVE mode)

make -f Makefile.OFFLOAD clean (Clean the Object files & Binaries )


The details ofr syntax of the command to compile the program on Intel Xeon Phi are given in the following table.

Compiler flag Purpose
-mmic requests the code to be generated for an Intel MIC architecture processor ( -mmic ) also known as Intel Xeon Phi Coprocessor.
-vec -report=3 Generate a Vector Report
-no-vec Not to Vectorize the code
-O3 Standard Optimisation technique
seq-matrix-matrix-multiply Writes the resulting excutable file to seq-matrix-matrix-multiply
run Writes the resulting excutable file to run


Compilation of Programs : OpenMP - Compiler to Vectorize


Using command line arguments

The compilation and execution of a program for an Intel Many Integrated Core (MIC) architecture coprocessor (-mmic) with OpenMP framework is given below.

# icc   -openp   -mmic  -vec-report=3   -O3  <program name> -O  <Name of executable>  

For example to compile a simple openmp-matrix-matrix-multiply.c program user can type on the command line

# icc   -openmp   -mmic  -vec-report=3   -O3  <seq-matrix-matrix-multiply.c > -O  <openmp-matrix-matrix-multiply>  

Using command line arguments

The compilation and execution of a program for MIC architecture coprocessor using offload and compliation is given below.

# icc   -openmp   -mmic  -vec-report2   -O3  <program name> \
                      -O  < Name of executable >
 


OpenMP Set Up Run time Prog. Env.

Setting Up the Prog. Environment : OpenMP programs to Scale upto 60 Cores

User can set the number of threads and the affinity using environment variables

omp_set_num_threads(32);
kmp_set_defaults("KMP_AFFINITY = compact");


in the coprocessor's Linux Operating environment. That is

export OMP_SET_THREADS=32
export KMP_AFFINITY = compact

Below given enviornment varaibles should be declared before execution of the program as per application requirements.

# set environment variables
export MKL_NUM_THREADS=32
export KMP_AFFINITY = granularity=fine,compact
export MIC_ENV_PREFIX=PHI
export PHI_MKL_NUM_THREADS=236
export PHI_KMP_AFFINITY=granularity=fine,compact
export OFFLOAD_REPORT=2
export MKL_MIC_ENABLE=0

Un-setting Up the Prog. Environment :

#unset env variables
unset MKL_NUM_THREADS
unset KMP_AFFINITY
unset MIC_ENV_PREFIX
unset PHI_MKL_NUM_THREADS
unset PHI_KMP_AFFINITY
unset MKL_MIC_ENABLE


Execution of Programs : Sequential & OpenMP

To execute the applicaiton on coprocessor : Log-in to Xeon-Phi Coprocessor

To execute the program on coprocessor, the user Log-in to the coprocessor, then simply type the name of executable on command line. 

./< Name of executable>

For example to execute a simple seq-matrix-matrix-multiply.c application, user types the command  

./seq-matrix-matrix-multiply 

For example to execute a simple openmp-matrix-matrix-multiply.c OpenMP application, user types 

./openmp-matrix-matrix-multiply  

The expected output:

Initializing the Vectors
Computation startd
gigaFLOPs = ****
Time = ****
gigaFLOPs per Sec = *****


Execution - Script

Script to run on Xeon Phi in Native Mode :

  export MKL_MIC_ENABLE=0
  export KMP_AFFINITY= "granularity=thread,balanced"
  export LD_LIBRARY_PATH=/tmp
  nThreads=240
  i=200
    while[ $i –le 1000 ]
    do
    echo-n "mic"
      ./openmp_matrix_matrix_multiply $i $nThreads 8
    let i+=100
  done


Compiler Offload Pragma & Report


Details of Code : Intel compiler's offload pragmas :

On Xeon-host, the code to transfer the data to the Xeon Phi coprocessor is automatically created by the Intel complier. When the code is written using OpenMP pragmas to C/CC+ Fortran code, the Intel complier encounters an offload pragma, it generates code for both the coprocessor and the host. The programmer responsibility is to include appropriate offload pragmas by adding data clauses. Details can be found under "Offload Using a Pragma" in the Intel compiler documentation as given in the references.

  • Using #pragma offload taret(mic) : In this example, how to offload the matrix computation to the Intel Xeon Phi coprocessor using #pragma offload target(mic) is shown

  • Choose the target MIC out of Multiple Coprocessors : The user could also specify the Intel Xeon-Phi Coprocessor Number_Id in a system with multiple coprocessors (Ex. PARAM YUVA Compute Nodes ) by using #pragma offload target(mic:Number_Id).

Other Information about Intel compiler's offload :
  • Use -no-offload : Offloading is enabled per default for the Intel compiler. Use -no-offload to disable the generation of offload code.

  • Vec Report : Using the compiler option -vec-report2 one can see which loops have been vectorized on the host and the MIC coprocessor:

  • Printing Data transfer (OFFLOAD_REPORT) : By setting the environment variable OFFLOAD_REPORT one can obtain information about performance and data transfers at runtime:

    hypack-01: ~offload_c> export OFFLOAD_REPORT=2

Intel Xeon Phi Coprocessor Compiler Offload Clauses
  • Using #pragma offload taret(mic) : In all the examples given below, the important information related to offload the matrix computations to the Intel Xeon Phi coprocessor using #pragma offload target(mic) is discussed.

The Intel Xeon-Phi coprocessor programming environment provides " offload pragma" which provides additional annotation so the compiler can correctly move data to and from the external Xeon Phi Card. Note that single or multiple OpenMP loops can be contained within the scope of the offload directive. The clauses are interpreted as follows:

Offload: The offload pragma keyword specifies different clauses that contain information relevant to offloading to the target device.

target(mic:MIC_DEV) is the target clause that tells the compiler to generate code for both the host processor and the specified offload device i.e., Xeon Phi Coprocessor. The constant parameter MIC_DEV is an an integer associated with Xeon-Phi device. Note that the offload performs different operations as per requirement.

  • The offload runtime will schedule offload work within a single application in a round-robin fashion, which can be useful to share the workload amongst multiple devices.

  • The offload runtime will utilize the host processor when no coprocessors are present and no device number is specified (for example, target(mic)).

  • programmers can use _Offload_to to specify a device in their code.

  • It is the responsibility of the programmer to ensure that any persistent data resides on all the devices. During the round-robin scheduling, the persistent data resides on all the devices is important from performance point of view and to avoid PCIe bottlenecks. In general, only use persistent data when the device number is specified.

  • Choose the target MIC out of Multiple Coprocessors : The user could also specify the Intel Xeon-Phi Coprocessor MIC_DEV in a system with multiple coprocessors (Ex. PARAM YUVA Compute Nodes ) by using #pragma offload target(mic:MIC_DEV).

  • Using #pragma offload taret(mic) : To Offload the matrix computation to the Intel Xeon Phi coprocessor using #pragma offload target(mic) the following clauses are required.

in(Matrix_A:length(size*size)): The in(var-list modifiersopt) clause explicitly copies data from the host to the coprocessor. Note that:

  • The length(element-count-expr) specifies the number of elements to be transferred. The compiler will perform the conversion to bytes based on the type of the elements.

  • By default, memory will be allocated on the device and deallocated on exiting the scope of the directive.

  • The alloc_if(condition) and free_if(condition) modifiers can change the default behavior.

out(Matrix_A:length(size*size)): The in(var-list modifiersopt) clause explicitly clause explicitly copies data from the coprocessor to the host. Note that:

  • The length(element-count-expr) specifies the number of elements to be transferred. The compiler will perform the conversion to bytes based on the type of the elements. By default, memory will be deallocated on exiting the scope of the directive.

  • The free_if(condition) modifier can change the default behavior.

Tuning & Performancre : KMP Thread Affinity
  • On Single a core, we can employ four threads. To ensure maximum performance, core needs to execute FMA calculations on evey clock cycle. To do this, core must be running more thatn one thread. i.e., two or three or four threads. The code is re-written to execute on multiple threads, and eventually multiple cores.

  • OpenMP Framework, a widley used standard support by compilers for Intel Xeon Phi coprocessors with standard API and programming model for shared memory multiprocessing for HPC. OpenMP parallel for) directive to enable the desired threading is used and the loop iteratons are divided among the available threads and run in parallel.

  • In OpenMP implementation, each thread will work on a separate set of row array elements of Matrix and offset variables are included in the code. To achieve performance close to theoretical peak performance, two openMP calls are used :

    omp_set_num_threads(32);
    kmp_set_defaults("KMP_AFFINITY = compact");

    The first OpeMP call sets the number of threads to use while running the code and the second one setting AFFINITY to compact will ensure that thirty two requested OpenMP threads execute on different cores.

  • A subset of main memory is available to an application and main memory is organised into pages. The two dimensional array Matrix_A and Matrix_B are accessed along the rows. On a system that supports virtual memory, the memory addresses for different applications are virtualised : they are given logical addresses, assigned into virutal pages. A typical page size is 4 or 8 Kbytes, but laret ones exist. In fact, the physical pages that are available to program may be spread out in memory, and so the virtual pages must be mapped to physical ones. The page addresses are stored in page table and translation-lookside buffer (TLB) a spcial cache that stores the recently accessed entries in page table. TLB is crticial to performance and in this exmaple cache reloads plus a large number of TLB misses may not result.

  • Loop Optimisation can give sinificant improvement in performance and loop unrolling can help to improve cache line utilization by improving data reuse. Inner loop unrolling with approriate OpenMP pragmas may further improve the performance. Loop unrolling can also help to increase the instruction-level parallelism, or ILP. The compiler switches can help to perform loop unrolling.

  • Use of Pointers and Contiuous Memory in C Lanuguage : The pointer aliasing problem exists in many application codes and it prevents a compiler from performing many progrm optmizations since it can nto determine that they are safe. It is assumed that all pointers may reference any memory address. The performance can be obtained, if pointers are guaranteed to point to portions of nonoverlapping memory. The restrict key-word is a feature of the C99 Standard which is supported by compiler and it may improve the performance.

  • The number of threads in an OpenMP enviornment and the mapping of cores on Intel Xeon-Phi Coprocessor play an important role to achieve maximum performance of developer code. The KMP_AFFINITY environment variable specifies the thread-to-core affinity. There are three preset schemes: compact, scatter, and balanced and the user explicitly define the affinity that works best for their application. The choice of affinity schemes depends upon the memory access, data sharing and work-load for each thread, affinte to core. The default runtime thread affinity can also be used and it may change between software releases. For consistent application performance across software releases, do not rely on the default affinity scheme.

  • Compact tries to use minimum number of cores by pinning four threads to a core before filling the next core

  • Scatter tries to evenly distribute threads across all cores

  • Balanced tries to equally scatter threads across all cores such that adjacent threads (sequential thread numbers) are pinned to the same core. One caveat being that all cores refers to the total number of cores -1 because one core is reserved for the operating system during an offload Interested readers can find more about the affinitization schemes


Script : Using KMP Thread Affinity

Script to run on Xeon Phi in Native Mode :

  export MKL_MIC_ENABLE=0
  export KMP_AFFINITY= "granularity=thread,balanced"
  export LD_LIBRARY_PATH=/tmp
  nThreads=240
  i=1
    while[ $i –lt 100 ]
    do
    echo-n "mic"
      ./openmp_matrix_matrix_multiply 1000 $i 10
    leti++
  done


Tuning & Performancre : Memory Alignment for Vectorizatio
  • Memory Alignment for Vectorization : The matrices size is dynamically allocated using posix_memalign(), and their sizes must be specified via the length() clause. Using in, out and inout one can specify which data has to be copied in on Intel Xeon-Phi Coprocessor from host.

  • Data alignment - 64-Byte : It is recommended that for Intel Xeon Phi data is 64-byte (512 Bits) aligned as given in the MIC architecture.

  • Alignment using #pragma vector alignment : For proper alignment of data to get performance using Intel compiler vectorization, #pragma vector aligned is used. This tells the compiler that all array data accessed in the loop is properly aligned.

  • In addition, the –std=c99 directive command-line option tells the compiler to allow use of the restrict keyword and C99 VLAs. Note that C99 restrict keyword that specifies that the vectors do not overlap. (Compile with -std=c99 is required for efficient vectorization)

  • For SSE, the data is aligned to 32 Bytes (256 Bits) for AVX and 16 Bytes (128 Bits) for SSE and for MIC architecture, data should be aligned to 64 Bytes (512 Bits) for the MIC architecture.


Tuning & Performance : Summary on Xeon Phi

Compiler Optimisation & Vectorization
  • Performance : The Intel Xeon-Phi coprocessor deliver peak performance greater than 1 teraFLOP/s for double precision (DP) floating point calculations and greater than 2 teraFLOP/s for single precision calculations.

  • The effective use of compiler pragmas with vectorization and scale to achieve the best performance results in terms of gigaFLOP/s for a given code is important. That is, the code needs to use automatic compiler flags, vectorization & scale to achieve the best performance. For more information on this, visit
    hypack13-mode03-coprocessor-compiler-vect.html

  • To paralleize more loops in an application on the coprocessor than on an Intel Xeon based host, the MIC new instructions such as fused multiply-add, masked vector instructions, gather/scatter etc. can be used to improve the performance due to large SIMD width of 64 Bytes vectorization that MIC architecture supports

  • The coprocessor supports a key performance enhancing capability of executing both a multiplication and an addition, known as performance enhancing capability of executing both a multiplicaiton and an addition, known as fused multiply and add (FMA) in a single instruction without precision loss, enabling two floating point in one instruction. Use FMA in Numerical Linear Algebra (NLA) Computations

  • The options for running the code will give information about the performance of code. The details of the program i.e., inner loop of matrix-matrix Multiply and the various compiler options do not indicate number of threads & the number of cores used.

  • To achieve maximum performance of singel core out of maximum peak performance of single core i.e., 35 giaFLOP/s (approximate), it is necessary to take full advantage of coprocessor's optimisation features and programming paradigms such as OpenMP & Intel MKL libraries. Only Single thread is used in this example and the maximum number of threads the Intel Xeon Phi Coprocessor can directly manage per core is four.

  • The option -vec -report=3 gives to user, the information about the compiler's choice for vectorizin portions of the code. (Note that the sign (-) is optional in this compiler option.) For Vectorization, the Loops and array processing are required.

  • In matrix-matrix multiply examples, the compiler was able to arrange chunks of arrays to be loaded into machine registers or directly from memory via caches. Thus all 16 Single precision point point lanes are used for simulataneous calculation.

  • The inner and outer loops are effectively vectorized for the code.


EXample Programs on Matrix Matrix Multiplication

Example. 1 : Reduciton Operation for one dimensional integer array
(Download source code : openmp4x-reduce-1darray.c,
                                  env-setup.sh ,
                                  Makefile_openmp4x

A simple example of how the constructs discussed below for OpenMP 4.0 can be used to offload computation to Intel Xeon Phi coprocessor. This example shows a sum reduction operation being run on a Xeon Phi processor. The offload computation “ reduce ” is first declared with “#pragma omp declare target” clause.This causes the routine or its component to be executable on the coprocessor. The offload pragma,


#pragma omp target map(inarray[0:SIZE]) map(sum)

that causes specific code block lines of the code to be sent to coprocessor, for computing. In this case it is computing the reduction of an array of numbers and returning the computing value through the “sum” variable to the host. The &lduo; inarray ” and the “sum” is copied in and out of the coprocessor before and after the computation.

Example. 2 : Reduciton Operation for two dimensional integer array
(Download source code : openmp4x-reduce-2darray.c)

Example. 3 : Vector Vector Multiplication
(Download source code : openmp4x-vect-vect-multiply.c)

Example. 4 : Matrix Matrix Addition
(Download source code : openmp4x-mat-mat-addition.c)

Example. 5 : Matrix Matrix Multiplication
(Download source code : openmp4x-mat-mat-multiply.c)

Example. 6 : Vector Vector Addtion - Scaling to more cores & Performance
(Download source code : openmp4x-vect-vect-addition-perf.c)

Example. 7 : Infinity Norm of a Square Matrix
(Download source code : openmp4x-infinity-norm-matix.c)

Example. 8 : Reduction Operation for one dimensional integer array using openmp distribute clause
(Download source code : openmp4-0-distribute-1Darray-sum.c)


Centre for Development of Advanced Computing