Prog. on Intel Xeon Phi Coprocessors (OpenMP 4.0)
|
Example programs using
compiler pragmas, directives, function calls, and environment variables, Compilation and execution
of OpenMP 4.0 programs, programs numerical and non-numerical computations
are discussed.
OpenMP 4.X on Intel Xeon Phi
Document :
OpenMP Version 4.0 July 2013 (pdf)
Compilation :
(Sequential ): Vectorize & Not to Vectorize
OpenMP Compilation :
OpenMP
OpenMP Execution :
Set Up Run time Prog. Env.
Execution
Script
Offload Information :
Compiler Offload Pragma & Report
Compiler Offload Clauses
Tuning & Performance :
KMP Thread Affinity
Memory Alignment
KMP-Script
Summary
Matrix Computation Codes
Example 1 : Reduction Operation for one dimensional integer array
Example 2 : Reduction Operation for two dimensional integer array
Example 3 : Vector-Vector Multiply
Example 4. Matrix - Matrix Addition
Example 5. Matrix - Matrix Multiply
Example 6 : Performance of Vector-Vector Multiply
Example 7. Infini Norm of A Square Matrix
Example 8. Reduction Operation for one dimensional integer array using openmp distribute clause
References :
Xeon Phi Coprocessor
|
The key specifications of Intel Xeon Coprocessor
|
Specification |
Feature |
Clock Frequency |
1.091 GHz |
No. of Cores |
61 |
Memory Size / Type |
8 GB/GDDR5 |
Memory Speed |
5.5 GT/sec |
Peak DP/SP |
1.065/2.130 teraFLOP/s |
Peak Memory Bandwdith |
352 GB/s |
|
|
|
An Overview of OpenMP 4.0
|
One of the non-shared memory model is supported On Intel Xeon host and Intel Xeon Phi Coprocessor, using offload pragmas in the application. The compiler handled the application data at runtime and sends data buffer over to coprocessors by logically ordering the data parameters as provided by the application. The other non-shared memory model is virtual shared memory model in which system level runtime support library is used to maintain the coherency between the host and coprocessor shared memory address space.
The non-shared memory model is supported by OpenMP 4.0. Intel Compiler supports non-shared programming model and Intel OpenMP 4.0 Programming sticks to OpenMP 4.0 syntax. Intel compiler had proprietary language extensions that were implemented to support non-shared programming model.
In the non-shared programming model, the main program starts on the host and the data and computation can be sent to Intel Xeon Phi through Offload pragmas when the data exchange between the host and the coprocessor are bit-wise copyable. The bitwise copyable includes scalars, arrays and structures without indirections.
At runtime, the compiler takes care-off copying data back and forth between the host and the coprocessor around the offload block indicated by the pragmas in view of Coprocessor memory space is separate from the host memory space.
The data selected for offload may be implicitly copied if used inside the offload code block provided the variables are in the lexical scope of the code block performing offload and variable listed explicitly as part of the OpenMP 4.0 pragmas.
In OpenMP 4.0, the same code can run on host processor with or without a coprocessor, the compiler runtime determines whether the coprocessor is present on the system or not. So if the coprocessor is not available or inactive during the offload call, the code fragment inside the offload code block may execute on the host as well.
To understand the new language extensions of OpenMP 4.0, for non-shared memory programming, please refer to OpenMP 4.0 Technical Report (TR)
i.e.
www.openmp.org/mp-documents/TR1_167.pdf .
In order to understand the language extensions defined in Open4.0 to deal with coprocessor, we need to
understand some terminologies associated with the specification to explain the coprocessor interaction and these are explined below.
Terminoloies & Notations
-
Device:
The coprocessor with its own memory. A device could have one or more coprocessors or host. A host device is the device executing the main thread.A target device executes the offloaded code segment
-
Offload: The process of sending computation from host to target.
-
Data Environment:
The variables associated with a given execution environment.
-
Device data environment : A data environment associated with a target data or target construct.
-
Mapping Variable : Mapping of a variable in a data environment to a variable in a device data environment. The original and corresponding variable may share storage.
-
Mappable type: A valid ‘data type’ for a mapped variable.
Offload Function and Data Declaration Constructs
Target Data Declarations: Declares device data environment for the scope of the target code block.
This allows creation of versions of specified function or data that can be used inside a target region
executing on the coprocessor.
C/C++
#pragma omp declare target new-line
[function-definition-or-declaration]
#pragma omp end target new-line
Function Offload and Execution Constructs:
Target Data Declarations:
In order to execute on a coprocessor, a “target” device must be available on the host
system running the code. For non-shared memory model, the host application executes the initial program which
spawns the master execution thread called “initial device thread”. The OpenMP pragma
“target” provides the capability to offload computations to coprocessor(s).
-
A target region begins as a single thread of execution and executes sequentially, as if enclosed in an implicit task region, called the initial device task region.
-
When a target construct is encountered, the target region is executed by the implicit device task
-
The task that encounters the target construct waits at the end of the construct until execution of the region completes. If a coprocessor does not exist, or is not supported by the implementation, or cannot execute the target construct then the target region is executed by the host device.
-
The data environment is created at the time the construct is encountered if needed. Whether a construct creates a data environment is defined in the description of the construct
C/C++
#pragma omp target [clause[[,] clause],...] newline
parallelloopconstruct | parallelsectionsconstruct
Where Clauses are :
device(scalar-integer-expression)
The integer expression must be a positive number to differentiate various coprocessor
available on a host. If no device is specified, the default device is determined by internal
control variable (ICV) named num-var.
devicemap(alloc | to | from | tofrom: list), Scratch(list):
These are data motion clauses that allow copying and mapping of variables or common
block to/from host scope to target device scope
if(scalar-expr):
If the scalar-expression evaluates to false, the new device environment is not created.
Num_threads( list )
The “ target ” directive creates a device data environment and executes the
code block on the target device. The target region binds to the enclosing parallel or
task region. It provides a superset of “ target data ” constructs described later and describes
data as well as the code block to be executed on the target device.
The master task waits for the coprocessor to complete the target region at the end of the
constructs. When an “ if ” clause is present and if the logical expression inside the if
clause evaluates to false, the target region is not executed by the device. When
a num_threads clause is present, the nthreads-var in the coprocessor
environment is assigned the value list.
-
If a target, target update, or target data construct appears within a target region ten the construct is ignored.
-
At most one device clause may appear on the directive. The device expression must evaluate to a positive integer value.
-
At most one “ if ” clause can appear on the directive.
-
The result of an omp_set_default_device,
omp_get_default_device,
or
omp_get_num_devices
routine called within a target region is unspecified. The effect of an access to a
threadprivate variable in a target region is unspecified
-
A variable referenced in a target construct
that is not declared in the construct is implicitly treated as if it had appeared in a map clause with a map-type of
tofrom.
-
A variable referenced in a target region but not the target construct that is not declared in the target region, must appear in a declare target directive.
-
C/C++ Specific: A throw executed inside a target
region must cause execution toresume within the same target region, and the same thread
that threw the exception must catch it.
|
ICV Name |
Routines |
device-num-var |
omp_set_device_num(),
omp_get_device_num(); |
|
The extension also provides routines to set and get runtime environment settings
(referred to as internal control variables ICV in the OpenMP spec). These are:
Target Data Construct Constructs
Target Data Declarations : Creates the device data environment for the scope of the target code block. If there are no “ device ” clause, the default device is determined by
device-num-var.
Syntax
C/C++
#pragma omp target data [clause[[,] clause],...] newline
structured-block | parallelsectionsconstruct
Where Clauses are :
device(scalar-integer-expression)
The integer expression must be a positive number to differentiate various coprocessor
available on a host. If no device is specified, the default device is determined by internal
control variable (ICV) named num-var.
devicemap(alloc | to | from | tofrom: list), Scratch(list):
These are data motion clauses that allow copying and mapping of variables or common
block to/from host scope to target device scope
if(scalar-expr):
If the scalar-expression evaluates to false, the new device environment is not created.
Strucrtured-block : No branching in and out of the block of code is allowed.
Target Update Constructs
Target Update Declarations :
This construct synchronizes the list items in the device data
environment consistent with their corresponding original list items.
device-num-var.
Syntax
C/C++
#pragma omp target update [clause[[,] clause],...] newline
Where Clauses are :
device(scalar-integer-expression)
The integer expression must be a positive number to differentiate various coprocessor
available on a host. If no device is specified, the default device is determined by internal
control variable (ICV) named num-var.
devicemap(alloc | to | from | tofrom: list), Scratch(list):
These are data motion clauses that allow copying and mapping of variables or common
block to/from host scope to target device scope
if(scalar-expr):
If the scalar-expression evaluates to false, the new device enviornment is not created.
Strucrtured-block : No branching in and out of the block of code is allowed.
|
Using command line arguments (Vectorization & No Vectorization )
The compilation and execution of a program for an Intel Many Integrated Core (MIC)
architecture coprocessor (-mmic)
also known as Intel Xeon Phi Coprocessor are given below.
Compilation :
|
|
Compilation of Sequential Programs : Compiler to Vectorize / NoVectorize
|
Using command line arguments (Vectorization & No Vectorization )
The compilation and execution of a program for an Intel Many Integrated Core (MIC)
architecture coprocessor (-mmic)
also known as Intel Xeon Phi Coprocessor are given below.
Compilation :
To compile the program : Using Intel C Compiler with Vectorization
# icc -mmic -vec -report=3 -O3 <program name> -O
<Name of executable>
For example to compile a simple seq-matrix-matrix-multiply.c program user can type on the command line
# icc -mmic -vec -report=3 -O3 <seq-matrix-matrix-multiply.c > -O
<seq-matrix-matrix-multiply>
To compile the program : Using Intel C Compiler without Vectorization
User can ask the complier not to vectorze the code with
-no -vec option and execute the code. It is possible to get less
performance.
# icc -mmic -no-vec -vec -report=3 -O3 <seq-matrix-matrix-multiply.c > -O
<seq-matrix-matrix-multiply>
To compile the program using Makefile Utility, Using Intel C Compiler with Vectorization
make
Note: If the Makefile has
some extension like Makefile_C
then user is required to type
make -f Makefile_C
(instead of simply typing make)
make -f make -f Makefile.OFFLOAD
(Compile using OFFLOAD mode)
make -f make -f Makefile.NATIVE
(Compile using NATIVE mode)
make -f Makefile.OFFLOAD clean
(Clean the Object files & Binaries )
The details ofr syntax of the command to compile the program on Intel Xeon Phi are given in the
following table.
|
Compiler flag |
Purpose |
-mmic
|
requests the code to be generated for an Intel MIC architecture processor
(
-mmic ) also known as Intel Xeon Phi Coprocessor.
|
-vec -report=3 |
Generate a Vector Report |
-no-vec |
Not to Vectorize the code |
-O3 |
Standard Optimisation technique |
seq-matrix-matrix-multiply |
Writes the resulting excutable file to
seq-matrix-matrix-multiply
|
run |
Writes the resulting excutable file to
run
|
|
|
|
|
Compilation of Programs : OpenMP - Compiler to Vectorize
|
Using command line arguments
The compilation and execution of a program for an Intel Many Integrated Core (MIC)
architecture coprocessor (-mmic)
with OpenMP framework is given below.
# icc -openp -mmic -vec-report=3 -O3 <program name> -O
<Name of executable>
For example to compile a simple openmp-matrix-matrix-multiply.c program user can type on the command line
# icc -openmp -mmic -vec-report=3 -O3 <seq-matrix-matrix-multiply.c > -O
<openmp-matrix-matrix-multiply>
Using command line arguments
The compilation and execution of a program for MIC
architecture coprocessor using offload
and compliation is given below.
# icc -openmp -mmic -vec-report2 -O3 <program name> \
-O
< Name of executable >
|
|
|
OpenMP Set Up Run time Prog. Env.
|
Setting Up the Prog. Environment : OpenMP programs to Scale upto 60 Cores
User can set the number of threads and the affinity using environment variables
omp_set_num_threads(32);
kmp_set_defaults("KMP_AFFINITY = compact");
in the coprocessor's Linux Operating environment. That is
export OMP_SET_THREADS=32
export KMP_AFFINITY = compact
Below given enviornment varaibles should be declared before execution of the program
as per application requirements.
# set environment variables
export MKL_NUM_THREADS=32
export KMP_AFFINITY = granularity=fine,compact
export MIC_ENV_PREFIX=PHI
export PHI_MKL_NUM_THREADS=236
export PHI_KMP_AFFINITY=granularity=fine,compact
export OFFLOAD_REPORT=2
export MKL_MIC_ENABLE=0
Un-setting Up the Prog. Environment :
#unset env variables
unset MKL_NUM_THREADS
unset KMP_AFFINITY
unset MIC_ENV_PREFIX
unset PHI_MKL_NUM_THREADS
unset PHI_KMP_AFFINITY
unset MKL_MIC_ENABLE
|
|
|
|
Execution of Programs : Sequential & OpenMP
|
To execute the applicaiton on coprocessor : Log-in to Xeon-Phi Coprocessor
To execute the program on coprocessor, the user Log-in to the
coprocessor, then simply type the name of executable
on command line.
./< Name of executable>
For example to execute a simple
seq-matrix-matrix-multiply.c application, user
types the command
./seq-matrix-matrix-multiply
For example to execute a simple
openmp-matrix-matrix-multiply.c OpenMP application, user types
./openmp-matrix-matrix-multiply
The expected output:
Initializing the Vectors
Computation startd
gigaFLOPs = ****
Time = ****
gigaFLOPs per Sec = *****
|
|
|
|
Execution - Script
|
Script to run on Xeon Phi in Native Mode :
export MKL_MIC_ENABLE=0
export KMP_AFFINITY=
"granularity=thread,balanced"
export LD_LIBRARY_PATH=/tmp
nThreads=240
i=200
while[ $i –le 1000 ]
do
echo-n
"mic"
./openmp_matrix_matrix_multiply $i $nThreads 8
let i+=100
done
|
|
|
|
Compiler Offload Pragma & Report
|
Details of Code : Intel compiler's offload pragmas :
On Xeon-host, the code to transfer the data to the Xeon Phi coprocessor is automatically
created by the Intel complier. When the code is written using OpenMP pragmas to C/CC+ Fortran
code, the Intel complier encounters an offload pragma, it generates code for both the coprocessor
and the host.
The programmer responsibility is to include appropriate offload pragmas by adding data clauses.
Details can be found under "Offload Using a Pragma" in the Intel compiler documentation as
given in the references.
-
Using #pragma offload taret(mic) :
In this example, how to offload the matrix computation to the Intel Xeon Phi
coprocessor using #pragma offload target(mic)
is shown
-
Choose the target MIC out of Multiple Coprocessors :
The user could also specify the Intel Xeon-Phi Coprocessor
Number_Id in a system with multiple coprocessors (Ex. PARAM YUVA
Compute Nodes ) by using
#pragma offload target(mic:Number_Id).
|
Other Information about Intel compiler's offload :
-
Use -no-offload :
Offloading is enabled per default for the Intel compiler. Use
-no-offload to
disable the generation of offload code.
-
Vec Report :
Using the compiler option -vec-report2 one can see which loops have been vectorized
on the host and the MIC coprocessor:
-
Printing Data transfer (OFFLOAD_REPORT) :
By setting the environment variable OFFLOAD_REPORT
one can obtain information about
performance and
data transfers at runtime:
hypack-01: ~offload_c>
export OFFLOAD_REPORT=2
|
|
|
Intel Xeon Phi Coprocessor Compiler Offload Clauses
|
The Intel Xeon-Phi coprocessor programming environment provides " offload pragma" which
provides additional annotation so the compiler can correctly move data to and from
the external Xeon Phi Card. Note that single or multiple OpenMP loops can be contained
within the scope of the offload directive. The clauses are interpreted as follows:
Offload: The offload pragma keyword specifies different clauses that contain
information relevant to offloading to the target device.
target(mic:MIC_DEV)
is the target clause that tells the compiler to generate code for both the host
processor and the specified offload device i.e., Xeon Phi Coprocessor.
The constant parameter MIC_DEV is an an integer associated with Xeon-Phi device.
Note that the offload performs different operations as per requirement.
-
The offload runtime will schedule offload work within a single application in
a round-robin fashion, which can be useful to share the workload amongst
multiple devices.
-
The offload runtime will utilize the host processor when no coprocessors are present
and no device number is specified (for example, target(mic)).
-
programmers can use _Offload_to to specify a device in their code.
-
It is the responsibility of the programmer to ensure that any persistent
data resides on all the devices.
During the round-robin scheduling, the persistent data resides on all the devices
is important from performance point of view and to avoid PCIe bottlenecks. In general,
only use persistent data when the device number is specified.
-
Choose the target MIC out of Multiple Coprocessors :
The user could also specify the Intel Xeon-Phi Coprocessor
MIC_DEV in a system with multiple coprocessors (Ex. PARAM YUVA
Compute Nodes ) by using
#pragma offload target(mic:MIC_DEV).
in(Matrix_A:length(size*size)):
The in(var-list modifiersopt) clause explicitly copies
data from the host to the coprocessor. Note that:
-
The length(element-count-expr) specifies the number of elements to be transferred. The compiler will perform the conversion to bytes based on the type of the elements.
-
By default, memory will be allocated on the device and deallocated on exiting the scope of the directive.
-
The
alloc_if(condition) and
free_if(condition) modifiers can change the default behavior.
out(Matrix_A:length(size*size)):
The in(var-list modifiersopt) clause explicitly clause explicitly copies data from the coprocessor to the host. Note that:
-
The length(element-count-expr) specifies the number of elements
to be transferred. The compiler will perform the conversion to bytes based on the type of the elements. By default, memory will be deallocated on exiting the scope of the directive.
-
The
free_if(condition) modifier can change the default behavior.
|
|
Tuning & Performancre : KMP Thread Affinity
|
-
On Single a core, we can employ four threads. To ensure maximum performance, core needs to execute FMA calculations on evey
clock cycle. To do this, core must be running more thatn one thread. i.e., two or three or four threads. The code is re-written
to execute on multiple threads, and eventually multiple cores.
-
OpenMP Framework, a widley used standard support by compilers for Intel Xeon Phi coprocessors with
standard API and programming model for shared memory multiprocessing for HPC. OpenMP
parallel for) directive to enable the desired threading is used
and the loop iteratons are divided among the available threads and run in parallel.
-
In OpenMP implementation, each thread will work on a separate set of row array elements of Matrix and
offset variables are included in the code.
To achieve performance close to theoretical peak performance, two openMP calls are used :
omp_set_num_threads(32);
kmp_set_defaults("KMP_AFFINITY = compact");
The first OpeMP call sets the number of threads to use while running the code and the second one
setting AFFINITY to
compact will ensure that thirty two requested OpenMP
threads execute
on different cores.
-
A subset of main memory is available to an application and main memory is organised into pages.
The two dimensional array Matrix_A and Matrix_B are accessed
along the rows. On a system that supports virtual memory, the memory addresses for different
applications
are virtualised : they are given logical addresses, assigned into virutal pages. A typical page
size is 4 or 8 Kbytes, but laret ones exist.
In fact, the physical pages that are available to program may be spread out in memory, and so the
virtual
pages must be mapped to physical ones. The page addresses are stored in page table and
translation-lookside buffer (TLB) a spcial cache that
stores the recently accessed entries in page table. TLB is crticial to performance and in this
exmaple cache reloads plus a large number of
TLB misses may not result.
-
Loop Optimisation can give sinificant improvement in performance and loop unrolling can
help to improve cache line utilization by improving data reuse. Inner loop unrolling with
approriate OpenMP pragmas may further improve the performance. Loop unrolling can also
help to increase the instruction-level parallelism, or ILP. The compiler switches can
help to perform loop unrolling.
-
Use of Pointers and Contiuous Memory in C Lanuguage : The pointer aliasing problem
exists in many application codes and it prevents a compiler from performing many progrm optmizations
since it can nto determine that they are safe. It is assumed that all pointers may reference any
memory address. The performance can be obtained, if pointers are guaranteed to point to portions
of nonoverlapping memory. The restrict key-word is a feature
of the C99 Standard which is supported by compiler and it may improve the performance.
-
The number of threads in an OpenMP enviornment and the mapping of cores on Intel Xeon-Phi Coprocessor
play an important role to achieve maximum performance of developer code.
The KMP_AFFINITY environment variable specifies the thread-to-core affinity.
There are three preset schemes: compact, scatter, and balanced
and the user explicitly define the affinity that works best for their application. The choice
of affinity schemes depends upon the memory access, data sharing and work-load for each thread,
affinte to core. The default runtime thread affinity can also be used and it may change
between software releases. For consistent application performance across software releases,
do not rely on the default affinity scheme.
-
Compact tries to use minimum number of cores by pinning four threads to a
core before filling the next core
-
Scatter tries to evenly distribute threads across all cores
-
Balanced tries to equally scatter threads across all cores such that adjacent threads (sequential thread numbers) are pinned to the same core. One caveat being that all cores refers to the total number of cores -1 because one core is reserved for the operating system during an offload
Interested readers can find more about the affinitization schemes
|
|
|
|
Script : Using KMP Thread Affinity
|
Script to run on Xeon Phi in Native Mode :
export MKL_MIC_ENABLE=0
export KMP_AFFINITY=
"granularity=thread,balanced"
export LD_LIBRARY_PATH=/tmp
nThreads=240
i=1
while[ $i –lt 100 ]
do
echo-n
"mic"
./openmp_matrix_matrix_multiply 1000 $i 10
leti++
done
|
|
|
|
Tuning & Performancre : Memory Alignment for Vectorizatio
|
-
Memory Alignment for Vectorization :
The matrices size is dynamically allocated using posix_memalign(), and their sizes
must be specified via the length() clause. Using in, out and inout one can specify
which data has to be copied in on Intel Xeon-Phi Coprocessor from host.
-
Data alignment - 64-Byte : It is recommended that for Intel Xeon Phi data is 64-byte (512 Bits) aligned as
given in the MIC architecture.
-
Alignment using #pragma vector alignment :
For proper alignment of data to get performance using Intel compiler vectorization,
#pragma vector aligned is used. This tells the compiler that all array data accessed
in the loop is properly aligned.
-
In addition, the –std=c99 directive command-line option tells the compiler to allow use
of the restrict keyword and C99 VLAs. Note that C99 restrict keyword that specifies that
the vectors do not overlap. (Compile with -std=c99 is required for efficient vectorization)
-
For SSE, the data is aligned to 32 Bytes (256 Bits) for AVX and 16 Bytes (128 Bits) for
SSE and for MIC architecture, data should be aligned to 64 Bytes (512 Bits) for the
MIC architecture.
|
|
|
|
Tuning & Performance : Summary on Xeon Phi
|
Compiler Optimisation & Vectorization
-
Performance :
The Intel Xeon-Phi coprocessor deliver peak performance greater than 1 teraFLOP/s for
double precision (DP) floating point calculations and greater than 2 teraFLOP/s for
single precision calculations.
-
The effective use of compiler pragmas with vectorization and scale to achieve the best
performance results in terms of gigaFLOP/s for a given code is important. That is, the
code needs to use automatic compiler flags, vectorization & scale to achieve the
best performance. For more information on this, visit
hypack13-mode03-coprocessor-compiler-vect.html
-
To paralleize more loops in an application on the coprocessor than on an Intel Xeon based host,
the MIC new instructions such as fused multiply-add, masked vector instructions, gather/scatter etc.
can be used to improve the performance due to large SIMD width of 64 Bytes vectorization that
MIC architecture supports
-
The coprocessor supports a key performance enhancing capability of executing both a
multiplication and an addition, known as performance enhancing capability of
executing both a multiplicaiton and an addition, known as fused multiply and add (FMA) in
a single instruction without precision loss, enabling two floating point in one instruction.
Use FMA in Numerical Linear Algebra (NLA) Computations
-
The options for running the code will give information about the performance of
code. The details of the program i.e., inner loop of matrix-matrix Multiply
and the various compiler options do not indicate number of threads & the number of cores
used.
-
To achieve maximum performance of singel core out of maximum peak performance of single core
i.e., 35 giaFLOP/s (approximate), it is necessary to take full advantage of coprocessor's
optimisation features and programming paradigms such as OpenMP & Intel MKL libraries.
Only Single thread is used in this example and the maximum number of threads the Intel Xeon Phi Coprocessor
can directly manage per core is four.
-
The option -vec -report=3 gives to user, the information
about the compiler's choice for vectorizin portions of the code. (Note that the sign
(-) is optional in this compiler option.) For Vectorization, the
Loops and array processing are required.
-
In matrix-matrix multiply examples, the compiler was able to arrange chunks of arrays to be loaded into machine
registers or directly from memory via caches. Thus all 16 Single precision point point lanes
are used for simulataneous calculation.
-
The inner and outer loops are effectively vectorized for the code.
|
|
|
EXample Programs on Matrix Matrix Multiplication
|
Example. 1 :
Reduciton Operation for one dimensional integer array
(Download source code :
openmp4x-reduce-1darray.c,
env-setup.sh ,
Makefile_openmp4x
A simple example of how the constructs discussed below for OpenMP 4.0 can
be used to offload computation to Intel Xeon Phi coprocessor. This example shows a sum reduction operation being run on a Xeon Phi processor.
The offload computation “ reduce ” is first declared with “#pragma omp declare target”
clause.This causes the routine or its component to be executable on the coprocessor. The offload pragma,
“#pragma omp target map(inarray[0:SIZE]) map(sum)”
that causes specific code block lines of the code to be sent to coprocessor, for computing.
In this case it is computing the reduction of an array of numbers and returning the computing value
through the “sum” variable to the host. The
&lduo; inarray ” and the “sum” is copied in and out of the
coprocessor before and after the computation.
|
Example. 2 :
Reduciton Operation for two dimensional integer array
(Download source code :
openmp4x-reduce-2darray.c)
|
Example. 3 :
Vector Vector Multiplication
(Download source code :
openmp4x-vect-vect-multiply.c)
|
Example. 4 :
Matrix Matrix Addition
(Download source code :
openmp4x-mat-mat-addition.c)
|
Example. 5 :
Matrix Matrix Multiplication
(Download source code :
openmp4x-mat-mat-multiply.c)
|
Example. 6 :
Vector Vector Addtion - Scaling to more cores & Performance
(Download source code :
openmp4x-vect-vect-addition-perf.c)
|
Example. 7 :
Infinity Norm of a Square Matrix
(Download source code :
openmp4x-infinity-norm-matix.c)
|
Example. 8 :
Reduction Operation for one dimensional integer array using openmp distribute clause
(Download source code :
openmp4-0-distribute-1Darray-sum.c)
|
|
|
| |
|