C-DAC,Pune : High-Perf. Comp. Frontier Technologies Exploration Group and CMSD, University of Hyderabad, Technology Workshop hyPACK (October 15-18), 2013

hyPACK-2013 : Mode-3 ( Coprocessors) Intel Xeon-Phi Coprocessor Programming Environment

Understanding Intel's MIC architecture, Compiler & Vectorization features and programming models for the Intel Xeon Phi coprocessor may enable programmers to achieve good performance for applications. The Xeon-Phi Coprocessors can deliver over one teraflop of floating-point performance and several paths as listed below can be taken to reach one tera-flop supercomputing speeds.

Offload work from the host processor to the Intel Xeon Phi coprocessor(s) using pragmas to augment existing codes

Use coprocessor as a separate many-core Linux SMP compute node and recompiling source code to run directly on coprocessor

Accessing the coprocessor as an accelerator through optimized libraries such as the Intel MKL (Math Kernel Library) and use MKL thread affinity features

Use OpenMP framework on coprocessor with Compiler Vectorization features and expressing sufficient parallelism with vector capability to achieve high floating-point performance.

The description of the hardware of the Intel Xeon Phi coprocessor through information about the basic programming models may assist the developer to port the applicaitons in an easy way. The pragma-based offload model and using Intel Xeon Phi as an SMP processor is one of the easiest approached to write a program similar to existing x86 systems. The challenge lies in expressing sufficient parallelism and vector capability to achieve high floating-point performance, as the Intel Xeon Phi coprocessors provide more than an order of magnitude increase in core count over the current generation dual-core and quad-core processors.

Topics dealing with all practical and experimental aspects of various complier and vector features implemented in hyPACK-2013 are considered on Intel Xeon Phi Coprocessors in order to achieve the best sustained performance of NLA and application Kernels. The example programs are made available to the participants in the laboratory session. The hyPACK-2013 programme is aimed to understand the practical aspects of performance enhancement through software multi-threading with Compiler and Vector technology features of Intel Xeon-Phi coprocessors. Participants will get an opportunity to walk-through and execute some of the programs designed for Mode-3 of this workshop. The information about porting codes and strategies how to analyze and improve the performance of applications is discussed.

The aggregate Intel Xeon Phi coprocessor computational performance is high, but each core is slow and has limited floating-point performance when compared with modern multi-core processor systems such as Intel sandy bridge processor. Most importantly, the high performance can be achieved only when a large number of parallel threads (minimum 120 to maximum 240) are utilized. The parallel threads issue instructions to the wide vector units quickly enough to keep the vector pipeline full. The current generation of coprocessor cores support up to four concurrent threads of execution via hyperthreading.

The Intel Xeon Phi Compiler technology assists developers for implementation of vectorization in data parallel codes. For data parallel codes, the complier recognizes the impendent chunks of computation and issues the Intel Xeon Phi special wide vector instructions per core vector units. Also, it is possible to utilize compiler intrinsic operations or assembly language to access the vector units. In general, the best floating-point performance for data parallel application will be realized when each core is running two threads that actively issue instructions to the vector unit. Data access and keeping core vector unit busy by running two threads may improve the performance for data parallel application kernels. This depends on the type and amount of work performed by each thread before it issues a vector operation.

The key to Intel Xeon Phi floating-point performance for data parallel applications is the efficient use of the per core vector unit. To access the vector unit, the compiler must be able to recognize SSE-compatible constructs so it can generate the special Intel Xeon Phi vector instructions. Most importantly, the code should have vectorizable features with different granularity. The data parallel applications will benefit from Xeon Phi floating-point capability by simply telling the compiler to utilize the SSE instructions on the current x86 processor (through the GNU -msse or other compiler switch).

Applications that run faster with SSE (or conversely slow down when the use of SSE instructions is disabled) will likely benefit from the Intel Xeon Phi wide vector unit. Applications that don't benefit from the SSE instruction set and vector operations can get benefit from coprocessors. In such type of applications, these Intel Xeon Phi coprocessors can still be used as support devices that provide many-core parallelism and high memory bandwidth.