C-DAC,Pune : High-Perf. Comp. Frontier Technologies Exploration Group and CMSD, University of Hyderabad, Technology Workshop hyPACK (October 15-18), 2013

Overview Venue : CMSD, UoH Key-Note/Invited Talks Faculty / Speakers Proceedings Downloads Past Tech. Workshops Target Audience Benefits Organisers Accommodation Local Travel Sponsors Feedback Acknowledgements Contact Home

Topics of Interest Tech. Prog. Schedule Topic : Multi-Core Topic : ARM Proc. Topic : Coprocessors Topic : GPGPUs Topic : HPC Cluster Topic : App. Kernels. Topic : Lab. Session Key-Note / Invited Talks Home

Mode-1 Multi-Core Memory Allocators OpenMP Intel TBB Pthreads Java - Threads Charm++ Prog. Message Passing (MPI) MPI - OpenMP MPI - Intel TBB MPI - Pthreads Compilers - Opt. Features Threads-Perf. Math. Lib. Threads-Prof. & Tools Threads - I/O Perf. PGAS : UPC / CAF/ GA Power & Perf. Home

Mode-2 ARM Prog. Env Benchmarks Power & Perf. Home

Mode-3 Coprocessors Arch. Software Compiler & Vect. Prog. Env. Benchmarks Power & Perf. Home

Mode-4 GPGPUs NVIDIA - CUDA/OpenCL AMD APP - OpenCL GPGPUs - OpenCL GPGPUs : Power & Perf. Home

Mode-5 HPC Cluster HPC MPI Cluster GPU Cluster - NVIDIA GPU Cluster - AMD APP Cluster - Intel Coprocessors Cluster- Power & Perf. Home

Mode-6 App. Kernels PDE Solvers : FDM/FEM Image Processing - FFT Monte Carlo Methods String Srch. Seq. Analy. Video Process. Intr. Detcn. Sys App. Power & Perf. Home

Reg. Overview Pvt. Sector Pub. Sector Govt. Acad. Staff Students Reg. On-line Reg. Accommodation Contact Home

• Mode-1 Multi-Core • Memory Allocators • OpenMP • Intel TBB • Pthreads • Java - Threads • Charm++ Prog. • Message Passing (MPI) • MPI - OpenMP • MPI - Intel TBB • MPI - Pthreads • Compiler Opt. Features • Threads-Perf. Math. Lib. • Threads-Prof. & Tools • Threads-I/O Perf. • PGAS : UPC / CAF / GA • Power-Perf. • Home

Tuning and Performance /Benchmarks on Multi-Core Processors

Tuning and Performance of Application Programs using Compiler optimization techniques, Code restructuring techniques on Multi-Core Processors is challenging. Understanding Programming Programming Paradigms (MPI, OpenMP, Pthreads), effective use of right Compiler Optimization flags and obtaining correct results for given application is important. Enhance performance and scalability on multiple core processors for given application with respect to increase in problem size require serious efforts. Several Optimization techniques are discussed below.

Tips for Tuning & Performance

It is important to understand issues regarding tuning & performance of an application or commerical code. Some of these are listed below.

Code Swapping: If the code is trying to use more physical memory than the system has, the OS will take data that is in memory and copy to its disk. When the OS needs the data it will copy it back into memory and copy something else into memory to disk. Swapping kills performance. Typically the wall clock time can increase by a factor of 10 if the code swaps a great deal. So, if your code seems slow, look at the processors to see if they are swapping.
Measure Wall clock time: Most of current CPUs and chipsets can measure many aspects of the system. Things such as number of floating-point operations, number of L1 cache misses, etc. is measured and available to the OS. There are patches to the Linux kernel that allow the OS access to these counters. Then there are software packages such as PAPI ( PAPI or (visit hyPACK 2013 software tools hypack13-mode01-multicore-software-tools.html that can be used to report the information to the user.
Using All the Cores? : The command to run a Multi-Core Pthread code allows you to specify the number of cores you want to run on. Thread Affinity (Thread binding) on Multi-core processors or scripts that are used to run a cluster or some other queuing/scheduling packages play an important role for performance. The most common problem is that all of the processes run on a single node of cluster and distributed across multiple nodes of cluster. Most importantly, on Multi-core processors, the multi-core node that is actually running the processes will start to swap. Dual Socket Quad Core or Quad socket Quad core or Man-core systems have own overheads due to system overview of threading, hardware overview of threading.

Tuning Options (Multi-Cores Processors & Clusters

Hardware:
. Current Processors (Intel, AMD, IBM, Cray, SGI, Sun)
1.Processor Choice
2.Interconnect Choice
3.Network Card Choice (in some cases)
4.IO Subsystem

Several issues need to be addressed from Multi-Core point of View as well cluster point of view. Most importantly, following play an important role for Multi-Core Processors

1. Selection of Processors Type,
2.IO subsystem

Software :
1.Compiler Choice
2.Compiler Options
3.MPI Choice
4.MPI Tuning
5.Tools
6.Mixed Programming Environment

Remarks: Many choices ↦ many test cases exist for tuning and performance and it is often depends upon the characteristics of an application. Many combinations exist for Tuning and some of the options we need to consider are are given below :

Multi-core Processors

Processors: Intel Xeon, AMD Opteron, IBM Power 5 /Power 6 (Memory per node, L0, L1, L2, L3 Cache)

Compilers: Opteron: PGI, Pathscale, Intel
EM64T, an abbreviation for Extended Memory 64 Technology (Intel 64 or the x64) - AMD64,PGI, Intel)

Compile Options: Five (5) options for each compiler -O, -O1, -O2, -O3, aggressive)

Thread Affinity on Many-Socket Many-Core processors

Tuned Mathematical Libraries :
Cluster of Multi-core Processors

1.Processors: Intel Xeon, AMD Opteron, IBM Power 5 /Power 6

2.Compilers and Compiler Options

3.Interconnects+MPI Options

4.GigE (MPICH1, MPICH2, LAM, Scali MPI Connect)

5.Myrinet (MPICH-GM, Scali MPI Connect)

6.Infiniband (MVAPICH, Scali MPI Connect)

7.MPI tuning: more than 20 options

8.Range of Processor Count: 2 size - 4/8/16 CPUs on Multi socket/Multi-core Processors

Compilers and Complier Options : Open source Compilers

There are a number of compilers available for Linux. Some are open-source courtesy of the GNU series: gcc (C compiler), g77 (older Fortran77 compiler), g95 (Fortran95 compiler that uses gcc backend), and gfortran (newFortran95 compiler in GNU compiler series).
The majority is commercial compilers are the IBM, PGI, Pathscale EKO, Intel, and Absoft compilers. They usually go from -O0 to -O3. The first and second levels of optimization are -O2 and -O3. Most compilers also allow even more aggressive levels of optimization and it is called as "Aggressive"
Vendors use various ways of indicating compiler options. For example, -O2 may include some Optimizations that other compiler vendors put in at -O3. Consequently, it's very difficult to compare the same optimization levels among compilers.

Remark:

Tuning on Multi-core Processors

Use Compiler Switches & explore performance - Localize the Data - Cache Utilization

Use Profiler to understand behavior of code and Hotspots in the Code

Use Linux tool "top" to know about the CPU & Memory Utilization as well as Scalability with respect to varying problem sizes

Threading APIs used; Locks & Heap contention

Thread affinity - Explore the performance

Sequential code Optimization - Use tuned libraries

Check for Swapping (is the code Swapping?) - Use "top" tools

Choice of MPI

After selection of compiler and to some degree a processor, the next step is to try various MPI implementations. There are a large number of MPI implementations, both open-source and commercial. The list may be quite long but important MPI libraries that are available is listed below.

Open-source MPI :

MPICH1, MPICH2, LAM, Open-MPI, GAMMA-MPI, FT-MPI, LA-MPI, PACX-MPI, MVAPICH OOMPI, MPICH-GM, MVICH, MP_Lite

Commercial MPI :

MPI/Pro, Scali MPI Connect, HP-MPI, Intel MPI, IBM MPI

MPICH1 : This is sort of the "reference" implementation for MPI. It is a very easy MPI to build and use. Many other MPIs, such as MPICH-GM, HPMPI,and Intel MPI are based on this MPI implementation
MPICH2 : This is an improved version of MPICH1 that is faster and also adds most of the features of the MPI-2 standard.
LAM : LAM is an alternative to the MPICH line of MPI libraries. It uses a daemon-based method for starting MPI codes.
OpenMPI : OpenMPI is an interesting project because it combines the best features of LAM, FT-MPI, LA-MPI, and PACX-MPI. It supports TCP, Myrinet (gm and mx), and Infiniband networks. An interesting feature of OpenMPI is the addition of the fault tolerance capability of FT-MPI. This will allow an MPI code to lose a node and then add a new node to fi nish the computations without lose of data.
MVAPICH : MVAPICH is being developed at Ohio State University. It basically is a port of MPICH to Infiniband with some changes to take advantage of Infiniband.
Scali MPI connect : Scali MPI Connect is a commercial MPI implementation that has a large number of features. A single binary built with Scali MPI Connect can be run on TCP networks, Myrinet, and Infiniband without having to recompile. Also, Scali MPI Connect has a network failover capability. If you are running on a high-speed network such as Myrinet and you lose a network connection, Scali will switch the MPI code to run on an alternative TCP network so your code will continue to run without any lose of data. This is the only MPI that I'm aware of that which can do this.
MPI_Lite : MP_Lite is not a full-fledged MPI library, but rather a subset. The designers of MP_Lite realized that the vast majority of MPI codes only use a small subset of the available MPI functions. Consequently, they have written MP_Lite to focus on this small subset. This allows them to focus on performance (low-latency and high bandwidth) for these functions. MP_Lite is also one of the few MPI libraries that take advantage of channel bonding multiple GigE networks.

The recommend MPI testing for several class of applications is as follows

MPICH1

MPICH2

LAM

MPICH-GM (for Myrinet)

MVAPICH (for Infinband)

At least one commercial MPI

Centre for Development of Advanced Computing