• Mode-1 Multi-Core • Memory Allocators • OpenMP • Intel TBB • Pthreads • Java - Threads • Charm++ Prog. • Message Passing (MPI) • MPI - OpenMP • MPI - Intel TBB • MPI - Pthreads • Compiler Opt. Features • Threads-Perf. Math. Lib. • Threads-Prof. & Tools • Threads-I/O Perf. • PGAS : UPC / CAF / GA • Power-Perf. • Home




Tuning and Performance /Benchmarks on Multi-Core Processors

Tuning and Performance of Application Programs using Compiler optimization techniques, Code restructuring techniques on Multi-Core Processors is challenging. Understanding Programming Programming Paradigms (MPI, OpenMP, Pthreads), effective use of right Compiler Optimization flags and obtaining correct results for given application is important. Enhance performance and scalability on multiple core processors for given application with respect to increase in problem size require serious efforts. Several Optimization techniques are discussed below.


Tips for Tuning & Performance



It is important to understand issues regarding tuning & performance of an application or commerical code. Some of these are listed below.

  • Code Swapping: If the code is trying to use more physical memory than the system has, the OS will take data that is in memory and copy to its disk. When the OS needs the data it will copy it back into memory and copy something else into memory to disk. Swapping kills performance. Typically the wall clock time can increase by a factor of 10 if the code swaps a great deal. So, if your code seems slow, look at the processors to see if they are swapping.

  • Measure Wall clock time: Most of current CPUs and chipsets can measure many aspects of the system. Things such as number of floating-point operations, number of L1 cache misses, etc. is measured and available to the OS. There are patches to the Linux kernel that allow the OS access to these counters. Then there are software packages such as PAPI ( PAPI or (visit hyPACK 2013 software tools hypack13-mode01-multicore-software-tools.html that can be used to report the information to the user.

  • Using All the Cores? : The command to run a Multi-Core Pthread code allows you to specify the number of cores you want to run on. Thread Affinity (Thread binding) on Multi-core processors or scripts that are used to run a cluster or some other queuing/scheduling packages play an important role for performance. The most common problem is that all of the processes run on a single node of cluster and distributed across multiple nodes of cluster. Most importantly, on Multi-core processors, the multi-core node that is actually running the processes will start to swap. Dual Socket Quad Core or Quad socket Quad core or Man-core systems have own overheads due to system overview of threading, hardware overview of threading.

Tuning Options (Multi-Cores Processors & Clusters
  • Hardware:
    . Current Processors (Intel, AMD, IBM, Cray, SGI, Sun)
    1.Processor Choice
    2.Interconnect Choice
    3.Network Card Choice (in some cases)
    4.IO Subsystem
    Several issues need to be addressed from Multi-Core point of View as well cluster point of view. Most importantly, following play an important role for Multi-Core Processors

    1. Selection of Processors Type,
    2.IO subsystem

  • Software :
    1.Compiler Choice
    2.Compiler Options
    3.MPI Choice
    4.MPI Tuning
    5.Tools
    6.Mixed Programming Environment
Remarks: Many choices ↦ many test cases exist for tuning and performance and it is often depends upon the characteristics of an application. Many combinations exist for Tuning and some of the options we need to consider are are given below :

  • Multi-core Processors

    Processors: Intel Xeon, AMD Opteron, IBM Power 5 /Power 6 (Memory per node, L0, L1, L2, L3 Cache)

    Compilers: Opteron: PGI, Pathscale, Intel
    EM64T, an abbreviation for Extended Memory 64 Technology (Intel 64 or the x64) - AMD64,PGI, Intel)

    Compile Options: Five (5) options for each compiler -O, -O1, -O2, -O3, aggressive)

    Thread Affinity on Many-Socket Many-Core processors

    Tuned Mathematical Libraries :

  • Cluster of Multi-core Processors

    1.Processors: Intel Xeon, AMD Opteron, IBM Power 5 /Power 6

    2.Compilers and Compiler Options

    3.Interconnects+MPI Options

    4.GigE (MPICH1, MPICH2, LAM, Scali MPI Connect)

    5.Myrinet (MPICH-GM, Scali MPI Connect)

    6.Infiniband (MVAPICH, Scali MPI Connect)

    7.MPI tuning: more than 20 options

    8.Range of Processor Count: 2 size - 4/8/16 CPUs on Multi socket/Multi-core Processors

Compilers and Complier Options : Open source Compilers

  • There are a number of compilers available for Linux. Some are open-source courtesy of the GNU series: gcc (C compiler), g77 (older Fortran77 compiler), g95 (Fortran95 compiler that uses gcc backend), and gfortran (newFortran95 compiler in GNU compiler series).

  • The majority is commercial compilers are the IBM, PGI, Pathscale EKO, Intel, and Absoft compilers. They usually go from -O0 to -O3. The first and second levels of optimization are -O2 and -O3. Most compilers also allow even more aggressive levels of optimization and it is called as "Aggressive"

  • Vendors use various ways of indicating compiler options. For example, -O2 may include some Optimizations that other compiler vendors put in at -O3. Consequently, it's very difficult to compare the same optimization levels among compilers.

  • Remark: For many class of applications highest levels of compiler optimization flags are no always the best.

Tuning on Multi-core Processors

  • Use Compiler Switches & explore performance - Localize the Data - Cache Utilization

    Use Profiler to understand behavior of code and Hotspots in the Code

    Use Linux tool "top" to know about the CPU & Memory Utilization as well as Scalability with respect to varying problem sizes

    Threading APIs used; Locks & Heap contention

    Thread affinity - Explore the performance

    Sequential code Optimization - Use tuned libraries

    Check for Swapping (is the code Swapping?) - Use "top" tools


Choice of MPI

After selection of compiler and to some degree a processor, the next step is to try various MPI implementations. There are a large number of MPI implementations, both open-source and commercial. The list may be quite long but important MPI libraries that are available is listed below.

Open-source MPI :

MPICH1, MPICH2, LAM, Open-MPI, GAMMA-MPI, FT-MPI, LA-MPI, PACX-MPI, MVAPICH OOMPI, MPICH-GM, MVICH, MP_Lite
Commercial MPI :

MPI/Pro, Scali MPI Connect, HP-MPI, Intel MPI, IBM MPI

  • MPICH1 : This is sort of the "reference" implementation for MPI. It is a very easy MPI to build and use. Many other MPIs, such as MPICH-GM, HPMPI,and Intel MPI are based on this MPI implementation

  • MPICH2 : This is an improved version of MPICH1 that is faster and also adds most of the features of the MPI-2 standard.

  • LAM : LAM is an alternative to the MPICH line of MPI libraries. It uses a daemon-based method for starting MPI codes.

  • OpenMPI : OpenMPI is an interesting project because it combines the best features of LAM, FT-MPI, LA-MPI, and PACX-MPI. It supports TCP, Myrinet (gm and mx), and Infiniband networks. An interesting feature of OpenMPI is the addition of the fault tolerance capability of FT-MPI. This will allow an MPI code to lose a node and then add a new node to fi nish the computations without lose of data.

  • MVAPICH : MVAPICH is being developed at Ohio State University. It basically is a port of MPICH to Infiniband with some changes to take advantage of Infiniband.

  • Scali MPI connect : Scali MPI Connect is a commercial MPI implementation that has a large number of features. A single binary built with Scali MPI Connect can be run on TCP networks, Myrinet, and Infiniband without having to recompile. Also, Scali MPI Connect has a network failover capability. If you are running on a high-speed network such as Myrinet and you lose a network connection, Scali will switch the MPI code to run on an alternative TCP network so your code will continue to run without any lose of data. This is the only MPI that I'm aware of that which can do this.

  • MPI_Lite : MP_Lite is not a full-fledged MPI library, but rather a subset. The designers of MP_Lite realized that the vast majority of MPI codes only use a small subset of the available MPI functions. Consequently, they have written MP_Lite to focus on this small subset. This allows them to focus on performance (low-latency and high bandwidth) for these functions. MP_Lite is also one of the few MPI libraries that take advantage of channel bonding multiple GigE networks.

The recommend MPI testing for several class of applications is as follows

MPICH1

MPICH2

LAM

MPICH-GM (for Myrinet)

MVAPICH (for Infinband)

At least one commercial MPI


Centre for Development of Advanced Computing