-
Code Swapping: If the code is trying to use more physical memory than the system has,
the OS will take data that is in memory and copy to its disk. When the OS needs the data it will copy
it back into memory and copy something else into memory to disk. Swapping kills performance. Typically
the wall clock time can increase by a factor of 10 if the code swaps a great deal. So, if your code seems slow,
look at the processors to see if they are swapping.
-
Measure Wall clock time:
Most of current CPUs and chipsets can measure many aspects of the system. Things such as number of
floating-point operations, number of L1 cache misses, etc. is measured and available to the OS.
There are patches to the Linux kernel that allow the OS access to these counters. Then there are software packages
such as PAPI ( PAPI or (visit hyPACK 2013 software tools
hypack13-mode01-multicore-software-tools.html
that can be used to report the information to the user.
-
Using All the Cores? : The command to run a Multi-Core Pthread code allows you to specify the number of cores
you want to run on. Thread Affinity (Thread binding) on Multi-core processors or scripts that are used to run a cluster or
some other queuing/scheduling packages play an important role for performance.
The most common problem is that all of the processes run on a single node of cluster and distributed across multiple nodes
of cluster. Most importantly, on Multi-core processors, the multi-core node that is actually running the processes will start
to swap. Dual Socket Quad Core or Quad socket Quad core or Man-core systems have own overheads due to system overview of
threading, hardware overview of threading.
Tuning Options (Multi-Cores Processors & Clusters
-
Hardware:
. Current Processors (Intel, AMD, IBM, Cray, SGI, Sun)
1.Processor Choice
2.Interconnect Choice
3.Network Card Choice (in some cases)
4.IO Subsystem
Several issues need to be addressed from Multi-Core point of View as well cluster point of view.
Most importantly, following play an important role for Multi-Core Processors
1. Selection of Processors Type,
2.IO subsystem
-
Software :
1.Compiler Choice
2.Compiler Options
3.MPI Choice
4.MPI Tuning
5.Tools
6.Mixed Programming Environment
Remarks: Many choices ↦ many test cases exist for tuning and performance
and it is often depends upon the characteristics of an application. Many combinations exist
for Tuning and some of the options we need to consider are are given below :
-
Multi-core Processors
Processors: Intel Xeon, AMD Opteron, IBM Power 5 /Power 6 (Memory per node, L0, L1, L2, L3 Cache)
Compilers: Opteron: PGI, Pathscale, Intel
EM64T, an abbreviation for Extended Memory 64 Technology (Intel 64 or the x64) - AMD64,PGI, Intel)
Compile Options: Five (5) options for each compiler -O, -O1, -O2, -O3, aggressive)
Thread Affinity on Many-Socket Many-Core processors
Tuned Mathematical Libraries :
-
Cluster of Multi-core Processors
1.Processors: Intel Xeon, AMD Opteron, IBM Power 5 /Power 6
2.Compilers and Compiler Options
3.Interconnects+MPI Options
4.GigE (MPICH1, MPICH2, LAM, Scali MPI Connect)
5.Myrinet (MPICH-GM, Scali MPI Connect)
6.Infiniband (MVAPICH, Scali MPI Connect)
7.MPI tuning: more than 20 options
8.Range of Processor Count: 2 size - 4/8/16 CPUs on Multi socket/Multi-core Processors
Compilers and Complier Options :
Open source Compilers
-
There are a number of compilers available for Linux. Some are open-source courtesy of the GNU
series: gcc (C compiler), g77 (older Fortran77 compiler), g95 (Fortran95 compiler that uses gcc backend),
and gfortran (newFortran95 compiler in GNU compiler series).
-
The majority is commercial compilers are the IBM, PGI, Pathscale EKO, Intel, and Absoft compilers.
They usually go from -O0 to -O3. The first and second levels of optimization are
-O2 and -O3. Most compilers also allow even more aggressive levels of optimization and it is called as
"Aggressive"
-
Vendors use various ways of indicating compiler options.
For example, -O2 may include some Optimizations that other compiler vendors put in at -O3.
Consequently, it's very difficult to compare the same optimization levels among compilers.
Remark: For many class of applications highest levels of compiler optimization flags are no always the best.
Tuning on Multi-core Processors
-
Use Compiler Switches & explore performance - Localize the Data - Cache Utilization
Use Profiler to understand behavior of code and Hotspots in the Code
Use Linux tool "top" to know about the CPU & Memory Utilization as well as Scalability
with respect to varying problem sizes
Threading APIs used; Locks & Heap contention
Thread affinity - Explore the performance
Sequential code Optimization - Use tuned libraries
Check for Swapping (is the code Swapping?) - Use "top" tools
Choice of MPI
After selection of compiler and to some degree a processor, the next step is to try various MPI
implementations.
There are a large number of MPI implementations, both open-source and commercial.
The list may be quite
long but important MPI libraries that are available is listed below.
Open-source MPI :
MPICH1, MPICH2, LAM, Open-MPI, GAMMA-MPI, FT-MPI, LA-MPI, PACX-MPI, MVAPICH
OOMPI, MPICH-GM, MVICH, MP_Lite
Commercial MPI :
MPI/Pro, Scali MPI Connect, HP-MPI, Intel MPI, IBM MPI
-
MPICH1 :
This is sort of the "reference" implementation for MPI. It is a very easy MPI to build and use.
Many other MPIs, such as MPICH-GM, HPMPI,and Intel MPI are based on this MPI implementation
-
MPICH2 :
This is an improved version of MPICH1 that is faster and also adds most of the features of the
MPI-2 standard.
-
LAM :
LAM is an alternative to the MPICH line of MPI libraries. It uses a daemon-based method for starting
MPI codes.
-
OpenMPI :
OpenMPI is an interesting project because it combines the best features of LAM, FT-MPI, LA-MPI,
and PACX-MPI. It supports TCP, Myrinet (gm and mx), and Infiniband networks. An interesting
feature of OpenMPI is the addition of the fault tolerance capability of FT-MPI. This will allow an MPI
code to lose a node and then add a new node to fi nish the computations without lose of data.
-
MVAPICH :
MVAPICH is being developed at Ohio State University. It basically is a port of MPICH to Infiniband
with some changes to take advantage of Infiniband.
-
Scali MPI connect :
Scali MPI Connect is a commercial MPI implementation that has a large number of features. A single
binary built with Scali MPI Connect can be run on TCP networks, Myrinet, and Infiniband without
having to recompile. Also, Scali MPI Connect has a network failover capability. If you are running on
a high-speed network such as Myrinet and you lose a network connection, Scali will switch the MPI
code to run on an alternative TCP network so your code will continue to run without any lose of data.
This is the only MPI that I'm aware of that which can do this.
-
MPI_Lite :
MP_Lite is not a full-fledged MPI library, but rather a subset. The designers of MP_Lite realized that
the vast majority of MPI codes only use a small subset of the available MPI functions. Consequently,
they have written MP_Lite to focus on this small subset. This allows them to focus on performance
(low-latency and high bandwidth) for these functions. MP_Lite is also one of the few MPI libraries
that take advantage of channel bonding multiple GigE networks.
The recommend MPI testing for several class of applications is as follows
MPICH1
MPICH2
LAM
MPICH-GM (for Myrinet)
MVAPICH (for Infinband)
At least one commercial MPI
|