C-DAC,Pune : High-Perf. Comp. Frontier Technologies Exploration Group and CMSD, University of Hyderabad, Technology Workshop hyPACK (October 15-18), 2013

hyPACK-2013 Mode 1 : Performance Visualization Tools : Multi-Core Processors

Performance visualization tools such as Intel Thread Checker, Intel Vtune Performance analyzer, Intel thread Debugger, Sun Studio, Etnus Totalview Debugger, Upshot, Jumpshot, PAPI (Public domain tools) and IBM Tools are used in order to help the programmer to understand the behavior of a parallel (MPI/Pthreaded) programs and understand performance of application. Example programs based on Pthreads, OpenMP, MPI, and Intel TBB programs using different APIs, and programs based on numerical and non-numerical computations are Chosen for to understand performance issues using Intel Software Tools and Open Source software tools on Parallel Processing technologies.

Introduction

Performance is a critical issue in the current cluster of workstations. However, delivery of adequate performance is not automatic and performance evaluation tools are required in order to help the programmer to understand the behavior of a parallel program. Performance evaluation and visualization is an important and useful technique that helps the user to understand and improve complex parallel performance phenomena. In order to produce MPI applications that perform well on today's parallel architectures, programmers need effective tools for collecting and analyzing performance data. Because programmers typically work on more than one platform, cross-platform tools are highly desirable. A variety of such tools, both commercial and research, are becoming available.

Performance visualization tools such as Intel Thread Checker, Interl Vtune Performance analyzer, Intel thread Debugger, Sun Studio, Etnus totalview Debugger, Upshot, Jumpshot (Public domain) , IBM Tools and C-DAC HPCC tools (C-DAC, 1998) have been developed in order to help the programmer to understand the behavior of a parallel program and understand complex performance processes in order to produce MPI applications that perform well on today's parallel architectures, performance evaluation and visualization is an important and useful technique that helps the user to understand and improve complex parallel performance phenomena. Performance is a critical issue in current massively parallel processors. However, the delivery of adequate performance is not automatic and performance tools are required in order to help the programmer to understand the behavior of a parallel program.

Performance visualization is the use of graphical display techniques to present an analysis of performance data for an improved understanding of complex performance phenomena. Performance visualization systems for parallel programs have been helpful in the past and they are commonly used in order to improve parallel program performance. However, despite the advances that have been made in visualizing scientific data, techniques for visualizing performance of parallel programs remain adhoc and performance visualization becomes more difficult as the parallel system becomes more complex.

Performance visualization tools have been developed in order to help the programmer to understand the behavior of a parallel program and understand complex performance processes. Massively parallel processors can generate a huge amount of performance data and sophisticated methods for representing and displaying this data are required. Visualization must be carefully used in order to get useful results. In general, current performance views are not scalable and they do not represent an abstraction of the performance data. If performance visualization is to become an integral tool in parallel performance evaluation, it must be based on a formal foundation that relates abstract performance behavior to visual representations. Parallel performance visualization offers a wide range of research opportunities, and one of these is the designing and development of new scalable performance data representations. These data representations should have the following features:

(a).

Scalability (i.e. number of processors, problem size)

(b).

Higher level of abstraction than that of existing views

(c).

Usefulness: a performance view must be useful in the process of understanding parallel program behavior.

(d).

Incorporation and integration of state-of-the-art technologies including: application domain expertise, visual perception and graphic design.

Parallel performance visualization has already been identified as a useful technique and work related to abstract performance views and several authors have carried scalability. Not only visual but also aural methods are being explored in order to represent parallel performance data. Visual and aural portrayals of parallel program execution and used to gain insight into how a program is working. The combination of portrayals in a coordinated performance environment provides the user with multiple perspectives and stimuli to comprehend complex, multidimensional run-time information. A programmer should begin the investigation into causes of performance degradation with high-level and abstract views, proceeding from the highest level to the lowest level. While a collection of views, animations and general performance data can help uncover performance bugs, some guidelines or strategies are needed to direct the order in which views are examined.

An Overview of Performance Analysis

Three basic steps in the performance analysis (data collection, data transformation, and data visualization) process need to be distinguished. Data collection is the process by which the data about program performance are obtained from an executing program. Data is normally collected in a file, either during or after execution, although in some situations it may be presented to the user in real time. Three basic data collection techniques can be distinguished:

Profiles: record the amount of time spent in different parts of a program. This information, though minimal, is often invaluable for highlighting performance problems. Profiles typically are gathered automatically.

Counters : record either frequencies of events or cumulative times. The insertion of counters may require some intervention from the programmer.

Event traces : record each occurrence of various specified events, thus typically producing a large amount of data. Traces can be produced either automatically or with programmer intervention.

The raw data produced by profiles, counters, or traces are rarely in the form required to answer performance questions. Hence, data transformations are applied, often with the goal of reducing total data volume. Transformations can be used to determine mean values or other higher-order statistics or to extract profile and counter data from traces. For example, a profile recording the time spent in each subroutine on each processor might be transformed to determine the mean time spent in each subroutine on each processor, and the standard deviation from this mean. Similarly, a trace can be processed to produce a histogram giving the distribution of message sizes. Each of the various performance tools described in subsequent sections incorporates some set of built-in transformations; more specialized transformation can also be coded by the programmer.

Parallel performance data are inherently multidimensional, consisting of execution times, communication costs, and so on, for multiple program components, on different processors, and for different problem sizes. Although data reduction techniques can be used in some situations to compress performance data to scalar values, it is often necessary to be able to explore the raw multidimensional data. As is well known in computational science and engineering, this process can benefit enormously from the use of data visualization techniques. Both conventional and more specialized display techniques can be applied to performance data. Next, we examine in more detail the techniques used to collect performance data. We consider in turn profiling, counters, and event traces, focusing in each case on the principles involved.

Profiles

The concept of a profile should be familiar from sequential computing. Typically, a profile shows the amount of time spent in different program components. This information is often obtained by sampling techniques, which are simple but not necessarily highly accurate. The value of the program counter is determined at fixed intervals and used to construct a histogram of execution frequencies. These frequencies are then combined with compiler symbol table information to estimate the amount of time spent in different parts of a program. This profile data may be collected on a per-processor basis and may be able to identify idle time and communication time as well as execution time.

Profiles have two important advantages. They can be obtained automatically, at relatively low cost, and they can provide a high-level view of program behavior that allows the programmer to identify problematic program components without generating huge amounts of data. (In general, the amount of data associated with a profile is both small and independent of execution time.) Therefore, a profile should be the first technique to be considered when seeking to understand the performance of a parallel program.

A profile can be used in numerous ways. For example, a single profile on a moderate number of processors can help identify the program components that are taking the most time and that hence may require further investigation. Similarly, profiles performed for a range of processor counts and problem sizes can identify components that do not scale.

Profiles also have limitations. In particular, they do not incorporate temporary aspects of program execution. For example, consider a program in which every processor sends to each other processor in turn. If all processors send to processor 0, then to processor 1, and so on, overall performance may be poor. This behavior would not be revealed in a profile, as every processor would be shown to communicate the same amount of data.
Profilers are available on most parallel computers but vary widely in their functionality and sophistication. The most basic do little more than collect sequential profile data on each processor; the most sophisticated provide various mechanisms for reducing this data, displaying it, and relating it to source code. Because efficient profiling requires the assistance of a compiler and runtime system, most profiling tools are vendor supplied and machine specific.

Counters

As its name suggests, a counter is a storage location that can be incremented each time a specified event occurs. Counters can be used to record the number of procedure calls, total number of messages, total message volume, or the number of messages sent between each pair of processors. Counters may be generated by compiler-generated code, by code incorporated in communication libraries, or by user-inserted calls to counter routines. Counters complement profilers by providing information that is not easily obtainable using sampling techniques. For example, they can provide the total number and volume of messages, information that can be combined with communication time data from a profile to determine the efficiency of communication operations. A useful variant of a counter is an interval timer; a timer used to determine the length of time spent executing a particular piece of code. This information can be accumulated in a counter to provide an accurate determination of the total time spent executing that program component. A disadvantage of interval timers is that the logic required to obtain a timer value can be expensive.

Traces

An execution trace is the most detailed and low-level approach to performance data collection. Trace-based systems typically generate log files containing time-stamped event records representing significant occurrences in a program's execution, such as calling a procedure or sending a message. Trace records may include information such as the type of event and the procedure name or destination task, and can be generated either automatically or under programmer control.

Trace-based approaches support a particularly broad study of program behavior. They can be used to examine casual relationships between communications, to localize sources of idle time, and to identify temporary hot spots. For example, an execution trace could be used to determine that all processors are sending to the same processor at the same time. An execution trace can also be post processed to obtain profile, count, and interval timer information; to compute higher-order statistics such as the means and variances of these values; and to obtain other data such as mean message queue length in a message-passing system.

The disadvantages of trace-based approaches stem primarily from the huge volume of data that can be generated. Particularly when a program is executing on large numbers of processors, it is easy to generate tens, hundreds, or even thousands of megabytes of data. This large data volume has three unwelcome consequences.

Logging of this data tends to perturb performance, thereby leading to what is called the probe effect in which the measuring of performance data changes their characteristics.
The sheer volume of data makes post processing difficult. Frequently, sophisticated analysis is required to extract relevant information.
The programmer, in order to combat the problems caused by volume, may have to spend considerable effort tuning the data collection process so that only relevant events are recorded while the phenomenon of interest is retained.

Tracing then becomes a labor-intensive process. For these reasons, tracing should be used with care and only if other data collection techniques are not available or do not provide sufficient information. Many parallel programming tools provide some automatic tracing capabilities, for example by generating a trace record for every message generated or received. These capabilities are invoked by linking with a specialized version of a communication library and/or by a runtime flag. Mechanisms for generating user-defined events may also be provided. In principle, event traces can be interpreted in various ways by using different tools. A stumbling block here is a lack of standards for event log records.

Characteristics of Visualization Tools

The reasons for poor performance of parallel message-passing codes can be varied and complex, and users need to be able to understand and correct performance problems. Performance tools can help by monitoring a program's execution and producing performance data that can be analyzed to locate and understand areas of poor performance. There are number of performance tools, both research and commercial, that are available for monitoring and/or analyzing the performance of MPI message-passing parallel programs. The tool must have good characteristics in order to assist the user to understand performance anomalies in the parallel program. When selecting a tool for a particular task, the following issues should be considered:

Accuracy: In general, performance data obtained using sampling techniques that are less accurate than data obtained by using counters or timers. In the case of timers, the accuracy of the clock must be taken into account.

Simplicity: The best tools in most circumstances are those that collect data automatically, with little or no programmer intervention, and that provide convenient analysis capabilities.

Flexibility: A flexible tool can be extended easily to collect additional performance data or to provide different views of the same data. Flexibility and simplicity are often opposing requirements.

Intrusiveness: Unless a computer provides hardware support, performance data collection inevitably introduces some overhead. We need to be aware of this overhead and account for it when analyzing data.

Abstraction: A good performance tool allows data to be examined at a level of abstraction appropriate for the programming model of the parallel program. For example, when analyzing an execution trace from a message-passing program, we probably wish to see individual messages, particularly if they can be related to send and receive statements in the source program. However, this presentation is probably not appropriate when studying a data-parallel program, even if compilation generates a message-passing program. Instead, we would like to see communication costs related to data-parallel program statements.

A number of commercial and research tools are available for performance analysis of Threaded and MPI programs. The most prevalent approach taken by these tools is to collect performance data during program execution are then provide post-mortem analysis and display of performance information. Some tools do both steps in an integrated manner, while other tools or tool components provide just one of these functions. A few tools also have the capability for run-time analysis, either in addition to or instead of post-mortem analysis. A similar procedure for testing each tool and using a common set of evaluation criteria is required.

In the first step, the software is installed using the instructions provided in the manual. After the software has been installed successfully, user works through specific tutorial or examples that are provided so that he/she can become familiar with the tool. Finally, we attempt to use the tool to analyze a number of test programs. A set of Evaluation criteria based on the following is investigated.

Robustness Usability Scalability Portability Versatility

Robustness : For robustness, we expect the tool to crash infrequently and features to work correctly as claimed by the developer. Errors should be handled by displaying appropriate diagnostic messages. The error should be indicated in a file or on the standard output device. Also, the tool should not cause the user to get stuck when he/she takes a wrong action by mistake. The tool should prompt the user and give proper feedback in detailed fashion. Such kind of error messages should be given in the technical manual. Research tools are not expected to be as robust as commercial tools, but if the tool has been released for public use, considerable effort should still have been invested in debugging it and on error handling. The installation procedure for the tool should be as simple as possible so that user can link the tool during compilation/execution time easily.

Usability : To be useful, a tool should have adequate documentation and support, and should have an intuitive easy-to use interface. On-line help and manual pages are also helpful for usability. Although research tool developers cannot provide extensive support for free, we consider an email address for questions and bug reporting to be a minimum requirement for a tool that has been released for public use. Adequate functionality should be provided to accomplish the intended task without putting undue burden on the user to carry out low-level tasks, such as manual insertion of calls to trace routines, or sorting and merging of per-process trace files without assistance. A set of questions and answers should be made available in the form FAQ so that user can see the importance and use of this tool. The tool should support SPMD and MPMD programming paradigms.

Scalability : For scalability, we look for the ability to handle large numbers of processes and large or long-running programs. The problem size and the machine size are increased to study the performance anomalies, keeping Amdahl 's law and Gustafson's law for scalability. Scalability is important both for data collection and for data analysis and display. For data collection, desirable features are scalable trace file formats, a mechanism for turning tracing on and off, and filtering of which constructs should be instrumented. For data analysis and display, important scalability features include ability to zoom in and out, aggregation of performance information, and filtering. Also, the overheads in usage of memory and CPU are analyzed for each tool for increase in problem size with respect to increase in machine size.

Portability : Because of the short lifespan of high performance computing platforms and because many applications are developed and run in a distributed heterogeneous environment, most parallel programmers will work on a number of platforms simultaneously or over time. Portability is a relative measure, which depends on both the language used and the target machine. Programmers are understandably reluctant to learn a new performance tool every time they move to a new platform. Thus, we consider portability to be an important feature. For portability, we look for whether the tool works on most major platforms and for what MPI implementations and languages it can handle. The tool must be easily usable when compilers, system area networks, and the message passing libraries are changed and should produce correct behavior, keeping certain computational characteristics in mind.

Versatility : For versatility, we look for the ability to analyze performance data in different ways and to display performance information using different views. Providing a unified view of the profiling data generated by each process buy modifying the time stamps of events on different processes so that all the processes start and end at the same time and provide a convenient form for visualizing the profiling data in a Gantt chart, Histograms and Graphs. Another feature of versatility is the ability to inter-operate with other trace formats and tools. There should be special features in tools to provide an abstract representation of performance. Sophisticated methods are required for representing and displaying the data in visual and aural forms. Tools must have interesting features to represent particular fragment in a program in the form of 3D display of surface. To provide useful information for the user, the important goals of the tools to provide an abstract representation of program performance are considered allowing detection of unknown bottlenecks using 3D displays, and to provide scalable views for programs running on a large number of processors.

An emerging parallel programming paradigm is to use message passing between nodes and shared memory (e.g., OpenMP) within a node. Most performance analysis tools were originally developed to support only one of these models, but tools that support both models simultaneously are highly desirable.

Characteristics: Performance Issues In Algorithm Design

Overheads : (Idling Overheads & Data sharing Overheads) Given a set of tasks and a set of processors there are many ways of mapping onto the performance on parallel computer. One must focus on the mapping, which gives good results in a parallel algorithm that yields the highest performance. Ideally, user will like to develop a parallel program that is able to achieve linear speedups. The time that is required by the parallel algorithm but is not essential to the serial algorithm is referred to as overhead due to parallelization. Minimizing these overheads is critical in developing efficient parallel algorithms. Parallel algorithms can potentially incur three types of overheads. The first is due to idling, the second is due to data sharing and synchronization, and the third is due to extraneous computations. We describe some of these overheads in detail.

Idling Overheads : Processors can become idle for three main reasons. The first reason is due to load imbalances, the second is due to computational dependencies, and the third is due to interprocessor information exchange and synchronization. Computations in which the work is equally distributed among the processors are called load balanced. If the work is not load balanced, then the processors with less work will finish first and stay idle while waiting for the processors with more work to finish with their computations.

For some problems, we can easily develop parallel algorithms that are load-balanced; however, for other problems designing load-balanced algorithms significantly challenging. In simplest case, the amount of work performed by each task is the same. However, there are some problems in which the amount of computation per task is significantly different and it is not known prior. Also, for many problems, distributing the computations equally among the processors is not sufficient ensure that idling is eliminated. This is particularly true for computations in which there are dependencies between tasks.

In fact, the loads before and after synchronization may be different which results in significant overheads. Minimizing idling among the processors is critical, since every time a processor becomes idle, it does absolutely nothing towards the solution of the problem.

Data Sharing Overheads : The sharing of data often leads to additional overheads that are inherent to the parallel algorithms, They must be minimised as they directly affect the performance achieved by this algorithm. Depending on the underlying architecture, data sharing can take different forms. On distributed memory machines, data is initially partitioned among the processors. Explicitly sending and receiving messages achieve data sharing. In this framework, the cost of data sharing is rather explicit, and minimizing the data sharing overheads directly translates to reducing the communication overhead.

On shared address-space machines, data sharing is often performed implicitly. Data is initially partitioned among the processors. In this type of architectures, the data is stored in the shared memory and every processor can read and write to them. Even though the data is accessible by all processors but data sharing overheads still exists. Data sharing overheads become worse when shared data needs to modify the data at a time and possibly no processor is reading the data while the modification is taking place. Developing algorithms that minimize data sharing overheads often involves intelligent task decomposition, task distribution, initial data distribution and proper scheduling of the data sharing operations, all of which play an important role for performance of a parallel program.

Parallel algorithm can potentially perform two types of extraneous computations. The first type consists of non-essential computations that are performed by the parallel algorithm but the serial program does not require them. In a parallel algorithm, non-essential computations often arise either because of speculative execution or because of exploratory execution between the processors. In some applications, the processors may end up in doing redundant work. In general, performing frequent processor interactions to check whether or not similar work has been already generated can eliminate redundant computations.