The Tau Parallel Performance System
Summary (7 min read)
Introduction
- The ability of performance technology to keep pace with the growing complexity of parallel and distributed systems depends on robust performance frameworks that can at once provide system-specific performance capabilities and support high-level performance problem solving.
- Flexibility and portability in empirical methods and processes are influenced primarily by the strategies available for instrumentation and measurement, and how effectively they are integrated and composed.
- This paper presents the TAU (Tuning and Analysis Utilities) parallel performance system and describe how it addresses diverse requirements for performance observation and analysis.
- Lack of portable performance evaluation environments forces users to adopt different techniques on different systems, even for common performance analysis.
2 A General Computation Model for Parallel Performance Technology
- To address the dual goals of performance technology for complex systems – robust performance capabilities and widely available performance problem solving methodologies – the authors need to contend with problems of system diversity while providing flexibility in tool composition, configuration, and integration.
- In the model, a node is defined as a physically distinct machine with one or more processors sharing a physical memory system (i.e. a shared memory multiprocessor (SMP)).
- A context is a distinct virtual address space within a node providing shared memory support for parallel software execution.
- The computation model above is general enough to apply to many high-performance architectures as well as to different parallel programming paradigms.
- When the authors consider a performance system to accommodate the range of instances, they can look to see what features are common and can be abstracted in the performance tool design.
3 TAU Performance System Architecture
- The TAU performance system (Shende et al.
- The TAU framework architecture is organized into three layers – instrumentation, measurement, and analysis – where within each layer multiple modules are available and can be configured in a flexible manner under user control.
- The instrumentation layer is used to define events for performance experiments.
- The performance measurement part supports two measurement forms: profiling and tracing.
- Also distributed with TAU is the PerfDMF (Huck et al. 2005) tool providing multi-experiment parallel profile management.
4 Instrumentation
- In order to observe performance, additional instructions or probes are typically inserted into a program.
- As events execute, they activate the probes which perform measurements.
- Thus, instrumentation exposes key characteristics of an execution.
- In this section the authors describe the instrumentation options supported by TAU.
4.1 Source-Based Instrumentation
- TAU provides an API that allows programmers to manually annotate the source code of the program.
- Thus, language specific features (e.g. runtime type information for tracking templates in C++) can be leveraged.
- TAU’s API can be broadly classified into the following five interfaces: Interval event interface Atomic event interface Query interface Control interface Sampling interface 291THE TAU PARALLEL PERFORMANCE SYSTEM 4.1.1.
- There are several ways to identify interval events and performance tools have used different techniques.
- Control of interrupt period and selection of system properties to track are provided.
4.2 Preprocessor-Based Instrumentation
- This approach typically involves parsing the source code to infer where instrumentation probes are to be inserted.
- Preprocessor-based instrumentation is also commonly used to insert performance measurement calls at interval entry and exit points in the source code.
- PDT is comprised of commercial-grade front-ends that emit an intermediate language (IL) file, IL analyzers that walk the abstract syntax tree and generate a subset of semantic entities in program database (PDB) ASCII text files, and a library interface to the PDB files that allows us to write static analysis tools.
- The instrumented source code is then compiled and linked with the TAU measurement library to produce an executable code.
- Opari inserts POMP (Mohr et al. 2002) annotations and rewrites OpenMP directives in the source code.
4.3 Compiler-Based Instrumentation
- A compiler can add instrumentation calls in the object code that it generates.
- The compiler has full access to source-level mapping information.
- It has the ability to choose the granularity of instrumentation and can include fine-grained instrumentation.
- The compiler strips the instrumentation calls from the source code and optimizes the compiled source code.
- The code then executes a branch to the instruction following the original instruction to continue execution.
4.4 Wrapper Library-Based Instrumentation
- A common technique to instrument library routines is to substitute the standard library routine with an instrumented version which in turn calls the orginal routine.
- The problem is that you would like to do this without having to develop a different library just to alter the calling interface.
- The advantage of this approach is that librarylevel instrumentation can be implemented by defining a wrapper interposition library layer that inserts instrumentation calls before and after calls to the native routines.
- The authors developed a TAU MPI wrapper library that intercepts calls to the native library by defining routines with the same name, such as MPI_Send.
- In addition, TAU’s performance grouping capabilities allows MPI events to be presented with respect to high-level categories such as send and receive types.
4.5 Binary Instrumentation
- TAU uses DyninstAPI (Buck and Hollingsworth 2000) for instrumenting the executable code of a program.
- The authors approach for TAU uses the DyninstAPI to construct calls to the TAU measurement library and then insert these calls into the executable code.
- Using the list of routines and their names, unique identifiers are assigned to each routine.
- Dynaprof is another tool that uses DyninstAPI for instrumentation.
- An interval event timer is defined to track the time spent in un-instrumented code.
4.6 Interpreter-Based Instrumentation
- Interpreted language environments present an interesting target for TAU integration.
- TAU has been integrated with Python by leveraging the Python interpreter’s debugging and profiling capabilities to instrument all entry and exit calls.
- A TAU interval event is created when a call is dispatched for the first time.
- Since shared objects are used in Python, instrumentation from multiple levels see the same runtime performance data.
- Python is particularly interesting since it can be use to dynamically link and control multi-language executable modules.
4.7 Component-Based Instrumentation
- Component technology extends the benefits of scripting systems and object-oriented design to support reuse and interoperability of component software, transparent of language and location (Szyperski 1997).
- Components are compiled into shared libraries and are loaded in, instantiated and composed into a useful code at runtime.
- There are two ways to instrument a component based application using TAU.
- A proxy component implements a port interface and has a provides and a uses port.
- The provides port is connected to the caller’s uses port and its uses port is connected to the callee’s provides port.
4.8 Virtual Machine-Based Instrumentation
- Support of performance instrumentation and measurement in language systems based on virtual machine (VM) execution poses several challenges.
- JVMPI provides profiling hooks into the virtual machine and allows a profiler agent to instrument the Java application without any changes to the source code, bytecode, or the executable code of the JVM.
- TAU maintains a per-thread performance data structure that is updated when a method entry or exit takes place.
- Since this is maintained on a per thread basis, it does not require mutual exclusion with other threads and is a low-overhead scalable data structure.
- When it receives a JVM shutdown event, it flushes the performance data for all running threads to the disk.
4.9 Multi-Level Instrumentation
- As the source code undergoes a series of transformations in the compilation, linking, and execution phases, it poses several constraints and opportunites for instrumentation.
- Instead of restricting the choice of instrumentation to one phase in the program transformation, TAU allows multiple instrumentation interfaces to be deployed concurrently 297THE TAU PARALLEL PERFORMANCE SYSTEM for better coverage.
- It taps into performance data from multiple levels and presents it in a consistent and a uniform manner by integrating events from different languages and instrumentation levels in the same address space.
- TAU maintains performance data in a common structure for all events and allows external tools access to the performance data using a common interface.
4.10 Selective Instrumentation
- In support of the different instrumentation schemes TAU provides, a facility for selecting which of the possible events to instrument has been developed (Malony et al. 2003).
- The file is then used during the instrumentation process to restrict the event set.
- The basic structure of the file is a list of names separated into include and exclude lists.
- The selective instrumentation mechanism is being used in TAU for all automatic instrumentation methods, including PDT source instrumentation, DyninstAPI executable instrumentation, and component instrumentation.
- It has proven invaluable as a means to both weed out unwanted performance events, such as high frequency, small routines that generate excessive measurement overhead, and provide easy event configuration for customized performance experiments.
4.11 TAU_COMPILER
- To simplify the integration of the source instrumentor and the MPI wrapper library in the build process, TAU provides a tool, tau_compiler.sh that can be invoked using a prefix of $(TAU_COMPILER) before the name of the compiler.
- In an application makefile, the variable: F90=mpxlf90 is modified to F90=$(TAU_COMPILER) mpxlf90.
- It can distinguish between object code creation and linking phases of compilation and during linking, it inserts the MPI wrapper library and the TAU measurement library in the link command line.
- A user can easily integrate TAU’s portable performance instrumentation in the code generation process.
- Optional parameters can be passed to all four compilation phases.
5 Measurement
- All TAU instrumentation code makes calls to the TAU measurement system through an API that provides a portable and consistent set of measurement services.
- Again, the instrumentation layer is responsible for defining the performance events for an experiment, establishing relationships between events (e.g. groups, mappings), and managing those events in the context of the parallel computing model being used.
- Using the TAU measurement API, event information is passed in the probe calls to be used during measurement operations to link events with performance data.
- It is in the measurement system configuration and usage where all choices for what performance data to capture and in what manner are made.
- It is highly robust, scalable, and has been ported to all HPC platforms.
5.1 Performance Data Sources
- TAU provides access to various sources of performance data.
- Time is perhaps the most important and ubiquitous data type, but it comes in various forms on different system platforms.
- Through TAU configuration, all of the linkages to these packages are taken care of.
- Within the measurement system, TAU allows for multiple sources of performance data to be concurrently active.
- That is, it is possible for both profiling and tracing to work with multiple performance data.
5.2 Profiling
- Profiles are typically represented as a list of various metrics (such as wall-clock time) and associated statistics for all performance events in the program.
- There are different statistics kept for interval events (such as routines or statements in the program) versus atomic events.
- Typically one metric is measured during a profiling run.
- Internally, the TAU measurement system maintains a profile data structure for each node/context/thread.
- When the program execution completes, a separate profile file is created for each.
5.3 Flat Profiling
- The TAU profiling system supports several profiling variants.
- Trace analysis can then easily calculate callpath profiles.
- Thus, a parallel profile that showed how performance data was distributed at differ- ent levels of an unfolding event call tree could help to understand the performance better.
- When TAU is configured with the -PROFILEPHASE option, TAU will effectively generate a separate profile for each phase in the program’s execution.
- This top level phase contains other routines and phases that it directly invokes, but excludes routines called by child phases.
5.4 Tracing
- While profiling is used to get aggregate summaries of metrics in a compact form, it cannot highlight the time varying aspect of the execution.
- With tracing enabled, every node/context/thread will generate a trace for instrumented events.
- For runtime trace reading and analysis, it is important to understand what takes place when TAU records performance events in traces.
- In their more general and dynamic scheme, the event identifiers are generated on the fly, local to a context.
- It can parse binary merged or unmerged traces (and their respective event definition files) and provides this information to an analysis tool using a trace analysis API.
5.5 Measurement Overhead
- The performance events of interest depend mainly on what aspect of the execution the user wants to see, so as to construct a meaningful performance view from the measurements made.
- Typical events include control flow events that identify points in the program that are executed, or operational events that occur when some operation or action has been performed.
- The authors define performance accuracy as the degree to which their performance measures correctly represent “actual” performance.
- If the authors attempt to measure a lot of events, the performance intrusion may be high because of the accumulated measurement overhead, regardless of the measurement accuracy for that event.
- TAU is a highly-engineered performance system and delivers excellent measurement efficiencies and low measurement overhead.
5.6 Overhead Compensation
- Unfortunately, by eliminating events from instrumentation, the authors lose the ability to see those events at all.
- On the other hand, accurate measurement is confounded by high relative overheads.
- The distortion in gathered performance data could be significant for a parallel program where the effects of perturbation are compounded by parallel execution and accumulation of overhead from remote processes.
- The authors have developed techniques in TAU profiling to compensate for measurement overhead at runtime.
- The way this is accomplished is quite clever by tracking the number of descendant events and adjusting the total inclusive time at event exit.
5.7 Performance Mapping
- The ability to associate low-level performance measurements with higher-level execution semantics is important 302 COMPUTING APPLICATIONS in understanding parallel performance data with respect to application structure and dynamics.
- The idea is to provide a mechanism whereby performance measurements, made by the occurrence of instrumented performance events, can be associated with semantic abstractions, possible at a different level of performance observation.
- TAU has implemented performance mapping as an integral part of its measurement system.
- To do this, the authors construct a key array that includes the identities of the current event and the parent phase.
- If the authors find the key, they access the profiling object and update its performance metrics.
6 Analysis
- TAU gives us the ability to track performance data in widely diverse environments, and thus provides a wealth of information to the user.
- It has been a continuing effort to include as part of TAU a set of analysis tools which can scale not only to the task of analyzing TAU data, but also to a more diverse arena outside of the TAU paradigm.
- This section discusses the developement of these tools, and the resulting benefits to the user in performing the often complex task of analyzing performance data.
- The authors approach in this section will be to show the use of the TAU analysis tools on a single parallel application, S3D (Subramanya and Reddy 2000).
- S3D is a high-fidelity finite difference solver for compressible reacting flows which includes detailed chemistry computations.
6.1 ParaProf
- The TAU performance measurement system is capable of producing parallel profiles for thousands of processes consisting of hundreds of events.
- The result is high extensibility and flexibility, enabling us to tackle the issues of re-use and scalability.
- One, DSS can be configured with profile input modules to read profiles from different sources.
- Its supports many advanced capabilities required in a modern performance analysis system, such as derived metrics for relating performance data, cross experiment analysis for analyzing data from disparate experiments, and data reduction for elimination of redundant data, thus allowing large data sources to be tolerated efficiently.
- To get a sense of the type of analysis displays ParaProf supports, Figure 7 shows the S3D flat profile (stacked view) on sixteen processes.
6.2 Performance Database Framework
- Empirical performance evaluation of parallel and distributed systems or applications often generates significant amounts of performance data and analysis results from multiple experiments and trials as performance is investigated and problems diagnosed.
- Profile data is organized such that for each combination of these items, an aggregate measurement is recorded.
- It builds on robust SQL relational database engines, some of which are freely distributed.
- To facilitate performance analysis development, the PerfDMF architecture includes a well-documented data management API to abstract query and analysis operations into a more programmatic, non-SQL, form.
- The last component, the profile analysis toolkit, is an extensible suite of common base analysis routines that can be reused across performance analysis programs.
6.3 Tracing
- The authors made an early decision in the TAU system to leverage existing trace analysis and visualization tools.
- For convenience, the TAU tracing systems also allows traces files to be output directly in VTF3 and EPILOG formats.
- Expert is trace-based in its analysis and looks for performance problems that arise in the execution.
- Figure 13 shows a view from Expert using CUBE for S3D.
- The tool can take parameters identifying where to start and stop the profile generation in time, allowing parallel profiles to be generated for specific regions of the traces.
7 Conclusion
- Complex parallel systems and software pose challenging performance evaluation problems that require robust methodologies and tools.
- The TAU performance system addresses performance technology problems at three levels: instrumentation, measurement, and analysis.
- Portability, robustness, and extensibility are the hallmarks of the TAU parallel performance system.
- It is in use in scientific research groups, HPC centers, and industrial laboratories around the world.
Did you find this useful? Give us your feedback
Citations
536 citations
510 citations
498 citations
Cites methods from "The Tau Parallel Performance System..."
...We performed a detailed performance analysis of runs on heterogeneous allocations using TAU [18]....
[...]
360 citations
Cites background from "The Tau Parallel Performance System..."
...Based on the postmortem analysis presentation of direct measurements, traced or summarized at runtime, Scalasca is closely related to TAU [18]....
[...]
References
4,791 citations
"The Tau Parallel Performance System..." refers background in this paper
...To study the post-mortem spatial and temporal aspect of performance data, event 300 COMPUTING APPLICATIONS tracing, that is, the activity of capturing an event or an action that takes place in the program, is more appropriate....
[...]
...Component technology extends the benefits of scripting systems and object-oriented design to support reuse and interoperability of component software, transparent of language and location (Szyperski 1997)....
[...]
3,181 citations
1,134 citations
692 citations
"The Tau Parallel Performance System..." refers background or methods in this paper
...Dynaprof can also use a PAPI probe and generate performance data that can be read by ParaProf....
[...]
...A user may choose to use TAU instrumentation, measurement, and analysis using tau_run and ParaProf or she may choose Dynaprof for instrumentation, TAU for measurement, and ParaProf or Vampir for analysis, or she may choose Dynaprof for instrumentation, a PAPI probe for measurement, and ParaProf for analysis....
[...]
...PDT is comprised of commercial-grade front-ends that emit an intermediate language (IL) file, IL analyzers that walk the abstract syntax tree and generate a subset of semantic entities in program database (PDB) ASCII text files, and a library interface (DUCTAPE) to the PDB files that allows us to…...
[...]
...TAU supports different timing sources, choice of hardware counters through the PAPI (Browne et al. 2000) or PCL (Berrendorf, Ziegler, and Mohr) interfaces, and access to system performance data....
[...]
...In a similar manner, TAU integrates alternative interfaces for access to hardware counters (PAPI (Browne et al. 2000) and PCL (Berrendorf, Ziegler, and Mohr) are supported) and other system-accessible performance data sources....
[...]
640 citations
"The Tau Parallel Performance System..." refers methods in this paper
...Our approach for TAU uses the DyninstAPI to construct calls to the TAU measurement library and then insert these calls into the executable code....
[...]
...Dynaprof (Mucci) is another tool that uses DyninstAPI for instrumentation....
[...]
...For DyninstAPI to be useful with a measurement strategy, calls to a measurement library (or the measurement code itself) must be correctly constructed in the code snippets....
[...]
...The selective instrumentation mechanism is being used in TAU for all automatic instrumentation methods, including PDT source instrumentation, DyninstAPI executable instrumentation, and component instrumentation....
[...]
...DyninstAPI is a dynamic instrumentation package that allows a tool to insert code snippets into a running program using a portable C++ class library....
[...]
Related Papers (5)
Frequently Asked Questions (15)
Q2. What are the future works mentioned in the paper "The tau parallel performance system" ?
While performance evaluation of a system is directly affected by what constraints the system imposes on performance instrumentation and measurement capabilities, the desire for performance problem solving tools that are common and portable, now and into the future, suggests that performance tools hardened and customized for a particular system platform will be short-lived, with limited utility. Unless performance technology evolves with system technology, a chasm will remain between the users expectations and the capabilities that performance tools provide. However, effective exploration of performance will necessarily require prudent selection from the range of alternative methods TAU provides to assemble meaningful performance experiments that sheds light on the relevant performance properties.
Q3. What are the two goals of performance technology for complex systems?
To address the dual goals of performance technology for complex systems – robust performance capabilities and widely available performance problem solving methodologies – the authors need to contend with problems of system diversity while providing flexibility in tool composition, configuration, and integration.
Q4. What is the function that constructs the proxy component?
It infers the arguments and return types of a port and its interfaces and constructs the source code of a proxy component, which when compiled and instantiated in the framework allows us to measure the performance of a component without any changes to its source or object code.
Q5. What is the function used to link events with performance data?
Using the TAU measurement API, event information is passed in the probe calls to be used during measurement operations to link events with performance data.
Q6. What is the advantage of using a common thread layer?
To deal with Java’s multi-threaded environment, TAU uses a common thread layer for operations such as getting the thread identifier, locking and unlocking the performance database, getting the number of concurrent threads, and so on.
Q7. What are the types of events that are used to identify points in the program that are executed?
Typical events include control flow events that identify points in the program that are executed, or operational events that occur when some operation or action has been performed.
Q8. What is the common technique to instrument library routines?
A common technique to instrument library routines is to substitute the standard library routine with an instrumented version which in turn calls the orginal routine.
Q9. Why does the tool programmer need to worry about vendor-specific SQL syntax?
Because all supported databases are accessed through a common interface, the tool programmer does not need to worry about vendor-specific SQL syntax.
Q10. What is the reason why tracing the program execution is not always feasible?
Tracing the program execution is not always feasible due to the high volume of performance data generated and the amount of trace processing needed.
Q11. What is the purpose of the profile analysis toolkit?
The last component, the profile analysis toolkit, is an extensible suite of common base analysis routines that can be reused across performance analysis programs.
Q12. How does the Python interpreter instrument all entry and exit calls?
TAU has been integrated with Python by leveraging the Python interpreter’s debugging and profiling capabilities to instrument all entry and exit calls.
Q13. Why does tau_merge not read the global event definitions?
It ensures that the trace analysis tools down the line that read the merged traces also read the global event definitions and refresh their internal tables when they encounter an event for which event definitions are not known.
Q14. What is the difference between calldepth and callpath profiling?
The implementation of calldepth profiling is similar to callpath profiling in that it requires dynamic event generation and profile object creation, but it benefits from certain efficiencies in pruning its search on the callstack.
Q15. Why is tau_merge unable to write trace records that contain one or more new events?
The trace generation library ensures that event tables are written to disk before writing trace records that contain one or more new events.