Journal Article•DOI•

The Tau Parallel Performance System

Sameer Shende¹, Allen D. Malony¹•Institutions (1)

01 May 2006-Vol. 20, Iss: 2, pp 287-311

TL;DR: This paper presents the TAU (Tuning and Analysis Utilities) parallel performance sytem and describes how it addresses diverse requirements for performance observation and analysis.

read less

Abstract: The ability of performance technology to keep pace with the growing complexity of parallel and distributed systems depends on robust performance frameworks that can at once provide system-specific performance capabilities and support high-level performance problem solving Flexibility and portability in empirical methods and processes are influenced primarily by the strategies available for instrmentation and measurement, and how effectively they are integrated and composed This paper presents the TAU (Tuning and Analysis Utilities) parallel performance sytem and describe how it addresses diverse requirements for performance observation and analysis

...read moreread less

Summary (7 min read)

Introduction

The ability of performance technology to keep pace with the growing complexity of parallel and distributed systems depends on robust performance frameworks that can at once provide system-specific performance capabilities and support high-level performance problem solving.
Flexibility and portability in empirical methods and processes are influenced primarily by the strategies available for instrumentation and measurement, and how effectively they are integrated and composed.
This paper presents the TAU (Tuning and Analysis Utilities) parallel performance system and describe how it addresses diverse requirements for performance observation and analysis.
Lack of portable performance evaluation environments forces users to adopt different techniques on different systems, even for common performance analysis.

2 A General Computation Model for Parallel Performance Technology

To address the dual goals of performance technology for complex systems – robust performance capabilities and widely available performance problem solving methodologies – the authors need to contend with problems of system diversity while providing flexibility in tool composition, configuration, and integration.
In the model, a node is defined as a physically distinct machine with one or more processors sharing a physical memory system (i.e. a shared memory multiprocessor (SMP)).
A context is a distinct virtual address space within a node providing shared memory support for parallel software execution.
The computation model above is general enough to apply to many high-performance architectures as well as to different parallel programming paradigms.
When the authors consider a performance system to accommodate the range of instances, they can look to see what features are common and can be abstracted in the performance tool design.

3 TAU Performance System Architecture

The TAU performance system (Shende et al.
The TAU framework architecture is organized into three layers – instrumentation, measurement, and analysis – where within each layer multiple modules are available and can be configured in a flexible manner under user control.
The instrumentation layer is used to define events for performance experiments.
The performance measurement part supports two measurement forms: profiling and tracing.
Also distributed with TAU is the PerfDMF (Huck et al. 2005) tool providing multi-experiment parallel profile management.

4 Instrumentation

In order to observe performance, additional instructions or probes are typically inserted into a program.
As events execute, they activate the probes which perform measurements.
Thus, instrumentation exposes key characteristics of an execution.
In this section the authors describe the instrumentation options supported by TAU.

4.1 Source-Based Instrumentation

TAU provides an API that allows programmers to manually annotate the source code of the program.
Thus, language specific features (e.g. runtime type information for tracking templates in C++) can be leveraged.
TAU’s API can be broadly classified into the following five interfaces: Interval event interface Atomic event interface Query interface Control interface Sampling interface 291THE TAU PARALLEL PERFORMANCE SYSTEM 4.1.1.
There are several ways to identify interval events and performance tools have used different techniques.
Control of interrupt period and selection of system properties to track are provided.

4.2 Preprocessor-Based Instrumentation

This approach typically involves parsing the source code to infer where instrumentation probes are to be inserted.
Preprocessor-based instrumentation is also commonly used to insert performance measurement calls at interval entry and exit points in the source code.
PDT is comprised of commercial-grade front-ends that emit an intermediate language (IL) file, IL analyzers that walk the abstract syntax tree and generate a subset of semantic entities in program database (PDB) ASCII text files, and a library interface to the PDB files that allows us to write static analysis tools.
The instrumented source code is then compiled and linked with the TAU measurement library to produce an executable code.
Opari inserts POMP (Mohr et al. 2002) annotations and rewrites OpenMP directives in the source code.

4.3 Compiler-Based Instrumentation

A compiler can add instrumentation calls in the object code that it generates.
The compiler has full access to source-level mapping information.
It has the ability to choose the granularity of instrumentation and can include fine-grained instrumentation.
The compiler strips the instrumentation calls from the source code and optimizes the compiled source code.
The code then executes a branch to the instruction following the original instruction to continue execution.

4.4 Wrapper Library-Based Instrumentation

A common technique to instrument library routines is to substitute the standard library routine with an instrumented version which in turn calls the orginal routine.
The problem is that you would like to do this without having to develop a different library just to alter the calling interface.
The advantage of this approach is that librarylevel instrumentation can be implemented by defining a wrapper interposition library layer that inserts instrumentation calls before and after calls to the native routines.
The authors developed a TAU MPI wrapper library that intercepts calls to the native library by defining routines with the same name, such as MPI_Send.
In addition, TAU’s performance grouping capabilities allows MPI events to be presented with respect to high-level categories such as send and receive types.

4.5 Binary Instrumentation

TAU uses DyninstAPI (Buck and Hollingsworth 2000) for instrumenting the executable code of a program.
The authors approach for TAU uses the DyninstAPI to construct calls to the TAU measurement library and then insert these calls into the executable code.
Using the list of routines and their names, unique identifiers are assigned to each routine.
Dynaprof is another tool that uses DyninstAPI for instrumentation.
An interval event timer is defined to track the time spent in un-instrumented code.

4.6 Interpreter-Based Instrumentation

Interpreted language environments present an interesting target for TAU integration.
TAU has been integrated with Python by leveraging the Python interpreter’s debugging and profiling capabilities to instrument all entry and exit calls.
A TAU interval event is created when a call is dispatched for the first time.
Since shared objects are used in Python, instrumentation from multiple levels see the same runtime performance data.
Python is particularly interesting since it can be use to dynamically link and control multi-language executable modules.

4.7 Component-Based Instrumentation

Component technology extends the benefits of scripting systems and object-oriented design to support reuse and interoperability of component software, transparent of language and location (Szyperski 1997).
Components are compiled into shared libraries and are loaded in, instantiated and composed into a useful code at runtime.
There are two ways to instrument a component based application using TAU.
A proxy component implements a port interface and has a provides and a uses port.
The provides port is connected to the caller’s uses port and its uses port is connected to the callee’s provides port.

4.8 Virtual Machine-Based Instrumentation

Support of performance instrumentation and measurement in language systems based on virtual machine (VM) execution poses several challenges.
JVMPI provides profiling hooks into the virtual machine and allows a profiler agent to instrument the Java application without any changes to the source code, bytecode, or the executable code of the JVM.
TAU maintains a per-thread performance data structure that is updated when a method entry or exit takes place.
Since this is maintained on a per thread basis, it does not require mutual exclusion with other threads and is a low-overhead scalable data structure.
When it receives a JVM shutdown event, it flushes the performance data for all running threads to the disk.

4.9 Multi-Level Instrumentation

As the source code undergoes a series of transformations in the compilation, linking, and execution phases, it poses several constraints and opportunites for instrumentation.
Instead of restricting the choice of instrumentation to one phase in the program transformation, TAU allows multiple instrumentation interfaces to be deployed concurrently 297THE TAU PARALLEL PERFORMANCE SYSTEM for better coverage.
It taps into performance data from multiple levels and presents it in a consistent and a uniform manner by integrating events from different languages and instrumentation levels in the same address space.
TAU maintains performance data in a common structure for all events and allows external tools access to the performance data using a common interface.

4.10 Selective Instrumentation

In support of the different instrumentation schemes TAU provides, a facility for selecting which of the possible events to instrument has been developed (Malony et al. 2003).
The file is then used during the instrumentation process to restrict the event set.
The basic structure of the file is a list of names separated into include and exclude lists.
The selective instrumentation mechanism is being used in TAU for all automatic instrumentation methods, including PDT source instrumentation, DyninstAPI executable instrumentation, and component instrumentation.
It has proven invaluable as a means to both weed out unwanted performance events, such as high frequency, small routines that generate excessive measurement overhead, and provide easy event configuration for customized performance experiments.

4.11 TAU_COMPILER

To simplify the integration of the source instrumentor and the MPI wrapper library in the build process, TAU provides a tool, tau_compiler.sh that can be invoked using a prefix of $(TAU_COMPILER) before the name of the compiler.
In an application makefile, the variable: F90=mpxlf90 is modified to F90=$(TAU_COMPILER) mpxlf90.
It can distinguish between object code creation and linking phases of compilation and during linking, it inserts the MPI wrapper library and the TAU measurement library in the link command line.
A user can easily integrate TAU’s portable performance instrumentation in the code generation process.
Optional parameters can be passed to all four compilation phases.

5 Measurement

All TAU instrumentation code makes calls to the TAU measurement system through an API that provides a portable and consistent set of measurement services.
Again, the instrumentation layer is responsible for defining the performance events for an experiment, establishing relationships between events (e.g. groups, mappings), and managing those events in the context of the parallel computing model being used.
Using the TAU measurement API, event information is passed in the probe calls to be used during measurement operations to link events with performance data.
It is in the measurement system configuration and usage where all choices for what performance data to capture and in what manner are made.
It is highly robust, scalable, and has been ported to all HPC platforms.

5.1 Performance Data Sources

TAU provides access to various sources of performance data.
Time is perhaps the most important and ubiquitous data type, but it comes in various forms on different system platforms.
Through TAU configuration, all of the linkages to these packages are taken care of.
Within the measurement system, TAU allows for multiple sources of performance data to be concurrently active.
That is, it is possible for both profiling and tracing to work with multiple performance data.

5.2 Profiling

Profiles are typically represented as a list of various metrics (such as wall-clock time) and associated statistics for all performance events in the program.
There are different statistics kept for interval events (such as routines or statements in the program) versus atomic events.
Typically one metric is measured during a profiling run.
Internally, the TAU measurement system maintains a profile data structure for each node/context/thread.
When the program execution completes, a separate profile file is created for each.

5.3 Flat Profiling

The TAU profiling system supports several profiling variants.
Trace analysis can then easily calculate callpath profiles.
Thus, a parallel profile that showed how performance data was distributed at differ- ent levels of an unfolding event call tree could help to understand the performance better.
When TAU is configured with the -PROFILEPHASE option, TAU will effectively generate a separate profile for each phase in the program’s execution.
This top level phase contains other routines and phases that it directly invokes, but excludes routines called by child phases.

5.4 Tracing

While profiling is used to get aggregate summaries of metrics in a compact form, it cannot highlight the time varying aspect of the execution.
With tracing enabled, every node/context/thread will generate a trace for instrumented events.
For runtime trace reading and analysis, it is important to understand what takes place when TAU records performance events in traces.
In their more general and dynamic scheme, the event identifiers are generated on the fly, local to a context.
It can parse binary merged or unmerged traces (and their respective event definition files) and provides this information to an analysis tool using a trace analysis API.

5.5 Measurement Overhead

The performance events of interest depend mainly on what aspect of the execution the user wants to see, so as to construct a meaningful performance view from the measurements made.
Typical events include control flow events that identify points in the program that are executed, or operational events that occur when some operation or action has been performed.
The authors define performance accuracy as the degree to which their performance measures correctly represent “actual” performance.
If the authors attempt to measure a lot of events, the performance intrusion may be high because of the accumulated measurement overhead, regardless of the measurement accuracy for that event.
TAU is a highly-engineered performance system and delivers excellent measurement efficiencies and low measurement overhead.

5.6 Overhead Compensation

Unfortunately, by eliminating events from instrumentation, the authors lose the ability to see those events at all.
On the other hand, accurate measurement is confounded by high relative overheads.
The distortion in gathered performance data could be significant for a parallel program where the effects of perturbation are compounded by parallel execution and accumulation of overhead from remote processes.
The authors have developed techniques in TAU profiling to compensate for measurement overhead at runtime.
The way this is accomplished is quite clever by tracking the number of descendant events and adjusting the total inclusive time at event exit.

5.7 Performance Mapping

The ability to associate low-level performance measurements with higher-level execution semantics is important 302 COMPUTING APPLICATIONS in understanding parallel performance data with respect to application structure and dynamics.
The idea is to provide a mechanism whereby performance measurements, made by the occurrence of instrumented performance events, can be associated with semantic abstractions, possible at a different level of performance observation.
TAU has implemented performance mapping as an integral part of its measurement system.
To do this, the authors construct a key array that includes the identities of the current event and the parent phase.
If the authors find the key, they access the profiling object and update its performance metrics.

6 Analysis

TAU gives us the ability to track performance data in widely diverse environments, and thus provides a wealth of information to the user.
It has been a continuing effort to include as part of TAU a set of analysis tools which can scale not only to the task of analyzing TAU data, but also to a more diverse arena outside of the TAU paradigm.
This section discusses the developement of these tools, and the resulting benefits to the user in performing the often complex task of analyzing performance data.
The authors approach in this section will be to show the use of the TAU analysis tools on a single parallel application, S3D (Subramanya and Reddy 2000).
S3D is a high-fidelity finite difference solver for compressible reacting flows which includes detailed chemistry computations.

6.1 ParaProf

The TAU performance measurement system is capable of producing parallel profiles for thousands of processes consisting of hundreds of events.
The result is high extensibility and flexibility, enabling us to tackle the issues of re-use and scalability.
One, DSS can be configured with profile input modules to read profiles from different sources.
Its supports many advanced capabilities required in a modern performance analysis system, such as derived metrics for relating performance data, cross experiment analysis for analyzing data from disparate experiments, and data reduction for elimination of redundant data, thus allowing large data sources to be tolerated efficiently.
To get a sense of the type of analysis displays ParaProf supports, Figure 7 shows the S3D flat profile (stacked view) on sixteen processes.

6.2 Performance Database Framework

Empirical performance evaluation of parallel and distributed systems or applications often generates significant amounts of performance data and analysis results from multiple experiments and trials as performance is investigated and problems diagnosed.
Profile data is organized such that for each combination of these items, an aggregate measurement is recorded.
It builds on robust SQL relational database engines, some of which are freely distributed.
To facilitate performance analysis development, the PerfDMF architecture includes a well-documented data management API to abstract query and analysis operations into a more programmatic, non-SQL, form.
The last component, the profile analysis toolkit, is an extensible suite of common base analysis routines that can be reused across performance analysis programs.

6.3 Tracing

The authors made an early decision in the TAU system to leverage existing trace analysis and visualization tools.
For convenience, the TAU tracing systems also allows traces files to be output directly in VTF3 and EPILOG formats.
Expert is trace-based in its analysis and looks for performance problems that arise in the execution.
Figure 13 shows a view from Expert using CUBE for S3D.
The tool can take parameters identifying where to start and stop the profile generation in time, allowing parallel profiles to be generated for specific regions of the traces.

7 Conclusion

Complex parallel systems and software pose challenging performance evaluation problems that require robust methodologies and tools.
The TAU performance system addresses performance technology problems at three levels: instrumentation, measurement, and analysis.
Portability, robustness, and extensibility are the hallmarks of the TAU parallel performance system.
It is in use in scientific research groups, HPC centers, and industrial laboratories around the world.

Did you find this useful? Give us your feedback

Figures (14)

Figures 1–13 appear in color online: http://hpc.sagepub.com

Fig. 11 ParaProf loading S3D parallel profile from PerfDMF database.

Fig. 12 Vampir displays of S3D performance trace: timeline (left), communication matrix (upper right), and callgraph (lower right).

Fig. 1 Execution model supported by TAU.

Fig. 13 Expert/CUBE display of TAU’s S3D performance data.

Fig. 2 Architecture of TAU Performance System – Instrumentation and Measurement.

Fig. 8 ParaProf view of S3D INT_RTE exclusive time.

Fig. 7 ParaProf view of S3D flat profile.

Fig. 5 Snapshot of the component application, as assembled for execution. We see three proxies (for AMRMesh, EFMFlux and States), as well as the TauMeasurement and Mastermind components to measure and record performance-related data.

Fig. 10 TAU Performance Database Architecture.

Fig. 3 Architecture of TAU Performance System – Analysis and Visualization.

Content maybe subject to copyright Report

287THE TAU PARALLEL PERFORMANCE SYSTEM

THE TAU PARALLEL PERFORMANCE

SYSTEM

Sameer S. Shende

Allen D. Malony

DEPARTMENT OF COMPUTER AND INFORMATION

SCIENCE, UNIVERSITY OF OREGON, EUGENE, OR

(SAMEER@CS.UOREGON.EDU)

Abstract

The ability of performance technology to keep pace with

the growing complexity of parallel and distributed systems

depends on robust performance frameworks that can at

once provide system-specific performance capabilities and

support high-level performance problem solving. Flexibility

and portability in empirical methods and processes are

influenced primarily by the strategies available for instru-

mentation and measurement, and how effectively they are

integrated and composed. This paper presents the TAU

(Tuning and Analysis Utilities) parallel performance sys-

tem and describe how it addresses diverse requirements

for performance observation and analysis.

Key words: Performance evaluation, instrumentation, meas-

urement, analysis, TAU

1 Introduction

The evolution of computer systems and of the applications

that run on them – towards more sophisticated modes of

operation, higher levels of abstraction, and larger scale of

execution – challenge the state of technology for empiri-

cal performance evaluation. The increasing complexity of

parallel and distributed systems, coupled with emerging

portable parallel programming methods, demands that

empirical performance tools provide robust performance

observation capabilities at all levels of a system, while

mapping low-level behavior to high-level performance

abstractions in a uniform manner.

Given the diversity of performance problems, evalua-

tion methods, and types of events and metrics, the instru-

mentation and measurement mechanisms needed to support

performance observation must be flexible, to give maximum

opportunity for configuring performance experiments,

and portable, to allow consistent cross-platform perform-

ance problem solving. In general, flexibility in empirical

performance evaluation implies freedom in experiment

design, and choices in selection and control of experiment

mechanisms. Using tools that otherwise limit the type and

structure of performance methods will restrict evaluation

scope. Portability, on the other hand, looks for common

abstractions in performance methods and how these can

be supported by reusable and consistent techniques across

different computing environments (software and hardware).

Lack of portable performance evaluation environments

forces users to adopt different techniques on different sys-

tems, even for common performance analysis.

The TAU (Tuning and Analysis Utilities) parallel per-

formance system is the product of fourteen years of devel-

opment to create a robust, flexible, portable, and integrated

framework and toolset for performance instrumentation,

measurement, analysis, and visualization of large-scale

parallel computer systems and applications. The success

of the TAU project represents the combined efforts of

researchers at the University of Oregon and colleagues at

the Research Centre Juelich and Los Alamos National

Laboratory. The purpose of this paper is to provide a

complete overview of the TAU system. The discussion

will be organized first according to the TAU system archi-

tecture and second from the point of view of how to use

TAU in practice.

2 A General Computation Model for

Parallel Performance Technology

To address the dual goals of performance technology for

complex systems – robust performance capabilities and

widely available performance problem solving method-

ologies – we need to contend with problems of system

diversity while providing flexibility in tool composition,

configuration, and integration. One approach to address

The International Journal of High Performance Computing Applications,

Volume 20, No. 2, Summer 2006, pp. 287–311

DOI: 10.1177/1094342006064482

Figures 1–13 appear in color online: http://hpc.sagepub.com

288 COMPUTING APPLICATIONS

these issues is to focus attention on a sub-class of compu-

tation models and performance problems as a way to

restrict the performance technology requirements. The

obvious consequence of this approach is limited tool cov-

erage. Instead, our idea is to define an abstract computation

model that captures general architecture and software

execution features and can be mapped straightforwardly

to existing complex system types. For this model, we can

target performance capabilities and create a tool frame-

work that can adapt and be optimized for particular com-

plex system cases.

Our choice of general computation model must reflect

real computing environments, both in terms of the paral-

lel systems architecture and the parallel software envi-

ronment. The computational model we target was initially

proposed by the HPC++ consortium (HPC++ Working

Group 1995) and is illustrated in Figure 1. Two com-

bined views of the model are shown: a physical (hard-

ware) view and an abstract software view. In the model, a

node is defined as a physically distinct machine with one

or more processors sharing a physical memory system

(i.e. a shared memory multiprocessor (SMP)). A node

may link to other nodes via a protocol-based interconnect,

ranging from proprietary networks, as found in tradi-

tional MPPs, to local- or global-area networks. Nodes and

their interconnection infrastructure provide a hardware

execution environment for parallel software computa-

tion. A context is a distinct virtual address space within a

node providing shared memory support for parallel soft-

ware execution. Multiple contexts may exist on a single

node. Multiple threads of execution, both user and sys-

tem level, may exist within a context; threads within a

context share the same virtual address space. Threads in

different contexts on the same node can interact via inter-

process communication (IPC) facilities, while threads in

contexts on different nodes communicate using message

passing libraries (e.g. MPI) or network IPC. Shared-mem-

ory implementations of message passing can also be used

for fast intra-node context communication. The bold arrows

in the figure reflect scheduling of contexts and threads on

the physical node resources.

The computation model above is general enough to

apply to many high-performance architectures as well as

to different parallel programming paradigms. Particular

instances of the model and how it is programmed defines

requirements for performance tool technology. That is,

by considering different instances of the general computing

model and the abstract operation of each, we can identify

important capabilities that a performance tool should sup-

port for each model instance. When we consider a per-

formance system to accommodate the range of instances,

we can look to see what features are common and can be

abstracted in the performance tool design. In this way,

the capability abstraction allows the performance system

to retain uniform interfaces across the range of parallel

platforms, while specializing tool support for the particu-

lar model instance.

3 TAU Performance System Architecture

The TAU performance system (Shende et al. 1998; Malony

and Shende 2000; University of Oregon b) is designed as

a tool framework, whereby tool components and modules

are integrated and coordinate their operation using well-

Fig. 1 Execution model supported by TAU.

289THE TAU PARALLEL PERFORMANCE SYSTEM

defined interfaces and data formats. The TAU framework

architecture is organized into three layers – instrumenta-

tion, measurement, and analysis – where within each layer

multiple modules are available and can be configured in

a flexible manner under user control.

The instrumentation and measurement layers of the

TAU framework are shown in Figure 2. TAU supports

a flexible instrumentation model that allows the user

to insert performance instrumentation calling the TAU

measurement API at different, multiple levels of pro-

gram code representation, transformation, compilation,

and execution. The key concept of the instrumentation

layer is that it is here where performance events are

defined. The instrumentation mechanisms in TAU sup-

port several types of performance events, including

events defined by code location (e.g. routines or blocks),

library interface events, system events, and arbitrary

user-defined events. TAU is also aware of events associ-

ated with message passing and multi-threading parallel

execution. The instrumentation layer is used to define

events for performance experiments. Thus, one output of

instrumentation are information about the events for a

performance experiment. This information will be used

by other tools.

Fig. 2 Architecture of TAU Performance System – Instrumentation and Measurement.

290 COMPUTING APPLICATIONS

The instrumentation layer interfaces with the measure-

ment layer through the TAU measurement API. TAU’s

measurement system is organized into four parts. The event

creation and management part determines how events are

processed. Events are dynamically created in the TAU

system as the result of their instrumentation and occur-

rence during execution. Two types of events are sup-

ported: entry/exit events and atomic events. In addition,

TAU provides the mapping of performance measurements

for “low-level” events to high-level execution entities.

Overall, this part provides the mechanisms to manage

events as a performance experiment proceeds. It includes

the grouping of events and their runtime measurement

control. The performance measurement part supports two

measurement forms: profiling and tracing. For each form,

TAU provides the complete infrastructure to manage the

measured data during execution at any scale (number of

events or parallelism). The performance data sources part

defines what performance data is measurable and can be

used in profiling or tracing. TAU supports different timing

sources, choice of hardware counters through the PAPI

(Browne et al. 2000) or PCL (Berrendorf, Ziegler, and

Mohr) interfaces, and access to system performance data.

The OS and runtime system part provide the coupling

between TAU’s measurement system and the underlying

parallel system platform. TAU specializes and optimizes

its execution according to the platform features available.

The TAU measurement systems can be customized and

configured for each performance experiment by compos-

ing specific modules for each part and setting runtime

controls. For instance, based on the composition of mod-

ules, an experiment could easily be configured to measure

the profile that shows the inclusive and exclusive counts

of secondary data cache misses associated with basic

blocks such as routines, or a group of statements. By pro-

viding a flexible measurement infrastructure, a user can

experiment with different attributes of the system and

iteratively refine the performance characterization of a

parallel application.

The TAU analysis and visualization layer is shown in

Figure 3. As in the instrumentation and measurement

layer, TAU flexibility allows use of several modules. These

are separated between those for parallel profile analysis

and parallel trace analysis. For each, support is given to the

management of the performance data (profiles or traces),

including the conversion to/from different formats. TAU

comes with both text-based and graphical tools to visual-

ize the performance profiles. ParaProf (Bell, Malony, and

Shende 2003) is TAU’s parallel profile analysis and visu-

alization tool. Also distributed with TAU is the PerfDMF

(Huck et al. 2005) tool providing multi-experiment paral-

lel profile management. Given the wealth of third-party

trace analysis and visualization tools, TAU does not imple-

ment its own. However, trace translation tools are imple-

mented to enable use of Vampir (Intel Corporation; Nagel

et al. 1996), Jumpshot (Wu et al. 2000), and Paraver (Euro-

pean Center for Parallelism of Barcelona (CEPBA)). It is

also possible to generate EPILOG (Mohr and Wolf 2003)

trace files for use with the Expert (Wolf et al. 2004) anal-

ysis tool. All TAU profile and trace data formats are open.

The framework approach to TAU’s architecture design

guarantees the most flexibility in configuring TAU capa-

bilities to the requirements of the parallel performance

experimentation and problem solving the user demands.

In addition, it allows TAU to extend these capabilities to

include the rich technology being developed by other per-

formance tool research groups. In the sections that follow,

we look at each framework layer in more depth and dis-

cuss in detail what can be done with the TAU perform-

ance system.

4 Instrumentation

In order to observe performance, additional instructions

or probes are typically inserted into a program. This proc-

ess is called instrumentation. From this perspective, the

execution of a program is regarded as a sequence of

significant performance events. As events execute, they

activate the probes which perform measurements. Thus,

instrumentation exposes key characteristics of an execu-

tion. Instrumentation can be introduced in a program at

several levels of the program transformation process. In

this section we describe the instrumentation options sup-

ported by TAU.

4.1 Source-Based Instrumentation

TAU provides an API that allows programmers to manu-

ally annotate the source code of the program. Source-

level instrumentation can be placed at any point in the

program and it allows a direct association between lan-

guage- and program-level semantics and performance

measurements. Using cross-language bindings, TAU pro-

vides its API in C++, C, Fortran, Java, and Python lan-

guages. Thus, language specific features (e.g. runtime type

information for tracking templates in C++) can be lever-

aged. TAU also provides a higher-level specification in

SIDL (Kohn et al. 2001; Shende et al. 2003) for cross-lan-

guage portability and deployment in component-based

programming environments (Bernholdt et al. 2005).

TAU’s API can be broadly classified into the following

five interfaces:

• Interval event interface

• Atomic event interface

• Query interface

• Control interface

• Sampling interface

291THE TAU PARALLEL PERFORMANCE SYSTEM

4.1.1 Interval event interface TAU supports the abil-

ity to make performance measurements with respect to

event intervals. An event interval is defined by its start

events and its stop events. A user may bracket parts of

his/her code to specify a region of interest using a pair of

start and stop event calls. There are several ways to iden-

tify interval events and performance tools have used dif-

ferent techniques. It is probably more recognizable to

talk about interval events as timers. To identify a timer,

some tools advocate the use of numeric identifiers and an

associated table mapping the identifiers to timer names.

While it is easy to specify and pass the timer identifier

among start and stop routines, it has its drawbacks. Main-

taining a table statically might work for languages such

as Fortran 90 and C, but it extends poorly to C++, where

a template may be instantiated with different parameters.

This aspect of compile time polymorphism makes it dif-

ficult to disambiguate between different instantiations of

the same code. Also, it can introduce instrumentation

errors in maintaining the table that maps the identifiers to

names. This is true for large projects that involve several

application modules and developers.

Our interface uses a dynamic naming scheme where

interval event (timer) names are associated with the per-

formance data (timer) object at runtime. An interval event

can have a unique name and a signature that can be obtained

at runtime. In the case of C++, this is done using runtime

type information of objects. Several logically related inter-

Fig. 3 Architecture of TAU Performance System – Analysis and Visualization.

HTML Viewer

Frequently Asked Questions (15)

Q1. What contributions have the authors mentioned in the paper "The tau parallel performance system" ?

The ability of performance technology to keep pace with the growing complexity of parallel and distributed systems depends on robust performance frameworks that can at once provide system-specific performance capabilities and support high-level performance problem solving. This paper presents the TAU ( Tuning and Analysis Utilities ) parallel performance system and describe how it addresses diverse requirements for performance observation and analysis.

Q2. What are the future works mentioned in the paper "The tau parallel performance system" ?

While performance evaluation of a system is directly affected by what constraints the system imposes on performance instrumentation and measurement capabilities, the desire for performance problem solving tools that are common and portable, now and into the future, suggests that performance tools hardened and customized for a particular system platform will be short-lived, with limited utility. Unless performance technology evolves with system technology, a chasm will remain between the users expectations and the capabilities that performance tools provide. However, effective exploration of performance will necessarily require prudent selection from the range of alternative methods TAU provides to assemble meaningful performance experiments that sheds light on the relevant performance properties.

Q3. What are the two goals of performance technology for complex systems?

To address the dual goals of performance technology for complex systems – robust performance capabilities and widely available performance problem solving methodologies – the authors need to contend with problems of system diversity while providing flexibility in tool composition, configuration, and integration.

Q4. What is the function that constructs the proxy component?

It infers the arguments and return types of a port and its interfaces and constructs the source code of a proxy component, which when compiled and instantiated in the framework allows us to measure the performance of a component without any changes to its source or object code.

Q5. What is the function used to link events with performance data?

Using the TAU measurement API, event information is passed in the probe calls to be used during measurement operations to link events with performance data.

Q6. What is the advantage of using a common thread layer?

To deal with Java’s multi-threaded environment, TAU uses a common thread layer for operations such as getting the thread identifier, locking and unlocking the performance database, getting the number of concurrent threads, and so on.

Q7. What are the types of events that are used to identify points in the program that are executed?

Typical events include control flow events that identify points in the program that are executed, or operational events that occur when some operation or action has been performed.

Q8. What is the common technique to instrument library routines?

A common technique to instrument library routines is to substitute the standard library routine with an instrumented version which in turn calls the orginal routine.

Q9. Why does the tool programmer need to worry about vendor-specific SQL syntax?

Because all supported databases are accessed through a common interface, the tool programmer does not need to worry about vendor-specific SQL syntax.

Q10. What is the reason why tracing the program execution is not always feasible?

Tracing the program execution is not always feasible due to the high volume of performance data generated and the amount of trace processing needed.

Q11. What is the purpose of the profile analysis toolkit?

The last component, the profile analysis toolkit, is an extensible suite of common base analysis routines that can be reused across performance analysis programs.

Q12. How does the Python interpreter instrument all entry and exit calls?

TAU has been integrated with Python by leveraging the Python interpreter’s debugging and profiling capabilities to instrument all entry and exit calls.

Q13. Why does tau_merge not read the global event definitions?

It ensures that the trace analysis tools down the line that read the merged traces also read the global event definitions and refresh their internal tables when they encounter an event for which event definitions are not known.

Q14. What is the difference between calldepth and callpath profiling?

The implementation of calldepth profiling is similar to callpath profiling in that it requires dynamic event generation and profile object creation, but it benefits from certain efficiencies in pruning its search on the callstack.

Q15. Why is tau_merge unable to write trace records that contain one or more new events?

The trace generation library ensures that event tables are written to disk before writing trace records that contain one or more new events.

The Tau Parallel Performance System

Summary (7 min read)

Introduction

2 A General Computation Model for Parallel Performance Technology

3 TAU Performance System Architecture

4 Instrumentation

4.1 Source-Based Instrumentation

4.2 Preprocessor-Based Instrumentation

4.3 Compiler-Based Instrumentation

4.4 Wrapper Library-Based Instrumentation

4.5 Binary Instrumentation

4.6 Interpreter-Based Instrumentation

4.7 Component-Based Instrumentation

4.8 Virtual Machine-Based Instrumentation

4.9 Multi-Level Instrumentation

4.10 Selective Instrumentation

4.11 TAU_COMPILER

5 Measurement

5.1 Performance Data Sources

5.2 Profiling

5.3 Flat Profiling

5.4 Tracing

5.5 Measurement Overhead

5.6 Overhead Compensation

5.7 Performance Mapping

6 Analysis

6.1 ParaProf

6.2 Performance Database Framework

6.3 Tracing

7 Conclusion

Figures (14)

Citations

Cites methods from "The Tau Parallel Performance System..."

Cites background from "The Tau Parallel Performance System..."

References