scispace - formally typeset
Search or ask a question
Journal ArticleDOI

The Tau Parallel Performance System

01 May 2006-Vol. 20, Iss: 2, pp 287-311
TL;DR: This paper presents the TAU (Tuning and Analysis Utilities) parallel performance sytem and describes how it addresses diverse requirements for performance observation and analysis.
Abstract: The ability of performance technology to keep pace with the growing complexity of parallel and distributed systems depends on robust performance frameworks that can at once provide system-specific performance capabilities and support high-level performance problem solving Flexibility and portability in empirical methods and processes are influenced primarily by the strategies available for instrmentation and measurement, and how effectively they are integrated and composed This paper presents the TAU (Tuning and Analysis Utilities) parallel performance sytem and describe how it addresses diverse requirements for performance observation and analysis

Summary (7 min read)

Introduction

  • The ability of performance technology to keep pace with the growing complexity of parallel and distributed systems depends on robust performance frameworks that can at once provide system-specific performance capabilities and support high-level performance problem solving.
  • Flexibility and portability in empirical methods and processes are influenced primarily by the strategies available for instrumentation and measurement, and how effectively they are integrated and composed.
  • This paper presents the TAU (Tuning and Analysis Utilities) parallel performance system and describe how it addresses diverse requirements for performance observation and analysis.
  • Lack of portable performance evaluation environments forces users to adopt different techniques on different systems, even for common performance analysis.

2 A General Computation Model for Parallel Performance Technology

  • To address the dual goals of performance technology for complex systems – robust performance capabilities and widely available performance problem solving methodologies – the authors need to contend with problems of system diversity while providing flexibility in tool composition, configuration, and integration.
  • In the model, a node is defined as a physically distinct machine with one or more processors sharing a physical memory system (i.e. a shared memory multiprocessor (SMP)).
  • A context is a distinct virtual address space within a node providing shared memory support for parallel software execution.
  • The computation model above is general enough to apply to many high-performance architectures as well as to different parallel programming paradigms.
  • When the authors consider a performance system to accommodate the range of instances, they can look to see what features are common and can be abstracted in the performance tool design.

3 TAU Performance System Architecture

  • The TAU performance system (Shende et al.
  • The TAU framework architecture is organized into three layers – instrumentation, measurement, and analysis – where within each layer multiple modules are available and can be configured in a flexible manner under user control.
  • The instrumentation layer is used to define events for performance experiments.
  • The performance measurement part supports two measurement forms: profiling and tracing.
  • Also distributed with TAU is the PerfDMF (Huck et al. 2005) tool providing multi-experiment parallel profile management.

4 Instrumentation

  • In order to observe performance, additional instructions or probes are typically inserted into a program.
  • As events execute, they activate the probes which perform measurements.
  • Thus, instrumentation exposes key characteristics of an execution.
  • In this section the authors describe the instrumentation options supported by TAU.

4.1 Source-Based Instrumentation

  • TAU provides an API that allows programmers to manually annotate the source code of the program.
  • Thus, language specific features (e.g. runtime type information for tracking templates in C++) can be leveraged.
  • TAU’s API can be broadly classified into the following five interfaces: Interval event interface Atomic event interface Query interface Control interface Sampling interface 291THE TAU PARALLEL PERFORMANCE SYSTEM 4.1.1.
  • There are several ways to identify interval events and performance tools have used different techniques.
  • Control of interrupt period and selection of system properties to track are provided.

4.2 Preprocessor-Based Instrumentation

  • This approach typically involves parsing the source code to infer where instrumentation probes are to be inserted.
  • Preprocessor-based instrumentation is also commonly used to insert performance measurement calls at interval entry and exit points in the source code.
  • PDT is comprised of commercial-grade front-ends that emit an intermediate language (IL) file, IL analyzers that walk the abstract syntax tree and generate a subset of semantic entities in program database (PDB) ASCII text files, and a library interface to the PDB files that allows us to write static analysis tools.
  • The instrumented source code is then compiled and linked with the TAU measurement library to produce an executable code.
  • Opari inserts POMP (Mohr et al. 2002) annotations and rewrites OpenMP directives in the source code.

4.3 Compiler-Based Instrumentation

  • A compiler can add instrumentation calls in the object code that it generates.
  • The compiler has full access to source-level mapping information.
  • It has the ability to choose the granularity of instrumentation and can include fine-grained instrumentation.
  • The compiler strips the instrumentation calls from the source code and optimizes the compiled source code.
  • The code then executes a branch to the instruction following the original instruction to continue execution.

4.4 Wrapper Library-Based Instrumentation

  • A common technique to instrument library routines is to substitute the standard library routine with an instrumented version which in turn calls the orginal routine.
  • The problem is that you would like to do this without having to develop a different library just to alter the calling interface.
  • The advantage of this approach is that librarylevel instrumentation can be implemented by defining a wrapper interposition library layer that inserts instrumentation calls before and after calls to the native routines.
  • The authors developed a TAU MPI wrapper library that intercepts calls to the native library by defining routines with the same name, such as MPI_Send.
  • In addition, TAU’s performance grouping capabilities allows MPI events to be presented with respect to high-level categories such as send and receive types.

4.5 Binary Instrumentation

  • TAU uses DyninstAPI (Buck and Hollingsworth 2000) for instrumenting the executable code of a program.
  • The authors approach for TAU uses the DyninstAPI to construct calls to the TAU measurement library and then insert these calls into the executable code.
  • Using the list of routines and their names, unique identifiers are assigned to each routine.
  • Dynaprof is another tool that uses DyninstAPI for instrumentation.
  • An interval event timer is defined to track the time spent in un-instrumented code.

4.6 Interpreter-Based Instrumentation

  • Interpreted language environments present an interesting target for TAU integration.
  • TAU has been integrated with Python by leveraging the Python interpreter’s debugging and profiling capabilities to instrument all entry and exit calls.
  • A TAU interval event is created when a call is dispatched for the first time.
  • Since shared objects are used in Python, instrumentation from multiple levels see the same runtime performance data.
  • Python is particularly interesting since it can be use to dynamically link and control multi-language executable modules.

4.7 Component-Based Instrumentation

  • Component technology extends the benefits of scripting systems and object-oriented design to support reuse and interoperability of component software, transparent of language and location (Szyperski 1997).
  • Components are compiled into shared libraries and are loaded in, instantiated and composed into a useful code at runtime.
  • There are two ways to instrument a component based application using TAU.
  • A proxy component implements a port interface and has a provides and a uses port.
  • The provides port is connected to the caller’s uses port and its uses port is connected to the callee’s provides port.

4.8 Virtual Machine-Based Instrumentation

  • Support of performance instrumentation and measurement in language systems based on virtual machine (VM) execution poses several challenges.
  • JVMPI provides profiling hooks into the virtual machine and allows a profiler agent to instrument the Java application without any changes to the source code, bytecode, or the executable code of the JVM.
  • TAU maintains a per-thread performance data structure that is updated when a method entry or exit takes place.
  • Since this is maintained on a per thread basis, it does not require mutual exclusion with other threads and is a low-overhead scalable data structure.
  • When it receives a JVM shutdown event, it flushes the performance data for all running threads to the disk.

4.9 Multi-Level Instrumentation

  • As the source code undergoes a series of transformations in the compilation, linking, and execution phases, it poses several constraints and opportunites for instrumentation.
  • Instead of restricting the choice of instrumentation to one phase in the program transformation, TAU allows multiple instrumentation interfaces to be deployed concurrently 297THE TAU PARALLEL PERFORMANCE SYSTEM for better coverage.
  • It taps into performance data from multiple levels and presents it in a consistent and a uniform manner by integrating events from different languages and instrumentation levels in the same address space.
  • TAU maintains performance data in a common structure for all events and allows external tools access to the performance data using a common interface.

4.10 Selective Instrumentation

  • In support of the different instrumentation schemes TAU provides, a facility for selecting which of the possible events to instrument has been developed (Malony et al. 2003).
  • The file is then used during the instrumentation process to restrict the event set.
  • The basic structure of the file is a list of names separated into include and exclude lists.
  • The selective instrumentation mechanism is being used in TAU for all automatic instrumentation methods, including PDT source instrumentation, DyninstAPI executable instrumentation, and component instrumentation.
  • It has proven invaluable as a means to both weed out unwanted performance events, such as high frequency, small routines that generate excessive measurement overhead, and provide easy event configuration for customized performance experiments.

4.11 TAU_COMPILER

  • To simplify the integration of the source instrumentor and the MPI wrapper library in the build process, TAU provides a tool, tau_compiler.sh that can be invoked using a prefix of $(TAU_COMPILER) before the name of the compiler.
  • In an application makefile, the variable: F90=mpxlf90 is modified to F90=$(TAU_COMPILER) mpxlf90.
  • It can distinguish between object code creation and linking phases of compilation and during linking, it inserts the MPI wrapper library and the TAU measurement library in the link command line.
  • A user can easily integrate TAU’s portable performance instrumentation in the code generation process.
  • Optional parameters can be passed to all four compilation phases.

5 Measurement

  • All TAU instrumentation code makes calls to the TAU measurement system through an API that provides a portable and consistent set of measurement services.
  • Again, the instrumentation layer is responsible for defining the performance events for an experiment, establishing relationships between events (e.g. groups, mappings), and managing those events in the context of the parallel computing model being used.
  • Using the TAU measurement API, event information is passed in the probe calls to be used during measurement operations to link events with performance data.
  • It is in the measurement system configuration and usage where all choices for what performance data to capture and in what manner are made.
  • It is highly robust, scalable, and has been ported to all HPC platforms.

5.1 Performance Data Sources

  • TAU provides access to various sources of performance data.
  • Time is perhaps the most important and ubiquitous data type, but it comes in various forms on different system platforms.
  • Through TAU configuration, all of the linkages to these packages are taken care of.
  • Within the measurement system, TAU allows for multiple sources of performance data to be concurrently active.
  • That is, it is possible for both profiling and tracing to work with multiple performance data.

5.2 Profiling

  • Profiles are typically represented as a list of various metrics (such as wall-clock time) and associated statistics for all performance events in the program.
  • There are different statistics kept for interval events (such as routines or statements in the program) versus atomic events.
  • Typically one metric is measured during a profiling run.
  • Internally, the TAU measurement system maintains a profile data structure for each node/context/thread.
  • When the program execution completes, a separate profile file is created for each.

5.3 Flat Profiling

  • The TAU profiling system supports several profiling variants.
  • Trace analysis can then easily calculate callpath profiles.
  • Thus, a parallel profile that showed how performance data was distributed at differ- ent levels of an unfolding event call tree could help to understand the performance better.
  • When TAU is configured with the -PROFILEPHASE option, TAU will effectively generate a separate profile for each phase in the program’s execution.
  • This top level phase contains other routines and phases that it directly invokes, but excludes routines called by child phases.

5.4 Tracing

  • While profiling is used to get aggregate summaries of metrics in a compact form, it cannot highlight the time varying aspect of the execution.
  • With tracing enabled, every node/context/thread will generate a trace for instrumented events.
  • For runtime trace reading and analysis, it is important to understand what takes place when TAU records performance events in traces.
  • In their more general and dynamic scheme, the event identifiers are generated on the fly, local to a context.
  • It can parse binary merged or unmerged traces (and their respective event definition files) and provides this information to an analysis tool using a trace analysis API.

5.5 Measurement Overhead

  • The performance events of interest depend mainly on what aspect of the execution the user wants to see, so as to construct a meaningful performance view from the measurements made.
  • Typical events include control flow events that identify points in the program that are executed, or operational events that occur when some operation or action has been performed.
  • The authors define performance accuracy as the degree to which their performance measures correctly represent “actual” performance.
  • If the authors attempt to measure a lot of events, the performance intrusion may be high because of the accumulated measurement overhead, regardless of the measurement accuracy for that event.
  • TAU is a highly-engineered performance system and delivers excellent measurement efficiencies and low measurement overhead.

5.6 Overhead Compensation

  • Unfortunately, by eliminating events from instrumentation, the authors lose the ability to see those events at all.
  • On the other hand, accurate measurement is confounded by high relative overheads.
  • The distortion in gathered performance data could be significant for a parallel program where the effects of perturbation are compounded by parallel execution and accumulation of overhead from remote processes.
  • The authors have developed techniques in TAU profiling to compensate for measurement overhead at runtime.
  • The way this is accomplished is quite clever by tracking the number of descendant events and adjusting the total inclusive time at event exit.

5.7 Performance Mapping

  • The ability to associate low-level performance measurements with higher-level execution semantics is important 302 COMPUTING APPLICATIONS in understanding parallel performance data with respect to application structure and dynamics.
  • The idea is to provide a mechanism whereby performance measurements, made by the occurrence of instrumented performance events, can be associated with semantic abstractions, possible at a different level of performance observation.
  • TAU has implemented performance mapping as an integral part of its measurement system.
  • To do this, the authors construct a key array that includes the identities of the current event and the parent phase.
  • If the authors find the key, they access the profiling object and update its performance metrics.

6 Analysis

  • TAU gives us the ability to track performance data in widely diverse environments, and thus provides a wealth of information to the user.
  • It has been a continuing effort to include as part of TAU a set of analysis tools which can scale not only to the task of analyzing TAU data, but also to a more diverse arena outside of the TAU paradigm.
  • This section discusses the developement of these tools, and the resulting benefits to the user in performing the often complex task of analyzing performance data.
  • The authors approach in this section will be to show the use of the TAU analysis tools on a single parallel application, S3D (Subramanya and Reddy 2000).
  • S3D is a high-fidelity finite difference solver for compressible reacting flows which includes detailed chemistry computations.

6.1 ParaProf

  • The TAU performance measurement system is capable of producing parallel profiles for thousands of processes consisting of hundreds of events.
  • The result is high extensibility and flexibility, enabling us to tackle the issues of re-use and scalability.
  • One, DSS can be configured with profile input modules to read profiles from different sources.
  • Its supports many advanced capabilities required in a modern performance analysis system, such as derived metrics for relating performance data, cross experiment analysis for analyzing data from disparate experiments, and data reduction for elimination of redundant data, thus allowing large data sources to be tolerated efficiently.
  • To get a sense of the type of analysis displays ParaProf supports, Figure 7 shows the S3D flat profile (stacked view) on sixteen processes.

6.2 Performance Database Framework

  • Empirical performance evaluation of parallel and distributed systems or applications often generates significant amounts of performance data and analysis results from multiple experiments and trials as performance is investigated and problems diagnosed.
  • Profile data is organized such that for each combination of these items, an aggregate measurement is recorded.
  • It builds on robust SQL relational database engines, some of which are freely distributed.
  • To facilitate performance analysis development, the PerfDMF architecture includes a well-documented data management API to abstract query and analysis operations into a more programmatic, non-SQL, form.
  • The last component, the profile analysis toolkit, is an extensible suite of common base analysis routines that can be reused across performance analysis programs.

6.3 Tracing

  • The authors made an early decision in the TAU system to leverage existing trace analysis and visualization tools.
  • For convenience, the TAU tracing systems also allows traces files to be output directly in VTF3 and EPILOG formats.
  • Expert is trace-based in its analysis and looks for performance problems that arise in the execution.
  • Figure 13 shows a view from Expert using CUBE for S3D.
  • The tool can take parameters identifying where to start and stop the profile generation in time, allowing parallel profiles to be generated for specific regions of the traces.

7 Conclusion

  • Complex parallel systems and software pose challenging performance evaluation problems that require robust methodologies and tools.
  • The TAU performance system addresses performance technology problems at three levels: instrumentation, measurement, and analysis.
  • Portability, robustness, and extensibility are the hallmarks of the TAU parallel performance system.
  • It is in use in scientific research groups, HPC centers, and industrial laboratories around the world.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

287THE TAU PARALLEL PERFORMANCE SYSTEM
THE TAU PARALLEL PERFORMANCE
SYSTEM
Sameer S. Shende
Allen D. Malony
DEPARTMENT OF COMPUTER AND INFORMATION
SCIENCE, UNIVERSITY OF OREGON, EUGENE, OR
(SAMEER@CS.UOREGON.EDU)
Abstract
The ability of performance technology to keep pace with
the growing complexity of parallel and distributed systems
depends on robust performance frameworks that can at
once provide system-specific performance capabilities and
support high-level performance problem solving. Flexibility
and portability in empirical methods and processes are
influenced primarily by the strategies available for instru-
mentation and measurement, and how effectively they are
integrated and composed. This paper presents the TAU
(Tuning and Analysis Utilities) parallel performance sys-
tem and describe how it addresses diverse requirements
for performance observation and analysis.
Key words: Performance evaluation, instrumentation, meas-
urement, analysis, TAU
1 Introduction
The evolution of computer systems and of the applications
that run on them – towards more sophisticated modes of
operation, higher levels of abstraction, and larger scale of
execution – challenge the state of technology for empiri-
cal performance evaluation. The increasing complexity of
parallel and distributed systems, coupled with emerging
portable parallel programming methods, demands that
empirical performance tools provide robust performance
observation capabilities at all levels of a system, while
mapping low-level behavior to high-level performance
abstractions in a uniform manner.
Given the diversity of performance problems, evalua-
tion methods, and types of events and metrics, the instru-
mentation and measurement mechanisms needed to support
performance observation must be flexible, to give maximum
opportunity for configuring performance experiments,
and portable, to allow consistent cross-platform perform-
ance problem solving. In general, flexibility in empirical
performance evaluation implies freedom in experiment
design, and choices in selection and control of experiment
mechanisms. Using tools that otherwise limit the type and
structure of performance methods will restrict evaluation
scope. Portability, on the other hand, looks for common
abstractions in performance methods and how these can
be supported by reusable and consistent techniques across
different computing environments (software and hardware).
Lack of portable performance evaluation environments
forces users to adopt different techniques on different sys-
tems, even for common performance analysis.
The TAU (Tuning and Analysis Utilities) parallel per-
formance system is the product of fourteen years of devel-
opment to create a robust, flexible, portable, and integrated
framework and toolset for performance instrumentation,
measurement, analysis, and visualization of large-scale
parallel computer systems and applications. The success
of the TAU project represents the combined efforts of
researchers at the University of Oregon and colleagues at
the Research Centre Juelich and Los Alamos National
Laboratory. The purpose of this paper is to provide a
complete overview of the TAU system. The discussion
will be organized first according to the TAU system archi-
tecture and second from the point of view of how to use
TAU in practice.
2 A General Computation Model for
Parallel Performance Technology
To address the dual goals of performance technology for
complex systems – robust performance capabilities and
widely available performance problem solving method-
ologies – we need to contend with problems of system
diversity while providing flexibility in tool composition,
configuration, and integration. One approach to address
The International Journal of High Performance Computing Applications,
Volume 20, No. 2, Summer 2006, pp. 287–311
DOI: 10.1177/1094342006064482
© 2006 SAGE Publications
Figures 1–13 appear in color online: http://hpc.sagepub.com

288 COMPUTING APPLICATIONS
these issues is to focus attention on a sub-class of compu-
tation models and performance problems as a way to
restrict the performance technology requirements. The
obvious consequence of this approach is limited tool cov-
erage. Instead, our idea is to define an abstract computation
model that captures general architecture and software
execution features and can be mapped straightforwardly
to existing complex system types. For this model, we can
target performance capabilities and create a tool frame-
work that can adapt and be optimized for particular com-
plex system cases.
Our choice of general computation model must reflect
real computing environments, both in terms of the paral-
lel systems architecture and the parallel software envi-
ronment. The computational model we target was initially
proposed by the HPC++ consortium (HPC++ Working
Group 1995) and is illustrated in Figure 1. Two com-
bined views of the model are shown: a physical (hard-
ware) view and an abstract software view. In the model, a
node is defined as a physically distinct machine with one
or more processors sharing a physical memory system
(i.e. a shared memory multiprocessor (SMP)). A node
may link to other nodes via a protocol-based interconnect,
ranging from proprietary networks, as found in tradi-
tional MPPs, to local- or global-area networks. Nodes and
their interconnection infrastructure provide a hardware
execution environment for parallel software computa-
tion. A context is a distinct virtual address space within a
node providing shared memory support for parallel soft-
ware execution. Multiple contexts may exist on a single
node. Multiple threads of execution, both user and sys-
tem level, may exist within a context; threads within a
context share the same virtual address space. Threads in
different contexts on the same node can interact via inter-
process communication (IPC) facilities, while threads in
contexts on different nodes communicate using message
passing libraries (e.g. MPI) or network IPC. Shared-mem-
ory implementations of message passing can also be used
for fast intra-node context communication. The bold arrows
in the figure reflect scheduling of contexts and threads on
the physical node resources.
The computation model above is general enough to
apply to many high-performance architectures as well as
to different parallel programming paradigms. Particular
instances of the model and how it is programmed defines
requirements for performance tool technology. That is,
by considering different instances of the general computing
model and the abstract operation of each, we can identify
important capabilities that a performance tool should sup-
port for each model instance. When we consider a per-
formance system to accommodate the range of instances,
we can look to see what features are common and can be
abstracted in the performance tool design. In this way,
the capability abstraction allows the performance system
to retain uniform interfaces across the range of parallel
platforms, while specializing tool support for the particu-
lar model instance.
3 TAU Performance System Architecture
The TAU performance system (Shende et al. 1998; Malony
and Shende 2000; University of Oregon b) is designed as
a tool framework, whereby tool components and modules
are integrated and coordinate their operation using well-
Fig. 1 Execution model supported by TAU.

289THE TAU PARALLEL PERFORMANCE SYSTEM
defined interfaces and data formats. The TAU framework
architecture is organized into three layers – instrumenta-
tion, measurement, and analysis – where within each layer
multiple modules are available and can be configured in
a flexible manner under user control.
The instrumentation and measurement layers of the
TAU framework are shown in Figure 2. TAU supports
a flexible instrumentation model that allows the user
to insert performance instrumentation calling the TAU
measurement API at different, multiple levels of pro-
gram code representation, transformation, compilation,
and execution. The key concept of the instrumentation
layer is that it is here where performance events are
defined. The instrumentation mechanisms in TAU sup-
port several types of performance events, including
events defined by code location (e.g. routines or blocks),
library interface events, system events, and arbitrary
user-defined events. TAU is also aware of events associ-
ated with message passing and multi-threading parallel
execution. The instrumentation layer is used to define
events for performance experiments. Thus, one output of
instrumentation are information about the events for a
performance experiment. This information will be used
by other tools.
Fig. 2 Architecture of TAU Performance System – Instrumentation and Measurement.

290 COMPUTING APPLICATIONS
The instrumentation layer interfaces with the measure-
ment layer through the TAU measurement API. TAU’s
measurement system is organized into four parts. The event
creation and management part determines how events are
processed. Events are dynamically created in the TAU
system as the result of their instrumentation and occur-
rence during execution. Two types of events are sup-
ported: entry/exit events and atomic events. In addition,
TAU provides the mapping of performance measurements
for “low-level” events to high-level execution entities.
Overall, this part provides the mechanisms to manage
events as a performance experiment proceeds. It includes
the grouping of events and their runtime measurement
control. The performance measurement part supports two
measurement forms: profiling and tracing. For each form,
TAU provides the complete infrastructure to manage the
measured data during execution at any scale (number of
events or parallelism). The performance data sources part
defines what performance data is measurable and can be
used in profiling or tracing. TAU supports different timing
sources, choice of hardware counters through the PAPI
(Browne et al. 2000) or PCL (Berrendorf, Ziegler, and
Mohr) interfaces, and access to system performance data.
The OS and runtime system part provide the coupling
between TAU’s measurement system and the underlying
parallel system platform. TAU specializes and optimizes
its execution according to the platform features available.
The TAU measurement systems can be customized and
configured for each performance experiment by compos-
ing specific modules for each part and setting runtime
controls. For instance, based on the composition of mod-
ules, an experiment could easily be configured to measure
the profile that shows the inclusive and exclusive counts
of secondary data cache misses associated with basic
blocks such as routines, or a group of statements. By pro-
viding a flexible measurement infrastructure, a user can
experiment with different attributes of the system and
iteratively refine the performance characterization of a
parallel application.
The TAU analysis and visualization layer is shown in
Figure 3. As in the instrumentation and measurement
layer, TAU flexibility allows use of several modules. These
are separated between those for parallel profile analysis
and parallel trace analysis. For each, support is given to the
management of the performance data (profiles or traces),
including the conversion to/from different formats. TAU
comes with both text-based and graphical tools to visual-
ize the performance profiles. ParaProf (Bell, Malony, and
Shende 2003) is TAU’s parallel profile analysis and visu-
alization tool. Also distributed with TAU is the PerfDMF
(Huck et al. 2005) tool providing multi-experiment paral-
lel profile management. Given the wealth of third-party
trace analysis and visualization tools, TAU does not imple-
ment its own. However, trace translation tools are imple-
mented to enable use of Vampir (Intel Corporation; Nagel
et al. 1996), Jumpshot (Wu et al. 2000), and Paraver (Euro-
pean Center for Parallelism of Barcelona (CEPBA)). It is
also possible to generate EPILOG (Mohr and Wolf 2003)
trace files for use with the Expert (Wolf et al. 2004) anal-
ysis tool. All TAU profile and trace data formats are open.
The framework approach to TAU’s architecture design
guarantees the most flexibility in configuring TAU capa-
bilities to the requirements of the parallel performance
experimentation and problem solving the user demands.
In addition, it allows TAU to extend these capabilities to
include the rich technology being developed by other per-
formance tool research groups. In the sections that follow,
we look at each framework layer in more depth and dis-
cuss in detail what can be done with the TAU perform-
ance system.
4 Instrumentation
In order to observe performance, additional instructions
or probes are typically inserted into a program. This proc-
ess is called instrumentation. From this perspective, the
execution of a program is regarded as a sequence of
significant performance events. As events execute, they
activate the probes which perform measurements. Thus,
instrumentation exposes key characteristics of an execu-
tion. Instrumentation can be introduced in a program at
several levels of the program transformation process. In
this section we describe the instrumentation options sup-
ported by TAU.
4.1 Source-Based Instrumentation
TAU provides an API that allows programmers to manu-
ally annotate the source code of the program. Source-
level instrumentation can be placed at any point in the
program and it allows a direct association between lan-
guage- and program-level semantics and performance
measurements. Using cross-language bindings, TAU pro-
vides its API in C++, C, Fortran, Java, and Python lan-
guages. Thus, language specific features (e.g. runtime type
information for tracking templates in C++) can be lever-
aged. TAU also provides a higher-level specification in
SIDL (Kohn et al. 2001; Shende et al. 2003) for cross-lan-
guage portability and deployment in component-based
programming environments (Bernholdt et al. 2005).
TAU’s API can be broadly classified into the following
five interfaces:
Interval event interface
Atomic event interface
Query interface
Control interface
Sampling interface

291THE TAU PARALLEL PERFORMANCE SYSTEM
4.1.1 Interval event interface TAU supports the abil-
ity to make performance measurements with respect to
event intervals. An event interval is defined by its start
events and its stop events. A user may bracket parts of
his/her code to specify a region of interest using a pair of
start and stop event calls. There are several ways to iden-
tify interval events and performance tools have used dif-
ferent techniques. It is probably more recognizable to
talk about interval events as timers. To identify a timer,
some tools advocate the use of numeric identifiers and an
associated table mapping the identifiers to timer names.
While it is easy to specify and pass the timer identifier
among start and stop routines, it has its drawbacks. Main-
taining a table statically might work for languages such
as Fortran 90 and C, but it extends poorly to C++, where
a template may be instantiated with different parameters.
This aspect of compile time polymorphism makes it dif-
ficult to disambiguate between different instantiations of
the same code. Also, it can introduce instrumentation
errors in maintaining the table that maps the identifiers to
names. This is true for large projects that involve several
application modules and developers.
Our interface uses a dynamic naming scheme where
interval event (timer) names are associated with the per-
formance data (timer) object at runtime. An interval event
can have a unique name and a signature that can be obtained
at runtime. In the case of C++, this is done using runtime
type information of objects. Several logically related inter-
Fig. 3 Architecture of TAU Performance System – Analysis and Visualization.

Citations
More filters
Journal ArticleDOI
TL;DR: An overview of HPCTOOLKIT is provided and its utility for performance analysis of parallel applications is illustrated.
Abstract: SUMMARY HPCTOOLKIT is an integrated suite of tools that supports measurement, analysis, attribution, and presentation of application performance for both sequential and parallel programs. HPCTOOLKIT can pinpoint and quantify scalability bottlenecks in fully-optimized parallel programs with a measurement overhead of only a few percent. Recently, new capabilities were added to HPCTOOLKIT for collecting call path profiles for fully-optimized codes without any compiler support, pinpointing and quantifying bottlenecks in multithreaded programs, exploring performance information and source code using a new user interface, and displaying hierarchical space-time diagrams based on traces of asynchronous call stack samples. This paper provides an overview of HPCTOOLKIT and illustrates its utility for performance analysis of parallel applications.

536 citations

Journal ArticleDOI
TL;DR: In this article, the authors present results from terascale direct numerical simulations (DNS) of turbulent flames, illustrating its role in elucidating flame stabilization mechanisms in a lifted turbulent hydrogen/air jet flame in a hot air coflow, and the flame structure of a fuel-lean turbulent premixed jet flame.
Abstract: Computational science is paramount to the understanding of underlying processes in internal combustion engines of the future that will utilize non-petroleum-based alternative fuels, including carbon-neutral biofuels, and burn in new combustion regimes that will attain high efficiency while minimizing emissions of particulates and nitrogen oxides. Next-generation engines will likely operate at higher pressures, with greater amounts of dilution and utilize alternative fuels that exhibit a wide range of chemical and physical properties. Therefore, there is a significant role for high-fidelity simulations, direct numerical simulations (DNS), specifically designed to capture key turbulence-chemistry interactions in these relatively uncharted combustion regimes, and in particular, that can discriminate the effects of differences in fuel properties. In DNS, all of the relevant turbulence and flame scales are resolved numerically using high-order accurate numerical algorithms. As a consequence terascale DNS are computationally intensive, require massive amounts of computing power and generate tens of terabytes of data. Recent results from terascale DNS of turbulent flames are presented here, illustrating its role in elucidating flame stabilization mechanisms in a lifted turbulent hydrogen/air jet flame in a hot air coflow, and the flame structure of a fuel-lean turbulent premixed jet flame. Computing at this scale requires close collaborations between computer and combustion scientists to provide optimized scaleable algorithms and software for terascale simulations, efficient collective parallel I/O, tools for volume visualization of multiscale, multivariate data and automating the combustion workflow. The enabling computer science, applied to combustion science, is also required in many other terascale physics and engineering simulations. In particular, performance monitoring is used to identify the performance of key kernels in the DNS code, S3D and especially memory intensive loops in the code. Through the careful application of loop transformations, data reuse in cache is exploited thereby reducing memory bandwidth needs, and hence, improving S3D's nodal performance. To enhance collective parallel I/O in S3D, an MPI-I/O caching design is used to construct a two-stage write-behind method for improving the performance of write-only operations. The simulations generate tens of terabytes of data requiring analysis. Interactive exploration of the simulation data is enabled by multivariate time-varying volume visualization. The visualization highlights spatial and temporal correlations between multiple reactive scalar fields using an intuitive user interface based on parallel coordinates and time histogram. Finally, an automated combustion workflow is designed using Kepler to manage large-scale data movement, data morphing, and archival and to provide a graphical display of run-time diagnostics.

510 citations

01 Aug 2008
TL;DR: Recent results from terascale DNS of turbulent flames are presented, illustrating its role in elucidating flame stabilization mechanisms in a lifted turbulent hydrogen/air jet flame in a hot air coflow, and the flame structure of a fuel-lean turbulent premixed jet flame.
Abstract: Computational science is paramount to the understanding of underlying processes in internal combustion engines of the future that will utilize non-petroleum-based alternative fuels, including carbon-neutral biofuels, and burn in new combustion regimes that will attain high efficiency while minimizing emissions of particulates and nitrogen oxides. Next-generation engines will likely operate at higher pressures, with greater amounts of dilution and utilize alternative fuels that exhibit a wide range of chemical and physical properties. Therefore, there is a significant role for high-fidelity simulations, direct numerical simulations (DNS), specifically designed to capture key turbulence-chemistry interactions in these relatively uncharted combustion regimes, and in particular, that can discriminate the effects of differences in fuel properties. In DNS, all of the relevant turbulence and flame scales are resolved numerically using high-order accurate numerical algorithms. As a consequence terascale DNS are computationally intensive, require massive amounts of computing power and generate tens of terabytes of data. Recent results from terascale DNS of turbulent flames are presented here, illustrating its role in elucidating flame stabilization mechanisms in a lifted turbulent hydrogen/air jet flame in a hot air co-flow, and the flame structure of a fuel-lean turbulent premixed jet flame. Computing at this scale requires close collaborations betweenmore » computer and combustion scientists to provide optimized scaleable algorithms and software for terascale simulations, efficient collective parallel I/O, tools for volume visualization of multiscale, multivariate data and automating the combustion workflow. The enabling computer science, applied to combustion science, is also required in many other terascale physics and engineering simulations. In particular, performance monitoring is used to identify the performance of key kernels in the DNS code, S3D and especially memory intensive loops in the code. Through the careful application of loop transformations, data reuse in cache is exploited thereby reducing memory bandwidth needs, and hence, improving S3D's nodal performance. To enhance collective parallel I/O in S3D, an MPI-I/O caching design is used to construct a two-stage write-behind method for improving the performance of write-only operations. The simulations generate tens of terabytes of data requiring analysis. Interactive exploration of the simulation data is enabled by multivariate time-varying volume visualization. The visualization highlights spatial and temporal correlations between multiple reactive scalar fields using an intuitive user interface based on parallel coordinates and time histogram. Finally, an automated combustion workflow is designed using Kepler to manage large-scale data movement, data morphing, and archival and to provide a graphical display of run-time diagnostics.« less

498 citations


Cites methods from "The Tau Parallel Performance System..."

  • ...We performed a detailed performance analysis of runs on heterogeneous allocations using TAU [18]....

    [...]

Journal IssueDOI
TL;DR: The current toolset architecture is reviewed, emphasizing its scalable design and the role of the different components in transforming raw measurement data into knowledge of application execution behavior.
Abstract: Scalasca is a performance toolset that has been specifically designed to analyze parallel application execution behavior on large-scale systems with many thousands of processors. It offers an incremental performance-analysis procedure that integrates runtime summaries with in-depth studies of concurrent behavior via event tracing, adopting a strategy of successively refined measurement configurations. Distinctive features are its ability to identify wait states in applications with very large numbers of processes and to combine these with efficiently summarized local measurements. In this article, we review the current toolset architecture, emphasizing its scalable design and the role of the different components in transforming raw measurement data into knowledge of application execution behavior. The scalability and effectiveness of Scalasca are then surveyed from experience measuring and analyzing real-world applications on a range of computer systems. Copyright © 2010 John Wiley & Sons, Ltd.

360 citations


Cites background from "The Tau Parallel Performance System..."

  • ...Based on the postmortem analysis presentation of direct measurements, traced or summarized at runtime, Scalasca is closely related to TAU [18]....

    [...]

References
More filters
Book
23 Nov 2002
TL;DR: Anyone responsible for developing software strategy, evaluating new technologies, buying or building software will find Clemens Szyperski's objective and market-aware perspective of this new area invaluable.
Abstract: From the Publisher: Component Software: Beyond Object-Oriented Programming explains the technical foundations of this evolving technology and its importance in the software market place. It provides in-depth discussion of both the technical and the business issues to be considered, then moves on to suggest approaches for implementing component-oriented software production and the organizational requirements for success. The author draws on his own experience to offer tried-and-tested solutions to common problems and novel approaches to potential pitfalls. Anyone responsible for developing software strategy, evaluating new technologies, buying or building software will find Clemens Szyperski's objective and market-aware perspective of this new area invaluable.

4,791 citations


"The Tau Parallel Performance System..." refers background in this paper

  • ...To study the post-mortem spatial and temporal aspect of performance data, event 300 COMPUTING APPLICATIONS tracing, that is, the activity of capturing an event or an action that takes place in the program, is more appropriate....

    [...]

  • ...Component technology extends the benefits of scripting systems and object-oriented design to support reuse and interoperability of component software, transparent of language and location (Szyperski 1997)....

    [...]

01 Apr 1994
TL;DR: This document contains all the technical features proposed for the interface and the goal of the Message Passing Interface, simply stated, is to develop a widely used standard for writing message-passing programs.
Abstract: The Message Passing Interface Forum (MPIF), with participation from over 40 organizations, has been meeting since November 1992 to discuss and define a set of library standards for message passing MPIF is not sanctioned or supported by any official standards organization The goal of the Message Passing Interface, simply stated, is to develop a widely used standard for writing message-passing programs As such the interface should establish a practical, portable, efficient and flexible standard for message passing , This is the final report, Version 10, of the Message Passing Interface Forum This document contains all the technical features proposed for the interface This copy of the draft was processed by LATEX on April 21, 1994 , Please send comments on MPI to mpi-comments@csutkedu Your comment will be forwarded to MPIF committee members who will attempt to respond

3,181 citations

Proceedings ArticleDOI
01 Jun 1982
TL;DR: The gprof profiler accounts for the running time of called routines in therunning time of the routines that call them, and the design and use of this profiler is described.
Abstract: Large complex programs are composed of many small routines that implement abstractions for the routines that call them. To be useful, an execution profiler must attribute execution time in a way that is significant for the logical structure of a program as well as for its textual decomposition. This data must then be displayed to the user in a convenient and informative way. The gprof profiler accounts for the running time of called routines in the running time of the routines that call them. The design and use of this profiler is described.

1,134 citations

Journal ArticleDOI
Shirley Browne1, Jack Dongarra1, N. Garner1, G. Ho1, Philip J. Mucci1 
01 Aug 2000
TL;DR: The purpose of the PAPI project is to specify a standard application programming interface for accessing hardware performance counters available on most modern microprocessors, which exist as a small set of registers that count events.
Abstract: The purpose of the PAPI project is to specify a standard application programming interface (API) for accessing hardware performance counters available on most modern microprocessors. These counters exist as a small set of registers that count events, which are occurrences of specific signals and states related to the processor's function. Monitoring these events facilitates correlation between the structure of source/object code and the efficiency of the mapping of that code to the underlying architecture. This correlation has a variety of uses in performance analysis, including hand tuning, compiler optimization, debugging, benchmarking, monitoring, and performance modeling. In addition, it is hoped that this information will prove useful in the development of new compilation technology as well as in steering architectural development toward alleviating commonly occurring bottlenecks in high performance computing.

692 citations


"The Tau Parallel Performance System..." refers background or methods in this paper

  • ...Dynaprof can also use a PAPI probe and generate performance data that can be read by ParaProf....

    [...]

  • ...A user may choose to use TAU instrumentation, measurement, and analysis using tau_run and ParaProf or she may choose Dynaprof for instrumentation, TAU for measurement, and ParaProf or Vampir for analysis, or she may choose Dynaprof for instrumentation, a PAPI probe for measurement, and ParaProf for analysis....

    [...]

  • ...PDT is comprised of commercial-grade front-ends that emit an intermediate language (IL) file, IL analyzers that walk the abstract syntax tree and generate a subset of semantic entities in program database (PDB) ASCII text files, and a library interface (DUCTAPE) to the PDB files that allows us to…...

    [...]

  • ...TAU supports different timing sources, choice of hardware counters through the PAPI (Browne et al. 2000) or PCL (Berrendorf, Ziegler, and Mohr) interfaces, and access to system performance data....

    [...]

  • ...In a similar manner, TAU integrates alternative interfaces for access to hardware counters (PAPI (Browne et al. 2000) and PCL (Berrendorf, Ziegler, and Mohr) are supported) and other system-accessible performance data sources....

    [...]

Journal ArticleDOI
01 Nov 2000
TL;DR: The authors present a postcompiler program manipulation tool called Dyninst, which provides a C++ class library for program instrumentation that permits machine-independent binary instrumentation programs to be written.
Abstract: The authors present a postcompiler program manipulation tool called Dyninst, which provides a C++ class library for program instrumentation. Using this library, it is possible to instrument and modify application programs during execution. A unique feature of this library is that it permits machine-independent binary instrumentation programs to be written. The authors describe the interface that a tool sees when using this library. They also discuss three simple tools built using this interface: a utility to count the number of times a function is called, a program to capture the output of an already running program to a file, and an implementation of conditional breakpoints. For the conditional breakpoint example, the authors show that by using their interface compared with gdb, they are able to execute a program with conditional breakpoints up to 900 times faster.

640 citations


"The Tau Parallel Performance System..." refers methods in this paper

  • ...Our approach for TAU uses the DyninstAPI to construct calls to the TAU measurement library and then insert these calls into the executable code....

    [...]

  • ...Dynaprof (Mucci) is another tool that uses DyninstAPI for instrumentation....

    [...]

  • ...For DyninstAPI to be useful with a measurement strategy, calls to a measurement library (or the measurement code itself) must be correctly constructed in the code snippets....

    [...]

  • ...The selective instrumentation mechanism is being used in TAU for all automatic instrumentation methods, including PDT source instrumentation, DyninstAPI executable instrumentation, and component instrumentation....

    [...]

  • ...DyninstAPI is a dynamic instrumentation package that allows a tool to insert code snippets into a running program using a portable C++ class library....

    [...]

Frequently Asked Questions (15)
Q1. What contributions have the authors mentioned in the paper "The tau parallel performance system" ?

The ability of performance technology to keep pace with the growing complexity of parallel and distributed systems depends on robust performance frameworks that can at once provide system-specific performance capabilities and support high-level performance problem solving. This paper presents the TAU ( Tuning and Analysis Utilities ) parallel performance system and describe how it addresses diverse requirements for performance observation and analysis. 

While performance evaluation of a system is directly affected by what constraints the system imposes on performance instrumentation and measurement capabilities, the desire for performance problem solving tools that are common and portable, now and into the future, suggests that performance tools hardened and customized for a particular system platform will be short-lived, with limited utility. Unless performance technology evolves with system technology, a chasm will remain between the users expectations and the capabilities that performance tools provide. However, effective exploration of performance will necessarily require prudent selection from the range of alternative methods TAU provides to assemble meaningful performance experiments that sheds light on the relevant performance properties. 

To address the dual goals of performance technology for complex systems – robust performance capabilities and widely available performance problem solving methodologies – the authors need to contend with problems of system diversity while providing flexibility in tool composition, configuration, and integration. 

It infers the arguments and return types of a port and its interfaces and constructs the source code of a proxy component, which when compiled and instantiated in the framework allows us to measure the performance of a component without any changes to its source or object code. 

Using the TAU measurement API, event information is passed in the probe calls to be used during measurement operations to link events with performance data. 

To deal with Java’s multi-threaded environment, TAU uses a common thread layer for operations such as getting the thread identifier, locking and unlocking the performance database, getting the number of concurrent threads, and so on. 

Typical events include control flow events that identify points in the program that are executed, or operational events that occur when some operation or action has been performed. 

A common technique to instrument library routines is to substitute the standard library routine with an instrumented version which in turn calls the orginal routine. 

Because all supported databases are accessed through a common interface, the tool programmer does not need to worry about vendor-specific SQL syntax. 

Tracing the program execution is not always feasible due to the high volume of performance data generated and the amount of trace processing needed. 

The last component, the profile analysis toolkit, is an extensible suite of common base analysis routines that can be reused across performance analysis programs. 

TAU has been integrated with Python by leveraging the Python interpreter’s debugging and profiling capabilities to instrument all entry and exit calls. 

It ensures that the trace analysis tools down the line that read the merged traces also read the global event definitions and refresh their internal tables when they encounter an event for which event definitions are not known. 

The implementation of calldepth profiling is similar to callpath profiling in that it requires dynamic event generation and profile object creation, but it benefits from certain efficiencies in pruning its search on the callstack. 

The trace generation library ensures that event tables are written to disk before writing trace records that contain one or more new events.