Karen L. Karavanic
Bio: Karen L. Karavanic is an academic researcher from Portland State University. The author has contributed to research in topics: Performance tuning & Tracing. The author has an hindex of 9, co-authored 32 publications receiving 1179 citations. Previous affiliations of Karen L. Karavanic include San Diego Supercomputer Center & University of Wisconsin-Madison.
TL;DR: Dynamic instrumentation lets us defer insertion until the moment it is needed (and remove it when it is no longer needed); Paradyn's Performance Consultant decides when and where to insert instrumentation.
Abstract: Paradyn is a tool for measuring the performance of large-scale parallel programs. Our goal in designing a new performance tool was to provide detailed, flexible performance information without incurring the space (and time) overhead typically associated with trace-based tools. Paradyn achieves this goal by dynamically instrumenting the application and automatically controlling this instrumentation in search of performance problems. Dynamic instrumentation lets us defer insertion until the moment it is needed (and remove it when it is no longer needed); Paradyn's Performance Consultant decides when and where to insert instrumentation. >
••19 Apr 2010
TL;DR: This paper proposes a statistical approach to understanding I/O performance that moves from the analysis of performance events to the exploration of performance ensembles and demonstrates that this approach can identify application and middleware performance deficiencies — resulting in more than 4× run time improvement for both examined applications.
Abstract: Parallel I/O is fast becoming a bottleneck to the research agendas of many users of extreme scale parallel computers. The principle cause of this is the concurrency explosion of high-end computation, coupled with the complexity of providing parallel file systems that perform reliably at such scales. More than just being a bottleneck, parallel I/O performance at scale is notoriously variable, being influenced by numerous factors inside and outside the application, thus making it extremely difficult to isolate cause and effect for performance events. In this paper, we propose a statistical approach to understanding I/O performance that moves from the analysis of performance events to the exploration of performance ensembles. Using this methodology, we examine two I/O-intensive scientific computations from cosmology and climate science, and demonstrate that our approach can identify application and middleware performance deficiencies — resulting in more than 4× run time improvement for both examined applications.
••01 Jan 1999
TL;DR: This work describes a new technique that uses historical performance data, gathered in previous executions of an application, to increase the effectiveness of automated performance diagnosis, and incorporates several different types of historical knowledge about the application’s performance into an existing profiling tool, the Paradyn Parallel Performance Tool.
Abstract: Accurate performance diagnosis of parallel and distributed programs is a difficult and time-consuming task. We describe a new technique that uses historical performance data, gathered in previous executions of an application, to increase the effectiveness of automated performance diagnosis. We incorporate several different types of historical knowledge about the applications performance into an existing profiling tool, the Paradyn Parallel Performance Tool. We gather performance and structural data from previous executions of the same program, extract knowledge useful for diagnosis from this collection of data in the form of search directives, then input the directives to an enhanced version of Paradyn, which conducts a directed online diagnosis. Compared to existing approaches, incorporating historical data shortens the time required to identify bottlenecks, decreases the amount of unhelpful instrumentation, and improves the usefulness of the information obtained from a diagnostic session.
••15 Nov 1997
TL;DR: The design and preliminary implementation of a tool that views each execution as a scientific experiment and provides the functionality to answer questions about a program's performance that span more than a single execution or environment are reported on.
Abstract: The development of a high-performance parallel system or application is an evolutionary process. It may begin with models or simulations, followed by an initial implementation of the program. The code is then incrementally modified to tune its performance and continues to evolve throughout the applications's life span. At each step, the key question for developers is: how and how much did the performance change? This question arises comparing an implementation to models or simulations; considering versions of an implementation that use a different algorithm, communication or numeric library, or language; studying code behavior by varying number or type of processors, type of network, type of processes, input data set or work load, or scheduling algorithm; and in benchmarking or regression testing. Despite the broad utility of this type of comparison, no existing performance tool provides the necessary functionality to answer it; even state of the art research tools such as Paradyn and Pablo focus instead on measuring the performance of a single program execution.We describe an infrastructure for answering this question at all stages of the life of an application. We view each program run, simulation result, or program model as an experiment, and provide this functionality in an Experiment Management system. Our project has three parts: (1) a representation for the space of executions, (2) techniques for quantitatively and automatically comparing two or more executions, and (3) enhanced performance diagnosis abilities based on historic performance data. In this paper we present initial results on the first two parts. The measure of success of this project is that an activity that was complex and cumbersome to do manually, we can automate.The first part is a concise representation for the set of executions collected over the life of an application. We store information about each experiment in a Program Event, which enumerates the components of the code executed and the execution environment, and stores the performance data collected. The possible combinations of code and execution environment form the multi-dimensional Program Space, with one dimension for each axis of variation and one point for each Program Event. We enable exploration of this space with a simple naming mechanism, a selection and query facility, and a set of interactive visualizations. Queries on a Program Space may be made both on the contents of the performance data and on the metadata that describes the multi-dimensional program space. A graphical representation of the Program Space serves as the user interface to the Experiment Management system.The second part of the project is to develop techniques for automating comparison between experiments. Performance tuning across multiple executions must answer the deceptively simple question: what changed in this run of the program? We have developed techniques for determining the "difference" between two or more program runs, automatically describing both the structural differences (differences in program execution structure and resources used), and the performance variation (how were the resources used and how did this change from one run to the next). We can apply our technique to compare an actual execution with a predicted or desired performance measure for the application, and to compare distinct time intervals of a single program execution. Uses for this include performance tuning efforts, automated scalability studies, resource allocation for metacomputing , performance model validation studies, and dynamic execution models where processes are created, destroyed, migrated , communication patterns and use of distributed shared memory may be optimized [6,9], or data values or code may be changed by steering [7,8]. The difference information is not necessarily a simple measure such as total execution time, but may be a more complex measure derived from details of the program structure, an analytical performance prediction, an actual previous execution of the code, a set of performance thresholds that the application is required to meet or exceed, or an incomplete set of data from selected intervals of an execution.The third part of this research is to investigate the use of the predicted, summary, and historical data contained in the Program Events and Program Space for performance diagnosis. We are exploring novel opportunities for exploiting this collection of data to focus data gathering and analysis efforts to the critical sections of a large application, and for isolating spurious effects from interesting performance variations. Details of this are outside of the scope of this paper.
••14 Nov 2009
TL;DR: This paper investigates pattern-based methods for reducing traces that will be used for performance analysis and evaluates the different methods against several criteria, including size reduction, introduced error, and retention of performance trends, using both benchmarks with carefully chosen performance behaviors and a real application.
Abstract: Event traces are required to correctly diagnose a number of performance problems that arise on today's highly parallel systems. Unfortunately, the collection of event traces can produce a large volume of data that is difficult, or even impossible, to store and analyze. One approach for compressing a trace is to identify repeating trace patterns and retain only one representative of each pattern. However, determining the similarity of sections of traces, i.e., identifying patterns, is not straightforward. In this paper, we investigate pattern-based methods for reducing traces that will be used for performance analysis. We evaluate the different methods against several criteria, including size reduction, introduced error, and retention of performance trends, using both benchmarks with carefully chosen performance behaviors, and a real application.
01 Jan 2002
••01 Nov 2000
TL;DR: The authors present a postcompiler program manipulation tool called Dyninst, which provides a C++ class library for program instrumentation that permits machine-independent binary instrumentation programs to be written.
Abstract: The authors present a postcompiler program manipulation tool called Dyninst, which provides a C++ class library for program instrumentation. Using this library, it is possible to instrument and modify application programs during execution. A unique feature of this library is that it permits machine-independent binary instrumentation programs to be written. The authors describe the interface that a tool sees when using this library. They also discuss three simple tools built using this interface: a utility to count the number of times a function is called, a program to capture the output of an already running program to a file, and an implementation of conditional breakpoints. For the conditional breakpoint example, the authors show that by using their interface compared with gdb, they are able to execute a program with conditional breakpoints up to 900 times faster.
TL;DR: An overview of HPCTOOLKIT is provided and its utility for performance analysis of parallel applications is illustrated.
Abstract: SUMMARY HPCTOOLKIT is an integrated suite of tools that supports measurement, analysis, attribution, and presentation of application performance for both sequential and parallel programs. HPCTOOLKIT can pinpoint and quantify scalability bottlenecks in fully-optimized parallel programs with a measurement overhead of only a few percent. Recently, new capabilities were added to HPCTOOLKIT for collecting call path profiles for fully-optimized codes without any compiler support, pinpointing and quantifying bottlenecks in multithreaded programs, exploring performance information and source code using a new user interface, and displaying hierarchical space-time diagrams based on traces of asynchronous call stack samples. This paper provides an overview of HPCTOOLKIT and illustrates its utility for performance analysis of parallel applications.
••23 Mar 2003
TL;DR: This work provides an interface for building external modules, or clients, for the DynamoRIO dynamic code modification system by restricting optimization units to linear streams of code and using adaptive levels of detail for representing instructions.
Abstract: Dynamic optimization is emerging as a promising approach to overcome many of the obstacles of traditional static compilation. But while there are a number of compiler infrastructures for developing static optimizations, there are very few for developing dynamic optimizations. We present a framework for implementing dynamic analyses and optimizations. We provide an interface for building external modules, or clients, for the DynamoRIO dynamic code modification system. This interface abstracts away many low-level details of the DynamoRIO runtime system while exposing a simple and powerful, yet efficient and lightweight API. This is achieved by restricting optimization units to linear streams of code and using adaptive levels of detail for representing instructions. The interface is not restricted to optimization and can be used for instrumentation, profiling, dynamic translation, etc. To demonstrate the usefulness and effectiveness of our framework, we implemented several optimizations. These improve the performance of some applications by as much as 40% relative to native execution. The average speedup relative to base DynamoRIO performance is 12%.
TL;DR: Valgrind is a programmable framework for creating program supervision tools such as bug detectors and profilers that executes supervised programs using dynamic binary translation, giving it total control over their every part without requiring source code, and without the need for recompilation or relinking prior to execution.
Abstract: Valgrind is a programmable framework for creating program supervision tools such as bug detectors and profilers It executes supervised programs using dynamic binary translation, giving it total control over their every part without requiring source code, and without the need for recompilation or relinking prior to execution New supervision tools can be easily created by writing skins that plug into Valgrind's core As an example, we describe one skin that performs Purify-style memory checks for C and C++ programs