scispace - formally typeset
Search or ask a question

Showing papers by "Martin Rinard published in 1997"


Journal ArticleDOI
TL;DR: This article presents a new analysis technique, commutativity analysis, for automatically parallelizing computations that manipulate dynamic, pointer-based data structures and presents performance results for the generated parallel code running on the Stanford DASH machine.
Abstract: This article presents a new analysis technique, commutativity analysis, for automatically parallelizing computations that manipulate dynamic, pointer-based data structures. Commutativity analysis views the computation as composed of operations on objects. It then analyzes the program at this granularity to discover when operations commute (i.e., generate the same final result regardless of the order in which they execute). If all of the operations required to perform a given computation commute, the compiler can automatically generate parallel code. We have implemented a prototype compilation system that uses commutativity analysis as its primary analysis technique. We have used this system to automatically parallelize three complete scientific computations: the Barnes-Hut N-body solver, the Water liquid simulation code, and the String seismic simulation code. This article presents performance results for the generated parallel code running on the Stanford DASH machine. These results provide encouraging evidence that commutativity analysis can serve as the basis for a successful parallelizing compiler.

137 citations


Proceedings ArticleDOI
01 May 1997
TL;DR: This paper presents dynamic feedback, a technique that enables computations to adapt dynamically to different execution environments, and performs a theoretical analysis which provides a guaranteed optimality bound for dynamic feedback relative to a hypothetical (and unrealizable) optimal algorithm.
Abstract: This paper presents dynamic feedback, a technique that enables computations to adapt dynamically to different execution environments. A compiler that uses dynamic feedback produces several different versions of the same source code; each version uses a different optimization policy. The generated code alternately performs sampling phases and production phases. Each sampling phase measures the overhead of each version in the current environment. Each production phase uses the version with the least overhead in the previous sampling phase. The computation periodically resamples to adjust dynamically to changes in the environment.We have implemented dynamic feedback in the context of a parallelizing compiler for object-based programs. The generated code uses dynamic feedback to automatically choose the best synchronization optimization policy. Our experimental results show that the synchronization optimization policy has a significant impact on the overall performance of the computation, that the best policy varies from program to program, that the compiler is unable to statically choose the best policy, and that dynamic feedback enables the generated code to exhibit performance that is comparable to that of code that has been manually tuned to use the best policy. We have also performed a theoretical analysis which provides, under certain assumptions, a guaranteed optimality bound for dynamic feedback relative to a hypothetical (and unrealizable) optimal algorithm that uses the best policy at every point during the execution.

121 citations


01 Jan 1997
TL;DR: Commutativity analysis as discussed by the authors views the computation as composed of operations on objects and analyzes the program at this granularity to discover when operations commute (i.e. generate the same final result regardless of the order in which they execute).
Abstract: This paper presents a new analysis technique, commutativity analysis, for automatically parallelizing computations that manipulate dynamic, pointer-based data structures. Commutativity analysis views the computation as composed of operations on objects. It then analyzes the program at this granularity to discover when operations commute (i.e. generate the same final result regardless of the order in which they execute). If all of the operations required to perform a given computation commute, the compiler can automatically generate parallel code.We have implemented a prototype compilation system that uses commutativity analysis as its primary analysis framework. We have used this system to automatically parallelize two complete scientific computations: the Barnes-Hut N-body solver and the Water code. This paper presents performance results for the generated parallel code running on the Stanford DASH machine. These results provide encouraging evidence that commutativity analysis can serve as the basis for a successful parallelizing compiler.

77 citations


Proceedings ArticleDOI
01 Jan 1997
TL;DR: A new framework for synchronization optimizations and a new set of transformations for programs that implement critical sections using mutual exclusion locks are described, which allow the compiler to move constructs that acquire and release locks both within and between procedures and to eliminate acquired and release constructs.
Abstract: As parallel machines become part of the mainstream computing environment, compilers will need to apply synchronization optimizations to deliver efficient parallel software. This paper describes a new framework for synchronization optimizations and a new set of transformations for programs that implement critical sections using mutual exclusion locks. These transformations allow the compiler to move constructs that acquire and release locks both within and between procedures and to eliminate acquire and release constructs.The paper also presents a new synchronization algorithm, lock elimination, for reducing synchronization overhead. This optimization locates computations that repeatedly acquire and release the same lock, then uses the transformations to obtain equivalent computations that acquire and release the lock only once. Experimental results from a parallelizing compiler for object-based programs illustrate the practical utility of this optimization. For three benchmark programs the optimization dramatically reduces the number of times the computations acquire and release locks, which significantly reduces the amount of time processors spend acquiring and releasing locks. For one of the three benchmarks, the optimization always significantly improves the overall performance. Depending on the number of processors executing the computation, the optimized version runs between 2.11 and 1.83 times faster than the unoptimized version. For one of the other benchmarks, the optimized version runs between 1.13 and 0.96 times faster than the unoptimized version, with a mean of 1.08 times faster. For the final benchmark, the optimization reduces the overall performance.

38 citations


Proceedings ArticleDOI
21 Jun 1997
TL;DR: This paper presents the first published algorithm that enables compilers to automatically generate optimistically synchronized parallel code and the presented experimental results indicate that optimistic synchronization is clearly the superior choice for this set of applications.
Abstract: As shared-memory multiprocessors become the dominant commodity source of computation, parallelizing compilers must support mainstream computations that manipulate irregular, pointer-based data structures such as lists, trees and graphs, Our experience with a parallelizing compiler for this class of applications shows that their synchronization requirements differ significantly from those of traditional parallel computations. Instead of coarse-grain barrier synchronization, irregular computations require synchronization primitives that support efficient fine-grain atomic operations.The standard implementation mechanism for atomic operations uses mutual exclusion locks. But the overhead of acquiring and releasing locks can reduce the performance. Locks can also consume significant amourtts of memory. Optimistic synchronization primitives such as load linked/stor conditional are an attractive alternative. They require no additional memory and eliminate the use of heavyweight blocking synchronization constructs.This paper presents our experience using optimistic synchronization to implement fine-grain atomic operations in the context of a parallelizing compiler for irregular object-based programs. We have implemented two versions of the compiler. One version generates code that uses mutual exclusion locks to make operations execute atomically. The other version uses optimistic synchronization. This paper presents the first published algorithm that enables compilers to automatically generate optimistically synchronized parallel code. The presented experimental results indicate that optimistic synchronization is clearly the superior choice for our set of applications. Our results show that it can significantly reduce the memory consumption and improve the overall performance.

11 citations


Journal ArticleDOI
TL;DR: This paper describes the experience automatically applying locality optimizations in the context of Jade, a portable, implicitly parallel programming language designed for exploiting task-level concurrency, and presents performance results for several Jade applications running on the Stanford DASH machine.
Abstract: Given the large communication overheads characteristic of modern parallel machines, optimizations that improve locality by executing tasks close to data that they will access may improve the performance of parallel computations. This paper describes our experience automatically applying locality optimizations in the context of Jade, a portable, implicitly parallel programming language designed for exploiting task-level concurrency. Jade programmers start with a program written in a standard serial, imperative language, then use Jade constructs to declare how parts of the program access data. The Jade implementation uses this data access information to automatically extract the concurrency and apply locality optimizations. We present performance results for several Jade applications running on the Stanford DASH machine. We use these results to characterize the overall performance impact of the locality optimizations. In our application set the locality optimization level has little effect on the performance of two of the applications and a large effect on the performance of the rest of the applications. We also found that, if the locality optimization level had a significant effect on the performance, the maximum performance was obtained when the programmer explicitly placed tasks on processors rather than relying on the scheduling algorithm inside the Jade implementation.

5 citations


Journal ArticleDOI
TL;DR: The commutativity decision problem is defined and its complexity is established for a variety of basic instructions and control constructs.
Abstract: Two operations commute if they generate the same result regardless of the order in which they execute. Commutativity is an important property — commuting operations enable significant optimizations in the fields of parallel computing, optimizing compilers, parallelizing compilers and database concurrency control. Algorithms that statically decide if operations commute can be an important component of systems in these fields because they enable the automatic application of these optimizations. In this paper we define the commutativity decision problem and establish its complexity for a variety of basic instructions and control constructs. Although deciding commutativity is, in general, undecidable or computationally intractable, we believe that efficient algorithms exist that can solve many of the cases that arise in practice.

4 citations