Showing papers on "Distributed memory published in 1994"

PDF

Open Access

Journal Article•DOI•

Fast folding and comparison of RNA secondary structures

[...]

Ivo L. Hofacker¹, Walter Fontana², Peter F. Stadler¹, Peter F. Stadler², L. S. Bonhoeffer³, Manfred Tacker¹, Peter Schuster², Peter Schuster¹, Peter Schuster⁴ - Show less +5 more•Institutions (4)

University of Vienna¹, Santa Fe Institute², University of Oxford³, Institute of Molecular Biotechnology⁴

01 Feb 1994-Monatshefte Fur Chemie

TL;DR: The Vienna RNA package as mentioned in this paper is based on dynamic programming algorithms and aims at predictions of structures with minimum free energies as well as at computations of the equilibrium partition functions and base pairing probabilities.

...read moreread less

Abstract: Computer codes for computation and comparison of RNA secondary structures, the Vienna RNA package, are presented, that are based on dynamic programming algorithms and aim at predictions of structures with minimum free energies as well as at computations of the equilibrium partition functions and base pairing probabilities. An efficient heuristic for the inverse folding problem of RNA is introduced. In addition we present compact and efficient programs for the comparison of RNA secondary structures based on tree editing and alignment. All computer codes are written in ANSI C. They include implementations of modified algorithms on parallel computers with distributed memory. Performance analysis carried out on an Intel Hypercube shows that parallel computing becomes gradually more and more efficient the longer the sequences are.

...read moreread less

2,136 citations

Journal Article•DOI•

Fast multilevel implementation of recursive spectral bisection for partitioning unstructured problems

[...]

Stephen T. Barnard, Horst D. Simon

01 Apr 1994-Concurrency and Computation: Practice and Experience

TL;DR: In this paper, a multilevel version of RSB is introduced that attains about an order-of-magnitude improvement in run time on typical examples, and it is shown that RSB in its simplest form is expensive.

...read moreread less

Abstract: SUMMARY If problems involving unstructured meshes are to be solved efficiently on distributed-memory parallel computers, the meshes must be partitioned and distributed across processors in a way that balances the computational load and minimizes communication. The recursive spectral bisection method (RSB) has been shown to be very effective for such partitioning problems compared to alternative methods, but RSB in its simplest form is expensive. Here a multilevel version of RSB is introduced that attains about an order-of-magnitude improvement in run time on typical examples. 1. INTRODUCTION Unstructured meshes are used in several large-scale scientific and engineering problems, including finite-volume methods for computational fluid dynamics and finite-element methods for structural analysis. If unstructured problems such as these are to be solved on distributed-memory parallel computers, their data structures must be partitioned and distributed across processors; if they are to be solved efficiently, the partitioning must niaximize load balance and minimize interprocessor communication. Recently, the recursive spectral bisection method (RSB)[l] has been shown to be very effective for such partitioning problems compared to alternative methods. Unfortunately, RSB in its simplest form is expensive. We shall describe a multilevel version of RSB that attains about im order-of-magnitude improvement in run time on typical examples.

...read moreread less

567 citations

MPI: A Message-Passing Interface

[...]

Forum Mpi

01 Apr 1994

TL;DR: An overview of MPI, a proposed standard message passing interface for MIMD distributed memory concurrent computers, which includes point-to-point and collective communication routines, as well as support for process groups, communication contexts, and application topologies is presented.

...read moreread less

Abstract: This paper presents an overview of MPI, a proposed standard message passing interface for MIMD distributed memory concurrent computers. The design of MPI has been a collective effort involving researchers in the Untied states and Europe from many organizations and institutions. MPI includes point-to-point and collective communication routines, as well as support for process groups, communication contexts, and application topologies. While making use of new ideas where appropriate, the MPI standard is based largely on current practice.

...read moreread less

458 citations

Proceedings Article•DOI•

MINT: a front end for efficient simulation of shared-memory multiprocessors

[...]

J.E. Veenstra¹, R.J. Fowler¹•Institutions (1)

University of Rochester¹

31 Jan 1994

TL;DR: MINT is a software package designed to ease the process of constructing event-driven memory hierarchy simulators for multiprocessors that uses a novel hybrid technique that exploits the best aspects of native execution and software interpretation to minimize the overhead of processor simulation.

...read moreread less

Abstract: MINT is a software package designed to ease the process of constructing event-driven memory hierarchy simulators for multiprocessors. It provides a set of simulated processors that run standard Unix executable files compiled for a MIPS R3000 based multiprocessor. These generate multiple streams of memory reference events that drive a user-provided memory system simulator. MINT uses a novel hybrid technique that exploits the best aspects of native execution and software interpretation to minimize the overhead of processor simulation. Combined with related techniques to improve performance, this approach makes simulation on uniprocessor hosts extremely efficient. >

...read moreread less

283 citations

Proceedings Article•DOI•

Fine-grain access control for distributed shared memory

[...]

Ioannis T. Schoinas¹, Babak Falsafi¹, Alvin R. Lebeck¹, Steven K. Reinhardt¹, James R. Larus¹, Darien Wood¹ - Show less +2 more•Institutions (1)

University of Wisconsin-Madison¹

01 Nov 1994

TL;DR: In this paper, the authors discuss implementations of fine-grain memory access control, which selectively restricts reads and writes to cache-block-sized memory regions, and incorporate three techniques that require no additional hardware into Blizzard.

...read moreread less

Abstract: This paper discusses implementations of fine-grain memory access control, which selectively restricts reads and writes to cache-block-sized memory regions. Fine-grain access control forms the basis of efficient cache-coherent shared memory. This paper focuses on low-cost implementations that require little or no additional hardware. These techniques permit efficient implementation of shared memory on a wide range of parallel systems, thereby providing shared-memory codes with a portability previously limited to message passing.This paper categorizes techniques based on where access control is enforced and where access conflicts are handled. We incorporated three techniques that require no additional hardware into Blizzard, a system that supports distributed shared memory on the CM-5. The first adds a software lookup before each shared-memory reference by modifying the program's executable. The second uses the memory's error correcting code (ECC) as cache-block valid bits. The third is a hybrid. The software technique ranged from slightly faster to two times slower than the ECC approach. Blizzard's performance is roughly comparable to a hardware shared-memory machine. These results argue that clusters of workstations or personal computers with networks comparable to the CM-5's will be able to support the same shared-memory interfaces as supercomputers.

...read moreread less

275 citations

Patent•

Message passing system for distributed shared memory multiprocessor system and message passing method using the same

[...]

Shigeki Yamada¹, Katsumi Maruyama¹, Minoru Kubota¹, Satoshi Tanaka¹•Institutions (1)

Nippon Telegraph and Telephone¹

03 Oct 1994

TL;DR: In this article, a multiprocessor system consisting of a processor, a distributed shared memory coupler, and a distributed memory protector is described, where the distributed shared memories are assigned global addresses common to all the processor modules, and each processor module has its addresses shared with the shared shared memory of each processor which is the destination of data transfer.

...read moreread less

Abstract: In a multiprocessor system, each processor module comprises a processor, a distributed shared memory, a distributed memory coupler for controlling copying between distributed shared memories and a distributed memory protector for protecting said distributed shared memory against illegal access. The distributed shared memories are assigned global addresses common to all the processor modules, and the distributed shared memory of each processor module has its addresses shared with the distributed shared memory of each processor module which is the destinatiion of data transfer. Message buffers and message control areas on the distributed shared memory are divided into areas specified by a combination of sending and receiving processor modules. A processing request area on the distributed shared memory is divided corresponding to each receiving processor module and arranged accordingly. The processing request area on the receiver's side distributed shared memory has a FIFO structure. The sender's side distributed memory coupler stores identifying information of the destination processor module between the processor module communication and, upon occurrence of a write into the distributed shared memory, sends a write address and write data to the destination processor module. The receiver's side distributed memory coupler copies the received write data into the distributed shared memory of the processor module to which the distributed shared memory coupler belongs, by receiving write address and write data from the sender's side distributed memory coupler.

...read moreread less

252 citations

Proceedings Article•DOI•

Disjoint-access-parallel implementations of strong shared memory primitives

[...]

Amos Israeli¹, Lihu Rappoport¹•Institutions (1)

Technion – Israel Institute of Technology¹

14 Aug 1994

TL;DR: An efficient, non-blocking, disjointaccess-parallel implementation of LL and SCn, using Read and CBS and the asynchronous shar-ed memory model is presented.

...read moreread less

Abstract: In this paper, we present efficient implementations of strong shared memory primitives. We use the asynchronous shar-ed memory model. In this model, processes communicate by applying primitive operations (e.g. Read, Write) to a shared memory. We define disjoint-access-parallel implementations. Intuitively, an implementation of shared memory primitives is disjoint-access-parallel, if processes which execute shared memory operations that access disjoint sets of words, progress concurrently, without interfering with each other (under an assumption described in the paper). Two commonly used primitives, both in theory and in practice, are Compare @Swap ( C’&S) and the pair Load Linked (LL) and Store Conditional (SC). We present an efficient, non-blocking, disjointaccess-parallel implementation of LL and SCn, using Read and CBS. SCn is a generalization of SC, which accesses n memory words. This implementation is constructed in three stages. We first present an implementation of L-L, SC and an additional primitive, called Validate ( VL), using Read and C&S. We then present an implementation of Read and CLYSn, using LL, SC and VL ( C&Sn is a generalization of C&S, which accesses n memory words). Finally, we present an implementation of SCn, using Read and Ct3’Sn. The work and space complexities of the implementations presented in this paper, improve the work and space complexities of previous works. *e-mail: amOs@ee. techniOn.ac.il t ~-mtil: fihu@cs .technion.ac.i] Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association of Computing Machinery. To copy otherwise, or to republish, requires a fee and[or specific permission. PODC 94-8194 Los Angeles CA USA @ 1994 ACM 0-89791 -654-9/94/0008.$3.50 Lihu Rappoportt Faculty of Computer Science

...read moreread less

249 citations

Proceedings Article•DOI•

A performance study of software and hardware data prefetching schemes

[...]

Tien-Fu Chen¹, Jean-Loup Baer²•Institutions (2)

National Chung Cheng University¹, University of Washington²

01 Apr 1994

TL;DR: Qualitative comparisons indicate that both schemes are able to reduce cache misses in the domain of linear array references, and an approach combining software and hardware schemes is proposed; it shows promise in reducing the memory latency with least overhead.

...read moreread less

Abstract: Prefetching, i.e., exploiting the overlap of processor computations with data accesses, is one of several approaches for tolerating memory latencies. Prefetching can be either hardware-based or software-directed or a combination of both. Hardware-based prefetching, requiring some support unit connected to the cache, can dynamically handle prefetches at run-time without compiler intervention. Software-directed approaches rely on compiler technology to insert explicit prefetch instructions. Mowry et al.'s software scheme [13, 14] and our hardware approach [1] are two representative schemes.In this paper, we evaluate approximations to these two schemes in the context of a shared-memory multiprocessor environment. Our qualitative comparisons indicate that both schemes are able to reduce cache misses in the domain of linear array references. When complex data access patterns are considered, the software approach has compile-time information to perform sophisticated prefetching whereas the hardware scheme has the advantage of manipulating dynamic information. The performance results from an instruction-level simulation of four benchmarks confirm these observations. Our simulations show that the hardware scheme introduces more memory traffic into the network and that the software scheme introduces a non-negligible instruction execution overhead. An approach combining software and hardware schemes is proposed; it shows promise in reducing the memory latency with least overhead.

...read moreread less

238 citations

Proceedings Article•DOI•

Global Arrays: a portable "shared-memory" programming model for distributed memory computers

[...]

J. Nieplocha, Robert W. Harrison, Richard J. Littlefield

14 Nov 1994

TL;DR: The key concept of GA is that it provides a portable interface through which each process in a MIMD parallel program can asynchronously access logical blocks of physically distributed matrices, with no need for explicit cooperation by other processes.

...read moreread less

Abstract: Portability, efficiency and ease of coding are all important considerations in choosing the programming model for a scalable parallel application. The message-passing programming model is widely used because of its portability, yet some applications are too complex to code in it while also trying to maintain a balanced computation load and avoid redundant computations. The shared-memory programming model simplifies coding, but it is not portable and often provides little control over interprocessor data transfer costs. This paper describes a new approach, called Global Arrays (GA) that combines the better features of both other models, leading to both simple coding and efficient execution. The key concept of GA is that it provides a portable interface through which each process in a MIMD parallel program can asynchronously access logical blocks of physically distributed matrices, with no need for explicit cooperation by other processes. We have implemented GA libraries on a variety of computer systems, including the Intel DELTA and Paragon, the IBM SP-1 (all message-passers), the Kendall Square KSR-2 (a nonuniform access shared-memory machine), and networks of Unix workstations. We discuss the design and implementation of these libraries, report their performance, illustrate the use of GA in the context of computational chemistry applications, and describe the use of a GA performance visualization tool. >

...read moreread less

224 citations

Journal Article•DOI•

Communication optimizations for irregular scientific computations on distributed memory architectures

[...]

Raja Das¹, Mustafa Uysal¹, Joel H. Saltz¹, Yuan-Shin Hwang¹•Institutions (1)

University of Maryland, College Park¹

01 Sep 1994-Journal of Parallel and Distributed Computing

TL;DR: A detailed performance and scalability analysis of the communication primitives is presented, carried out using a workload generator, kernels from real applications, and a large unstructured adaptive application.

...read moreread less

211 citations

Patent•

Multi-processor with crossbar link of processors and memories and method of operation

[...]

Robert J. Gove¹, Karl M. Guttag¹, Keith Balmer¹, Nicholas Ing-Simmons¹•Institutions (1)

Texas Instruments¹

21 Jun 1994

TL;DR: In this article, a multiprocessor system and method arranged, in one embodiment, as an image and graphics processor is described, with several individual processors all having communication links to several memories without restriction.

...read moreread less

Abstract: There is disclosed a multiprocessor system and method arranged, in one embodiment, as an image and graphics processor. The processor is structured with several individual processors all having communication links to several memories without restriction. A crossbar switch serves to establish the processor memory links and the entire image processor, including the individual processors, the crossbar switch and the memories, are contained on a single silicon chip.

...read moreread less

Proceedings Article•DOI•

Application-specific protocols for user-level shared memory

[...]

Babak Falsafi¹, Alvin R. Lebeck¹, Steven K. Reinhardt¹, Ioannis T. Schoinas¹, Mark D. Hill¹, James R. Larus¹, Anne Rogers¹, Darien Wood¹ - Show less +4 more•Institutions (1)

University of Wisconsin-Madison¹

14 Nov 1994

TL;DR: Evidence is presented that application-specific protocols substantially improved the performance of three application programs-appbt, em3d, and barnes-over carefully tuned transparent shared memory implementations.

...read moreread less

Abstract: Recent distributed shared memory (DSM) systems and proposed shared-memory machines have implemented some or all of their cache coherence protocols in software. One way to exploit the flexibility of this software is to tailor a coherence protocol to match an application's communication patterns and memory semantics. This paper presents evidence that this approach can lead to large performance improvements. It shows that application-specific protocols substantially improved the performance of three application programs--appbt, em3d, and barnes--over carefully tuned transparent shared memory implementations. The speed-ups were obtained on Blizzard, a fine-grained DSM system running on a 32-node Thinking Machines CM-5.

...read moreread less

Journal Article•DOI•

The design of a standard message passing interface for distributed memory concurrent computers

[...]

David W. Walker¹•Institutions (1)

Oak Ridge National Laboratory¹

01 Apr 1994

...read moreread less

Abstract: This paper presents an overview of MPI, a proposed standard message passing interface for MIMD distributed memory concurrent computers. The design of MPI has been a collective effort involving researchers in the United States and Europe from many organizations and institutions. MPI includes point-to-point and collective communication routines, as well as support for process groups, communication contexts, and application topologies. While making use of new ideas where appropriate, the MPI standard is based largely on current practice.

...read moreread less

Proceedings Article•DOI•

Software write detection for a distributed shared memory

[...]

Matthew J. Zekauskas¹, Wayne A. Sawdon¹, Brian N. Bershad²•Institutions (2)

Carnegie Mellon University¹, University of Washington²

14 Nov 1994

TL;DR: A new method for write detection that relies on the compiler and runtime system to detect writes to shared data without invoking the operating system, and has low average write latency and supports fine-grained sharing with low overhead.

...read moreread less

Abstract: Most software-based distributed shared memory (DSM) systems rely on the operating system's virtual memory interface to detect writes to shared data. Strategies based on virtual memory page protection create two problems for a DSM system. First, writes can have high overhead since they are detected with a page fault. As a result, a page must be written many times to amortize the cost of that fault. Second, the size of a virtual memory page is too big to serve as a unit of coherency, inducing false sharing. Mechanisms to handle false sharing can increase runtime overhead and may cause data to be unnecessarily communicated between processors.In this paper, we present a new method for write detection that solves these problems. Our method relies on the compiler and runtime system to detect writes to shared data without invoking the operating system. We measure and compare implementations of a distributed shared memory system using both strategies, virtual memory and compiler/runtime, running a range of applications on a small scale distributed memory multicomputer. We show that the new method has low average write latency and supports fine-grained sharing with low overhead. Further, we show that the dominant cost of write detection with either strategy is due to the mechanism used to handle fine-grain sharing.

...read moreread less

Patent•

Reconfigurable multi-processor operating in SIMD mode with one processor fetching instructions for use by remaining processors

[...]

Robert J. Gove¹, Keith Balmer¹, Nicholas Ing-Simmons¹, Karl M. Guttag¹•Institutions (1)

Texas Instruments¹

22 Jun 1994

...read moreread less

Abstract: There is disclosed a multiprocessor system and method arranged, in one embodiment, as an image and graphics processor The processor is structured with several individual processors all having communication links to several memories without restriction A crossbar switch serves to establish the processor memory links and an inter-processor communication link allows the processors to communicate with each other for the purpose of establishing operational modes A parameter memory, accessible via the crossbar switch, is used in conjunction with the communication link for control purposes The entire image processor, including the individual processors, the crossbar switch and the memories, is contained on a single silicon chip

...read moreread less

Patent•

System and method of memory access in apparatus having plural processors and plural memories

[...]

Robert J. Gove¹, Keith Balmer¹, Nicholas Ing-Simmons¹, Karl M. Guttag¹•Institutions (1)

Texas Instruments¹

22 Jun 1994

TL;DR: In this paper, a multi-processor system and method arranged, in one embodiment, as an image and graphics processor is described. But this system is based on a single silicon chip and does not have a crossbar switch to establish the memory links.

...read moreread less

Abstract: There is disclosed a multi-processor system and method arranged, in one embodiment, as an image and graphics processor. The image processor is structured with several individual processors all having communication links to several memories. A crossbar switch serves to establish the processor memory links. The entire image processor, including the individual processors, the crossbar switch and the memories, is contained on a single silicon chip.

...read moreread less

Proceedings Article•DOI•

The design and evaluation of a shared object system for distributed memory machines

[...]

Daniel J. Scales¹, Monica S. Lam¹•Institutions (1)

Stanford University¹

14 Nov 1994

TL;DR: This paper provides an extensive analysis on several complex scientific algorithms written in SAM on a variety of hardware platforms and finds that the performance of these SAM applications depends fundamentally on the scalability of the underlying parallel algorithm, and whether the algorithm's communication requirements can be satisfied by the hardware.

...read moreread less

Abstract: This paper describes the design and evaluation of SAM, a shared object system for distributed memory machines. SAM is a portable run-time system that provides a global name space and automatic caching of shared data. SAM incorporates mechanisms to address the problem of high communication overheads on distributed memory machines; these mechanisms include tying synchronization to data access, chaotic access to data, prefetching of data, and pushing of data to remote processors. SAM has been implemented on the CM-5, Intel iPSC/860 and Paragon, IBM SP1, and networks of workstations running PVM. SAM applications run on all these platforms without modification.This paper provides an extensive analysis on several complex scientific algorithms written in SAM on a variety of hardware platforms. We find that the performance of these SAM applications depends fundamentally on the scalability of the underlying parallel algorithm, and whether the algorithm's communication requirements can be satisfied by the hardware. Our experience suggests that SAM is successful in allowing programmers to use distributed memory machines effectively with much less programming effort than required today.

...read moreread less

Journal Article•DOI•

Message-passing multi-cell molecular dynamics on the Connection Machine 5

[...]

D. M. Beazley¹, Peter S. Lomdahl¹•Institutions (1)

Los Alamos National Laboratory¹

01 Feb 1994

TL;DR: In this article, a message-passing multi-cell approach is proposed for short-range molecular dynamics simulations on distributed memory MIMD multicomputers based on a message passing multicell approach.

...read moreread less

Abstract: We present a new scalable algorithm for short-range molecular dynamics simulations on distributed memory MIMD multicomputers based on a message-passing multi-cell approach We have implemented the algorithm on the Connection Machine 5 (CM-5) and demonstrate that meso-scale molecular dynamics with more than 108 particles is now possible on massively parallel MIMD computers Typical runs show single particle update-times of 015 μs in 2 dimentions (2D) and approximately 1 μs in 3 dimensions (3D) on a 1024 node CM-5 without vector units, corresponding to more than 18 Gflops overall performance We also present a scaling equation which agrees well with actually observed timings

...read moreread less

Proceedings Article•DOI•

Performance optimizations, implementation, and verification of the SGI Challenge multiprocessor

[...]

M. Galles, E. Williams

01 Jan 1994

TL;DR: The SGI Challenge system architecture provides a high-bandwidth, low-latency cache-coherent interconnect for several high performance processors, I/O busses, and a scalable memory system.

...read moreread less

Abstract: This paper presents the architecture, implementation, and performance results for the SGI Challenge symmetric multiprocessor system. Novel aspects of the architecture are highlighted, as well as key design trade-offs targeted at increasing performance and reducing complexity. Multiprocessor design verification techniques and their impact is also presented. The SGI Challenge system architecture provides a high-bandwidth, low-latency cache-coherent interconnect for several high performance processors, I/O busses, and a scalable memory system. Hardware cache coherence mechanisms maintain a consistent view of shared memory for all processors, with no software overhead and minimal impact on processor performance. HDL simulation with random, self checking vector generation and a lightweight operating system on full processor models contributed to a concept to customer shipment cycle of 26 months. >

...read moreread less

Proceedings Article•DOI•

Run-time and compile-time support for adaptive irregular problems

[...]

S.D. Sharma¹, Ravi Ponnusamy¹, Bongki Moon¹, Yuan-Shin Hwang¹, Raja Das¹, Joel H. Saltz¹ - Show less +2 more•Institutions (1)

University of Maryland, College Park¹

14 Nov 1994

TL;DR: CHAOS is described, a library of efficient runtime primitives that provides support for dynamic data partitioning, efficient preprocessing and fast data migration in adaptive irregular problems and is used to parallelize kernels from two adaptive applications.

...read moreread less

Abstract: In adaptive irregular problems, data arrays are accessed via indirection arrays, and data access patterns change during computation. Parallelizing such problems on distributed memory machines requires support for dynamic data partitioning, efficient preprocessing and fast data migration. This paper describes CHAOS, a library of efficient runtime primitives that provides such support. To demonstrate the effectiveness of the runtime support, two adaptive irregular applications have been parallelized using CHAOS primitives: a molecular dynamics code (CHARMM) and a code for simulating gas flows (DSMC). We have also proposed minor extensions to Fortran D which would enable compilers to parallelize irregular for all loops in such adaptive applications by embedding calls to primitives provided by a runtime library. We have implemented our proposed extensions in the Syracuse Fortran 90D/HPF prototype compiler, and have used the compiler to parallelize kernels from two adaptive applications. >

...read moreread less

Journal Article•DOI•

Compiling Fortran 90D/HPF for distributed memory MIMD computers

[...]

Zeki Bozkus¹, Alok Choudhary¹, Geoffrey C. Fox¹, Tomasz Haupt¹, Sanjay Ranka¹, Min-You Wu¹ - Show less +2 more•Institutions (1)

Syracuse University¹

01 Apr 1994-Journal of Parallel and Distributed Computing

TL;DR: This thesis describes an advanced compiler that can generate efficient parallel programs when the source programming language naturally represents an application's parallelism and Fortran 90D/HPF, described in this thesis is such a language.

...read moreread less

Proceedings Article•DOI•

On the design of Chant: a talking threads package

[...]

Matthew Haines¹, David Cronk¹, Piyush Mehrotra¹•Institutions (1)

Langley Research Center¹

14 Nov 1994

TL;DR: This work introduces the design of a runtime interface, called Chant, that supports communicating threads in a distributed memory environment, and is layered atop standard message passing and lightweight thread libraries, and supports efficient point-to-point and remote service request communication primitives.

...read moreread less

Abstract: Lightweight threads are becoming increasingly useful for supporting parallelism and asynchronous control structures in applications and language implementations. However, lightweight thread packages for distributed memory systems have received little attention. In this paper, we introduce the design of a runtime interface, called Chant, that supports communicating threads in a distributed memory environment. In particular, Chant is layered atop standard message passing and lightweight thread libraries, and supports efficient point-to-point and remote service request communication primitives. We examine the design issues of Chant, the efficiency of its point-to-point communication layer, and the evaluation of scheduling policies to poll for the presence of incoming messages.

...read moreread less

Proceedings Article•DOI•

Software versus hardware shared-memory implementation: a case study

[...]

Alan L. Cox¹, Sandhya Dwarkadas¹, P. Keleher¹, Honghui Lu¹, Ramakrishnan Rajamony¹, Willy Zwaenepoel¹ - Show less +2 more•Institutions (1)

Rice University¹

01 Apr 1994

TL;DR: The results show that TreadMarks performs comparably to the 4D/480 for applications with moderate amounts of synchronization, but the difference in performance grows as the synchronization frequency increases.

...read moreread less

Abstract: We compare the performance of software-supported shared memory on a general-purpose network to hardware-supported shared memory on a dedicated interconnect.Up to eight processors, our results are based on the execution of a set of application programs on a SGI 4D/480 multiprocessor and on TreadMarks, a distributed shared memory system that runs on a Fore ATM LAN of DECstation-5000/240s. Since the DECstation and the 4D/480 use the same processor, primary cache, and compiler, the shared-memory implementation is the principal difference between the systems. Our results show that TreadMarks performs comparably to the 4D/480 for applications with moderate amounts of synchronization, but the difference in performance grows as the synchronization frequency increases. For applications that require a large amount of memory bandwidth, TreadMarks can perform better than the SGI 4D/480.Beyond eight processors, our results are based on execution-driven simulation. Specifically, we compare a software implementation on a general-purpose network of uniprocessor nodes, a hardware implementation using a directory-based protocol on a dedicated interconnect, and a combined implementation using software to provide shared memory between multiprocessor nodes with hardware implementing shared memory within a node. For the modest size of the problems that we can simulate, the hardware implementation scales well and the software implementation scales poorly. The combined approach delivers performance close to that of the hardware implementation for applications with small to moderate synchronization rates and good locality. Reductions in communication overhead improve the performance of the software and the combined approach, but synchronization remains a bottleneck.

...read moreread less

Journal Article•DOI•

The Linda alternative to message-passing systems

[...]

Nicholas Carriero¹, David Gelernter¹, Timothy G. Mattson², Andrew H. Sherman•Institutions (2)

Yale University¹, Intel²

01 Apr 1994

TL;DR: This paper focuses on the expressiveness of the Linda model, on techniques required to build efficient implementations, and on observed performance both on workstation networks and distributed-memory parallel machines.

...read moreread less

Abstract: The use of distributed data structures in a logically-shared memory is a natural, readily-understood approach to parallel programming. The principal argument against such an approach for portable software has always been that efficient implementations could not scale to massively-parallel, distributed memory machines. Now, however, there is growing evidence that it is possible to develop efficient and portable implementations of virtual shared memory models on scalable architectures. In this paper we discuss one particular example: Linda. After presenting an introduction to the Linda model, we focus on the expressiveness of the model, on techniques required to build efficient implementations, and on observed performance both on workstation networks and distributed-memory parallel machines. Finally, we conclude by briefly discussing the range of applications developed with Linda and Linda's suitability for the sorts of heterogeneous, dynamically-changing computational environments that are of growing significance.

...read moreread less

Proceedings Article•DOI•

Jovian: a framework for optimizing parallel I/O

[...]

Robert L. Bennett¹, Kelvin S. Bryant¹, Alan Sussman¹, Raja Das¹, Joel H. Saltz¹ - Show less +1 more•Institutions (1)

University of Maryland, College Park¹

01 Nov 1994

TL;DR: The design and implementation of the Jovian library is discussed, which is intended to optimize the I/O performance of multiprocessor architectures that include multiple disks or disk arrays.

...read moreread less

Abstract: There has been a great deal of recent interest in parallel I/O. We discuss the design and implementation of the Jovian library, which is intended to optimize the I/O performance of multiprocessor architectures that include multiple disks or disk arrays. We also present preliminary performance measurements from benchmarking the Jovian I/O library on the IBM SP1 distributed memory parallel machine for two application templates. >

...read moreread less

Journal Article•DOI•

Toward automatic distribution

[...]

Paul Feautrier

01 Sep 1994-Parallel Processing Letters

TL;DR: The resulting structure satisfies the “owner computes” rule and is reminiscent of two-level distribution schemes, like HPF’s and directives, or the CM-2 virtual processor system.

...read moreread less

Abstract: This paper considers the problem of distributing data and code among the processors of a distributed memory supercomputer. Provided that the source program is amenable to detailed dataflow analysis, one may determine a placement function by an incremental analogue of Gaussian elimination. Such a function completely characterizes the distribution by giving the identity of the virtual processor on which each elementary calculation is done. One has then to “realize” the virtual processors on the PE. The resulting structure satisfies the “owner computes” rule and is reminiscent of two-level distribution schemes, like HPF’s and directives, or the CM-2 virtual processor system.

...read moreread less

Journal Article•DOI•

Communication costs for parallel volume-rendering algorithms

[...]

Ulrich Neumann¹•Institutions (1)

University of North Carolina at Chapel Hill¹

01 Jul 1994-IEEE Computer Graphics and Applications

TL;DR: The article enumerates and classifies parallel volume-rendering algorithms suitable for multicomputers with distributed memory and a communication network and determined the communication costs for classes of parallel algorithms by considering their inherent communication requirements.

...read moreread less

Abstract: The computational expense of volume rendering motivates the development of parallel implementations on multicomputers. Parallelism achieves higher frame rates, which provide more natural viewing control and enhanced comprehension of 3D structure. Although many parallel implementations exist, we have no framework to compare their relative merits independent of host hardware. The article attempts to establish that framework by enumerating and classifying parallel volume-rendering algorithms suitable for multicomputers with distributed memory and a communication network. It determined the communication costs for classes of parallel algorithms by considering their inherent communication requirements. >

...read moreread less

Book Chapter•DOI•

Cid: A Parallel, Shared-Memory C for Distributed-Memory Machines

[...]

Rishiyur S. Nikhil

08 Aug 1994

TL;DR: Cid is a parallel, “shared-memory” superset of C for distributed-memory machines that uses available C compilers and packet-transport primitives, and links with existing libraries.

...read moreread less

Abstract: Cid is a parallel, “shared-memory” superset of C for distributed-memory machines. A major objective is to keep the entry cost low. For users-the language should be easily comprehensible to a C programmer. For implementors-it should run on standard hardware (including workstation farms); it should not require major new compilation techniques (which may not even be widely applicable); and it should be compatible with existing code, run-time systems and tools. Cid is implemented with a simple pre-processor and a library, uses available C compilers and packet-transport primitives, and links with existing libraries.

...read moreread less

Proceedings Article•DOI•

Exploring the design space for a shared-cache multiprocessor

[...]

Basem A. Nayfeh¹, Kunle Olukotun¹•Institutions (1)

Stanford University¹

01 Apr 1994

TL;DR: This paper investigates the architecture and partitioning of resources between processors and cache memory for single chip and MCM-based multiprocessors, and shows that for parallel applications, clustering via shared caches provides an effective mechanism for increasing the total number of processors in a system.

...read moreread less

Abstract: In the near future, semiconductor technology will allow the integration of multiple processors on a chip or multichip-module (MCM). In this paper we investigate the architecture and partitioning of resources between processors and cache memory for single chip and MCM-based multiprocessors. We study the performance of a cluster-based multiprocessor architecture in which processors within a cluster are tightly coupled via a shared cluster cache for various processor-cache configurations. Our results show that for parallel applications, clustering via shared caches provides an effective mechanism for increasing the total number of processors in a system, without increasing the number of invalidations. Combining these results with cost estimates for shared cluster cache implementations leads to two conclusions: 1) For a four cluster multiprocessor with single chip clusters, two processors per cluster with a smaller cache provides higher performance and better cost/performance than a single processor with a larger cache and 2) this four cluster configuration can be scaled linearly in performance by adding processors to each cluster using MCM packaging techniques.

...read moreread less

Journal Article•DOI•

Scalability issues affecting the design of a dense linear algebra library

[...]

Jack Dongarra¹, Jack Dongarra², Robert A. van de Geijn, David W. Walker¹•Institutions (2)

Oak Ridge National Laboratory¹, University of Tennessee²

01 Sep 1994-Journal of Parallel and Distributed Computing

TL;DR: This paper discusses the scalability of Cholesky, LU, and QR factorization routines on MIMD distributed memory concurrent computers, and shows that the routines are highly scalable on this machine for problems that occupy more than about 25% of the memory on each processor.

...read moreread less

Collapse