scispace - formally typeset
Search or ask a question

Showing papers on "Distributed memory published in 1994"


Journal ArticleDOI
TL;DR: The Vienna RNA package as mentioned in this paper is based on dynamic programming algorithms and aims at predictions of structures with minimum free energies as well as at computations of the equilibrium partition functions and base pairing probabilities.
Abstract: Computer codes for computation and comparison of RNA secondary structures, the Vienna RNA package, are presented, that are based on dynamic programming algorithms and aim at predictions of structures with minimum free energies as well as at computations of the equilibrium partition functions and base pairing probabilities. An efficient heuristic for the inverse folding problem of RNA is introduced. In addition we present compact and efficient programs for the comparison of RNA secondary structures based on tree editing and alignment. All computer codes are written in ANSI C. They include implementations of modified algorithms on parallel computers with distributed memory. Performance analysis carried out on an Intel Hypercube shows that parallel computing becomes gradually more and more efficient the longer the sequences are.

2,136 citations


Journal ArticleDOI
TL;DR: In this paper, a multilevel version of RSB is introduced that attains about an order-of-magnitude improvement in run time on typical examples, and it is shown that RSB in its simplest form is expensive.
Abstract: SUMMARY If problems involving unstructured meshes are to be solved efficiently on distributed-memory parallel computers, the meshes must be partitioned and distributed across processors in a way that balances the computational load and minimizes communication. The recursive spectral bisection method (RSB) has been shown to be very effective for such partitioning problems compared to alternative methods, but RSB in its simplest form is expensive. Here a multilevel version of RSB is introduced that attains about an order-of-magnitude improvement in run time on typical examples. 1. INTRODUCTION Unstructured meshes are used in several large-scale scientific and engineering problems, including finite-volume methods for computational fluid dynamics and finite-element methods for structural analysis. If unstructured problems such as these are to be solved on distributed-memory parallel computers, their data structures must be partitioned and distributed across processors; if they are to be solved efficiently, the partitioning must niaximize load balance and minimize interprocessor communication. Recently, the recursive spectral bisection method (RSB)[l] has been shown to be very effective for such partitioning problems compared to alternative methods. Unfortunately, RSB in its simplest form is expensive. We shall describe a multilevel version of RSB that attains about im order-of-magnitude improvement in run time on typical examples.

567 citations


01 Apr 1994
TL;DR: An overview of MPI, a proposed standard message passing interface for MIMD distributed memory concurrent computers, which includes point-to-point and collective communication routines, as well as support for process groups, communication contexts, and application topologies is presented.
Abstract: This paper presents an overview of MPI, a proposed standard message passing interface for MIMD distributed memory concurrent computers. The design of MPI has been a collective effort involving researchers in the Untied states and Europe from many organizations and institutions. MPI includes point-to-point and collective communication routines, as well as support for process groups, communication contexts, and application topologies. While making use of new ideas where appropriate, the MPI standard is based largely on current practice.

458 citations


Proceedings ArticleDOI
31 Jan 1994
TL;DR: MINT is a software package designed to ease the process of constructing event-driven memory hierarchy simulators for multiprocessors that uses a novel hybrid technique that exploits the best aspects of native execution and software interpretation to minimize the overhead of processor simulation.
Abstract: MINT is a software package designed to ease the process of constructing event-driven memory hierarchy simulators for multiprocessors. It provides a set of simulated processors that run standard Unix executable files compiled for a MIPS R3000 based multiprocessor. These generate multiple streams of memory reference events that drive a user-provided memory system simulator. MINT uses a novel hybrid technique that exploits the best aspects of native execution and software interpretation to minimize the overhead of processor simulation. Combined with related techniques to improve performance, this approach makes simulation on uniprocessor hosts extremely efficient. >

283 citations


Proceedings ArticleDOI
01 Nov 1994
TL;DR: In this paper, the authors discuss implementations of fine-grain memory access control, which selectively restricts reads and writes to cache-block-sized memory regions, and incorporate three techniques that require no additional hardware into Blizzard.
Abstract: This paper discusses implementations of fine-grain memory access control, which selectively restricts reads and writes to cache-block-sized memory regions. Fine-grain access control forms the basis of efficient cache-coherent shared memory. This paper focuses on low-cost implementations that require little or no additional hardware. These techniques permit efficient implementation of shared memory on a wide range of parallel systems, thereby providing shared-memory codes with a portability previously limited to message passing.This paper categorizes techniques based on where access control is enforced and where access conflicts are handled. We incorporated three techniques that require no additional hardware into Blizzard, a system that supports distributed shared memory on the CM-5. The first adds a software lookup before each shared-memory reference by modifying the program's executable. The second uses the memory's error correcting code (ECC) as cache-block valid bits. The third is a hybrid. The software technique ranged from slightly faster to two times slower than the ECC approach. Blizzard's performance is roughly comparable to a hardware shared-memory machine. These results argue that clusters of workstations or personal computers with networks comparable to the CM-5's will be able to support the same shared-memory interfaces as supercomputers.

275 citations


Patent
03 Oct 1994
TL;DR: In this article, a multiprocessor system consisting of a processor, a distributed shared memory coupler, and a distributed memory protector is described, where the distributed shared memories are assigned global addresses common to all the processor modules, and each processor module has its addresses shared with the shared shared memory of each processor which is the destination of data transfer.
Abstract: In a multiprocessor system, each processor module comprises a processor, a distributed shared memory, a distributed memory coupler for controlling copying between distributed shared memories and a distributed memory protector for protecting said distributed shared memory against illegal access. The distributed shared memories are assigned global addresses common to all the processor modules, and the distributed shared memory of each processor module has its addresses shared with the distributed shared memory of each processor module which is the destinatiion of data transfer. Message buffers and message control areas on the distributed shared memory are divided into areas specified by a combination of sending and receiving processor modules. A processing request area on the distributed shared memory is divided corresponding to each receiving processor module and arranged accordingly. The processing request area on the receiver's side distributed shared memory has a FIFO structure. The sender's side distributed memory coupler stores identifying information of the destination processor module between the processor module communication and, upon occurrence of a write into the distributed shared memory, sends a write address and write data to the destination processor module. The receiver's side distributed memory coupler copies the received write data into the distributed shared memory of the processor module to which the distributed shared memory coupler belongs, by receiving write address and write data from the sender's side distributed memory coupler.

252 citations


Proceedings ArticleDOI
14 Aug 1994
TL;DR: An efficient, non-blocking, disjointaccess-parallel implementation of LL and SCn, using Read and CBS and the asynchronous shar-ed memory model is presented.
Abstract: In this paper, we present efficient implementations of strong shared memory primitives. We use the asynchronous shar-ed memory model. In this model, processes communicate by applying primitive operations (e.g. Read, Write) to a shared memory. We define disjoint-access-parallel implementations. Intuitively, an implementation of shared memory primitives is disjoint-access-parallel, if processes which execute shared memory operations that access disjoint sets of words, progress concurrently, without interfering with each other (under an assumption described in the paper). Two commonly used primitives, both in theory and in practice, are Compare @Swap ( C’&S) and the pair Load Linked (LL) and Store Conditional (SC). We present an efficient, non-blocking, disjointaccess-parallel implementation of LL and SCn, using Read and CBS. SCn is a generalization of SC, which accesses n memory words. This implementation is constructed in three stages. We first present an implementation of L-L, SC and an additional primitive, called Validate ( VL), using Read and C&S. We then present an implementation of Read and CLYSn, using LL, SC and VL ( C&Sn is a generalization of C&S, which accesses n memory words). Finally, we present an implementation of SCn, using Read and Ct3’Sn. The work and space complexities of the implementations presented in this paper, improve the work and space complexities of previous works. *e-mail: amOs@ee. techniOn.ac.il t ~-mtil: fihu@cs .technion.ac.i] Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association of Computing Machinery. To copy otherwise, or to republish, requires a fee and[or specific permission. PODC 94-8194 Los Angeles CA USA @ 1994 ACM 0-89791 -654-9/94/0008.$3.50 Lihu Rappoportt Faculty of Computer Science

249 citations


Proceedings ArticleDOI
01 Apr 1994
TL;DR: Qualitative comparisons indicate that both schemes are able to reduce cache misses in the domain of linear array references, and an approach combining software and hardware schemes is proposed; it shows promise in reducing the memory latency with least overhead.
Abstract: Prefetching, i.e., exploiting the overlap of processor computations with data accesses, is one of several approaches for tolerating memory latencies. Prefetching can be either hardware-based or software-directed or a combination of both. Hardware-based prefetching, requiring some support unit connected to the cache, can dynamically handle prefetches at run-time without compiler intervention. Software-directed approaches rely on compiler technology to insert explicit prefetch instructions. Mowry et al.'s software scheme [13, 14] and our hardware approach [1] are two representative schemes.In this paper, we evaluate approximations to these two schemes in the context of a shared-memory multiprocessor environment. Our qualitative comparisons indicate that both schemes are able to reduce cache misses in the domain of linear array references. When complex data access patterns are considered, the software approach has compile-time information to perform sophisticated prefetching whereas the hardware scheme has the advantage of manipulating dynamic information. The performance results from an instruction-level simulation of four benchmarks confirm these observations. Our simulations show that the hardware scheme introduces more memory traffic into the network and that the software scheme introduces a non-negligible instruction execution overhead. An approach combining software and hardware schemes is proposed; it shows promise in reducing the memory latency with least overhead.

238 citations


Proceedings ArticleDOI
14 Nov 1994
TL;DR: The key concept of GA is that it provides a portable interface through which each process in a MIMD parallel program can asynchronously access logical blocks of physically distributed matrices, with no need for explicit cooperation by other processes.
Abstract: Portability, efficiency and ease of coding are all important considerations in choosing the programming model for a scalable parallel application. The message-passing programming model is widely used because of its portability, yet some applications are too complex to code in it while also trying to maintain a balanced computation load and avoid redundant computations. The shared-memory programming model simplifies coding, but it is not portable and often provides little control over interprocessor data transfer costs. This paper describes a new approach, called Global Arrays (GA) that combines the better features of both other models, leading to both simple coding and efficient execution. The key concept of GA is that it provides a portable interface through which each process in a MIMD parallel program can asynchronously access logical blocks of physically distributed matrices, with no need for explicit cooperation by other processes. We have implemented GA libraries on a variety of computer systems, including the Intel DELTA and Paragon, the IBM SP-1 (all message-passers), the Kendall Square KSR-2 (a nonuniform access shared-memory machine), and networks of Unix workstations. We discuss the design and implementation of these libraries, report their performance, illustrate the use of GA in the context of computational chemistry applications, and describe the use of a GA performance visualization tool. >

224 citations


Journal ArticleDOI
TL;DR: A detailed performance and scalability analysis of the communication primitives is presented, carried out using a workload generator, kernels from real applications, and a large unstructured adaptive application.

211 citations


Patent
21 Jun 1994
TL;DR: In this article, a multiprocessor system and method arranged, in one embodiment, as an image and graphics processor is described, with several individual processors all having communication links to several memories without restriction.
Abstract: There is disclosed a multiprocessor system and method arranged, in one embodiment, as an image and graphics processor. The processor is structured with several individual processors all having communication links to several memories without restriction. A crossbar switch serves to establish the processor memory links and the entire image processor, including the individual processors, the crossbar switch and the memories, are contained on a single silicon chip.

Proceedings ArticleDOI
14 Nov 1994
TL;DR: Evidence is presented that application-specific protocols substantially improved the performance of three application programs-appbt, em3d, and barnes-over carefully tuned transparent shared memory implementations.
Abstract: Recent distributed shared memory (DSM) systems and proposed shared-memory machines have implemented some or all of their cache coherence protocols in software. One way to exploit the flexibility of this software is to tailor a coherence protocol to match an application's communication patterns and memory semantics. This paper presents evidence that this approach can lead to large performance improvements. It shows that application-specific protocols substantially improved the performance of three application programs--appbt, em3d, and barnes--over carefully tuned transparent shared memory implementations. The speed-ups were obtained on Blizzard, a fine-grained DSM system running on a 32-node Thinking Machines CM-5.

Journal ArticleDOI
01 Apr 1994
TL;DR: An overview of MPI, a proposed standard message passing interface for MIMD distributed memory concurrent computers, which includes point-to-point and collective communication routines, as well as support for process groups, communication contexts, and application topologies is presented.
Abstract: This paper presents an overview of MPI, a proposed standard message passing interface for MIMD distributed memory concurrent computers. The design of MPI has been a collective effort involving researchers in the United States and Europe from many organizations and institutions. MPI includes point-to-point and collective communication routines, as well as support for process groups, communication contexts, and application topologies. While making use of new ideas where appropriate, the MPI standard is based largely on current practice.

Proceedings ArticleDOI
14 Nov 1994
TL;DR: A new method for write detection that relies on the compiler and runtime system to detect writes to shared data without invoking the operating system, and has low average write latency and supports fine-grained sharing with low overhead.
Abstract: Most software-based distributed shared memory (DSM) systems rely on the operating system's virtual memory interface to detect writes to shared data. Strategies based on virtual memory page protection create two problems for a DSM system. First, writes can have high overhead since they are detected with a page fault. As a result, a page must be written many times to amortize the cost of that fault. Second, the size of a virtual memory page is too big to serve as a unit of coherency, inducing false sharing. Mechanisms to handle false sharing can increase runtime overhead and may cause data to be unnecessarily communicated between processors.In this paper, we present a new method for write detection that solves these problems. Our method relies on the compiler and runtime system to detect writes to shared data without invoking the operating system. We measure and compare implementations of a distributed shared memory system using both strategies, virtual memory and compiler/runtime, running a range of applications on a small scale distributed memory multicomputer. We show that the new method has low average write latency and supports fine-grained sharing with low overhead. Further, we show that the dominant cost of write detection with either strategy is due to the mechanism used to handle fine-grain sharing.

Patent
22 Jun 1994
TL;DR: In this article, a multiprocessor system and method arranged, in one embodiment, as an image and graphics processor is described, with several individual processors all having communication links to several memories without restriction.
Abstract: There is disclosed a multiprocessor system and method arranged, in one embodiment, as an image and graphics processor The processor is structured with several individual processors all having communication links to several memories without restriction A crossbar switch serves to establish the processor memory links and an inter-processor communication link allows the processors to communicate with each other for the purpose of establishing operational modes A parameter memory, accessible via the crossbar switch, is used in conjunction with the communication link for control purposes The entire image processor, including the individual processors, the crossbar switch and the memories, is contained on a single silicon chip

Patent
22 Jun 1994
TL;DR: In this paper, a multi-processor system and method arranged, in one embodiment, as an image and graphics processor is described. But this system is based on a single silicon chip and does not have a crossbar switch to establish the memory links.
Abstract: There is disclosed a multi-processor system and method arranged, in one embodiment, as an image and graphics processor. The image processor is structured with several individual processors all having communication links to several memories. A crossbar switch serves to establish the processor memory links. The entire image processor, including the individual processors, the crossbar switch and the memories, is contained on a single silicon chip.

Proceedings ArticleDOI
14 Nov 1994
TL;DR: This paper provides an extensive analysis on several complex scientific algorithms written in SAM on a variety of hardware platforms and finds that the performance of these SAM applications depends fundamentally on the scalability of the underlying parallel algorithm, and whether the algorithm's communication requirements can be satisfied by the hardware.
Abstract: This paper describes the design and evaluation of SAM, a shared object system for distributed memory machines. SAM is a portable run-time system that provides a global name space and automatic caching of shared data. SAM incorporates mechanisms to address the problem of high communication overheads on distributed memory machines; these mechanisms include tying synchronization to data access, chaotic access to data, prefetching of data, and pushing of data to remote processors. SAM has been implemented on the CM-5, Intel iPSC/860 and Paragon, IBM SP1, and networks of workstations running PVM. SAM applications run on all these platforms without modification.This paper provides an extensive analysis on several complex scientific algorithms written in SAM on a variety of hardware platforms. We find that the performance of these SAM applications depends fundamentally on the scalability of the underlying parallel algorithm, and whether the algorithm's communication requirements can be satisfied by the hardware. Our experience suggests that SAM is successful in allowing programmers to use distributed memory machines effectively with much less programming effort than required today.

Journal ArticleDOI
01 Feb 1994
TL;DR: In this article, a message-passing multi-cell approach is proposed for short-range molecular dynamics simulations on distributed memory MIMD multicomputers based on a message passing multicell approach.
Abstract: We present a new scalable algorithm for short-range molecular dynamics simulations on distributed memory MIMD multicomputers based on a message-passing multi-cell approach We have implemented the algorithm on the Connection Machine 5 (CM-5) and demonstrate that meso-scale molecular dynamics with more than 108 particles is now possible on massively parallel MIMD computers Typical runs show single particle update-times of 015 μs in 2 dimentions (2D) and approximately 1 μs in 3 dimensions (3D) on a 1024 node CM-5 without vector units, corresponding to more than 18 Gflops overall performance We also present a scaling equation which agrees well with actually observed timings

Proceedings ArticleDOI
01 Jan 1994
TL;DR: The SGI Challenge system architecture provides a high-bandwidth, low-latency cache-coherent interconnect for several high performance processors, I/O busses, and a scalable memory system.
Abstract: This paper presents the architecture, implementation, and performance results for the SGI Challenge symmetric multiprocessor system. Novel aspects of the architecture are highlighted, as well as key design trade-offs targeted at increasing performance and reducing complexity. Multiprocessor design verification techniques and their impact is also presented. The SGI Challenge system architecture provides a high-bandwidth, low-latency cache-coherent interconnect for several high performance processors, I/O busses, and a scalable memory system. Hardware cache coherence mechanisms maintain a consistent view of shared memory for all processors, with no software overhead and minimal impact on processor performance. HDL simulation with random, self checking vector generation and a lightweight operating system on full processor models contributed to a concept to customer shipment cycle of 26 months. >

Proceedings ArticleDOI
14 Nov 1994
TL;DR: CHAOS is described, a library of efficient runtime primitives that provides support for dynamic data partitioning, efficient preprocessing and fast data migration in adaptive irregular problems and is used to parallelize kernels from two adaptive applications.
Abstract: In adaptive irregular problems, data arrays are accessed via indirection arrays, and data access patterns change during computation. Parallelizing such problems on distributed memory machines requires support for dynamic data partitioning, efficient preprocessing and fast data migration. This paper describes CHAOS, a library of efficient runtime primitives that provides such support. To demonstrate the effectiveness of the runtime support, two adaptive irregular applications have been parallelized using CHAOS primitives: a molecular dynamics code (CHARMM) and a code for simulating gas flows (DSMC). We have also proposed minor extensions to Fortran D which would enable compilers to parallelize irregular for all loops in such adaptive applications by embedding calls to primitives provided by a runtime library. We have implemented our proposed extensions in the Syracuse Fortran 90D/HPF prototype compiler, and have used the compiler to parallelize kernels from two adaptive applications. >

Journal ArticleDOI
TL;DR: This thesis describes an advanced compiler that can generate efficient parallel programs when the source programming language naturally represents an application's parallelism and Fortran 90D/HPF, described in this thesis is such a language.

Proceedings ArticleDOI
14 Nov 1994
TL;DR: This work introduces the design of a runtime interface, called Chant, that supports communicating threads in a distributed memory environment, and is layered atop standard message passing and lightweight thread libraries, and supports efficient point-to-point and remote service request communication primitives.
Abstract: Lightweight threads are becoming increasingly useful for supporting parallelism and asynchronous control structures in applications and language implementations. However, lightweight thread packages for distributed memory systems have received little attention. In this paper, we introduce the design of a runtime interface, called Chant, that supports communicating threads in a distributed memory environment. In particular, Chant is layered atop standard message passing and lightweight thread libraries, and supports efficient point-to-point and remote service request communication primitives. We examine the design issues of Chant, the efficiency of its point-to-point communication layer, and the evaluation of scheduling policies to poll for the presence of incoming messages.

Proceedings ArticleDOI
01 Apr 1994
TL;DR: The results show that TreadMarks performs comparably to the 4D/480 for applications with moderate amounts of synchronization, but the difference in performance grows as the synchronization frequency increases.
Abstract: We compare the performance of software-supported shared memory on a general-purpose network to hardware-supported shared memory on a dedicated interconnect.Up to eight processors, our results are based on the execution of a set of application programs on a SGI 4D/480 multiprocessor and on TreadMarks, a distributed shared memory system that runs on a Fore ATM LAN of DECstation-5000/240s. Since the DECstation and the 4D/480 use the same processor, primary cache, and compiler, the shared-memory implementation is the principal difference between the systems. Our results show that TreadMarks performs comparably to the 4D/480 for applications with moderate amounts of synchronization, but the difference in performance grows as the synchronization frequency increases. For applications that require a large amount of memory bandwidth, TreadMarks can perform better than the SGI 4D/480.Beyond eight processors, our results are based on execution-driven simulation. Specifically, we compare a software implementation on a general-purpose network of uniprocessor nodes, a hardware implementation using a directory-based protocol on a dedicated interconnect, and a combined implementation using software to provide shared memory between multiprocessor nodes with hardware implementing shared memory within a node. For the modest size of the problems that we can simulate, the hardware implementation scales well and the software implementation scales poorly. The combined approach delivers performance close to that of the hardware implementation for applications with small to moderate synchronization rates and good locality. Reductions in communication overhead improve the performance of the software and the combined approach, but synchronization remains a bottleneck.

Journal ArticleDOI
01 Apr 1994
TL;DR: This paper focuses on the expressiveness of the Linda model, on techniques required to build efficient implementations, and on observed performance both on workstation networks and distributed-memory parallel machines.
Abstract: The use of distributed data structures in a logically-shared memory is a natural, readily-understood approach to parallel programming. The principal argument against such an approach for portable software has always been that efficient implementations could not scale to massively-parallel, distributed memory machines. Now, however, there is growing evidence that it is possible to develop efficient and portable implementations of virtual shared memory models on scalable architectures. In this paper we discuss one particular example: Linda. After presenting an introduction to the Linda model, we focus on the expressiveness of the model, on techniques required to build efficient implementations, and on observed performance both on workstation networks and distributed-memory parallel machines. Finally, we conclude by briefly discussing the range of applications developed with Linda and Linda's suitability for the sorts of heterogeneous, dynamically-changing computational environments that are of growing significance.

Proceedings ArticleDOI
01 Nov 1994
TL;DR: The design and implementation of the Jovian library is discussed, which is intended to optimize the I/O performance of multiprocessor architectures that include multiple disks or disk arrays.
Abstract: There has been a great deal of recent interest in parallel I/O. We discuss the design and implementation of the Jovian library, which is intended to optimize the I/O performance of multiprocessor architectures that include multiple disks or disk arrays. We also present preliminary performance measurements from benchmarking the Jovian I/O library on the IBM SP1 distributed memory parallel machine for two application templates. >

Journal ArticleDOI
TL;DR: The resulting structure satisfies the “owner computes” rule and is reminiscent of two-level distribution schemes, like HPF’s and directives, or the CM-2 virtual processor system.
Abstract: This paper considers the problem of distributing data and code among the processors of a distributed memory supercomputer. Provided that the source program is amenable to detailed dataflow analysis, one may determine a placement function by an incremental analogue of Gaussian elimination. Such a function completely characterizes the distribution by giving the identity of the virtual processor on which each elementary calculation is done. One has then to “realize” the virtual processors on the PE. The resulting structure satisfies the “owner computes” rule and is reminiscent of two-level distribution schemes, like HPF’s and directives, or the CM-2 virtual processor system.

Journal ArticleDOI
TL;DR: The article enumerates and classifies parallel volume-rendering algorithms suitable for multicomputers with distributed memory and a communication network and determined the communication costs for classes of parallel algorithms by considering their inherent communication requirements.
Abstract: The computational expense of volume rendering motivates the development of parallel implementations on multicomputers. Parallelism achieves higher frame rates, which provide more natural viewing control and enhanced comprehension of 3D structure. Although many parallel implementations exist, we have no framework to compare their relative merits independent of host hardware. The article attempts to establish that framework by enumerating and classifying parallel volume-rendering algorithms suitable for multicomputers with distributed memory and a communication network. It determined the communication costs for classes of parallel algorithms by considering their inherent communication requirements. >

Book ChapterDOI
08 Aug 1994
TL;DR: Cid is a parallel, “shared-memory” superset of C for distributed-memory machines that uses available C compilers and packet-transport primitives, and links with existing libraries.
Abstract: Cid is a parallel, “shared-memory” superset of C for distributed-memory machines. A major objective is to keep the entry cost low. For users-the language should be easily comprehensible to a C programmer. For implementors-it should run on standard hardware (including workstation farms); it should not require major new compilation techniques (which may not even be widely applicable); and it should be compatible with existing code, run-time systems and tools. Cid is implemented with a simple pre-processor and a library, uses available C compilers and packet-transport primitives, and links with existing libraries.

Proceedings ArticleDOI
01 Apr 1994
TL;DR: This paper investigates the architecture and partitioning of resources between processors and cache memory for single chip and MCM-based multiprocessors, and shows that for parallel applications, clustering via shared caches provides an effective mechanism for increasing the total number of processors in a system.
Abstract: In the near future, semiconductor technology will allow the integration of multiple processors on a chip or multichip-module (MCM). In this paper we investigate the architecture and partitioning of resources between processors and cache memory for single chip and MCM-based multiprocessors. We study the performance of a cluster-based multiprocessor architecture in which processors within a cluster are tightly coupled via a shared cluster cache for various processor-cache configurations. Our results show that for parallel applications, clustering via shared caches provides an effective mechanism for increasing the total number of processors in a system, without increasing the number of invalidations. Combining these results with cost estimates for shared cluster cache implementations leads to two conclusions: 1) For a four cluster multiprocessor with single chip clusters, two processors per cluster with a smaller cache provides higher performance and better cost/performance than a single processor with a larger cache and 2) this four cluster configuration can be scaled linearly in performance by adding processors to each cluster using MCM packaging techniques.

Journal ArticleDOI
TL;DR: This paper discusses the scalability of Cholesky, LU, and QR factorization routines on MIMD distributed memory concurrent computers, and shows that the routines are highly scalable on this machine for problems that occupy more than about 25% of the memory on each processor.