scispace - formally typeset
Search or ask a question

Showing papers on "Distributed memory published in 1989"


Journal ArticleDOI
TL;DR: Both theoretical and practical results show that the memory coherence problem can indeed be solved efficiently on a loosely coupled multiprocessor.
Abstract: The memory coherence problem in designing and implementing a shared virtual memory on loosely coupled multiprocessors is studied in depth. Two classes of algorithms, centralized and distributed, for solving the problem are presented. A prototype shared virtual memory on an Apollo ring based on these algorithms has been implemented. Both theoretical and practical results show that the memory coherence problem can indeed be solved efficiently on a loosely coupled multiprocessor.

1,139 citations


Journal ArticleDOI
George Cybenko1
TL;DR: This paper completely analyze the hypercube network by explicitly computing the eigenstructure of its node adjacency matrix and shows that a diffusion approach to load balancing on a hypercube multiprocessor is inferior to another approach which is called the dimension exchange method.

1,074 citations


Proceedings ArticleDOI
21 Jun 1989
TL;DR: A system which, given a sequential program and its domain decomposition, performs process decomposition so as to enhance spatial locality of reference and an application - generating code from shared-memory programs for the (distributed memory) Intel iPSC/2.
Abstract: In the context of sequential computers, it is common practice to exploit temporal locality of reference through devices such as caches and virtual memory. In the context of multiprocessors, we believe that it is equally important to exploit spatial locality of reference. We are developing a system which, given a sequential program and its domain decomposition, performs process decomposition so as to enhance spatial locality of reference. We describe an application of this method - generating code from shared-memory programs for the (distributed memory) Intel iPSC/2.

256 citations


Proceedings Article
01 Jan 1989
TL;DR: The direct cancellation mechanism is proposed that eliminates the need for anti-messages and provides an efficient mechanism for canceling erroneous computations and thereby eliminates many of the overheads associated with conventional, message-based implementations of Time Warp.
Abstract: : A variation of the Time Warp parallel discrete event simulation mechanism is presented that is optimized for execution on a shared memory multiprocessor. In particular, the direct cancellation mechanism is proposed that eliminates the need for anti-messages and provides an efficient mechanism for canceling erroneous computations. The mechanism thereby eliminates many of the overheads associated with conventional, message-based implementations of Time Warp. More importantly, this mechanism effects rapid repairs of the parallel computation when an error is discovered. Initial performance measurements of an implementation of the mechanism executing on a BBN Butterfly multiprocessor are presented. These measurements indicate that the mechanism achieves good performance, particularly for many workloads where conservative clock synchronization algorithms perform poorly. Speedups as high as 56.8 using 64 processors were obtained. However, our studies also indicate that state saving overheads represent a significant stumbling block for many parallel simulations using Time Warp.

211 citations


Proceedings ArticleDOI
01 Nov 1989
TL;DR: The design of a distributed shared memory (DSM) system called Mirage is described, which provides a form of network transparency to make network boundaries invisible for shared memory and is upward compatible with an existing interface to shared memory.
Abstract: Shared memory is an effective and efficient paradigm for interprocess communication. We are concerned with software that makes use of shared memory in a single site system and its extension to a multimachine environment. Here we describe the design of a distributed shared memory (DSM) system called Mirage developed at UCLA. Mirage provides a form of network transparency to make network boundaries invisible for shared memory and is upward compatible with an existing interface to shared memory. We present the rationale behind our design decisions and important details of the implementation. Mirage's basic performance is examined by component timings, a worst case application, and a “representative” application. In some instances of page contention, the tuning parameter in our design improves application throughput. In other cases, thrashing is reduced and overall system performance improved using our tuning parameter.

205 citations


Journal ArticleDOI
TL;DR: A new algorithm for implementing plasma particle-in-cell (PIC) simulation codes on concurrent processors with distributed memory, named the general concurrent PIC algorithm (GCPIC), has been used to implement an electrostatic PIC code on the 32-node JPL Mark III Hypercube parallel computer.

149 citations


Proceedings ArticleDOI
01 Nov 1989
TL;DR: The design and implementation of the PLATINUM memory management system is described, emphasizing the coherent memory, and the cost and performance of a set of application programs running on PLATinUM are measured.
Abstract: PLATINUM is an operating system kernel with a novel memory management system for Non-Uniform Memory Access (NUMA) multiprocessor architectures. This memory management system implements a coherent memory abstraction. Coherent memory is uniformly accessible from all processors in the system. When used by applications coded with appropriate programming styles it appears to be nearly as fast as local physical memory and it reduces memory contention. Coherent memory makes programming NUMA multiprocessors easier for the user while attaining a level of performance comparable with hand-tuned programs.This paper describes the design and implementation of the PLATINUM memory management system, emphasizing the coherent memory. We measure the cost of basic operations implementing the coherent memory. We also measure the performance of a set of application programs running on PLATINUM. Finally, we comment on the interaction between architecture and the coherent memory system.PLATINUM currently runs on the BBN Butterfly Plus Multiprocessor.

138 citations


Journal ArticleDOI
01 Apr 1989
TL;DR: T traces of parallel programs are used to evaluate the cache and bus performance of shared memory multiprocessors, in which coherency is maintained by a write-invalidate protocol, and show that parallel programs incur substantially higher miss ratios and bus utilization than comparable uniprocessor programs.
Abstract: Bus bandwidth ultimately limits the performance, and therefore the scale, of bus-based, shared memory multiprocessors. Previous studies have extrapolated from uniprocessor measurements and simulations to estimate the performance of these machines. In this study, we use traces of parallel programs to evaluate the cache and bus performance of shared memory multiprocessors, in which coherency is maintained by a write-invalidate protocol. In particular, we analyze the effect of sharing overhead on cache miss ratio and bus utilization.Our studies show that parallel programs incur substantially higher miss ratios and bus utilization than comparable uniprocessor programs. The sharing component of these metrics proportionally increases with both cache and block size, and for some cache configurations determines both their magnitude and trend. The amount of overhead depends on the memory reference pattern to the shared data. Programs that exhibit good per-processor-locality perform better than those with fine-grain-sharing. This suggests that parallel software writers and better compiler technology can improve program performance through better memory organization of shared data.

130 citations


Proceedings Article
01 Dec 1989

105 citations


Patent
07 Nov 1989
TL;DR: In this paper, a high speed computer that permits the partitioning of a single computer program into smaller concurrent processes running in different parallel processors is presented, where the program execution time is divided into synchronous phases, each of which may require a shared memory to be configured in a distinct way.
Abstract: A high speed computer that permits the partitioning of a single computer program into smaller concurrent processes running in different parallel processors. The program execution time is divided into synchronous phases, each of which may require a shared memory to be configured in a distinct way. At the end of each execution phase, the processors are resynchronized such that the composite system will be in a known state at a known point in time. The computer makes efficient use of hardware such that n processors can solve a problem almost n times as fast as a single processor.

101 citations


Journal ArticleDOI
TL;DR: A form of coherence in the ray-tracing algorithm is identified that can be exploited to develop optimum schemes for data distribution in a multiprocessor system, which gives rise to high processor efficiency for systems with limited distributed memory.
Abstract: The scalability and cost effectiveness of general-purpose distributed-memory multiprocessor systems makes them particularly suitable for ray-tracing applications. However, the limited memory available to each processor in such a system requires schemes to distribute the model database among the processors. The authors identify a form of coherence in the ray-tracing algorithm that can be exploited to develop optimum schemes for data distribution in a multiprocessor system. This in turn gives rise to high processor efficiency for systems with limited distributed memory. >

Book ChapterDOI
TL;DR: This chapter describes the models of working memory, a functional part of human memory that accomplishes the temporary holding and manipulation of information during the performance of a range of cognitive tasks such as comprehension, learning, and reasoning.
Abstract: Publisher Summary This chapter describes the models of working memory. Working memory refers to a functional part of human memory that accomplishes the temporary holding and manipulation of information during the performance of a range of cognitive tasks such as comprehension, learning, and reasoning. At least three different functions performed by working memory, as expressed by models of cognitive processing, can be described in computational terms. Working memory functions as a place to hold operands or things to be operated on by the operations of cognitive processing, a cache to hold in a rapidly accessible state recently input or used information, and a buffer between processes that happen at incommensurate rates. In addition to its functions, working memory has also been characterized from two other points of view: (1) time and (2) structure. It is also distinguished from very short-term memory, lasting for only a fraction of a second. While a number of partial models of working memory exist, they do not yet embrace all the phenomena related to it in a computational framework.

Patent
21 Mar 1989
TL;DR: In this paper, a control processor and a high-level data-transfer processor are docked by a shared variable-duration clock and the duration of the clock is adjusted on the fly, to accommodate whichever of the two processors needs the longest cycle time on that particular cycle.
Abstract: A multiprocessor system which includes a control processor and a high-level data-transfer processor. Both of these two processors are docked by a shared variable-duration clock. The duration of the clock is adjusted on the fly, to accommodate whichever of the two processors needs the longest cycle time on that particular cycle. Thus, the control processor 110 and the data transfer processor 120 are enabled to run synchronously, even though they are concurrently running separate streams of instructions.

Journal ArticleDOI
TL;DR: A new efficient parallel triangular solver is described, based on the previous method of Li and Coleman [1986], but is considerably more efficient when $\frac{n}{p}$ is relatively modest, where p is the number of processors and $n$ is the problem dimension.
Abstract: Efficient triangular solvers for use on message passing multiprocessors are required, in several contexts, under the assumption that the matrix is distributed by columns (or rows) in a wrap fashion. In this paper we describe a new efficient parallel triangular solver for this problem. This new algorithm is based on the previous method of Li and Coleman [1986] but is considerably more efficient when $\frac{n}{p}$ is relatively modest, where $p$ is the number of processors and $n$ is the problem dimension. A useful theoretical analysis is provided as well as extensive numerical results obtained on an Intel iPSC with $p \leq 128$.



03 Jan 1989
TL;DR: This thesis presents a parallelizing compiler which can automate the program mapping, and generate efficient parallel code which achieves 8-fold speedup on the 10-processor array for small matrices of size 180 x 180.
Abstract: Distributed memory parallel computers provide an attractive approach to high speed computing because their performance can be easily scaled up by increasing the number of processor-memory modules. To use these computers, we have to design parallel algorithms and produce parallel programs. In many cases, parallel algorithm design is a mapping of existing algorithms to parallel architectures. In this thesis, we study such a mapping process and present a parallelizing compiler which can: (1) automate the program mapping, and (2) generate efficient parallel code. There are three key components in our program mapping: data decomposition, loop distribution and data relations. Data decomposition maps data structures to the distributed memory system; loop distribution maps the computation to processors; and data relations determine the interprocessor communication. The compiler applies data flow analysis and data dependence analysis to minimize interprocessor communication overhead and parallelize program execution. Based on these ideas, we have implemented the AL parallelizing compiler for the Warp machine. AL is a generic programming language for the prototype implementation. The target machine, Warp, is a programmable linear systolic array of 10 processors. AL has been successfully used to program many applications on Warp. These applications include matrix computations, image processing, finite element analysis, and partial differential equations. The AL compiler is able to generate efficient parallel code. For example, for the LINPACK routines such as LU decomposition, QR decomposition, and singular value decomposition (SVD), the AL compiler generates parallel code which achieves 8-fold speedup on the 10-processor array for small matrices of size 180 x 180. This thesis makes contributions to the research area of parallelizing compilers by introducing a model for mapping programs to distributed memory parallel computers. This thesis also makes contributions to the research area of parallel programming by introducing an approach to programming distributed memory parallel computers.

Proceedings ArticleDOI
01 Aug 1989
TL;DR: The experimental results show that ACWN algorithm achieves better performance in most cases than randomized allocation, and its agility in spreading the work helps it outperform the gradient model in performance and scalability.
Abstract: One of the challenges in programming distributed memory parallel machines is deciding how to allocate work to processors. This problem is particularly acute for computations with unpredictable dynamic behavior or irregular structure. We present a scheme for dynamic scheduling of medium-grained processes that is useful in this context. The Adaptive Contracting Within Neighborhood (ACWN), is a dynamic, distributed, self-adaptive, and scalable scheme. The basic scheme and its adaptive extensions are described, and contrasted with other schemes that have been proposed in this context. The performance of all the three schemes on an iPSC/2 hypercube is presented and analyzed. The experimental results show that ACWN algorithm achieves better performance in most cases than randomized allocation. Its agility in spreading the work helps it outperform the gradient model in performance and scalability.

Proceedings ArticleDOI
01 Aug 1989
TL;DR: A methodology based on a unified approach to task and data partitioning is presented, showing how to derive data and task partitions for computations expressed as nested loops that exhibit regular dependencies and how to map these onto distributed memory multiprocessors.
Abstract: The availability of large distributed memory multiprocessors and parallel computers with complex memory hierarchies have left the programmer with the difficult task of planning the detailed parallel execution of a program; for example, in the case of distributed memory machines, the programmer is forced to manually distribute code and data in addition to managing communication among tasks explicitly Current work in compiler support for this task has focused on automating task partitioning, assuming a fixed data partition or a programmer-specified (in the form of annotations) data partition This paper argues that one of the reasons for inefficient parallel execution is the lack of synergism between task partitioning and data partitioning and allocation; hence data and task allocation should both be influenced by the inherent dependence structure of the computation (which is the source of synergism) We present a methodology based on a unified approach to task and data partitioning; we show how to derive data and task partitions for computations expressed as nested loops that exhibit regular dependencies and how to map these onto distributed memory multiprocessors Based on the mapping, we show how to derive code for the nodes of a distributed memory machine with appropriate message transmission constructs We also discuss related communication optimizations

Proceedings ArticleDOI
01 Apr 1989
TL;DR: The results indicate that the MHN organization can have substantial performance benefits and so should be of increasing interest as the enabling technology becomes available.
Abstract: As VLSI technology continues to improve, circuit area is gradually being replaced by pin restrictions as the limiting factor in design. Thus, it is reasonable to anticipate that on-chip memory will become increasingly inexpensive since it is a simple, regular structure than can easily take advantage of higher densities. In this paper we examine one way in which this trend can be exploited to improve the performance of multistage interconnection networks (MINs). In particular, we consider the performance benefits of placing significant memory in each MIN switch. This memory is used in two ways: to store (the unique copies of) data items and to maintain directories. The data storage function allows data to be placed nearer processors that reference it relatively frequently, at the cost of increased distance to other processors. The directory function allows data items to migrate in reaction to changes in program locality. We call our MIN architecture the Memory Hierarchy Network (MHN), In a preliminary investigation of the merits of this design [8] we examined the performance of MHNs under the simplifying assumption that an unlimited amount of memory was available in each switch. We found that despite the longer switch processing times of the MHN, system performance is improved over simpler, conventional schemes based on caching. In this paper we refine the earlier model to account for practical storage limitations. We study ways to reduce the amount of directory storage required by keeping only partial information regarding the current location of data items. The price paid for this reduction in memory requirement is more complicated (and in some circumstances slower) protocols. We obtain comparative performance estimates in an environment containing a single global memory module and a tree-structured MIN. Our results indicate that the MHN organization can have substantial perfor- mance benefits and so should be of increasing interest as the enabling technology becomes available.

Proceedings ArticleDOI
01 Apr 1989
TL;DR: The VMP-MC design is described, a distributed parallel multi-computer based on the VMP multiprocessor design that is intended to provide a set of building blocks for configuring machines from one to several thousand processors.
Abstract: The problem of building a scalable shared memory multiprocessor can be reduced to that of building a scalable memory hierarchy, assuming interprocessor communication is handled by the memory system. In this paper, we describe the VMP-MC design, a distributed parallel multi-computer based on the VMP multiprocessor design, that is intended to provide a set of building blocks for configuring machines from one to several thousand processors. VMP-MC uses a memory hierarchy based on shared caches, ranging from on- chip caches to board-level caches connected by busses to, at the bottom, a high-speed fiber optic ring. In addition to describing the building block components of this architecture, we identify the key performance issues associated with the design and provide performance evaluation of these issues using trace-drive simulation and measurements from the VMP. This work was sponsored in part by the Defense Advanced Re- search Projects Agency under Contract N00014-88-K-0619.

01 Jul 1989
TL;DR: A new design for a Sparse Distributed Memory, called the selected-coordinate design, is described, where there are a large number of memory locations, each of which may be activated by many different addresses (binary vectors) in a very large address space.
Abstract: A new design for a Sparse Distributed Memory, called the selected-coordinate design, is described. As in the original design, there are a large number of memory locations, each of which may be activated by many different addresses (binary vectors) in a very large address space. Each memory location is defined by specifying ten selected coordinates (bit positions in the address vectors) and a set of corresponding assigned values, consisting of one bit for each selected coordinate. A memory location is activated by an address if, for all ten of the locations's selected coordinates, the corresponding bits in the address vector match the respective assigned value bits, regardless of the other bits in the address vector. Some comparative memory capacity and signal-to-noise ratio estimates for the both the new and original designs are given. A few possible hardware embodiments of the new design are described.

Book ChapterDOI
12 Jun 1989
TL;DR: The Data Diffusion Machine (DDM), a scalable shared memory multiprocessor in which the location of a datum in the machine is completely decoupled from its address, provides an automatic duplication and migration of the data to wherever needed.
Abstract: The Data Diffusion Machine (DDM) is a scalable shared memory multiprocessor in which the location of a datum in the machine is completely decoupled from its address. A data access "snooping" protocol provides an automatic duplication and migration of the data to wherever needed. The protocol also handles data coherence and replacement. The hardware organization consists of a hierarchy of buses and data controllers linking an arbitrary number of processors each having a large set-associative memory. Each data controller has a set-associative directory containing status bits for data under its control. The rest of the system appears to one processor like shared memory system, which makes the DDM a general architecture. The DDM is scalable in that there may be any number of levels in the hierarchy. The logical topmost bus (or any other bus) can be implemented by an unlimited number of physical buses removing an anticipated bottleneck.

Journal ArticleDOI
TL;DR: Strategies for performing database operations in a cube-connected multicomputer system with parallel I/O are presented, which account for the non-uniform distribution of data across parallel paths by incorporating data redistribution steps as part of the overall algorithm.
Abstract: Distributed memory architectures, specifically hypercubes, for parallel database processing are treated. The cube interconnects support-efficient data combination for the various database operations, and nonuniform data distributions are handled by dynamically redistributing data utilizing these interconnections. Selection and scalar aggregation operations are easily supported. An algorithm for the join operation is discussed in some detail. The cube is compared with another multicomputer database machine, SM3, and the performance of the join operation in these systems is described. The join performance in a cube is comparable to that in SM3 even when the cube is assumed to have a nonuniform data distribution. >


Journal ArticleDOI
TL;DR: A specialized programming language, called Apply, is developed, which reduces the problem of writing the algorithm for this class of programs to the task ofWriting the function to be applied to a window around a single pixel.
Abstract: Low-level vision is particularly amenable to implementation on parallel architectures, which offer an enormous speedup at this level. To take advantage of this, the algorithm must be adapted to the particular parallel architecture. Having to adapt programs in this way poses a significant barrier to the vision programmer, who must learn and practice a different method of parallelization for each different parallel machine. There is also no possibility of portability for programs written for a particular parallel architecture. We have developed a specialized programming language, called Apply, which reduces the problem of writing the algorithm for this class of programs to the task of writing the function to be applied to a window around a single pixel. Apply provides a method for programming these applications which is easy, consistent, and efficient. Apply is programming model specific—it implements the input partitioning model—but is architecture independent. It is possible to implement versions of Apply which run efficiently on a wide variety of computers. We describe implementations of Apply on Warp, various Warp-like architectures, unix , and the Hughes HBA and sketch implementations on bit-serial processor arrays and distributed memory machines.

Journal ArticleDOI
TL;DR: A method that allows conditioning of the response of a linear distributed memory to a variable context and its capacity for the conditional extraction of features from a complex perceptual input, its capacity to perform quasi-logical operations and the potential importance of the capacity to establish arbitrary contexts are evaluated.

Book ChapterDOI
12 Jun 1989
TL;DR: The realization of programmed graph reduction in PAM — a parallel abstract machine with distributed memory is described and results of the implementation of PAM on an Occam-Transputersystem are given.
Abstract: Programmed graph reduction has been shown to be an efficient implementation technique for lazy functional languages on sequential machines. Considering programmed graph reduction as a generalization of conventional environment-based implementations where the activation records are allocated in a graph instead of on a stack it becomes very easy to use this technique for the execution of functional programs in a parallel machine with distributed memory. We describe in this paper the realization of programmed graph reduction in PAM — a parallel abstract machine with distributed memory. Results of our implementation of PAM on an Occam-Transputersystem are given.

Patent
29 Nov 1989
TL;DR: In parallel processing computational apparatus, a switching network employing both routing switch elements and concentrator elements efficiently couples bit serial messages from a multiplicity of processors to a multiplivity of memory modules.
Abstract: In parallel processing computational apparatus, a switching network employing both routing switch elements and concentrator elements efficiently couples bit serial messages from a multiplicity of processors to a multiplicity of memory modules. The apparatus operates in a highly synchronous mode in which all processors issue memory requests only at essentially the same predetermined time within a frame interval encompassing a predetermined substantial number of clock periods. The routing switch elements and concentrator elements incorporate circuitry for comparing the addresses of requests which may be blocked at any element with requests which get through and, if the addresses are the same, returning the memory response to all processors seeking the same memory location.

Journal ArticleDOI
01 Nov 1989
TL;DR: The asynchronous fast adaptive composite grid method (AFAC) is developed here as a method that can process refinement levels in parallel while maintaining full multilevel convergence speeds.
Abstract: Several mesh refinement methods exists for solving partial differential equations that make efficient use of local grids on scalar computers. On distributed memory multiprocessors, such methods benefit from their tendency to create multiple refinement regions, yet they suffer from the sequential way that the levels of refinement are treated. The asynchronous fast adaptive composite grid method (AFAC) is developed here as a method that can process refinement levels in parallel while maintaining full multilevel convergence speeds. In the present paper, we develop a simple two-level AFAC theory and provide estimates of its asymptotic convergence factors as it applies to very large scale examples. In a companion paper, we report on extensive timing results for AFAC, implemented on an Intel iPSC hypercube.