scispace - formally typeset
Search or ask a question

Showing papers on "Distributed memory published in 1986"


Proceedings ArticleDOI
Kai Li1, Paul Hudak1
01 Nov 1986
TL;DR: Both theoretical and practical results show that the memory coherence problem can indeed be solved efficiently on a loosely-coupled multiprocessor.
Abstract: This paper studies the memory coherence problem in designing and implementing a shared virtual memory on looselycoupled multiprocessors. Two classes of algorithms for solving the problem are presented. A prototype shared virtual memory on an Apollo ring has been implemented based on these algorithms. Both theoretical and practical results show that the memory coherence problem can indeed be solved efficiently on a loosely-coupled multiprocessor.

580 citations


01 Jan 1986
TL;DR: A burial vault apparatus for the interment of bodies either partially or entirely below ground level, and such vault is provided with structure to resist vertical movement.
Abstract: A burial vault apparatus for the interment of bodies either partially or entirely below ground level. The vault is constructed of corrugated lightweight material which is airtight and moisture-proof and from which air may be evacuated, and such vault is provided with structure to resist vertical movement.

381 citations


Patent
10 Feb 1986
TL;DR: In this article, a system for dynamically partitioning processors in a multiprocessor system intercoupled by a network utilizes, in association with each processor, a network accessible, locally changeable memory section.
Abstract: A system for dynamically partitioning processors in a multiprocessor system intercoupled by a network utilizes, in association with each processor, a network accessible, locally changeable memory section. An available one of a number of common dynamic group addresses in each of the memories is reserved for a subgroup for the performance of subtasks within an overall task, and members of the group are designated as they receive messages to be processed. The members then locally update status words which establish membership, group validity and semaphore conditions, so that transactions may be initiated, coordinated and terminated with minimum involvement of processors that have no relevant subtasks. When the full task is completed the dynamic group is relinquished for use when a new task is to be undertaken. The system enables many tasks to be carried out concurrently with higher intercommunication efficiency.

147 citations


Journal ArticleDOI
01 May 1986
TL;DR: It is observed that to obtain this limited factor of 10-fold speed-up, it is necessary to exploit parallelism at a very fine granularity, and it is proposed that a suitable architecture to exploit such fine-grain parallelism is a bus-based shared-memory multiprocessor with 32-64 processors.
Abstract: Rule-based systems, on the surface, appear to be capable of exploiting large amounts of parallelism—it is possible to match each rule to the data memory in parallel. In practice, however, we show that the speed-up from parallelism is quite limited, less than 10-fold. The reasons for the small speed-up are: (1) the small number of rules relevant to each change to data memory; (2) the large variation in the processing required by the relevant rules; and (3) the small number of changes made to data memory between synchronization steps. Furthermore, we observe that to obtain this limited factor of 10-fold speed-up, it is necessary to exploit parallelism at a very fine granularity. We propose that a suitable architecture to exploit such fine-grain parallelism is a bus-based shared-memory multiprocessor with 32-64 processors. Using such a multiprocessor (with individual processors working at 2 MIPS), it is possible to obtain execution speeds of about 3800 rule-firings/sec. This speed is significantly higher than that obtained by other proposed parallel implementations of rule-based systems.

121 citations


Journal ArticleDOI
TL;DR: The algorithm is faster than the traditionallocked counter approach for two processors and has an attractive log2N time scaling for largerN, and requires a shared memory bandwidth which grows linearly withN, the number of participating processors.
Abstract: We describe and algorithm for barrier synchronization that requires only read and write to shared store. The algorithm is faster than the traditionallocked counter approach for two processors and has an attractive log2N time scaling for largerN. The algorithm is free of hot spots and critical regions and requires a shared memory bandwidth which grows linearly withN, the number of participating processors. We verify the technique using both a real shared memory multiprocessor, for numbers of processors up to 30, and a shared memory multiprocessor simulator, for number of processors up to 256.

111 citations


Journal ArticleDOI
01 May 1986
TL;DR: This paper shows how the VMP design provides the high memory bandwidth required by modern high-performance processors with a minimum of hardware complexity and cost, and describes simple solutions to the consistency problems associated with virtually addressed caches.
Abstract: VMP is an experimental multiprocessor that follows the familiar basic design of multiple processors, each with a cache, connected by a shared bus to global memory. Each processor has a synchronous, virtually addressed, single master connection to its cache, providing very high memory bandwidth. An unusually large cache page size and fast sequential memory copy hardware make it feasible for cache misses to be handled in software, analogously to the handling of virtual memory page faults. Hardware support for cache consistency is limited to a simple state machine that monitors the bus and interrupts the processor when a cache consistency action is required.In this paper, we show how the VMP design provides the high memory bandwidth required by modern high-performance processors with a minimum of hardware complexity and cost. We also describe simple solutions to the consistency problems associated with virtually addressed caches. Simulation results indicate that the design achieves good performance providing data contention is not excessive.

99 citations


Proceedings ArticleDOI
01 Nov 1986

93 citations


Patent
05 May 1986
TL;DR: In this paper, the authors present a multiprocessor system architecture in which two processors are each provided with an autonomous bus and the two buses can be selectively connected to each other to form a unique system bus which enables access by all processors to common memory resources connected to one of the autonomous buses.
Abstract: A multiprocessor system architecture in which two processors are each provided with an autonomous bus and the two buses can be selectively connected to each other to form a unique system bus which enables access by all processors to common memory resources connected to one of the autonomous buses. The communication between processors takes place through messages stored into mailboxes located in the common memory. The presence of a message is evidenced by a notify/interrupt signal generated by a logic unit to which each processor has access to modify and verify the logic unit's status, using the processor's autonomous bus, and without interfering with operations using the other autonomous buses of the other processor. Such verification and access does not require access to common memory resources nor polling operations to verify the status of messages stored into "mailboxes".

86 citations


Patent
Steve S. Chen1, Alan J. Schiffleger1
30 Apr 1986
TL;DR: In this article, the authors propose a multiprocessor with a plurality of shared registers for scalar and address information and registers for coordinating the transfer of information through the shared registers, which allow independent tasks of different jobs or related tasks of a single job to be run concurrently.
Abstract: A pair of processors are each connected to a central memory through a plurality of memory reference ports. The processors are further each connected to a plurality of shared registers which may be directly addressed by either processor at rates commensurate with intra-processor operation. The shared registers include registers for holding scalar and address information and registers for holding information to be used in coordinating the transfer of information through the shared registers. A multiport memory is provided and includes a conflict resolution circuit which senses and prioritizes conflicting references to the central memory. Each CPU is interfaced with the central memory through three ports, with each of the ports handling different ones of several different types of memory references which may be made. At least one I/O port is provided to be shared by the processors in transferring information between the central memory and peripheral storage devices. A vector register design is also disclosed for use in vector processing computers, and provides that each register consist of at least two independently addressable memories, to deliver data to or accept data from a functional unit. The method of multiprocessing permits multitasking in the multiprocessor, in which the shared registers allow independent tasks of different jobs or related tasks of a single job to be run concurrently, and facilitate multithreading of the operating system by permitting multiple critical code regions to be independently synchronized.

82 citations


Journal ArticleDOI
TL;DR: A simulation study of the CRAY X-MP interleaved memory system with attention focused on steady state performance for sequences of vector operations, identifying the occurrence of linked conflicts, repeating sequences of conflicts between two or more vector streams that result in reduced steadyState performance.
Abstract: One of the significant differences between the CRAY X-MP and its predecessor, the CRAY-1S, is a considerably increased memory bandwidth for vector operations. Up to three vector streams in each of the two processors may be active simultaneously. These streams contend for memory banks as well as data paths. All memory conflicts are resolved dynamically by the memory system. This paper describes a simulation study of the CRAY X-MP interleaved memory system with attention focused on steady state performance for sequences of vector operations. Because it is more amenable to analysis, we first study the interaction of vector streams issued from a single processor. We identify the occurrence of linked conflicts, repeating sequences of conflicts between two or more vector streams that result in reduced steady state performance. Both worst case and average case performance measures are given. The discussion then turns to dual processor interactions. Finally, based on our simulations, possible modifications to the CRAY X-MP memory system are proposed and compared. These modifications are intended to eliminate or reduce the effects of linked conflicts.

81 citations



Patent
Hironobu Sakata1
16 Jun 1986
TL;DR: In this article, a multi-processing device includes three or more processing systems, each having a processor and a corresponding main memory connected to each other by means of an individual memory bus.
Abstract: A multi-processing device includes three or more processing systems, each having a processor and a corresponding main memory connected to each other by means of an individual memory bus. The multi-processing device also includes a common memory bus connectable to all the processors and all the main memories of the respective systems, an asynchronism detection circuit connected to the respective processors to produce an asynchronism detection signal indicating which system or systems are in asynchronous state, and a device control circuit responsive to the asynchronism detection signal to send a common memory bus select signal to the main memory of each failed system to change its bus connection from the individual memory bus to the common memory bus. The device control circuit also generates a master designation signal for allowing an arbitrary processor of the normal non-faulty systems to be designated as a master processor, and a copy request signal to the respective processors. The copy request signal causes the master processor to copy the content of the main memory of the normal system to the main memory of each failed system. When the synchronization between the respective systems is established, the device control circuit outputs a restart request signal to the respective processors, thus initiating the execution from a fixed, stored address in a control memory of each processor to enable synchronous starting of all of the processor. The multi-processing device further includes a communication control circuit connected to the common memory bus, thus permitting parallel loading of an initial program to the main memories of the respective systems for achieving recovery in the case where all the systems are asynchronous with each other.

Proceedings Article
01 Dec 1986
TL;DR: The goal was to analyze the tradeoffs between the shared memory model, as exemplified by the BBN Uniform System package, and a simple message-passing model, and the two models are compared with respect to performance, scalability, and ease of programming.
Abstract: The BBN Butterfly Parallel Processor can support a user model of computation that is based on either shared memory or message-passing. A description is given of the results of experiments with the message-passing model. The goal was to analyze the tradeoffs between the shared memory model, as exemplified by the BBN Uniform System package, and a simple message-passing model. The two models are compared with respect to performance, scalability, and ease of programming. It is concluded that the particular model of computation used is less important than how well it is matched to the application.

Patent
11 Aug 1986
TL;DR: In this article, a modular switching system for connecting a plurality of digital computers to the same computer memory is presented, employing interconnecting digital switching modules, each of which has the capability of recognizing an access request to a computer memory by recognizing a generated address within a range of addresses assigned to that computer memory.
Abstract: A modular switching system for connecting a plurality of digital computers to a plurality of computer memories, employing interconnecting digital switching modules. Each module has the capability of recognizing an access request to a computer memory by recognizing a generated address within a range of addresses assigned to that computer memory. The digital switching modules are interconnected with a priority network which permits arbitration between simultaneous requests for access by several digital computers to the same computer memory.

Patent
02 Apr 1986
TL;DR: In this paper, the authors describe a computer system where two independent processors communicate via a bus system and operate substantially concurrently, each computer having its own operating system software and share a common memory.
Abstract: 57 A computer system is described wherein two independent processors communicate via a bus system and operate substantially concurrently, each computer having its own operating system software and share a common memory. The architecture of the computer system is such that one of the processors is allocated the bulk of memory band-width with the other processor taking the remainder. Arbitration for memory allocation is accomplished via a combination of a new firmware instruction and a semaphore.

Proceedings ArticleDOI
01 May 1986
TL;DR: It is shown that the logical problem of buffering is directly related to the problem of synchronization, and a simple model is presented to evaluate the performance improvement resulting from buffering.
Abstract: In highly-pipelined machines, instructions and data are prefetched and buffered in both the processor and the cache. This is done to reduce the average memory access latency and to take advantage of memory interleaving. Lock-up free caches are designed to avoid processor blocking on a cache miss. Write buffers are often included in a pipelined machine to avoid processor waiting on writes. In a shared memory multiprocessor, there are more advantages in buffering memory requests, since each memory access has to traverse the memory- processor interconnection and has to compete with memory requests issued by different processors. Buffering, however, can cause logical problems in multiprocessors. These problems are aggravated if each processor has a private memory in which shared writable data may be present, such as in a cache-based system or in a system with a distributed global memory. In this paper, we analyze the benefits and problems associated with the buffering of memory requests in shared memory multiprocessors. We show that the logical problem of buffering is directly related to the problem of synchronization. A simple model is presented to evaluate the performance improvement resulting from buffering.

Patent
15 Dec 1986
TL;DR: In this paper, the data memory existing in the state save unit has pairs of memory words, in which the state at the recovery point last reached is in each case saved in one of the memory words and the current changes are recorded in the respective other memory word.
Abstract: When a fault is detected in a processor, program execution in this processor is interrupted and taken up again by a standby processor from an earlier uncorrupted state, a recovery point. Such recovery points are specially provided in the program. A save copy of the state at the recovery points is created in each case in a state save unit by recording changes compared with the respective previous state. The data memory existing in the state save unit has pairs of memory words, in which arrangement the state at the recovery point last reached is in each case saved in one of the memory words and the current changes are in each case recorded in the respective other memory word. The memory words are accessed via pointers which are formed by a control logic from two check bits allocated to each memory word pair. Processing of the check bits and of the pointers is very fast. It is not necessary to copy data within the state save unit. The standby processor has direct access to the saved data.

Patent
15 Sep 1986
TL;DR: In this paper, a system for error correction in the reading and writing of data to memory in a multiprocessor environment such as a parallel processor is presented, which effectively treats the data for plural memories associated with plural processors as a single data word and generates a single error correcting code for that combined data word.
Abstract: A system for error correction in the reading and writing of data to memory in a multiprocessor environment such as a parallel processor. The data written to and read from memory for each processor is channeled through a single error correcting system which effectively treats the data for plural memories associated with plural processors as a single data word and generates a single error correcting code for that combined data word. By applying a single error correcting methodology to a plurality of memories and associated processors, far greater efficiency is achieved in the parallel processor environment. The read and write operations for the plural memories must be accomplished substantially simultaneously in order that the read and write operations can be treated as acting on a single word and a single error correcting code generated. This ideally suits the system for use in parallel processor environments where the processing function is distributed over a multiplicity of processors and associated memories, acting in parallel.

Journal ArticleDOI
01 Jan 1986
TL;DR: As the cost of computer memory continues to decline faster than that of processors, it may be realistic to effectively apply pattern recognition methodology to security evaluation of an electric power system with a modest level of memory requirement.
Abstract: As the cost of computer memory continues to decline faster than that of processors, it may be realistic to effectively apply pattern recognition methodology to security evaluation of an electric power system. Efficient implementation techniques are developed to achieve assessment in real time with a modest level of memory requirement. The basic idea is to recognize the unknown security of a particular system state operation from stored knowledge about similar operating patterns. Two efficient data structures are proposed here for its implementation. First, a distributed memory device, an associative memory is developed for recognition. This particular memory is found to be capable of parallel pattern matching along with reduced computer storage. Second, for an efficient implementation of the memory structure, these associative memories are configured in a hierarchical structure which not only expands storage capacity but also utilizes the speed of tree search. This structure provides a basis of an error-free, rapid, and memory-saving recognition algorithm.

Patent
11 Aug 1986
TL;DR: In this article, a distributed processing system where a series of processes are carried out by distributed processors connected through a network without provision of a control processor which controls the overall system, an arbitrary processor generates test information and, if test-run is carried out, send out the result of the test run program into the network when necessary.
Abstract: In a distributed processing system where a series of processes are carried out by distributed processors connected through a network without provision of a control processor which controls the overall system, an arbitrary processor generates test information and the processors, each having a memory for storing a program to be tested, decide whether or not the program is to be test-run in accordance with the test information and, if test-run is carried out, send out the result of the test-run program into the network when necessary.

Journal ArticleDOI
TL;DR: A framework for synthesizing communication-efficient distributed-memory parallel programs for block recursive algorithms such as the fast Fourier transform and Strassen's matrix multiplication is presented, based on an algebraic representation of the algorithms, which involves the tensor (Kronecker) product and other matrix operations.

Proceedings ArticleDOI
13 Feb 1986
TL;DR: This paper contains some new results which show that both of the above functions, viz. formation of the internal representations and their storage, can be implemented simultaneously by an adaptive, massively parallel, self-organizing network.
Abstract: Information processing in future computers as well as in higher animals must refer to a complicated knowledge base which is somewhat vaguely called memory. Especially if one is dealing with natural data such as images and sounds, one has to realize the two aspects to be discussed: 1. The internal representations of sensory information in the computing networks. 2. The memory mechanism itself. Most of the experimental and theoretical works have concentrated on the latter problem, which might be named the "back-end" problem of memory. This paper contains some new results which show that both of the above functions, viz. formation of the internal representations and their storage, can be implemented simultaneously by an adaptive, massively parallel, self-organizing network.

Journal ArticleDOI
01 May 1986
TL;DR: Simulation of the Concert RingBus and arbiter show their performance to lie between that of a crossbar switch and a simple shared intercluster bus.
Abstract: Concert is a shared-memory multiprocessor testbed intended to facilitate experimentation with parallel programs and programming languages. It consists of up to eight clusters, with 4-8 processors in each cluster. The processors in each cluster communicate using a shared bus, but each processor also has a private path to some memory. The novel feature of Concert is the RingBus, a segmented bus in the shape of a ring that permits communication between clusters at relatively low cost. Efficient arbitration among requests to use the RingBus is a major challenge, which is met by a novel hardware organization, the criss-cross arbiter. Simulation of the Concert RingBus and arbiter show their performance to lie between that of a crossbar switch and a simple shared intercluster bus.

Patent
17 Jan 1986
TL;DR: In this paper, a plurality of processors use a common memory under a time division control mode by way of a time-division data bus and flip-flops are mounted for holding respective write permission flags.
Abstract: A plurality of processors use a common memory under a time division control mode by way of a time division data bus. In the multiprocessor system, flip-flops are mounted for holding respective write permission flags. Also, a logic gate is employed, operative to allow the processor to write data in the common memory when both the write permission flag and the write request signal from the processor are generated simultaneously. Further, multiplexers are used so that the write operation can be achieved under the time division control mode.

Journal ArticleDOI
W Oed, O Lange1
01 Oct 1986
TL;DR: Some analytical results regarding the access in vector mode to an interleaved memory system and the number and type of memory conflicts that were encountered are presented.
Abstract: Memory interleaving and multiple access ports are the key to a high memory bandwidth in vector processing systems. Each of the active ports supports an independent access stream to memory among which access conflicts may arise. Such conflicts lead to a decrease in memory bandwidth and consequently to longer execution times. We present some analytical results regarding the access in vector mode to an interleaved memory system. In order to demonstrate the practical effects of our analytical results we have done time measurements of some simple vector loops on a 2-CPU, 16-bank CRAY X-MP. By corresponding simulations we obtained the number and type of memory conflicts that were encountered.

Patent
07 May 1986
TL;DR: In this paper, a high-speed, intelligent, distributed control memory system is described, which is comprised of an array of modular, cascadable, integrated circuit devices, referred to as "memory elements".
Abstract: A highspeed, intelligent, distributed control memory system is comprised of an array of modular, cascadable, integrated circuit devices, hereinafter referred to as "memory elements." Each memory element is further comprised of storage means, programmable on board processing ("distributed control") means and means for interfacing with both the host system and the other memory elements in the array utilizing a single shared bus. Each memory element of the array is capable of transferring (reading or writing) data between adjacent memory elements once per clock cycle. In addition, each memory element is capable of broadcasting data to all memory elements of the array once per clock cycle. This ability to asynchronously transfer data between the memory elements at the clock rate, using the distributed control, facilitates unburdening host system hardware and software from tasks more efficiently performed by the distributed control. As a result, the memory itself can, for example, perform such tasks as sorting and searching, even across memory element boundaries, in a manner which conserves, is faster and more efficient then using, host system resources.

Proceedings ArticleDOI
01 Nov 1986
TL;DR: It is shown that there are a variety of definitions which can reasonably be applied to what a process can know about the global state, and the first proof methods for proving knowledge assertions are presented.
Abstract: The importance of the notion of knowledge in reasoning about distributed systems has been recently pointed out by several works. It has been argued that a distributed computation can be understood and analyzed by considering how it affects the state of knowledge of the system. We show that there are a variety of definitions which can reasonably be applied to what a process can know about the global state. We also move beyond the semantic definitions, and present the first proof methods for proving knowledge assertions. Both shared memory and message passing models are considered.

Patent
14 Mar 1986
TL;DR: In this paper, two or more processors (1, 2) with differing address space sizes are used in a multi-processor system, where a portion of the memory means accessible by the first processor (1) with a larger address space size is also accessible by a second processor (2) for elevating a utilization efficiency of memory capacity.
Abstract: Two or more processors (1, 2) with differing address space sizes are used in a multi-processor system. A portion of the memory means (6) accessible by the first processor (1) with a larger address space size is also accessible by the second processor (2) for elevating a utilization efficiency of the memory capacity. A window (101 ) for the memory space of the second processor (2) is provided in the address space of the first processor (1). The memory devices controlled by the second processor (2) are controlled through this window (101) by the first processor (1) as well as for elevating the utilization efficiency of the memory devices.

Journal ArticleDOI
TL;DR: In this article, a distributed algorithm for distributed match-making in store-and-forward computer networks is presented, and the theoretical limitations of distributed matchmaking are established, and techniques are applied to several network topologies.
Abstract: In the very large multiprocessor systems and, on a grander scale, computer networks now emerging, processes are not tied to fixed processors but run on processors taken from a pool of processors. Processors are released when a process dies, migrates or when the process crashes. In distributed operating systems using the service concept, processes can be clients asking for a service, servers giving a service or both. Establishing communication between a process asking for a service and a process giving that service, without centralized control in a distributed environment with mobile processes, constitutes the problem of distributed match-making. Logically, such a match-making phase precedes routing in store-and-forward computer networks of this type. Algorithms for distributed match-making are developed and their complexity is investigated in terms of message passes and in terms of storage needed. The theoretical limitations of distributed match-making are established, and the techniques are applied to several network topologies.

Proceedings ArticleDOI
01 Dec 1986
TL;DR: The method described is being applied to parallelize the CORBAN combat simulation for operation on a hypercube architecture as part of preliminary feasibility analysis concerning simulation support of Airland Battle Management (ALBM).
Abstract: Parallel processing offers the possibility of greatly increased performance for simulations which are computationally bound on existing machines. On shared memory machines, such as the BBN Butterfly, a natural approach is to allocate entities to be processed on different processors with locks used to prevent synchronization problems for a state space in global memory. Parallel processors having local memory only, such as the hypercube architectures, cannot use this approach. Such machines are potentially less expensive than shared memory architectures with similar local computational power, since the interconnection network is simpler. The most natural simulation paradigm for such machines, object oriented programming with interactions limited to messages, may become communications bound if the entities represented are tightly coupled. This paper presents an alternative approach based on use of replicated state spaces on each processor, and consolidation of these changes on a processor basis rather than an interaction basis to minimize message passing. The effect is to trade a parallel processing synchronization problem for a data consistency problem. The approach then relies only on a message broadcasting or passing architecture. For small degrees of parallelism, this requirement can be met by a variety of architectures. The method described is being applied to parallelize the CORBAN combat simulation for operation on a hypercube architecture as part of preliminary feasibility analysis concerning simulation support of Airland Battle Management (ALBM).