scispace - formally typeset
Search or ask a question

Showing papers in "ACM Sigarch Computer Architecture News in 1989"


Journal ArticleDOI
TL;DR: This is the first entirely asynchronous microprocessor ever built and it is quite aware that asynchronous techniques may influence the computer architects in completely new ways that this first design is just starting to explore.
Abstract: : Prejudices are as tenacious in science and engineering as in any other human activity. One of the most firmly held prejudices in digital VLSI design is that asynchronous circuits-a.k.a. self-timed or delay-insensitive circuits-are necessarily slow and wasteful in area and logic. Whereas asynchronous techniques would be appropriate for control, they would be inadequate for data paths because of the cost of dual-rail encoding of data, the cost of generating completion signals for write operations on registers, and the difficulty of designing self-timed buses. Because a general-purpose microprocessor contains a complex data path, a corollary of the previous opinion is that it is impossible to design an efficient asynchronous microprocessor. Since we have been developing a design method for asynchronous circuits that gives excellent results, and since the above objections to large-scale data path designs are genuine but untested, we decided to "pick up the gauntlet" and design a complete processor. The design of an asynchronous microprocessor poses new challenges and opens new avenues to the computer architect. Hence, the experiment unavoidably developed a dual purpose: We are refining an already well-tested design method and we are starting a new series of experiments in asynchronous architectures. (As far as we know, this is the first entirely asynchronous microprocessor ever built.) The results we are reporting have a different implication depending on whether they are related to the first or second goal of the experiment. Whereas we are convinced that our design methods have reached maturity, we are quite aware that asynchronous techniques may influence the computer architects in completely new ways that this first design is just starting to explore.

265 citations


Journal ArticleDOI
TL;DR: It is argued that file systems such as Bridge will satisfy the I/O needs of a wide range of parallel architectures and applications, and empirical results on a 32-processor implementation agree with this prediction.
Abstract: High-performance parallel computers require high-performance file systems Exotic I/O hardware will be of little use if file system software runs on a single processor of a many-processor machine We believe that cost-effective I/O for large multiprocessors can best be obtained by spreading both data and file system computation over a large number of processors and disks To assess the effectiveness of this approach, we have implemented a prototype system called Bridge, and have studied its performance on several data intensive applications, among them external sorting A detailed analysis of our sorting algorithm indicates that Bridge can profitably be used on configurations in excess of one hundred processors with disks Empirical results on a 32-processor implementation agree with the analysis, providing us with a high degree of confidence in this prediction Based on our experience, we argue that file systems such as Bridge will satisfy the I/O needs of a wide range of parallel architectures and applications

83 citations


Journal ArticleDOI
TL;DR: The purpose of this note is to publish these results, which are quite remarkable because of the speed reached on this first design, and, as importantly,Because of the surprising robustness of the chips to variations in temperature and VDD voltage values.
Abstract: We have designed the first entirely asynchronous (also called self-timed or delay-insensitive) microprocessor. The design was reported at the Decennial Caltech Conference on VLSI, last March. The conference paper is included here as an appendix. Since the chips had not yet been fabricated at the moment of writing the conference paper, the paper does not include the results of the experiment. The purpose of this note is to publish these results, which are quite remarkable because of the speed reached on this first design, and, as importantly, because of the surprising robustness of the chips to variations in temperature and VDD voltage values.

58 citations


Journal ArticleDOI
TL;DR: The increasing speed of new generation processors will exacerbate the already large difference between CPU cycle times and main memory access times as mentioned in this paper, and it will increase the difficulty of accessing the memory.
Abstract: The increasing speed of new generation processors will exacerbate the already large difference between CPU cycle times and main memory access times. As this difference grows, it will be increasingl...

58 citations


Journal ArticleDOI
TL;DR: A project on High Performance I/0 Subsystems was proposed by Katz and Ousterhout as discussed by the authors with three overall goals: high performance, high reliability, and high performance.
Abstract: A Project on High Performance I/0 Subsystems Randy H. Katz, John K. Ousterhout, David A. Patterson, Peter Chen, Ann Chervenak, Rich Drewes, Garth Gibson, Ed Lee, Ken Lutz, Ethan Miller, Mendel Rosenblum Computer Science Division Department of Electrical Engineering and Computer Sciences University of California Berkeley, California 94720 1. Introduction and Overview Computing is seeing an unprecedented improvement in performance; over the last five years there has been an order-of-magnitude improvement in the speeds of workstation CPUs. At least another order of magnitude seems likely in the next five years, to machines with 100 MIPs or more. DARPA has already launched a program to develop even larger, more powerful machines, executing as many as 1012 operations per second. Unfortunately, we have seen no comparable break-throughs in I/O performance; the speeds of I/O devices and the hardware and software architectures for managing them have not changed substantially in many years [Katz 89]. What will unbalanced improvements in performance mean? Twenty years ago, Gene Amdahl was asked to comment about the Illiac-IV. He noted that while the vector portion of programs might run much faster, a major portion of the programs would run essentially at the same speed. In what has come to be known as Amdahl's Law, he observed that no matter how much faster one piece of the program would go over tradi- tional computers, overall performance improvement is limited by the part of the program that is not improved. Without major increases in I/O performance and reliability we think that transaction processing systems, supercomputers, and high-performance workstations will be unable to achieve their true potential. Our research group is pursuing a program of research to develop hardware and software I/O architectures capable of supporting the kinds of internetworked work- stations and super compute servers that will appear in the early 1990s. The project has three overall goals: High Performance. We are developing new I/O architectures and a prototype system that can scale to achieve significant factors of improvement in I/O performance, relative to today's commercially available I/O systems, for the same cost. We believe this speedup can be achieved using a combination of arrays of inexpensive personal computer disks coupled with a file system that can accommodate both striped and partitioned file organizations. High Reliability. To support the high-performance computing of the mid-1990's, I/O 24

30 citations


Journal ArticleDOI
TL;DR: Prime proposes that IOBENCH and a standard spectrum of runs be adopted as an industry standard for measuring IO performance, which has proven to be a very good indicator of system IO performance.
Abstract: IOBENCH is an operating system and processor independent synthetic input/output (IO) benchmark designed to put a configurable IO and processor (CP) load on the system under test. It is meant to stress the system under test in a manner consistent with the way in which Oracle, Ingres, Prime INFORMATION or other data management products do IO. The IO and CP load is generated by background processes doing as many "transactions" as they can on a specified set of files during a specified time interval. By appropriately choosing and varying the benchmark parameters, IOBENCH can be configured to approximate the IO access patterns of real applications. IOBENCH can be used to compare different hardware platforms, different implementations of the operating system, different disk buffering mechanisms, and so forth. IOBENCH has proven to be a very good indicator of system IO performance. Use of IOBENCH has enabled us to pinpoint operating system bugs and bottlenecks.IOBENCH currently runs on PRIMOS and a number of UNIX systems; this paper discusses the UNIX versions. IOBENCH can be ported to a new platform in a few days. Prime proposes that IOBENCH and a standard spectrum of runs be adopted as an industry standard for measuring IO performance. Sources and documentation for IOBENCH will be made available free of charge.

25 citations


Journal ArticleDOI
TL;DR: It is concluded, that for a given number of disk drives, that both RAID 1 and RAID 5 have acceptable performance in a read environment, while RAID 5 degrades significantly in an update intensive environment.
Abstract: Large arrays of disks have been proposed as a way to meet the need for increasing IO bandwidth. This paper examines disk array performance in a random IO environment. It also presents the results of performance testing using the Prime IOBENCH™ benchmark on a combination of disk striping, RAID 1, and RAID 5 disk arrays. It concludes, that for a given number of disk drives, that both RAID 1 and RAID 5 have acceptable performance in a read environment, while RAID 5 degrades significantly in an update intensive environment.

25 citations


Journal ArticleDOI
SakaiS., Yamaguchiy., HirakiK., KodamaY., YubaT. 
TL;DR: The EM-4 as discussed by the authors is a highly parallel (more than a thousand) dataflow machine that uses a compact architecture by overcoverage of the dataflow kernel and the architecture.
Abstract: A highly parallel (more than a thousand) dataflow machine EM-4 is now under development. The EM-4 design principle is to construct a high performance computer using a compact architecture by overco...

15 citations


Journal ArticleDOI
TL;DR: A review of historical machines demonstrates the need for a more comprehensive categorization than previously published and reveals the historical firsts of I/O interrupts in the NBS DYSEAC, DMA in the IBM SAGE (AN/FSQ-7), the interrupt vector concept in the Lincoln Labs TX-2, and fully symmetric I/o in the Burroughs D-825 multiprocessor.
Abstract: A new taxonomy for I/O systems is proposed that is based on the program sequencing necessary for the control of I/O devices. A review of historical machines demonstrates the need for a more comprehensive categorization than previously published and reveals the historical firsts of I/O interrupts in the NBS DYSEAC, DMA in the IBM SAGE (AN/FSQ-7), the interrupt vector concept in the Lincoln Labs TX-2, and fully symmetric I/O in the Burroughs D-825 multiprocessor.

14 citations


Journal ArticleDOI
TL;DR: Results of performance evaluation of several parallel disk organizations are presented and a characterization of the disk systems is presented.
Abstract: In this paper, several issues related to designing a parallel disk system are discussed. Results of performance evaluation of several parallel disk organizations are presented. A characterization of the disk systems is also presented. Issues such as scalability, networking etc. are discussed. Several problems for future research on improving the I/O performance are pointed out.

13 citations


Journal ArticleDOI
TL;DR: A hardware mechanism for supporting priority queues is described that provides the primitive operations on a priority queue and a fast fully parallel Content Associative Memory (CAM), a high speed priority memory, and a good multiple--response resolver.
Abstract: In this paper a hardware mechanism for supporting priority queues is described that provides the primitive operations on a priority queue. The principle mechanism for attaining this goal are a fast fully parallel Content Associative Memory (CAM), a high speed priority memory, and a good multiple--response resolver.

Journal ArticleDOI
TL;DR: This paper presents an inventive information exchange pro-cess between the main memory and cache equipped processors that makes use of serial multiport memories and high throughput serial transmission supports and generates a family of possible architectures in which serial transfers of informations are parallelized.
Abstract: This paper presents an inventive information exchange pro-cess between the main memory and cache equipped processors. It makes use of serial multiport memories and high throughput serial transmission supports. It is then possible to consider the realization of a multiprocessor with a common memory shared by several hundreds processors set with a performance level close to that of a crossbar network one's without having its disadvantages. This exchange process generates a family of possible architectures in which serial transfers of informations are parallelized, in the contrary of conventional architectures which serialize parallel transfers of informations.

Journal ArticleDOI
TL;DR: In this paper, simulations of real parallel applications show that large-scale cache-coherent multiprocessors suffer from significant performance degradation when using shared variables for synchronization in shared-memory multi-processors.
Abstract: Shared-memory multiprocessors commonly use shared variables for synchronization Our simulations of real parallel applications show that large-scale cache-coherent multiprocessors suffer significan

Journal ArticleDOI
TL;DR: A distributed shared memory (DSM) architecture is presented that is the basis for the design of a scalable high performance multiprocessor system that is able to process very large processing tasks with supercomputer performance.
Abstract: The rapid progress of microprocessors provides economic solutions for small and medium-scale data processing tasks, e.g., workstations. It is a challenging task to combine many powerful microprocessors to a fixed or reconfigurable array which is able to process very large processing tasks with supercomputer performance. Fortunately, many very large applications are regularly structured and can easily be partitioned. One example are physical phenomena which are often described by mathematical models, e.g. by sets of partial differential equations (PDE). In most cases, the mathematical models can only be computed approximately The finer the used model is, the higher is the necessary computational effort. With the appearance of more powerful computers more complicated and more refined models can be calculated. Such user problems are compute- intensive and have strong inherent computational parallelism. Therefore, the needed high performance can be achieved by using many computers working in parallel. In particular, parallel architectures of the MIMD (multiple-instruction multiple-data) type, known as multiprocessors, are well suited because of their higher flexibility with respect to SIMD (single-instruction multiple-data). In this paper, the authors present a distributed shared memory (DSM) architecture that is the basis for the design of a scalable high performance multiprocessor system.

Journal ArticleDOI
TL;DR: A panel session on future directions in parallel computer architecture, to identify the likely trajectory of future parallel computer system progress, particularly from the sandpoint of marketplace acceptance.
Abstract: One of the program highlights of the 15th Annual International Symposium on Computer Architecture, held May 30 - June 2, 1988 in Honolulu, was a panel session on future directions in parallel computer architecture. The panel was organized and chaired by the author, and was comprised of Prof. Jack Dennis (NASA Ames Research Institute for Advanced Computer Science), Prof. H.T. Kung (Carnegie Mellon), and Dr. Burton Smith (Tera Computer Company). The objective of the panel was to identify the likely trajectory of future parallel computer system progress, particularly from the sandpoint of marketplace acceptance. Approximately 250 attendees participated in the session, in which each panelist began with a ten minute viewgraph explanation of his views, followed by an open and sometimes lively exchange with the audience and fellow panelists. The session ran for ninety minutes.

Journal ArticleDOI
TL;DR: A strategy is described which treats cache blocks as components of a working set rather than merely as statistically related entities, and extension of this technique to global memory multiprocessors is discussed as a means to reduce memory contention during synchronized phase transitions.
Abstract: A variety of strategies has been proposed for fetching code and data into cache memories before it is explicitly referenced by a processor (prefetching), including both history-based and "prescient" strategies. Here, a strategy is described which treats cache blocks as components of a working set rather than merely as statistically related entities. Extension of this technique to global memory multiprocessors is discussed as a means to reduce memory contention during synchronized phase transitions.

Journal ArticleDOI
TL;DR: The synthesized CLs and FSMs can serve as "correct-by-construction" building blocks for self-timed silicon system compilation and are shown to require less gates than other proposed methods.
Abstract: Self-timed logic provides a method for designing logic circuits such that their correct behavior depends neither on the speed of their components nor on the delay along the communication wires. General synthesis methods for efficiently implementing self-timed combinational logic (CL) and finite state machines (FSM) are presented. The resulting CL is shown to require less gates than other proposed methods. The FSM is implemented by interconnecting a CL module with aself-timed master-slave register. The FSM synthesis method is also compared with other approaches. A formal system of behavioral sequential constraints is presented for each of the systems, and their behavior is proven correct. Thus, the synthesized CLs and FSMs can serve as "correct-by-construction" building blocks for self-timed silicon system compilation.

Journal ArticleDOI
TL;DR: In this article, the authors explore the extent to which multiple hardware contexts per hardware context can tolerate high latency memory operations in a scalable multiprocessor and show that it is possible to tolerate high-latency memory operations.
Abstract: A fundamental problem that any scalable multiprocessor must address is the ability to tolerate high latency memory operations. This paper explores the extent to which multiple hardware contexts per...

Journal ArticleDOI
TL;DR: It is proved that this view does not follow from the consecutive nature of this memory, but from the group structure of the law performed in the address arithmetic unit, which can get a memory with a non commutative access.
Abstract: Memory in the von Neumann computer is usually viewed as a linear array. We prove that this view does not follow from the consecutive nature of this memory, but from the group structure of the law performed in the address arithmetic unit. By changing that law, we can get a memory with a non commutative access. As an example we describe the metacyclic memory.

Journal ArticleDOI
TL;DR: The SCCDC scheme appears attractive for multi-user, multi-thread environments where the actual shared rate is modest, and mostly caused by synchronization mechanisms, and the implementation cost is low.
Abstract: We present and evaluate a snoopy cache memory protocol, the Single Cache Copy Data Coherence (SCCDC), for multiprocessors that allows only a single cache to hold a given share-d data at any time. The simulations presented here indicate that despite its simplicity, the scheme has the potential for good performance comparable with more complex snoopy cache schemes. We have also shown in related work [8] that the existence of only a single copy of data in cache allows efficient access control to shared data by minimizing the overh-ead caused by critical sections. Thus, with low implementation cost, and the efficient support for important operating system functions, the SCCDC scheme appears attractive for multi-user, multi-thread environments where the actual shared rate is modest, and mostly caused by synchronization mechanisms.

Journal ArticleDOI
TL;DR: In addition to existing methods for improving computer performance, high performance systems of the 90:0090's and beyond will benefit from a sustained performance architecture (SPA), which is independent of technology and instruction set, and can be incorporated in machines without sacrificing binary comparitiblity.
Abstract: The thesis of this report is that, in addition to existing methods for improving computer performance, high performance systems of the 90:0090's and beyond will benefit from a sustained performance architecture (SPA). This architecture, to be introduced informally here, is independent of technology and instruction set, and can be incorporated in machines without sacrificing binary comparitiblity. The purpose of SPA is to keep the instruction pipeline of a uniprocessor full to the greatest possible extend; this efficiency comes at the cost of increased hardware complexity in the prefetch mechanism and software complexity in the use of program flow information.

Journal ArticleDOI
TL;DR: A methodology, and an architecture, which greatly reduces this overhead while maintaining the inherent advantages of the register window approach is introduced, and ways of implemnting traditional stacks and queues, as well as hierarchical storage structures using windows are presented.
Abstract: The organization of large register banks into windows hasbeen shown to be effective in enhancing the performance of sequential programs. One drawback of such an organization, which is of of minor importance to sexsquential languages, is the overhead encountered when the register bank must be replaced during a task switch. With concurrent language paradigms, such as are found in Ada, Occam, and Modula-2, these switches will be more frequent. We introduce here a methodology, and an architecture, which greatly reduces this overhead while maintaining the inherent advantages of the register window approach. In addition, we present ways of implemnting traditional stacks and queues, as well as hierarchical storage structures using windows.

Journal ArticleDOI
TL;DR: The response time behavior of access requests for pagesized objects which are made using a network connection is discussed, feeling that future architectures must include networked resources as a component of the storage hierarchy; such performance measurements will help architects place these resources intelligently.
Abstract: Previous work has examined the response times associated with disk accesses and the response time associated with various memory access operations. Here, we discuss the response time behavior of access requests for pagesized objects which are made using a network connection. We feel that future architectures must include networked resources as a component of the storage hierarchy; such performance measurements will help architects place these resources intelligently.

Journal ArticleDOI
TL;DR: PARCBench is a non-synthetic, multi-component benchmark written in C and directed at Unix-based shared-memory multiprocessor architectures.
Abstract: PARCBench is a non-synthetic, multi-component benchmark written in C and directed at Unix-based shared-memory multiprocessor architectures. Unlike most conventional benchmarks PARCBench generates multivariate data to produce a characterization of shared-memory architectures which necessarily goes beyond the single timing datum.

Journal ArticleDOI
TL;DR: In t h E p re sen t CISC the processor needs to s to re t h e t a s k s t a t e i n f o r m a t i o n, or P rocess C o n t r o l B lock ( P C B ) of the p r e sen t r u n n i n g t as k in to T h e m a i n m e m o r y.
Abstract: In t h e p re sen t CISC the processor needs to s to re t h e t a s k s t a t e i n f o r m a t i o n , k n o w n as T a s k S t a t e S e g m e n t (TSS), or P rocess C o n t r o l B lock ( P C B ) of the p r e sen t r u n n i n g t a s k in to t h e m a i n m e m o r y a t each t i m e the t a s k swi t ch ing occur . T h e TSS of the new t a s k shou ld be l oaded to t h e TSS reg is te r of t he processor. Th i s t r a n s f e r of TSS t o / f r o m m a l n m e m o r y for each t a s k sw l t ch lng ls t i m e c o n s u m i n g , especia l ly w h e n the TSS size is large a n d t h e t a s k sw l t ch lng Is more f r equen t .

Journal ArticleDOI
TL;DR: A hardware approach and new message-passing mechanisms that the pure software approach does not support are presented and the resulting system is suitable for studying concurrent software, scientific computations and distributed problem solving.
Abstract: Although message-passing is a versatile communication paradigm in the multiprocessing arena [AthSe88], a pure message-passing mechanism via special communication channels is inefficient. Shared-memory multiprocessor systems are usually much cheaper and efficient. Conventional approaches tend to implement message-passing on top of shared-memory architectures using a pure software approach. Even with special techniques [FinHe88], the performance of such systems is still worse than simple shared-memory communication systems. In this paper, we shall present a hardware approach and new message-passing mechanisms that the pure software approach does not support. Such an approach is quite cost effective. The resulting system is suitable for studying concurrent software, scientific computations and distributed problem solving.

Journal ArticleDOI
TL;DR: SIMP as mentioned in this paper is a multiple instruction-pipeline parallel architecture targeted for enhancing the performance of SISD processors by exploiting both temporal and spatial parallelism, and it can be used for both single-core and multi-core architectures.
Abstract: SIMP is a novel multiple instruction-pipeline parallel architecture. It is targeted for enhancing the performance of SISD processors drastically by exploiting both temporal and spatial parallelisms...

Journal ArticleDOI
TL;DR: The performance implications of using server-based locking instead of shared memory locking in an input/output (IO) intensive benchmark showed a significant degradation in throughput of IO limited multiprocessor runs.
Abstract: This paper discusses the performance implications of using server-based locking instead of shared memory locking in an input/output (IO) intensive benchmark. Uniprocessor and multiprocessor systems were investigated. Server-based locking was not a problem in uniprocessor runs or in multiprocessor runs that were processor limited. Server-based locking resulted in a significant degradation in throughput of IO limited multiprocessor runs.

Journal ArticleDOI
TL;DR: In this paper, the multiprocessor Sequent Symmetry was first delivered to customers with write-through caches and later on each machine was upgraded with copy-back caches.
Abstract: The multiprocessor Sequent Symmetry was first delivered to customers with write-through caches. Later on each machine was upgraded with copy-back caches. Because all the other architectural paramet...

Journal ArticleDOI
TL;DR: The d i f f i c u l t y i n o p t i m i z i n g t h e t w o p a t h s p r o v i d e s f a s t a c c e s s t o b o t h p aT h s o f memory.
Abstract: The d i f f i c u l t y i n o p t i m i z i n g t h e c o n d i t i o n a l b r a n c h i n s t r u c t i o n l i e s i n h a v i n g t h e t w o s u c c e s s o r p a t h s i n d i f f e r e n t p a r t s o f memory. I n t e r l a c i n g t h e t w o p a t h s p r o v i d e s f a s t a c c e s s t o b o t h p a t h s. The new i n s t r u c t i o n i s c a l l e d \" t e s t \". The T e s t I n s t r u c t i o n : The t e s t i n s t r u c t i o n p r o v i d e s t h e f u n c t i o n o f a c o n d i t i o n a l b r a n c h i n s t r u c t i o n. F o l l o w i n g a t e s t i n s t r u c t i o n i n a p r o g r a m , t h e two p r o g r a m p a t h s t h a t f o l l o w t h e t e s t a r e i n t e r l a c e d. By i n t e r l a c i n g t h e t w o p r o g r a m p a t h s , b o t h program p a t h s a r e a v a i l a b l …