scispace - formally typeset
Search or ask a question

Showing papers on "Scalability published in 1989"


Proceedings ArticleDOI
01 Apr 1989
TL;DR: The extent to which multiple hardware contexts per processor can help to mitigate the negative effects of high latency is explored and it is shown that two or four contexts can achieve substantial performance gains over a single context.
Abstract: A fundamental problem that any scalable multiprocessor must address is the ability to tolerate high latency memory operations. This paper explores the extent to which multiple hardware contexts per processor can help to mitigate the negative effects of high latency. In particular, we evaluate the performance of a directory-based cache coherent multiprocessor using memory reference traces obtained from three parallel applications. We explore the case where there are a small fixed number (2-4) of hardware contexts per processor and the context switch overhead is low. In contrast to previously proposed approaches, we also use a very simple context switch criterion, namely a cache miss or a write-hit to shared data. Our results show that the effectiveness of multiple contexts depends on the nature of the applications, the context switch overhead, and the inherent latency of the machine architecture. Given reasonably low overhead hardware context switches, we show that two or four contexts can achieve substantial performance gains over a single context. For one application, the processor utilization increased by about 46% with two contexts and by about 80% with four contexts.

163 citations


Journal ArticleDOI
M.M. Theimer1, K.A. Lantz2
TL;DR: The authors describe the design and performance of scheduling facilities for finding idle hosts in a workstation-based distributed system and focus on the tradeoffs between centralized and decentralized architectures with respect to scalability, fault tolerance, and simplicity of design.
Abstract: The authors describe the design and performance of scheduling facilities for finding idle hosts in a workstation-based distributed system. They focus on the tradeoffs between centralized and decentralized architectures with respect to scalability, fault tolerance, and simplicity of design, as well as several implementation issues of interest when multicast communication is used. They conclude that the principal tradeoff between the two approaches is that a centralized architecture can be scaled to a significantly greater degree and can more easily monitor global system statistics whereas a decentralized architecture is simpler to implement. >

139 citations


Journal ArticleDOI
TL;DR: A form of coherence in the ray-tracing algorithm is identified that can be exploited to develop optimum schemes for data distribution in a multiprocessor system, which gives rise to high processor efficiency for systems with limited distributed memory.
Abstract: The scalability and cost effectiveness of general-purpose distributed-memory multiprocessor systems makes them particularly suitable for ray-tracing applications. However, the limited memory available to each processor in such a system requires schemes to distribute the model database among the processors. The authors identify a form of coherence in the ray-tracing algorithm that can be exploited to develop optimum schemes for data distribution in a multiprocessor system. This in turn gives rise to high processor efficiency for systems with limited distributed memory. >

99 citations


Proceedings ArticleDOI
01 Aug 1989
TL;DR: The experimental results show that ACWN algorithm achieves better performance in most cases than randomized allocation, and its agility in spreading the work helps it outperform the gradient model in performance and scalability.
Abstract: One of the challenges in programming distributed memory parallel machines is deciding how to allocate work to processors. This problem is particularly acute for computations with unpredictable dynamic behavior or irregular structure. We present a scheme for dynamic scheduling of medium-grained processes that is useful in this context. The Adaptive Contracting Within Neighborhood (ACWN), is a dynamic, distributed, self-adaptive, and scalable scheme. The basic scheme and its adaptive extensions are described, and contrasted with other schemes that have been proposed in this context. The performance of all the three schemes on an iPSC/2 hypercube is presented and analyzed. The experimental results show that ACWN algorithm achieves better performance in most cases than randomized allocation. Its agility in spreading the work helps it outperform the gradient model in performance and scalability.

64 citations


Proceedings ArticleDOI
01 Apr 1989
TL;DR: The VMP-MC design is described, a distributed parallel multi-computer based on the VMP multiprocessor design that is intended to provide a set of building blocks for configuring machines from one to several thousand processors.
Abstract: The problem of building a scalable shared memory multiprocessor can be reduced to that of building a scalable memory hierarchy, assuming interprocessor communication is handled by the memory system. In this paper, we describe the VMP-MC design, a distributed parallel multi-computer based on the VMP multiprocessor design, that is intended to provide a set of building blocks for configuring machines from one to several thousand processors. VMP-MC uses a memory hierarchy based on shared caches, ranging from on- chip caches to board-level caches connected by busses to, at the bottom, a high-speed fiber optic ring. In addition to describing the building block components of this architecture, we identify the key performance issues associated with the design and provide performance evaluation of these issues using trace-drive simulation and measurements from the VMP. This work was sponsored in part by the Defense Advanced Re- search Projects Agency under Contract N00014-88-K-0619.

55 citations


01 Jun 1989
TL;DR: How layout theory engendered the notion of area and volume-universal networks, such as fat-trees is discussed and these scalable networks offer a flexible alternative to the more common hypercube-based networks for inter-connecting the processors of large parallel supercomputers.
Abstract: : Since its inception, VLSI theory has expanded in many fruitful and interesting directions. One major branch is layout theory which studies the efficiency with which graphs can be embedded in the plane according to VLSI design rules. In this survey paper, I review some of the major accomplishments of VLSI layout theory and discuss how layout theory engendered the notion of area and volume-universal networks, such as fat-trees. These scalable networks offer a flexible alternative to the more common hypercube-based networks for inter-connecting the processors of large parallel supercomputers. Keywords: Integrated circuits; Interconnection networks; Parallel computing; Super-computing; Universality; Thompson's model; Tree of meshes.

36 citations


Proceedings Article
01 Jan 1989
TL;DR: In this paper, the authors describe an approach for a system that accesses the distributed collection of repositories that naturally maintain resource information, rather than building a global database to register all resources.
Abstract: Large scale computer networks provide access to a bewilderingly large number and variety of resources, including retail products, network services, and people in various capacities. We consider the problem of allowing users to discover the existence of such resources in an administratively decentralized environment. We describe an approach for a system that accesses the distributed collection of repositories that naturally maintain resource information, rather than building a global database to register all resources. A key problem is organizing the resource space in a manner suitable to all participants. Rather than imposing an inflexible hierarchical organization, our approach allows the resource space organization to evolve in accordance with what resources exist and what types of queries users make. Concretely, a set of agents organize and search the resource space by constructing links between the repositories of resource information based on keywords that describe the contents of each repository, and the semantics of the resources being sought. The links form a general graph, with a flexible set of hierarchies embedded within the graph to provide some measure of scalability. The graph structure evolves over time through the use of cache aging protocols. Additional scalability is targeted through the use of probabilistic graph protocols. A prototype implementation and a measurement study are under way. hhhhhhhhhhhhhhhhhh 1 This material is based upon work supported in part by the National Science Foundation under Cooperative Agreement DCR-84200944, and by a grant from AT&T Bell Laboratories.

27 citations


Proceedings ArticleDOI
01 Aug 1989
TL;DR: A technique is proposed that can be used to help determine whether a candidate model is correct, that is, whether it adequately approximates the system's scalability, and Experimental results illustrate this technique for both a poorly scalable and a very scalable system.
Abstract: This paper discusses scalability and outlines a specific approach to measuring the scalability of parallel computer systems. The relationship between scalability and speedup is described. It is shown that a parallel system is scalable for a given algorithm if and only if its speedup is unbounded. A technique is proposed that can be used to help determine whether a candidate model is correct, that is, whether it adequately approximates the system's scalability. Experimental results illustrate this technique for both a poorly scalable and a very scalable system.

27 citations


01 Mar 1989
TL;DR: Psyche as discussed by the authors is an operating system designed to enable the most effective use possible of large-scale shared-memory multiprocessors, both within and among applications, with information sharing as the default, rather than the exception.
Abstract: Scalable shared-memory multiprocessors (those with non-uniform memory access times) are among the most flexible architectures for high-performance parallel computing, admitting efficient implementations of a wide range of process models, communication mechanisms, and granularities of parallelism. Such machines present opportunities for general-purpose parallel computing that cannot be exploited by existing operating systems, because the traditional approach to operating system design presents a virtual machine in which the definition of process, communication, and grain size are outside the control of the user. Psyche is an operating system designed to enable the most effective use possible of large-scale shared-memory multiprocessors. The Psyche project is characterized by (1) a design that permits the implementation of multiple models of parallelism, both within and among applications, (2) the ability to trade protection for performance, with information sharing as the default, rather than the exception, (3) explicit, user-level control of process structure and scheduling, and (4) a kernel implementation that uses shared memory itself, and that provides users with the illusion of uniform memory access times.

16 citations


Proceedings ArticleDOI
27 Feb 1989
TL;DR: The authors describe and motivate the design of a scalable and portable benchmark for database systems, the AS/sup 3/AP benchmark (Ansi SQL Standard Scalable and Portable).
Abstract: The authors describe and motivate the design of a scalable and portable benchmark for database systems, the AS/sup 3/AP benchmark (Ansi SQL Standard Scalable and Portable). The benchmark is designed to provide meaningful measures of database processing power, to be portable between different architectures, and to be scalable to facilitate comparisons between systems with different capabilities. The authors introduce a performance metric, namely, the equivalent database ratio, to be used in comparing systems. >

16 citations


Journal ArticleDOI
TL;DR: Results of performance evaluation of several parallel disk organizations are presented and a characterization of the disk systems is presented.
Abstract: In this paper, several issues related to designing a parallel disk system are discussed. Results of performance evaluation of several parallel disk organizations are presented. A characterization of the disk systems is also presented. Issues such as scalability, networking etc. are discussed. Several problems for future research on improving the I/O performance are pointed out.

Journal ArticleDOI
TL;DR: The features offered by current high-performance 32-bit system buses are examined, and the factors that need to be taken into account when designing these buses are considered.
Abstract: The features offered by current high-performance 32-bit system buses are examined. They allow multiprocessing, scalability, block transfers to RAM, cache coherence, and autoconfiguration (the ability to poll boards connected to them, identify the boards, and adjust the software interface accordingly). The factors that need to be taken into account when designing these buses are considered, and their performance and limitations are discussed. >

Book ChapterDOI
19 Jun 1989
TL;DR: It is concluded that although parallelism must be limited in some circumstances, in general the benefits of increased parallelism in shared-nothing systems exceed the costs.
Abstract: We describe results from an experiment that investigates the scalability of response time performance in shared-nothing systems, such as the Bubba parallel database machine. In particular, we show how—and how much—potential response time improvements for certain transaction types can be impaired in shared-nothing architectures by the increased cost of transaction startup, communication, and synchronization as the degree of execution parallelism is increased. We further show how these effects change under increased levels of concurrency and heterogeneity in the transaction workload. From the results, we conclude that although parallelism must be limited in some circumstances, in general the benefits of increased parallelism in shared-nothing systems exceed the costs.

Journal ArticleDOI
TL;DR: A distributed shared memory (DSM) architecture is presented that is the basis for the design of a scalable high performance multiprocessor system that is able to process very large processing tasks with supercomputer performance.
Abstract: The rapid progress of microprocessors provides economic solutions for small and medium-scale data processing tasks, e.g., workstations. It is a challenging task to combine many powerful microprocessors to a fixed or reconfigurable array which is able to process very large processing tasks with supercomputer performance. Fortunately, many very large applications are regularly structured and can easily be partitioned. One example are physical phenomena which are often described by mathematical models, e.g. by sets of partial differential equations (PDE). In most cases, the mathematical models can only be computed approximately The finer the used model is, the higher is the necessary computational effort. With the appearance of more powerful computers more complicated and more refined models can be calculated. Such user problems are compute- intensive and have strong inherent computational parallelism. Therefore, the needed high performance can be achieved by using many computers working in parallel. In particular, parallel architectures of the MIMD (multiple-instruction multiple-data) type, known as multiprocessors, are well suited because of their higher flexibility with respect to SIMD (single-instruction multiple-data). In this paper, the authors present a distributed shared memory (DSM) architecture that is the basis for the design of a scalable high performance multiprocessor system.

Proceedings ArticleDOI
01 Apr 1989
TL;DR: A queueing network model of the dynamic dataflow architecture is developed based on the idea of characterizing dataflow graphs by their average parallelism and the effect on the performance of the system due to factors such as scalability, coarse grain vs. fine grain parallelism, degree of decentralized scheduling of dataflow instructions, and locality is studied.
Abstract: This paper presents analytical results of computation-communication issues in dynamic dataflow architectures. The study is based on a generalized architecture which encompasses all the features of the proposed dynamic dataflow architectures. Based on the idea of characterizing dataflow graphs by their average parallelism, a queueing network model of the architecture is developed. Since the queueing network violates properties required for product from solution, a few approximations have been used. These approximations yield a multi-chain closed queueing network in which the population of each chain is related to the average parallelism of the dataflow graph executed in the architecture. Based on the model, we are able to study the effect on the performance of the system due to factors such as scalability, coarse grain vs. fine grain parallelism, degree of decentralized scheduling of dataflow instructions, and locality.

Proceedings ArticleDOI
Ruby B. Lee1
03 Jan 1989
TL;DR: The author discusses the Hewlett-Packard Precision architecture, which was designed as a common architecture for HP computer systems with a RISC (reduced-instruction-set computer)-like execution model, with features for code compaction and execution time reduction for frequent instruction sequences.
Abstract: The author discusses the Hewlett-Packard Precision architecture, which was designed as a common architecture for HP computer systems. It has a RISC (reduced-instruction-set computer)-like execution model, with features for code compaction and execution time reduction for frequent instruction sequences. In addition, it has features for making the architecture extendible, for enhancing its longevity, and for supporting different operating environments. The author describes some aspects of the Precision processor architecture, its goals, how it addresses the spectrum of general-purpose use information, processing needs, and some architectural design tradeoffs. >

Proceedings ArticleDOI
11 Jun 1989
TL;DR: A description is given of a prototype broadband integrated services digital network which uses a scalable, distributed control architecture, which enhances reliability, to meet the need for central office software control architectures for broadband services.
Abstract: It is noted that broadband services accelerate the need for central office software control architectures allowing flexible allocation of computing resources, since they increase the volume, complexity, and fluctuation of the workload. A description is given of a prototype broadband integrated services digital network which uses a scalable, distributed control architecture, which enhances reliability, to meet this goal. In contrast with other architectures in which processors are tightly coupled to subscriber lines, the prototype control architecture decomposes call processing into functions that are distributed among several processors with minimized common functions and coupling between subscribers and processors. Scalability in terms of lines and traffic volume is achieved. Two versions of the architecture, one using general-purpose computers and the other a single board computer system, are operational. Extensions of the architecture for unified network control offer the additional benefit of simplifying new service deployment. >

01 Dec 1989
TL;DR: This research makes three major contributions: Several distinct types of skew are identified, and the relative partition model of skew is defined, a simple analytic model that allows worst-case analysis of each type of data skew.
Abstract: This research will improve understanding of the interaction between data skew and scalability in parallel join algorithms. Previous work in this area assumes that data are uniformly distributed, but data skew is widespread in existing databases. This research makes three major contributions: 1. Several distinct types of skew are identified. Previous work treats skew as a homogeneous phenomenon, but simple analytic analysis shows that each type of skew has a different effect on response time. 2. The relative partition model of skew is defined. It is a simple analytic model that allows worst-case analysis of each type of data skew. The use of this model is demonstrated in an analysis of the sort-merge join algorithm. 3. A systematic plan for investigating skew and scalability. The interplay between simple analytic models and detailed simulations is vital: Analytic models bound the results expected from simulation, while more detailed simulation results validate the analytic models.

Dissertation
01 Dec 1989
TL;DR: This thesis discusses how the system software might manage the 'relative pointers' in a clean, transparent way, solutions to the problem of testing pointer equivalence, protocols and algorithms for migrating objects to maximize concurrency and communication locality, garbage collection techniques, and other aspects of the CNRA system design.
Abstract: : The Computer Architecture Group is developing a new model of computation called L. This thesis describes a highly scalable architecture for implementing L called CNRA. In the CNRA architecture, processor/memory pairs are placed at the nodes of a low-dimensional Cartesian grid network. Addresses in the system are composed of a routing component which describes a relative path through the interconnection network (the origin of the path is the node on which the address resides), and a 'memory location' component which specifies the memory location to be addressed on the node at the destination of the routing path. The CNRA addressing system allows sharing of data structures in a style similar to that of global shared memory machines, but does not have the disadvantages normally associated with shared-memory machines (i.e. limited address space and memory access latency that increases with system size). This thesis discusses how a practical CNRA system might be built. It discusses how the system software might manage the 'relative pointers' in a clean, transparent way, solutions to the problem of testing pointer equivalence, protocols and algorithms for migrating objects to maximize concurrency and communication locality, garbage collection techniques, and other aspects of the CNRA system design. Simulations experiments with a toy program are presented. Multiprocessors; Scalability; Topology; Address space; Relative addressing; Task migration; Parallelism.

Proceedings ArticleDOI
10 Oct 1989
TL;DR: An analysis of the scalability of large-scale degradable homogeneous multiprocessors is presented by assessing the limitations imposed by reliability considerations on the number of processors and it is demonstrated that graceful degradation in large- scale systems is not scalable.
Abstract: The authors present an analysis of the scalability of large-scale degradable homogeneous multiprocessors by assessing the limitations imposed by reliability considerations on the number of processors. They demonstrate that graceful degradation in large-scale systems is not scalable. An increase in the number of processors must be matched by a significant increase in the coverage factor in order to maintain the same performance and reliability levels. >

01 Jun 1989
TL;DR: A cost-effective sort engine and a novel algorithm that combines hashing and semijoins are proposed that scale very well and in some cases, synergistic effects lead to better than linear speedup.
Abstract: This paper focuses on parallel joins computed on a mesh-connected multicomputer. We propose a cost-effective sort engine and a novel algorithm that combines hashing and semijoins. An analytic model is used to select hardware configurations for detailed evaluation and to suggest refinements to the algorithm. Simulation of our model confirmed the analytic results. Results indicate that parallel joins scale very well. In some cases, synergistic effects lead to better than linear speedup.

Proceedings ArticleDOI
23 May 1989
TL;DR: A parallel computer architecture targeted at signal pattern analysis applications, scalable to configurations capable of TeraFLOP throughput, and derived and used to analyze the impact of skewness of the embedded trees on the execution time of parallel recognition algorithms.
Abstract: Describes a parallel computer architecture targeted at signal pattern analysis applications, scalable to configurations capable of TeraFLOP (10/sup 12/ floating point operations per second) throughput An important attribute of the architecture is its low interconnection overhead, making it well suited to miniaturization using advanced packaging Preliminary design and thermal tests project a computing density of 300 GigaFLOPS per cubit foot The architecture is reconfigurable as a tree machine, one or more rings, or a set of linear systolic arrays Fault tolerance is achieved by embedding these topologies within a four-connected lattice, growing around any faults A performance model is derived and used to analyze the impact of skewness of the embedded trees on the execution time of parallel recognition algorithms >