scispace - formally typeset
Search or ask a question

Showing papers on "Scalability published in 1991"


Proceedings ArticleDOI
01 Apr 1991
TL;DR: Fast, simple algorithms for contention-free mutual exclusion, reader-writer control, and barrier synchronization are presented, based on widely available fetch-and-@ instructions, that exploit local access to shared memory to avoid contention.
Abstract: Conventional wisdom holds that contention due to busy-wait synchronization is a major obstacle to scalability and acceptable performance in large shared-memory multiprocessors. We argue the contrary, and present fast, simple algorithms for contention-free mutual exclusion, reader-writer control, and barrier synchronization. These algorithms, based on widely available fetch-and-@ instructions, exploit local access to shared memory to avoid contention. We compare our algorithms to previous approaches in both qualitative and quantitative terms, presenting their performance on the Sequent Symmetry and BBN Butterfly multiprocessors. Our results highlight the importance of local access to shared memory, provide a case against the construction of so-called "dance hall" machines, and suggest that special-purpose hardware support for synchronization is unlikely to be cost effective on machines with sequentially consistent memory.

165 citations


Proceedings Article
03 Sep 1991
TL;DR: Compared to other join strategies, a hash-ba9ed join algorithm is particularly efficient and easily parallelized for this computation model, but this hardware structure is very sensitive to the data skew problem.
Abstract: Shared nothing multiprocessor archit.ecture is known t.o be more scalable to support very large databases. Compared to other join strategies, a hash-ba9ed join algorithm is particularly efficient and easily parallelized for this computation model. However, this hardware structure is very sensitive to the data skew problem. Unless the parallel hash join algorithm includes some load balancing mechanism, skew effect can deteriorate t.he system performance

164 citations


ReportDOI
01 Nov 1991
TL;DR: This work provides a discussion of MMCC's peer-to-peer model of communication and an overview of its connection control protocol, and future directions for research in multimedia conference control are presented.
Abstract: : MMCC, the multimedia conference control program, is a window-based tool for conference management. It serves as an application interface to the ISI/BBN teleconferencing system, where it is used not only to orchestrate multisite conferences, but also to provide local and remote audio and video control, and to interact with other conference-oriented tools that support shared workspaces. The motivation for this paper is to document the design, operation and continued evolution of MMCC. After presenting the context for this work, we provide a discussion of MMCC's peer-to-peer model of communication and an overview of its connection control protocol. Issues are also raised about heterogeneity, robust services, scalability and the impact of conferencing over the Internet. A description of the system's regular use offers insights into the feasibility of the architecture. Finally, future directions for research in multimedia conference control are presented.

145 citations


Journal ArticleDOI
TL;DR: The scalability of a given architecture is defined to be the fraction of the parallelism inherent in a given algorithm that can be exploited by any machine of that architecture as a function of problem size.
Abstract: We define the scalability of a given architecture to be the fraction of the parallelism inherent in a given algorithm that can be exploited by any machine of that architecture as a function of problem size

139 citations


Proceedings ArticleDOI
18 Oct 1991
TL;DR: The environment prototype contains a set of performance data transformation modules that can be interconnected in user-specified ways and allows users to interconnect and configure modules graphically to form an acyclic, directed data analysis graph.
Abstract: As parallel systems expand in size and complexity, the absence of performance tools for these parallel systems exacerbates the already difficult problems of application program and system software performance tuning. Moreover, given the pace of technological change, we can no longer afford to develop ad hoc, one-of-a-kind performance instrumentation software; we need scalable, portable performance analysis tools. We describe an environment prototype based on the lessons learned from two previous generations of performance data analysis software. Our environment prototype contains a set of performance data transformation modules that can be interconnected in user-specified ways. It is the responsibility of the environment infrastructure to hide details of module interconnection and data sharing. The environment is written in C++ with the graphical displays based on X windows and the Motif toolkit. It allows users to interconnect and configure modules graphically to form an acyclic, directed data analysis graph. Performance trace data are represented in a self-documenting stream format that includes internal definitions of data types, sizes, and names. The environment prototype supports the use of head-mounted displays and sonic data presentation in addition to the traditional use of visual techniques.

92 citations


Journal ArticleDOI
TL;DR: Paradigm (parallel distributed global memory), a shared-memory multicomputer architecture that is being developed to show that one can build a large-scale machine using high-performance microprocessors, is discussed and some results to date are summarized.
Abstract: Paradigm (parallel distributed global memory), a shared-memory multicomputer architecture that is being developed to show that one can build a large-scale machine using high-performance microprocessors, is discussed. The Paradigm architecture allows a parallel application program to execute any of its tasks on any processor in the machine, with all the tasks in a single address space. The focus is on novel design techniques that support scalability. The key performance issues are identified, and some results to date from this work and experience with the VMP architecture design on which it is based are summarized. >

90 citations


Proceedings ArticleDOI
13 May 1991
TL;DR: It is argued that multiuser distributed memory multiprocessors with dynamic mapping of the application onto the hardware structure are needed to make available the advantages of this type of architecture to a wider user community.
Abstract: It is argued that multiuser distributed memory multiprocessors with dynamic mapping of the application onto the hardware structure are needed to make available the advantages of this type of architecture to a wider user community. It is shown, based on an abstract model, that such architectures may be used efficiently. It is also shown that future developments in interconnection hardware will allow the fulfillment of the assumptions made in the model. Since a dynamic load balancing procedure will be one of the most important components in the systems software, the elements of its implementation are discussed and first results based on a testbed implementation are presented. >

77 citations


Journal ArticleDOI
TL;DR: In this article, the authors use the isoefficiency metric to analyze the scalability of parallel algorithms for finding shortest paths between all pairs of nodes in a densely connected graph, and find the classic trade-offs of hardware cost vs scalability and memory vs time to be represented here as tradeoffs of HPCs vs. scalability.

73 citations


Proceedings ArticleDOI
01 Dec 1991
TL;DR: The design choices, architecture and performance evaluation of the current DBS3 prototype are presented, and initial performance experiments for single-user queries are promising and show excellent response times and scalability.
Abstract: DBS3 is a database system with extended relational capabilities designed for a shared-memory multiprocessor. This paper presents the design choices, architecture and performance evaluation of the current DBS3 prototype. The major contributions of DBS3 are: a parallel dataflow execution model based on data declustering, the compile-time optimization of both independent and pipelined parallelism, and the exploitation of shared-memory for efficient concurrent execution. The current DBS3 prototype runs on a shared-memory, commercially available multiprocessor. The initial performance experiments for single-user queries are promising and show excellent response times and scalability. >

71 citations


Proceedings ArticleDOI
01 Sep 1991
TL;DR: This research presents a probabilistic procedure that can be used to estimate the intensity of the response of the immune system to certain types of attacks.
Abstract: Peter B. Danzig, Jongsuk Ahn, John Nell, Katia Obraczka Computer Science Department University of Southern California Los Angeles, California 90089-0782

69 citations


Proceedings ArticleDOI
02 Apr 1991
TL;DR: It is observed that processor pools can be used to provide significant performance improvements as the system size increases, while maintaining the workload composition and intensity, and it is concluded that processor pool-based scheduling may be an effective and efficient technique for scalable systems.
Abstract: Large-scale Non-Uniform Memory Access (NUMA) multiprocessors are gaining increased attention due to their potential for achieving high performance through the replication of relatively simple components. Because of the complexity of such systems, scheduling algorithms for parallel applications are crucial in realizing the performance potential of these systems. In particular, scheduling methods must consider the scale of the system, with the increased likelihood of creating bottlenecks, along with the NUMA characteristics of the system, and the benefits to be gained by placing threads close to their code and data.We propose a class of scheduling algorithms based on processor pools. A processor pool is a software construct for organizing and managing a large number of processors by dividing them into groups called pools. The parallel threads of a job are run in a single processor pool, unless there are performance advantages for a job to span multiple pools. Several jobs may share one pool. Our simulation experiments show that processor pool-based scheduling may effectively reduce the average job response time. The performance improvements attained by using processor pools increase with the average parallelism of the jobs, the load level of the system, the differentials in memory access costs, and the likelihood of having system bottlenecks. As the system size increasesr, while maintaining the workload composition and intensity, we observed that processor pools can be used to provide significant performance improvements. We therefore conclude that processor pool-based scheduling may be an effective and efficient technique for scalable systems.

25 Dec 1991
TL;DR: This dissertation describes the properties of image-composition architectures and presents the design of a prototype z-buffer-based system called PixelFlow, which is expected to render 2.5 million triangles per second and 870 thousand antialiased triangles perSecond in a two-card-cage system.
Abstract: This dissertation describes a new approach for high-speed image-generation based on image compositing Application software distributes the primitives of a graphics database over a homogeneous array of processors called renderers Each renderer transforms and rasterizes its primitives to form a full-sized image of its portion of the database A hardware datapath, called an image-composition network, composites the partial images into a single image of the entire scene Image-composition architectures are linearly scalable to arbitrary performance This property arises because: (1) renderers compute their subimages independently, and (2) an image-composition network can accommodate an arbitrary number of renderers, with constant bandwidth in each link of the network Because they are scalable, image-composition architectures promise to achieve much higher performance than existing commercial or experimental systems They are flexible as well, supporting a variety of primitive types and rendering algorithms Also, they are efficient, having approximately the same performance/price ratio as the underlying renderers Antialiasing is a special challenge for image-composition architectures The compositing method must retain primitive geometry within each pixel to avoid aliasing Two alternatives are explored in this dissertation: simple z-buffer compositing combined with supersampling, and A-buffer compositing This dissertation describes the properties of image-composition architectures and presents the design of a prototype z-buffer-based system called PixelFlow The PixelFlow design, using only proven technology, is expected to render 25 million triangles per second and 870 thousand antialiased triangles per second in a two-card-cage system Additional card cages can be added to achieve nearly linear increases in performance

Proceedings ArticleDOI
01 Jun 1991
TL;DR: The objective of this paper is to critically assess the state of the art in the theory of scalability analysis, and motivate further research on the development of new and more comprehensive analytical tools to study the scalability of parallel algorithms and architectures.
Abstract: The scalability of a parallel algorithm on a parallel architecture is a measure of its capability to effectively utilize an increasing number of processors. The scalability analysis may be used to select the best algorithm-architecture combination for a problem under different constraints on the growth of the problem size and the number of processors. It may be used to predict the performance of a parallel algorithm and a parallel architecture for a large number of processors from the known performance on fewer processors. For a fixed problem size it may be used to determine the optimal number of processors to be used and the maximum possible speedup for that problem size. The objective of this paper is to critically assess the state of the art in the theory of scalability analysis, and motivate further research on the development of new and more comprehensive analytical tools to study the scalability of parallel algorithms and architectures. We survey a number of techniques and formalisms that have been developed for studying the scalability issues, and discuss their interrelationships. *This work was supported by Army Research Office grant # 28408-MA-SDI to the University of Minnesota and by the Army High Performance Computing Research Center at the University of Minnesota. An extended version of this paper is available from the authors upon request. Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission. o 1991 ACM 0-89791 -4341/91 /000610396 . ..S1 .50

Journal ArticleDOI
25 Mar 1991
TL;DR: Results indicate that for the shallow-water equations, parallel efficiency is generally poor and it is predicted that for complete global climate models, the parallel efficiency will be significantly better; nevertheless, projected teraflop computers will have difficulty achieving acceptable throughput necessary for long-term regional climate studies.
Abstract: This paper investigates the suitability of the spectral transform method for parallel implementation. The spectral transform method is a natural candidate for general circulation models (GCMs) designed to run on large-scale parallel computers due to the large number of existing serial and moderately parallel implementations. Analytic and empirical studies are presented that allow the parallel performance, and hence the scalability, of the spectral transform method to be quantified on different parallel computer architectures. Both the shallow-water equations and complete GCMs are considered. Results indicate that for the shallow-water equations, parallel efficiency is generally poor because of high communication requirements. It is predicted that for complete global climate models, the parallel efficiency will be significantly better; nevertheless, projected teraflop computers will have difficulty achieving acceptable throughput necessary for long-term regional climate studies.

01 May 1991
TL;DR: The purpose of this research is to show how nonshared-memory machines can be programmed at a higher level than is currently possible by developing techniques for compiling shared-memory programs for execution on those architectures.
Abstract: Nonshared-memory parallel computers promise scalable performance for scientific computing needs. Unfortunately, these machines are now difficult to program because the message-passing languages available for them do not reflect the computational models used in designing algorithms. This introduces a semantic gap in the programming process which is difficult for the programmer to fill. The purpose of this research is to show how nonshared-memory machines can be programmed at a higher level than is currently possible. We do this by developing techniques for compiling shared-memory programs for execution on those architectures. The heart of the compilation process is translating references to shared memory into explicit messages between processors. To do this, we first define a formal model for distribution data structures across processor memories. Several abstract results describing the messages needed to execute a program are immediately derived from this formalism. We then develop two distinct forms of analysis to translate these formulas into actual programs. Compile-time analysis is used when enough information is available to the compiler to completely characterize the data sent in the messages. This allows excellent code to be generated for a program. Run-time analysis produces code to examine data references while the program is running. This allows dynamic generation of messages and a correct implementation of the program. While the over-head of the run-time approach is higher than the compile-time approach, run-time analysis is applicable to any program. Performance data from an initial implementation show that both approaches are practical and produce code with acceptable efficiency.

Journal ArticleDOI
TL;DR: The design of a benchmark is presented, SLALOM{trademark}, that scales automatically to the computing power available, and corrects several deficiencies in various existing benchmarks: it is highly scalable, it solves a real problem, it includes input and output times, and it can be run on parallel machines of all kinds, using any convenient language.

Proceedings ArticleDOI
01 Sep 1991
TL;DR: A genetic algorithm for MDAP is developed and the effects of varying the communication cost matrix representing the interprocessor communication topology and the uniformity of the distribution of documents to the clusters are studied.
Abstract: Information retrieval is the selection of documents that are potentially relevant to a user’s information need. Given the vast volume of data stored in modern information retrieval systems, searching the document database requires vast computational resources. To meet these computational demands, various researchers have developed parallel information retrieval systems. As efficient exploitation of parallelism demands fast access to the documents, data organization and placement significantly affect the total processing time. We describe and evaluate a data placement strategy for distributed memory, distributed 1/0 multicomputers. Initially, a formal description of the Multiprocessor Document Allocation Problem (MDAP) and a proof that MDAP is NP Complete are presented. A document allocation algorithm for MDAP based on Genetic Algorithms is developed. This algorithm assumes that the documents are clustered using any one of the many clustering algorithms. We define a cost function for the derived allocation and evaluate the performance of our algorithm using this function. As part of the experimental analysis, the effects of varying the number of documents and their distribution across the clusters as well the exploitation of various differing architectural interconnection topologies are studied. We also experiment with the several parameters common to Genetic Algorithms, e.g., the probability of mutation and the population size. 1.0 Introduction An efficient multiprocessor information retrieval system must maintain a low system response time and require relatively little storage overhead. As the volume of stored data continues to increase daily, the multiprocessor engines must likewise scale to a large number of processors. This demand for system scalability necessitates a distributed memory architecture as a large number of processors is not currently possible in a sharedmemory configuration. A distributed memory system, however, introduces the problem Perrrkion to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appaar, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission. @ 1991 ACM 0-89791 -448 -1/91 /0009 /0230 . ..$1 .50 Hava Tova Siegelmann Dept. of Computer Science Rutgers University New Brunswick, NJ 08903 associated with the proper placement of data onto the given architecture. We refer to this problem as the Multiprocessor Document Allocation Problem (MDAP), a derivative of the Mapping Problem originally described by Bokhari [Bok8 1]. We assume a clustered document database. A clustered approach is taken since an index file organization can introduce vast storage overhead (up to roughly 300% according to Haskin [Has8 1]) and a full-text or signature analysis technique results in lengthy search times. In this context, a proper solution to MDAP is any mapping of the documents onto the processors such that the average cluster diameter is kept to a minimum while still providing for an even document distribution across the nodes. To achieve a significant reduction in the total query processing time using parallelism, the allocation of data among the processors should be distributed as evenly as possible and the interprocessor communication among the nodes should be minimized. Achieving such an allocation is NP Complete. Thus, it is necessary to use heuristics to obtain satisfactory mappings, which may indeed be suboptimal. Genetic Algorithms [DeJ89, G0189, Gre85, Gre87, H0187, Rag87] approximate optimal solutions to computationally intractable problems. We develop a genetic algorithm for MDAP and examine the effects of varying the communication cost matrix representing the interprocessor communication topology and the uniformity of the distribution of documents to the clusters. 1.1 Mapping Problem Approximations As the Mapping Problem and some of its derivatives are NP complete, heuristic algorithms are commonly employed to approximate the optimal solutions. Some of these approaches [Bok81, B0188, Lee87] deal, in some manner, This work was partially supported by grants from DCS, Inc. under contract number 5-35071 and the Center for Innovative Technology under contract number 5-34042.

Proceedings ArticleDOI
30 Apr 1991
TL;DR: The paper presents two new parallel algorithms QSP1 and QSP2 based on sequential quicksort for sorting data on a mesh multicomputer, and analyzes their scalability using the isoefficiency metric and presents a different variant of Lang's sort which is asymptotically as scalable as QSP 2 (for the multiple-element-per-processor case).
Abstract: The paper presents two new parallel algorithms QSP1 and QSP2 based on sequential quicksort for sorting data on a mesh multicomputer, and analyzes their scalability using the isoefficiency metric. It shows that QSP2 matches the lower bound on the isoefficiency function for mesh multicomputers. The isoefficiency of QSP1 is also fairly close to optimal. Lang et al. (1985) and Schnorr et al. (1986) have developed parallel sorting algorithms for the mesh architecture that have either optimal (Schnorr) or close to optimal (Lang) run-time complexity for the one-element-per-processor case. Both QSP1 and QSP2 have worse performance than these algorithms for the one-element-per-processor case. But QSP1 and QSP2 have better scalability than the scaled-down variants of these algorithms (for the case in which there are more elements than processors). As a result, the new parallel formulations are better than these scaled-down variants in terms of speedup w.r.t the best sequential algorithms. The paper also presents a different variant of Lang's sort which is asymptotically as scalable as QSP2 (for the multiple-element-per-processor case). It briefly discusses another metric called 'resource consumption metric'. According to this metric, both QSP1 and QSP2 are strictly superior to Lang's sort and its variations. >

01 Jun 1991
TL;DR: This work describes an approach for a system that accesses the distributed collection of repositories that naturally maintain resource information, rather than building a global database to register all resources.
Abstract: Large scale computer networks provide access to a bewilderingly large number and variety of resources, including retail products, network services, and people in various capacities. We consider the problem of allowing users to discover the existence of such resources in an administratively decentralized environment. We describe an approach for a system that accesses the distributed collection of repositories that naturally maintain resource information, rather than building a global database to register all resources. A key problem is organizing the resource space in a manner suitable to all participants. Rather than imposing an inflexible hierarchical organization, our approach allows the resource space organization to evolve in accordance with what resources exist and what types of queries users make. Concretely, a set of agents organize and search the resource space by constructing links between the repositories of resource information based on keywords that describe the contents of each repository, and the semantics of the resources being sought. The links form a general graph, with a flexible set of hierarchies embedded within the graph to provide some measure of scalability. The graph structure evolves over time through the use of cache aging protocols. Additional scalability is targeted through the use of probabilistic graph protocols. A prototype implementation and a measurement study are under way. hhhhhhhhhhhhhhhhhh 1 This material is based upon work supported in part by the National Science Foundation under Cooperative Agreement DCR-84200944, and by a grant from AT&T Bell Laboratories.

BookDOI
01 Jan 1991
TL;DR: Personality traits and psychological explanation consistency in personality measurement scalability and elevation a single trait measure of scalability development of the situation behaviour inventory scalability as measured by the SBI scalability on standard personality inventories scalability re-examined as discussed by the authors.
Abstract: Personality traits and psychological explanation consistency in personality measurement scalability and elevation a single trait measure of scalability development of the situation behaviour inventory scalability and elevation as measured by the SBI scalability on standard personality inventories scalability re-examined.

Proceedings ArticleDOI
01 Jun 1991
TL;DR: A parallel implementation of a particle simulation method that is portable between a wide class of multiprocessor computers is presented, where a fine grain spatial decomposition is utilized where several subdomains having a regular structure are computed at each processing node.
Abstract: A parallel implementation of a particle simulation method that is portable between a wide class of multiprocessor computers is presented. A fine grain spatial decomposition is utilized where several subdomains having a regular structure are computed at each processing node. This leads directly to an efficient and straightforward load balancing scheme if the number of subdomains at each processor is permitted to vary in an appropriate manner. Three dimensional simulations incorporating full thermochemical nonequilibrium are possible using the resulting code. Vectorizable algorithms are retained from earlier work allowing efficient use of deeply pipelined node processors where available. Performance results are presented from three different machine architectures demonstrating the portability of the code. On a 128-node Intel iPSC/860, performance is twice that of a single Cray-Y/MP CPU running a highly vectorized simulation code. Speedup is linear over the full range of number of processors on all target machines, indicating scalability of the method to higher degrees of parallelism.

Book ChapterDOI
01 Apr 1991
TL;DR: A locally adaptive enhancement to the basic deterministic routing mechanism is proposed and its impact on the network performance is proved by simulation results.
Abstract: Dynamically switched, sparse interconnection networks are the major element of scalable general purpose parallel computers. This paper presents different network structures and corresponding routing schemes based on a universal routing device. A locally adaptive enhancement to the basic deterministic routing mechanism is proposed and its impact on the network performance is proved by simulation results. This is done comparatively for different network topologies and with respect to network throughput and size scalability. Finally, the extension of the results to very large networks is demonstrated by analytical methods.

10 Dec 1991
TL;DR: The notion of data distribution independence is embedded into the underlying methodology and practical software, which provides an excellent basis for building and then tuning the performance of whole applications, rather than single computational steps.
Abstract: In this paper, we consider what is required to develop parallel algorithms for engineering applications on message-passing concurrent computers. At Caltech, the first author studied the concurrent dynamic simulation of distillation column networks. This research was accomplished with attention to portability, high performance and reusability of the underlying algorithms. Emerging from this work are several key results: first, a methodology for explicit parallelization of algorithms and for the evaluation of parallel algorithms in the distributed-memory context; second, a set of portable, reusable numerical algorithms constituting a ``Multicomputer Toolbox,`` suitable for use on both existing and future medium-grain concurrent computers; third, a working prototype simulation system, Cdyn, for distillation problems, that can be enhanced to address more complex flowsheeting problems in chemical engineering; fourth, ideas for how to achieve higher performance with Cdvn, using iterative methods for the underlying linear algebra. Of these, the chief emphasis in the present paper is the reusable collection of parallel libraries comprising the Toolbox. Concurrent algorithms for the solution of dense and sparse linear systems, and ordinary differential-algebraic equations were developed as part of this work. Importantly, we have embedded the notion of data distribution independence--support for algorithmic correctness independent of data locality choices--into themore » underlying methodology and practical software. This together with carefully designed message passing primitives, and concurrent data structures encapsulating data layout of matrices and vectors, provides an excellent basis for building and then tuning the performance of whole applications, rather than single computational steps.« less

Proceedings ArticleDOI
R.V. Iyer1, S. Ghosh1
13 Oct 1991
TL;DR: Experimental results indicate DARYN's feasibility, significant superiority over the traditional approach, and that it exhibits, in general, the notion of performance scalability.
Abstract: In DARYN, the decision process for every train is executed by an on-board process that negotiates for temporary ownership of the tracks with the respective station controlling the tracks, through explicit processor to processor communication primitives. This processor then computes its own route utilizing the results of its negotiation, its knowledge of the track layout of the entire system, and its evaluation of the cost function. Every station's decision process is also executed by a dedicated processor that maintains absolute control over a given set of tracks and participates in the negotiation with the trains. Since the computational responsibility is distributed over all the logical entities of the system, DARYN offers the potential of superior performance over the traditional uniprocessor approach. The development of a realistic model of a railway network based on the DARYN approach and implementation on a loosely coupled parallel processor system are reported. Experimental results indicate DARYN's feasibility, significant superiority over the traditional approach, and that it exhibits, in general, the notion of performance scalability. >

Proceedings ArticleDOI
01 Dec 1991
TL;DR: A hybrid architecture in which SE clusters are interconnected through a communication network to form a SN structure at the inter-cluster level is presented, and a generalized performance model was developed to perform sensitivity analysis for the hybrid structure.
Abstract: The most debated architectures for parallel database processing are Shared Nothing (SN) and Shared Everything (SE) structures. Although SN is considered to be most scalable, it is very sensitive to the data skew problem. On the other hand, SE allows the collaborating processors to share the work load more efficiently. It, however, suffers from the limitation of the memory and disk I/O band-width. The authors present a hybrid architecture in which SE clusters are interconnected through a communication network to form a SN structure at the inter-cluster level. In this approach, processing elements are clustered into SE systems to minimize the skew effect. Each cluster, however, is kept small within the limitation of the memory and I/O technology to avoid the data access bottleneck. A generalized performance model was developed to perform sensitivity analysis for the hybrid structure, and to compare it against SE and SN organizations. >

Journal ArticleDOI
TL;DR: A radical technology for databases, which implements a relational model for network services and scales to support throughput of thousands of transactions per second is proposed, and a set of data manipulation primitives useful in describing the logic network services is described.
Abstract: A radical technology for databases, called the Datacycle architecture, which implements a relational model for network services and scales to support throughput of thousands of transactions per second is proposed. A set of data manipulation primitives useful in describing the logic network services is described. The use of the relational model together with an extended structured-query-language-like query language to describe 800 service, network automatic call distribution, and directory-assisted call completion services, is examined. The architectural constraints on the scalability of traditional database systems is reviewed, and an alternative, the Datacycle architecture is presented. The Datacycle approach exploits the bandwidth of fiber optics to circulate database contents among processing nodes (e.g. switching offices or other network elements) in a network, providing highly flexible access to data and controlling the administrative and processing overhead of coordinating changes to database contents. A prototype system operating in the laboratory is described. The feasibility of the Datacycle approach for both existing and future applications is considered. >

Proceedings ArticleDOI
J. Wilkes1
07 Oct 1991
TL;DR: The initial target of the DataMesh project is storage servers for groups of high-powered workstations, although the techniques are applicable to several different problems.
Abstract: A description is given of DataMesh, a research project with the goal of developing software to maximize the performance of mass storage I/O, while providing high availability, ease of use, and scalability The DataMesh hardware architecture is that of an array of disk nodes, with each disk having a dedicated 20-MIPS single-chip processor and 8-32 MBytes RAM The nodes are linked by a fast, reliable, small-area network, and programmed to appear as a single storage server Phase 1 of the DataMesh project will provide smart disk arrays; phase 2 will expand this to include file systems; and phase 3 will support parallel databases, data searches, and other application-specific functions The initial target of the work is storage servers for groups of high-powered workstations, although the techniques are applicable to several different problems >

Proceedings ArticleDOI
D.M. Choy1, P.G. Selinger1
01 Dec 1991
TL;DR: A scalable catalog framework is described, which is an extension of previous work in a distributed relational DBMS research prototype called R*.
Abstract: To support a distributed, heterogeneous computing environment, an inter-system catalog protocol is needed so that remote resources can be located, used, and maintained with little human intervention. This paper describes a scalable catalog framework, which is an extension of previous work in a distributed relational DBMS research prototype called R*. This work builds on the R* concepts to accommodate heterogeneity, to handle partitioned and replicated data, to support non-DBMS resource managers, and to enhance catalog access performance and system extensibility. >

01 Jan 1991
TL;DR: The principal conclusion is that contention due to synchronisation need not be a problem for large-scale shared-memory multiprocessors, and the existence of scalable algorithms greatly weakens the case for costly special-purpose hardware support for synchronization, and provides a case against so-called “dance hall” architectures.
Abstract: Busy-wait techniques are heavily used for mutual exclusion and barrier synchronization in shared-memory parallel programs Unfortunately, typical implementations of busy-waiting tend to produce large amounts of memory and interconnect contention, introducing performance bottlenecks that become markedly more pronounced as apphcations scale. We argue that this problem is not fundamental, and that one can in fact construct busy wait synchronization algorithms that induce no memory or interconnect contention, The key to these algorithms is for every processor to spin on separate locally-accessible flag variables, and for some other processor to terminate the spin with a single remote write operation at an appropriate time. Flag variables may lbe locally-accessible as a result of coherent caching, or by virtue of allocation m the local portion of physically distributed shared memory We present a new scalable algorithm for spin locks that generates 0(1) remote references per lock acquisition, independent of the number of processors attempting to acquire the lock Our algorithm provides reasonable latency in the absence of contention, requires only a constant amount of space per lock, and requires no hardware support other than a swap-with-memory instruction. We also present a new scalable barrier algorithm that generates O(1) remote references per processor reaching the barrier, and observe that two previously-known barriers can hkewise be cast in a form that spins only on locally-accessible flag variables. None of these barrier algorithms requires hardware support beyond the usual atomicity of memory reads and writes. We compare the performance of our scalable algorithms with other software approaches to busy-wait synchronization on both a Sequent Symmetry and a BBN Butterfly. Our principal conclusion is that contention due to synchron zzation need not be a problem zn large-scale shared-memory multiprocessors. The existence of scalable algorithms greatly weakens the case for costly special-purpose hardware support for synchronization, and provides a case against so-called “dance hall” architectures, in which shared memory locations are equally far from all processors

Proceedings ArticleDOI
M. Schwarz1, Bedrich Hosticka1, M. Kesper, P. Richert, M. Scholles 
18 Nov 1991
TL;DR: A scalable MIMD computer system which was designed to be used as neurocomputer, capable of emulating different types of neurons, including complex biologically motivated models based on activity pulses, variable pulse transmission times, and multiple threshold learning rules is presented.
Abstract: The authors present a scalable MIMD computer system which was designed to be used as neurocomputer. It is capable of emulating different types of neurons, including complex biologically motivated models based on activity pulses, variable pulse transmission times, and multiple threshold learning rules. It is constructed as an array consisting of nodal computer chips, each containing an on-chip communication processor to realize a full global communication. Hence, not only neural networks featuring arbitrary topologies can be built, but also a wide range of nonneural processing applications can be implemented. As an example, the authors show how to use the system in solving optimization problems using genetic algorithms, and how to program it for real-time image processing using a combination of neural nets, genetic algorithms, and classical image processing techniques. >