Showing papers on "Scalability published in 1993"

PDF

Open Access

Proceedings Article•DOI•

Global optimizations for parallelism and locality on scalable parallel machines

[...]

01 Jun 1993

TL;DR: A compiler algorithm that automatically finds computation and data decompositions that optimize both parallelism and locality that is designed for use with both distributed and shared address space machines.

...read moreread less

Abstract: Data locality is critical to achieving high performance on large-scale parallel machines. Non-local data accesses result in communication that can greatly impact performance. Thus the mapping, or decomposition, of the computation and data onto the processors of a scalable parallel machine is a key issue in compiling programs for these architectures. This paper describes a compiler algorithm that automatically finds computation and data decompositions that optimize both parallelism and locality. This algorithm is designed for use with both distributed and shared address space machines. The scope of our algorithm is dense matrix computations where the array accesses are affine functions of the loop indices. Our algorithm can handle programs with general nestings of parallel and sequential loops. We present a mathematical framework that enables us to systematically derive the decompositions. Our algorithm can exploit parallelism in both fully parallelizable loops as well as loops that require explicit synchronization. The algorithm will trade off extra degrees of parallelism to eliminate communication. If communication is needed, the algorithm will try to introduce the least expensive forms of communication into those parts of the program that are least frequently executed.

...read moreread less

388 citations

Journal Article•DOI•

Utopia: a load sharing facility for large, heterogeneous distributed computer systems

[...]

Songnian Zhou¹, Xiaohu Zheng¹, Jingwen Wang¹, Pierre Delisle¹•Institutions (1)

Systems Research Institute¹

01 Dec 1993-Software - Practice and Experience

TL;DR: The design and implementation issues in Utopia, a load sharing facility specifically built for large and heterogeneous systems, are discussed, which has no restriction on the types of tasks that can be remotely executed, involves few application changes and no operating system change, and incurs low overhead.

...read moreread less

Abstract: Load sharing in large, heterogeneous distributed systems allows users to access vast amounts of computing resources scattered around the system and may provide substantial performance improvements to applications. We discuss the design and implementation issues in Utopia, a load sharing facility specifically built for large and heterogeneous systems. The system has no restriction on the types of tasks that can be remotely executed, involves few application changes and no operating system change, supports a high degree of transparency for remote task execution, and incurs low overhead. The algorithms for managing resource load information and task placement take advantage of the clustering nature of large-scale distributed systems; centralized algorithms are used within host clusters, and directed graph algorithms are used among the clusters to make Utopia scalable to thousands of hosts. Task placements in Utopia exploit the heterogeneous hosts and consider varying resource demands of the tasks. A range of mechanisms for remote execution is available in Utopia that provides varying degrees of transparency and efficiency. A number of applications have been developed for Utopia, ranging from a load sharing command interpreter, to parallel and distributed applications, to a distributed batch facility. For example, an enhanced Unix command interpreter allows arbitrary commands and user jobs to be executed remotely, and a parallel make facility achieves speed-ups of 15 or more by processing a collection of tasks in parallel on a number of hosts.

...read moreread less

380 citations

Journal Article•DOI•

Isoefficiency: measuring the scalability of parallel algorithms and architectures

[...]

Ananth Grama¹, Anshul Gupta¹, Vipin Kumar¹•Institutions (1)

University of Minnesota¹

01 Aug 1993-IEEE Parallel & Distributed Technology: Systems & Applications

TL;DR: Isoefficiency analysis helps to determine the best algorithm/architecture combination for a particular problem without explicitly analyzing all possible combinations under all possible conditions.

...read moreread less

Abstract: Isoefficiency analysis helps us determine the best algorithm/architecture combination for a particular problem without explicitly analyzing all possible combinations under all possible conditions. >

...read moreread less

329 citations

Proceedings Article•DOI•

Scalable performance analysis: the Pablo performance analysis environment

[...]

Daniel A. Reed¹, P.C. Roth¹, Ruth A. Aydt¹, K.A. Shields¹, L.F. Tavera¹, R.J. Noe¹, B.W. Schwartz¹ - Show less +3 more•Institutions (1)

University of Illinois at Urbana–Champaign¹

06 Oct 1993

TL;DR: Pablo is a performance analysis environment designed to provide unobtrusive performance data capture, analysis, and presentation across a wide variety of scalable parallel systems.

...read moreread less

Abstract: Developers of application codes for massively parallel computer systems face daunting performance tuning and optimization problems that must be solved if massively parallel systems are to fulfill their promise. Recording and analyzing the dynamics of application program, system software, and hardware interactions is the key to understanding and the prerequisite to performance tuning, but this instrumentation and analysis must not unduly perturb program execution. Pablo is a performance analysis environment designed to provide unobtrusive performance data capture, analysis, and presentation across a wide variety of scalable parallel systems. Current efforts include dynamic statistical clustering to reduce the volume of data that must be captured and complete performance data immersion via head-mounted displays. >

...read moreread less

299 citations

Journal Article•DOI•

Scalable problems and memory-bounded speedup

[...]

Xian-He Sun¹, Lionel M. Ni¹•Institutions (1)

Michigan State University¹

01 Sep 1993-Journal of Parallel and Distributed Computing

TL;DR: The simplified memory-bounded speedup contains both Amdahl′s law and Gustafson′s scaled speedup as special cases and leads to a better understanding of parallel processing.

...read moreread less

176 citations

Journal Article•DOI•

The scalability of FFT on parallel computers

[...]

Anshul Gupta¹, Vipin Kumar¹•Institutions (1)

University of Minnesota¹

01 Aug 1993-IEEE Transactions on Parallel and Distributed Systems

TL;DR: The authors present the scalability analysis of a parallel fast Fourier transform (FFT) algorithm on mesh and hypercube connected multicomputers using the isoefficiency metric and show that it is more cost-effective to implement the FFT algorithm on a hypercube rather than a mesh.

...read moreread less

Abstract: The authors present the scalability analysis of a parallel fast Fourier transform (FFT) algorithm on mesh and hypercube connected multicomputers using the isoefficiency metric. The isoefficiency function of an algorithm architecture combination is defined as the rate at which the problem size should grow with the number of processors to maintain a fixed efficiency. It is shown that it is more cost-effective to implement the FFT algorithm on a hypercube rather than a mesh despite the fact that large scale meshes are cheaper to construct than large hypercubes. Although the scope of this work is limited to the Cooley-Tukey FFT algorithm on a few classes of architectures, the methodology can be used to study the performance of various FFT algorithms on a variety of architectures such as SIMD hypercube and mesh architectures and shared memory architecture. >

...read moreread less

139 citations

Journal Article•DOI•

The CM-5 Connection Machine: a scalable supercomputer

[...]

W. Daniel Hillis, Lewis W. Tucker

01 Nov 1993-Communications of The ACM

TL;DR: The CM-5 Connection Machine is a scalable homogeneous multiprocessor designed for large-scale scientific and business applications and it is believed that architectures of this type will replace most other forms of supercomputing in the foreseeable future.

...read moreread less

Abstract: The CM-5 Connection Machine is a scalable homogeneous multiprocessor designed for large-scale scientific and business applications. In this article we describe its architecture and implementation from the standpoint of the programmer or user of parallel machines. In particular, we emphasize three features of the Connection Machine architecture: scalability, distributed memory/global addressing, and distributed execution/global synchronization. We believe that architectures of this type will replace most other forms of supercomputing in the foreseeable future. Examples of the current applications of the machine are included, focusing particularly on the machine's ability to support a variety of programming models

...read moreread less

116 citations

Proceedings Article•DOI•

The local Time Warp approach to parallel simulation

[...]

Hassan Rajaei, Rassul Ayani, Lars-Erik Thorelli

01 Jul 1993

TL;DR: The Local Time Warp method for parallel discrete-event simulation is proposed and a novel synchronization scheme for it called HCTW is presented, which hierarchically combines a Conservative Time Window algorithm with Time Warp and aims at reducing cascade rollbacks, sensitivity to lookahead, and the scalability problems.

...read moreread less

Abstract: The two main approaches to parallel discrete event simulation – conservative and optimistic – are likely to encounter some limitations when the size and complexity of the simulation system increases. For such large scale simulations, the conservative approach appears to be limited by blocking overhead and sensitivity to lookahead, whereas the optimistic approach may become prone to cascading rollbacks, state saving overhead, and demands for larger memory space. These drawbacks restrict the synchronization schemes based on each of the two approaches from scaling up. A combined approach may resolve these limitations, while preserving and utilizing potential advantages of each method. However, the schemes proposed so far integrate the two views at the same level, i.e. local to a logical process, and hence may not be able to fully solve the problems. In this paper we propose the Local Time Warp method for parallel discrete-event simulation and present a novel synchronization scheme for it called HCTW. The new scheme hierarchically combines a Conservative Time Window algorithm with Time Warp and aims at reducing cascade rollbacks, sensitivity to lookahead, and the scalability problems. Local Time Warp is believed to be suitable for parallel machines equipped with thousands of processors and thus an appropriate candidate for simulation of large and complex systems.

...read moreread less

83 citations

Journal Article•DOI•

Efficient parallel computing in distributed workstation environments

[...]

Clemens H. Cap¹, Volker Strumpen¹•Institutions (1)

University of Zurich¹

01 Nov 1993

TL;DR: It is shown that heterogeneous partitioning with respect to the load situation at startup and dynamic load balancing throughout the entire computation are essential techniques for obtaining high efficiency with the hypercomputer approach.

...read moreread less

Abstract: The typical workstation in a LAN is idle for long periods of time. Within the concept of a hypercomputer this free, distributed computing power can be placed at the disposal of the user. The main problem with this approach is the permanently changing load situation in the network. We show that heterogeneous partitioning with respect to the load situation at startup and dynamic load balancing throughout the entire computation are essential techniques for obtaining high efficiency with the hypercomputer approach. We describe a parallel programming platform called THE PARFORM, which supports these two features and therefore proves faster than related approaches. Performance measurements and a scalability model for an explicit finite difference solver of a partial differential equation conclude the paper.

...read moreread less

76 citations

Proceedings Article•DOI•

A reconfigurable data-driven multiprocessor architecture for rapid prototyping of high throughput DSP algorithms

[...]

A.K.W. Yeung¹, Jan M. Rabaey¹•Institutions (1)

University of California, Berkeley¹

05 Jan 1993

TL;DR: A data-driven multiprocessor architecture for the rapid prototyping of complex DSP algorithms, based on direct execution of data-flow graphs, is presented, which confirms the performance efficiency and generality of the architecture.

...read moreread less

Abstract: A data-driven multiprocessor architecture for the rapid prototyping of complex DSP algorithms, based on direct execution of data-flow graphs, is presented. High computation bandwidth is achieved by exploiting the fine-grain parallelism inherent in the target algorithms using simple processing elements called nanoprocessors interconnected by a configurable static communication network. The use of distributed control and the data-driven execution approach resulted in a highly scalable and modular architecture. A prototype chip, which is currently being designed, contains 64 nanoprocessors, 1 kByte of memory in four banks and eight 16-bit I/O ports, and provides 3.2 GOPS peak when running at 50 MHz. The benchmark results, based on a variety of DSP algorithms in video processing, digital communication, digital filtering and speech recognition, confirm the performance efficiency and generality of the architecture. >

...read moreread less

74 citations

Proceedings Article•DOI•

Lazy updates for distributed search structure

[...]

Theodore Johnson¹, Padmashree Krishna¹•Institutions (1)

University of Florida¹

01 Jun 1993

TL;DR: This paper presents an approach to maintaining distributed data structures that uses lazy updates, which take advantage of the semantics of the search structure operations to allow for scalable and low-overhead replication.

...read moreread less

Abstract: Very large database systems require distributed storage, which means that they need distributed search structures for fast and efficient access to the data In this paper, we present an approach to maintaining distributed data structures that uses lazy updates, which take advantage of the semantics of the search structure operations to allow for scalable and low-overhead replication Lazy updates can be used to design distributed search structures that support very high levels of concurrency The alternatives to lazy update algorithms (eager updates) use synchronization to ensure consistency, while lazy update algorithms avoid blocking Since lazy updates avoid the use of synchronization, they are much easier to implement than eager update algorithms We demonstrate the application of lazy updates to the dB-tree, which is a distributed B+tree that replicates its interior nodes for highly parallel access We develop a correctness theory for lazy updates so that our algorithms can be applied to other distributed search structures

...read moreread less

Journal Article•DOI•

Hierarchical scalable photonic architectures for high-performance processor interconnection

[...]

Patrick W. Dowd¹, Kalyani Bogineni, K.A. Aly, J.A. Perreault•Institutions (1)

State University of New York System¹

01 Sep 1993-IEEE Transactions on Computers

TL;DR: Introduces two hierarchical optical structures for processor interconnection and compares their performance through analytic models and discrete-event simulation, based on wavelength division multiplexing (WDM).

...read moreread less

Abstract: Introduces two hierarchical optical structures for processor interconnection and compares their performance through analytic models and discrete-event simulation. Both architectures are based on wavelength division multiplexing (WDM) which enables multiple multi-access channels to be realized on a single optical fiber. The objective of the hierarchical architectures is to achieve scalability yet avoid the requirement of multiple wavelength tunable devices per node. Furthermore, both hierarchical architectures are single-hop: a packet remains in the optical form from source to destination and does not require cross dimensional intermediate routing. >

...read moreread less

Book•

Networks, Routers and Transputers: Function, Performance and Applications

[...]

M. D. May, Peter Thompson, P. H. Welch

01 Jan 1993

TL;DR: A generic architecture of ATM systems: an introduction to asynchronous transfer mode ATM systems mapping ATM onto DS-links, and the need for standard inferfaces outline of a multimedia architecture levels of conformance.

...read moreread less

Abstract: Part 1 Transputers and routers - components for concurrent machines: transputers routers message routing addressing universal routing. Part 2 The T9000 communications architecture: the IMS T9000 instruction set basics and processes implementation of communications alternative input shared channels and resources use of resources. Part 3 DS-links and C104 routers: using links between devices levels of link protocol channel communication errors on links network communications - the IMS C104. Part 4 Connecting DS-links: signal properties of transputer links PCB connections cable connections error rates optical interconnections standards. Part 5 Using links for system control: control networks system initialization debugging errors embedded applications control system commands. Part 6 Models of DS-link performance: performance of the DS-link protocol bandwidth effects of latency a model of contention in a single C104. Part 7 Performance of C104 networks: the C104 switch networks and routing algorithms the networks investigated the traffic patterns universal routing results performance predictability. Part 8 General purpose parallel computers: universal message passing machines networks for universal message passing machines building universal parallel computers from T9000s and C104s. Part 9 The implementation of large parallel database machines on T9000 and C104 networks: database machines review of the T8 design an interconnection strategy data storage interconnection strategy relational processing referential integrity processing concurrency management complex data types recovery resource allocation and scalability. Part 10 A generic architecture of ATM systems: an introduction to asynchronous transfer mode ATM systems mapping ATM onto DS-links. Part 11 An enabling infrastructure for a distributed multimedia industry: network requirements for multimedia integration and scaling directions in networking technology convergence of applications, communications and parallel processing a multimedia industry - the need for standard inferfaces outline of a multimedia architecture levels of conformance building stations from components mapping the architecture onto transputer technology. Appendices: new link cable connector link waveforms DS-link electrical specification an equivalent circuit for DS-link output pads.

...read moreread less

Journal Article•DOI•

Depth-first heuristic search on a SIMD machine

[...]

Curt Powley¹, Chris Ferguson¹, Richard E. Korf¹•Institutions (1)

University of California, Los Angeles¹

01 Apr 1993-Artificial Intelligence

TL;DR: This work presents a parallel implementation of Iterative-Deepening-A, a depth-first heuristic search, on the single-instruction, multiple-data (SIMD) Connection Machine, and indicates that work only needs to increase as P log P to maintain constant efficiency.

...read moreread less

Book Chapter•DOI•

Scalability of Sparse Direct Solvers

[...]

Robert Schreiber¹•Institutions (1)

Ames Research Center¹

01 Jan 1993

TL;DR: It is shown that the column-oriented approach to sparse Cholesky for distributed-memory machines is not scalable and by considering message volume, node contention, and bisection width, one may obtain lower bounds on the time required for communication in a distributed algorithm.

...read moreread less

Abstract: We shall say that a scalable algorithm achieves efficiency that is bounded away from zero as the number of processors and the problem size increase in such a way that the size of the data structures increases linearly with the number of processors In this paper we show that the column-oriented approach to sparse Cholesky for distributed-memory machines is not scalable By considering message volume, node contention, and bisection width, one may obtain lower bounds on the time required for communication in a distributed algorithm Applying this technique to distributed, column-oriented, dense Cholesky leads to the conclusion that N (the order of the matrix) must scale with P (the number of processors) so that storage grows like P 2 So the algorithm is not scalable Identical conclusions have previously been obtained by consideration of communication and computation latency on the critical path in the algorithm; these results complement and reinforce that conclusion

...read moreread less

Proceedings Article•DOI•

The architecture and implementation of a distributed hypermedia storage system

[...]

Douglas E. Shackelford¹, John B. Smith¹, F. Donelson Smith•Institutions (1)

University of North Carolina at Chapel Hill¹

01 Dec 1993

TL;DR: The architecture and implementation of the Distributed Graph Storage (DGS) component of ABC, a graph-based data model, conservatively extended to meet hypermedia requirements, is focused on.

...read moreread less

Abstract: Our project is studying the process by which groups of individuals work together to build large, complex structures of ideas and is developing a distributed hypermedia collaboration environment (called ABC) to support that process. This paper focuses on the architecture and implementation of the Distributed Graph Storage (DGS) component of ABC. The DGS supports a graph-based data model, conservatively extended to meet hypermedia requirements. Some important issues addressed in the system include scale, performance, concurrency semantics, access protection, location independence, and replication (for fault tolerance).

...read moreread less

Exploiting the memory hierarchy in sequential and parallel sparse Cholesky factorization

[...]

Edward Eric Rothberg

01 Jan 1993

TL;DR: This thesis provides the first thorough analysis of the interaction between sequential sparse Cholesky factorization methods and memory hierarchies, and shows that panel methods are inappropriate for large-scale parallel machines because they do not expose enough concurrency.

...read moreread less

Abstract: Cholesky factorization of large sparse matrices is an extremely important computation, arising in a wide range of domains including linear programming, finite element analysis, and circuit simulation. This thesis investigates crucial issues for obtaining high performance for this computation on sequential and parallel machines with hierarchical memory systems. The thesis begins by providing the first thorough analysis of the interaction between sequential sparse Cholesky factorization methods and memory hierarchies. We look at popular existing methods and find that they produce relatively poor memory hierarchy performance. The methods are extended, using blocking techniques, to reuse data in the fast levels of the memory hierarchy. This increased reuse is shown to provide a three-fold speedup over popular existing approaches (e.g., SPARSPAK) on modern workstations. The thesis then considers the use of blocking techniques in parallel sparse factorization. We first describe parallel methods we have developed that are natural extensions of the sequential approach described above. These methods distribute panels (sets of contiguous columns with nearly identical non-zero structures) among the processors. The thesis shows that for small parallel machines, the resulting methods again produce substantial performance improvements over existing methods. A framework is provided for understanding the performance of these methods, and also for understanding the limitations inherent in them. Using this framework, the thesis shows that panel methods are inappropriate for large-scale parallel machines because they do not expose enough concurrency. The thesis then considers rectangular block methods, where the sparse matrix is split both vertically and horizontally. These methods address the concurrency problems of panel methods, but they also introduce a number of complications. Primary among these are issues of choosing blocks that can be manipulated efficiently and structuring a parallel computation in terms of these blocks. The thesis describes solutions to these problems and presents performance results from an efficient block method implementation. The contributions of this work come both from its theoretical foundation for understanding the factors that limit the scalability of panel- and block-oriented methods on hierarchical memory multiprocessors, and from its investigation of practical issues related to the implementation of efficient parallel factorization methods.

...read moreread less

Linda on distributed memory multiprocessors

[...]

Robert D. Bjornson

01 Jan 1993

TL;DR: This dissertation shows that Linda can be made efficient on scalable distributed-memory multiprocessors, and presents a design for implementing Linda's Tuple Space on such machines, and claims that the design results in an efficient implementation.

...read moreread less

Abstract: The coordination language Linda (Gel85) is a convenient and powerful model for parallel and distributed computing. By permitting the expression of parallel algorithms in an architecture-independent manner, it supports truly portable programming (BCGL88). In order to be practical, Linda must be efficient. Previous work (Car87), (Lei89) demonstrated that Linda can be made efficient on shared memory multiprocessors, bus-based distributed-memory multiprocessors, and local area networks. In this dissertation, we show that Linda can be made efficient on scalable distributed-memory multiprocessors. We present a design for implementing Linda's Tuple Space on such machines, and claim that the design results in an efficient implementation. As an existence proof, we characterize the performance of the design as implemented on an Intel iPSC/2 hyper-cube. We go on to describe a number of runtime optimizations, and investigate their effect on the performance of several synthetic programs that stress communication, as well as eight non-trivial applications. Finally, we discuss the extensibility of the design to massively parallel machines.

...read moreread less

Caching and memory management in client-server database systems

[...]

Michael J. Franklin¹•Institutions (1)

University of Wisconsin-Madison¹

11 Jan 1993

TL;DR: The study shows that significant performance and scalability gains can be obtained through client caching, and enables the quantification of the design tradeoffs that are identified in the taxonomy.

...read moreread less

Abstract: The widespread adoption of client-server architectures has made distributed computing the conventional mode of operation for many application domains. At the same time, new classes of applications have placed additional demands on database systems, resulting in an emerging generation of Object-Oriented Database Management Systems (OODBMSs). The combination of these factors gives rise to significant challenges and performance opportunities in the design of modern database systems. This thesis proposes and investigates techniques to provide high performance and scalability for these new systems, while maintaining the transaction semantics, reliability, and availability associated with more traditional database architectures. The common theme of the techniques developed here is the utilization of client resources through caching-based data replication. The initial chapters describe the architectural alternatives for client-server database systems, present the arguments for using caching as the basis for constructing page server database systems, and provide an overview of other environments in which caching-related issues arise. The bulk of the thesis is then focused on the development and simulation-based performance analysis of algorithms for data caching and memory management. A taxonomy of transactional cache consistency algorithms is developed, which includes the algorithms proposed in this thesis as well as others that have appeared in the literature. A performance study of seven proposed algorithms is then presented. The study shows that significant performance and scalability gains can be obtained through client caching, and enables the quantification of the design tradeoffs that are identified in the taxonomy. The remainder of the thesis extends the caching-based techniques to further improve system performance and scalability. The first extension is the investigation of algorithms to efficiently manage the "global memory hierarchy" that results from allowing client page requests to be satisfied from the caches of other clients, thus avoiding disk accesses at the server. The second extension investigates algorithms for using local client disks to augment client memory caches. Both extensions are shown to be simple and effective way to reduce dependence on server disk and cpu resources.

...read moreread less

Journal Article•DOI•

A GRG2-Based System for Training Neural Networks: Design and Computational Experience

[...]

Venkat Subramanian, Ming S. Hung

01 Nov 1993-Informs Journal on Computing

TL;DR: Comparisons with back-propagation based upon three benchmark problems suggest not only that the GRG2-based system is much faster, more robust and offers solutions with better quality, but also offers better scalability to larger problems.

...read moreread less

Abstract: Artificial neural networks is a very active area of research in artificial intelligence. They are better than existing methods for many problems in pattern recognition and pattern matching. They are trained through examples rather than being programmed. Training neural networks is a nonlinear minimization problem. The currently popular algorithm for training neural networks is called back-propagation, a form of steepest descent technique. This paper presents a new training system based on GRG2, a widely distributed nonlinear optimization software. Comparisons with back-propagation based upon three benchmark problems suggest not only that the GRG2-based system is much faster, more robust and offers solutions with better quality, but also offers better scalability to larger problems. This paper is thus a clear example of the kinds of contributions that optimization theory can make to artificial intelligence. INFORMS Journal on Computing, ISSN 1091-9856, was published as ORSA Journal on Computing from 1989 t...

...read moreread less

Book•

Advanced computer architecture : parallelism, scalability, programmability /edited by Kai Hwang

[...]

Kai Hwang

01 Jan 1993

TL;DR: This text presents the latest technologies for parallel processing and high performance computing, providing an integrated study of computer hardware and software systems, and is suitable for use on courses found in computer science, computer engineering, or electrical engineering departments.

...read moreread less

Abstract: This text presents the latest technologies for parallel processing and high performance computing. It deals with advanced computer architecture and parallel processing systems and techniques, providing an integrated study of computer hardware and software systems, and the material is suitable for use on courses found in computer science, computer engineering, or electrical engineering departments.

...read moreread less

Proceedings Article•DOI•

A MIMD rendering algorithm for distributed memory architectures

[...]

Thomas W. Crockett¹, Tobias Orloff•Institutions (1)

Langley Research Center¹

01 Nov 1993

TL;DR: A parallel rendering algorithm targeted to MIMD distributed-memory message-passing architectures that exploits both object-level and image level parallelism and Scalability to large numbers of processors is found to be limited primarily by communication overheads.

...read moreread less

Abstract: We present a parallel rendering algorithm targeted to MIMD distributed-memory message-passing architectures. For maximum performance, the algorithm exploits both object-level and image level parallelism. The behavior of the algorithm is examined both analytically and experimentally. The results show that the choice of message size has a significant impact on performance. Scalability to large numbers of processors is found to be limited primarily by communication overheads. An experimental implementation for the Intel iPSC/860 confirms the analytical results and demonstrates increasing performance from 1 to 128 processors across a wide range of scene complexities.

...read moreread less

Book•

The design of scalable software libraries for distributed memory concurrent computers

[...]

Jae-Young Choi, Jack Dongarra, David W. Walker

02 Jan 1993

TL;DR: The block cyclic data distribution is adopted as a simple, yet general purpose, way of decomposing block-partitioned matrices in the design of ScaLAPACK, a scalable software library for performing dense and banded linear algebra computations on distributed memory concurrent computers.

...read moreread less

Abstract: Describes the design of ScaLAPACK, a scalable software library for performing dense and banded linear algebra computations on distributed memory concurrent computers. The specification of the data distribution has important consequences for interprocessor communication and load balance, and hence is a major factor in determining performance and scalability of the library routines. The block cyclic data distribution is adopted as a simple, yet general purpose, way of decomposing block-partitioned matrices. Distributed memory versions of the Level 3 BLAS provide an easy and convenient way of implementing the ScaLAPACK routines. >

...read moreread less

Journal Article•DOI•

Scalable, adaptive load sharing for distributed systems

[...]

O. Kremien¹, Jeff Kramer, Jeff Magee•Institutions (1)

Bar-Ilan University¹

01 Aug 1993-IEEE Parallel & Distributed Technology: Systems & Applications

TL;DR: FLS presents a flexible load-sharing algorithm which achieves scalability by partitioning a system into domains and applies load sharing within each domain, independently of how it is applied within other domains.

...read moreread less

Abstract: Presents a flexible load-sharing algorithm which achieves scalability by partitioning a system into domains. Each node dynamically and adaptively selects other nodes to be included in its domain. FLS applies load sharing within each domain, independently of how it is applied within other domains. >

...read moreread less

Proceedings Article•DOI•

A performance evaluation of tree-based coherent distributed shared memory

[...]

Koichi Wada¹, Obata Motoko¹, Mitsuteru Nakamura¹, T. Yamazaki•Institutions (1)

University of Tsukuba¹

19 May 1993

TL;DR: The system is organized based on the tree-based coherent algorithm, in which the coherence is maintained at the network nodes, in a fully distributed and localized manner, and the proposed coherence protocol relies only on one-to-one message passing.

...read moreread less

Abstract: The system is organized based on the tree-based coherent algorithm, in which the coherence is maintained at the network nodes, in a fully distributed and localized manner. The system has an overlapped tree structure, where the processors are located on the leaves. The proposed coherence protocol relies only on one-to-one message passing. Simulation shows that, in a matrix multiplication program, the 32-processor system can execute 21 times faster than the single processor. How the performance of network nodes and block size affect overall system performance is also discussed. As regards performance, only the copy-request and the invalidate message are thought to be insufficient. >

...read moreread less

Book Chapter•DOI•

Network Support for Dynamically Scaled Multimedia Data Streams

[...]

Don Hoffman¹, Michael F. Speer¹, Gerard Fernando¹•Institutions (1)

Sun Microsystems¹

03 Nov 1993

TL;DR: This paper discusses the networking issues associated with encoding hierarchical streams and mapping them to a multimedia transport service interface in the context of multi-media applications.

...read moreread less

Abstract: As multi-media applications such as video-on-demand and video conferencing become more common, the classes of systems and networks participating in these applications are becoming more diverse. Where several endpoints need to access the same video stream simultaneously, multicast protocols are often employed to reduce the duplication of network traffic across common links. Previous literature has discussed the concept that hierarchical media encodings can be used to achieve some form of stream scalability within a multicast network. This paper discuss the networking issues associated with encoding hierarchical streams and mapping them to a multimedia transport service interface.

...read moreread less

Proceedings Article•DOI•

Communication and computation performance of the CM-5

[...]

T. T. Kwan¹, B. K. Totty¹, Daniel A. Reed¹•Institutions (1)

University of Illinois at Urbana–Champaign¹

01 Dec 1993

TL;DR: To assess the scalability of the CM-5's computation and interprocessor communication rates, a series of benchmarks was used to measure the performance of theCM-5 data and control networks, the node vector units, and the balance of computation and communication.

...read moreread less

Abstract: The Thinking Machines CM-5 is one of the first of a new generation of massively parallel systems. To assess the scalability of the CM-5's computation and interprocessor communication rates, a series of benchmarks was used to measure the performance of the CM-5 data and control networks, the node vector units, and the balance of computation and communication. At the application level, the achievable communication bandwidth and processing rates were found to be roughly 50% and 40%, respectively, of the corresponding theoretical peak rates. The early assessment is that the CM-5 is scalable but that a better balance of communication and processing rates would increase its effectiveness.

...read moreread less

Proceedings Article•

On distributed system management

[...]

Germán S. Goldszmidt¹•Institutions (1)

Columbia University¹

24 Oct 1993

TL;DR: This work proposes an alternative model, Management by Delegation, and contrasts its properties via an application example, evaluating the health of a Distributed System.

...read moreread less

Abstract: Device failures, performance inefficiencies, improper allocation of resources, security compromises, and accounting are some of the problems associated with the operations of distributed systems. Effective management requires monitoring, interpreting and controlling the behavior of the distributed system resources, both hardware and software. Current management systems pursue a platform-centered paradigm, where agents monitor the system and collect data, which can be accessed by applications via management protocols. Some of the fundamental limitations of this paradigm include limited scalability, micromanagement, and semantic heterogeneity. We propose an alternative model, Management by Delegation, and contrast its properties via an application example, evaluating the health of a Distributed System.

...read moreread less

Proceedings Article•DOI•

Flexible control for program recognition

[...]

Linda M. Wills¹•Institutions (1)

Georgia Institute of Technology¹

21 May 1993

TL;DR: The author reports on the application of recognition to multiple tasks requiring reverse engineering, such as inspecting, maintaining, and reusing software, which requires a flexible, adaptable recognition architecture based on graph parsing.

...read moreread less

Abstract: The author reports on the application of recognition to multiple tasks requiring reverse engineering, such as inspecting, maintaining, and reusing software. This requires a flexible, adaptable recognition architecture. A recognition system based on graph parsing has been developed. It has a flexible, adaptable control structure that can accept advice from external agents. Its flexibility arises from using a chart parsing algorithm. This graph parsing approach is studied to determine what types of advice can enhance its capabilities, performance, and scalability. >

...read moreread less

Proceedings Article•DOI•

Laura: a coordination language for open distributed systems

[...]

R. Tolksdorf

25 May 1993

TL;DR: Laura is a coordination language for open distributed systems following this paradigm by introducing a service space via which agents offer and request services without knowing about each other.

...read moreread less

Abstract: Open distributed systems are an emerging class of distributed systems that have to take into account a number of heterogeneities in the system components and possibly high dynamics in the system structure by unrestrictedly joining and leaving agents. Uncoupled processing is a basis for a solution of the coordination problem that arises. Laura is a coordination language for open distributed systems following this paradigm by introducing a service space via which agents offer and request services without knowing about each other. They place and withdraw forms from the service space describing the type of service offered or requested. Type conformance based on subtyping determines if the forms match. An initial Laura implementation is described and further issues such as scalability and a role model are considered. >

...read moreread less