scispace - formally typeset
Search or ask a question

Showing papers on "Scalability published in 1994"


Proceedings ArticleDOI
28 Sep 1994
TL;DR: Four per-session guarantees are proposed to aid users and applications of weakly consistent replicated data: "read your writes", "monotonic reads", "writes follow reads", and " monotonic writes".
Abstract: Four per-session guarantees are proposed to aid users and applications of weakly consistent replicated data: "read your writes", "monotonic reads", "writes follow reads", and "monotonic writes". The intent is to present individual applications with a view of the database that is consistent with their own actions, even if they read and write from various, potentially inconsistent servers. The guarantees can be layered on existing systems that employ a read-any/write-any replication scheme while retaining the principal benefits of such a scheme, namely high availability, simplicity, scalability, and support for disconnected operation. These session guarantees were developed in the context of the Bayou project at Xerox PARC in which we are designing and building a replicated storage system to support the needs of mobile computing users who may be only intermittently connected. >

476 citations


Journal ArticleDOI
TL;DR: The results show that the decentralized data fusion system described in this paper offers many advantages in terms of robustness, scalability and flexibility over a centralized system.

362 citations


Journal ArticleDOI
01 Nov 1994
TL;DR: The methodology used at the National Center for Supercomputing Applications in building a scalable World Wide Web server is outlined, allowing for dynamic scalability by rotating through a pool of http servers that are alternately mapped to the hostname alias of the www server.
Abstract: While the World Wide Web (www) may appear to be intrinsically scalable through the distribution of files across a series of decentralized servers, there are instances where this form of load distribution is both costly and resource intensive. In such cases it may be necessary to administer a centrally located and managed http server. Given the exponential growth of the internet in general, and www in particular, it is increasingly more difficult for persons and organizations to properly anticipate their future http server needs, both in human resources and hardware requirements. It is the purpose of this paper to outline the methodology used at the National Center for Supercomputing Applications in building a scalable World Wide Web server. The implementation described in the following pages allows for dynamic scalability by rotating through a pool of http servers that are alternately mapped to the hostname alias of the www server. The key components of this configuration include: (1) cluster of identically configured http servers; (2) use of Round-Robin DNS for distributing http requests across the cluster; (3) use of distributed File System mechanism for maintaining a synchronized set of documents across the cluster; and (4) method for administering the cluster. The result of this design is that we are able to add any number of servers to the available pool, dynamically increasing the load capacity of the virtual server. Implementation of this concept has eliminated perceived and real vulnerabilities in our single-server model that had negatively impacted our user community. This particular design has also eliminated the single point of failure inherent in our single-server configuration, increasing the likelihood for continued and sustained availability. while the load is currently distributed in an unpredictable and, at times, deleterious manner, early implementation and maintenance of this configuration have proven promising and effective.

251 citations


ReportDOI
01 Jul 1994
TL;DR: This paper introduces Harvest, a system that provides a set of customizable tools for gathering information from diverse repositories, building topic-specific content indexes, flexibly searching the indexes, widely replicating them, and caching objects as they are retrieved across the Internet.
Abstract: : Rapid growth in data volume user base and data diversity render Internet-accessible information increasingly difficult to use effectively. In this paper we introduce Harvest, a system that provides a set of customizable tools for gathering information from diverse repositories, building topic-specific content indexes, flexibly searching the indexes, widely replicating them, and caching objects as they are retrieved across the Internet. The system interoperates with Mosaic and with HTTP, FTP, and Gopher information resources. We discuss the design and implementation of each subsystem and provide measurements indicating that Harvest can reduce server load, network traffic and index space requirements significantly compared with previous indexing systems. We also discuss a half dozen indexes we have built using Harvest, underscoring both the customizability and scalability of the system.

250 citations


Journal ArticleDOI
TL;DR: This paper analyses the scalability of a number of load balancing algorithms which can be applied to problems that have the following characteristics: the work done by a processor can be partitioned into independent work pieces; the work pieces are of highly variable sizes; and it is not possible to estimate the size of total work at a given processor.

235 citations


Journal ArticleDOI
TL;DR: A detailed performance and scalability analysis of the communication primitives is presented, carried out using a workload generator, kernels from real applications, and a large unstructured adaptive application.

211 citations


Journal ArticleDOI
TL;DR: The objectives of this paper are to critically assess the state of the art in the theory of scalability analysis, and to motivate further research on the development of new and more comprehensive analytical tools to study the scalability of parallel algorithms and architectures.

203 citations


Journal ArticleDOI
TL;DR: Theoretical results show that a large class of algorithm-machine combinations is scalable and the scalability can be predicted through premeasured machine parameters, and a harmony between speedup and scalability has been observed.
Abstract: Scalability has become an important consideration in parallel algorithm and machine designs. The word scalable, or scalability, has been widely and often used in the parallel processing community. However, there is no adequate, commonly accepted definition of scalability available. Scalabilities of computer systems and programs are difficult to quantify, evaluate, and compare. In this paper, scalability is formally defined for algorithm-machine combinations. A practical method is proposed to provide a quantitative measurement of the scalability. The relation between the newly proposed scalability and other existing parallel performance metrics is studied. A harmony between speedup and scalability has been observed. Theoretical results show that a large class of algorithm-machine combinations is scalable and the scalability can be predicted through premeasured machine parameters. Two algorithms have been studied on an nCUBE 2 multicomputer and on a MasPar MP-1 computer. These case studies have shown how scalabilities can be measured, computed, and predicted. Performance instrumentation and visualization tools also have been used and developed to understand the scalability related behavior. >

186 citations


Journal ArticleDOI
TL;DR: This article demonstrates that simple and natural parallelizations work very well, the sequential implementations do not have to be fundamentally restructured, and the high degree of temporal locality obviates the need for explicit data distribution and communication management on the best known visualization algorithms.
Abstract: Recently, a new class of scalable, shared-address-space multiprocessors has emerged. Like message-passing machines, these multiprocessors have a distributed interconnection network and physically distributed main memory. However, they provide hardware support for efficient implicit communication through a shared address space, and they automatically exploit temporal locality by caching both local and remote data in a processor's hardware cache. In this article, we show that these architectural characteristics make it much easier to obtain very good speedups on the best known visualization algorithms. Simple and natural parallelizations work very well, the sequential implementations do not have to be fundamentally restructured, and the high degree of temporal locality obviates the need for explicit data distribution and communication management. We demonstrate our claims through parallel versions of three state-of-the-art algorithms: a recent hierarchical radiosity algorithm by Hanrahan et al. (1991), a parallelized ray-casting volume renderer by Levoy (1992), and an optimized ray-tracer by Spach and Pulleyblank (1992). We also discuss a new shear-warp volume rendering algorithm that provides the first demonstration of interactive frame rates for a 256/spl times/256/spl times/256 voxel data set on a general-purpose multiprocessor. >

176 citations


Journal ArticleDOI
01 Nov 1994
TL;DR: This paper examines the effects of OS scheduling and page migration policies on the performance of compute servers for multiprogramming and parallel application workloads, and suggests that policies based only on TLB miss information can be quite effective, and useful for addressing the data distribution problems of space-sharing schedulers.
Abstract: Several cache-coherent shared-memory multiprocessors have been developed that are scalable and offer a very tight coupling between the processing resources. They are therefore quite attractive for use as compute servers for multiprogramming and parallel application workloads. Process scheduling and memory management, however, remain challenging due to the distributed main memory found on such machines. This paper examines the effects of OS scheduling and page migration policies on the performance of such compute servers. Our experiments are done on the Stanford DASH, a distributed-memory cache-coherent multiprocessor. We show that for our multiprogramming workloads consisting of sequential jobs, the traditional Unix scheduling policy does very poorly. In contrast, a policy incorporating cluster and cache affinity along with a simple page-migration algorithm offers up to two-fold performance improvement. For our workloads consisting of multiple parallel applications, we compare space-sharing policies that divide the processors among the applications to time-slicing policies such as standard Unix or gang scheduling. We show that space-sharing policies can achieve better processor utilization due to the operating point effect, but time-slicing policies benefit strongly from user-level data distribution. Our initial experience with automatic page migration suggests that policies based only on TLB miss information can be quite effective, and useful for addressing the data distribution problems of space-sharing schedulers.

175 citations


Proceedings ArticleDOI
23 May 1994
TL;DR: Preliminary results from a prototype PIOUS implementation are presented, to exploit the combined file I/O and buffer cache capacities of networked computing resources, and transaction-based concurrency control, to guarantee access consistency without explicit synchronization.
Abstract: PIOUS is a parallel file system architecture that provides cost-effective, scalable bandwidth in a network computing environment. PIOUS employs data declustering, to exploit the combined file I/O and buffer cache capacities of networked computing resources, and transaction-based concurrency control, to guarantee access consistency without explicit synchronization. This paper presents preliminary results from a prototype PIOUS implementation. >

Proceedings ArticleDOI
14 Nov 1994
TL;DR: Experience with implementing the Cache Kernel and measurements of its performance on a multiprocessor suggest that the caching model can provide competitive performance with conventional monolithic operating systems, yet provides application-level control of system resources, better modularity, better scalability, smaller size and a basis for fault containment.
Abstract: Operating system research has endeavored to develop micro-kernels that provide modularity, reliability and security improvements over conventional monolithic kernels. However, the resulting kernels have been slower, larger and more error-prone than desired. These efforts have also failed to provide sufficient application control of resource management required by sophisticated applications.This paper describes a caching model of operating system functionality as implemented in the Cache Kernel, the supervisor-mode component of the V++ operating system. The Cache Kernel caches operating system objects such as threads and address spaces just as conventional hardware caches memory data. User-mode application kernels handle the loading and writeback of these objects, implementing application-specific management policies and mechanisms. Experience with implementing the Cache Kernel and measurements of its performance on a multiprocessor suggest that the caching model can provide competitive performance with conventional monolithic operating systems, yet provides application-level control of system resources, better modularity, better scalability, smaller size and a basis for fault containment.

Proceedings ArticleDOI
24 May 1994
TL;DR: This work proposes a distributed search tree that inherits desirable properties from non-distributed trees, and shows that it does indeed combine a guarantee for good storage space utilization with high query efficiency.
Abstract: Databases are growing steadily, and distributed computer systems are more and more easily available. This provides an opportunity to satisfy the increasingly tighter efficiency requirements by means of distributed data structures. The design and analysis of these structures under efficiency aspects, however, has not yet been studied sufficiently. To our knowledge, a single scalable, distributed data structure has been proposed so far. It is a distributed variant of linear hashing with uncontrolled splits, and, as a consequence, performs efficiently for data distributions that are close to uniform, but not necessarily for others. In addition, it does not support queries that refer to the linear order of keys, such as nearest neighbor or range queries. We propose a distributed search tree that avoids these problems, since it inherits desirable properties from non-distributed trees. Our experiments show that our structure does indeed combine a guarantee for good storage space utilization with high query efficiency. Nevertheless, we feel that further research in the area of scalable, distributed data structures is dearly needed; it should eventually lead to a body of knowledge that is comparable with the non-distributed, classical data structures field.

Journal ArticleDOI
TL;DR: The architecture relies on innovative data striping and real-time scheduling to allow a large number of guaranteed concurrent accesses, and uses separation of metadata from real data to achieve a direct flow of the media streams between the storage devices and the network.
Abstract: Large scale multimedia storage servers will be an integral part of the emerging distributed multimedia computing infrastructure. However, given the modest rate of improvements in storage transfer rates, designing servers that meet the demands of multimedia applications is a challenging task that needs significant architectural innovation. Our research project, called Massively-parallel And Real-time Storage ( mars ) architecture, is aimed at the design and prototype implementation of a large scale multimedia storage server. It uses some of the well-known techniques in parallel I/O, such as data striping and Redundant Arrays of Inexpensive Disks ( raid ) and an innovative atm based interconnect inside the server to achieve a scalable architecture that transparently connects storage devices to an atm -based broadband network. The atm interconnect within the server uses a custom asic called ATM Port Interconnect Controller ( apic ) currently being developed as a part of an arpa sponsored gigabit local atm testbed. Our architecture relies on innovative data striping and real-time scheduling to allow a large number of guaranteed concurrent accesses, and uses separation of metadata from real data to achieve a direct flow of the media streams between the storage devices and the network. This paper presents our system architecture; one that is scalable in terms of the number of supported users and the throughput.

Proceedings ArticleDOI
14 Feb 1994
TL;DR: The design of Mariposa is described, an experimental distributed data management system that provides high performance in an environment of high data mobility and heterogeneous host capabilities and a general, flexible platform for the development of new algorithms for distributed query optimization, storage management, and scalable data storage structures.
Abstract: We describe the design of Mariposa, an experimental distributed data management system that provides high performance in an environment of high data mobility and heterogeneous host capabilities. The Mariposa design unifies the approaches taken by distributed file systems and distributed databases. In addition, Mariposa provides a general, flexible platform for the development of new algorithms for distributed query optimization, storage management, and scalable data storage structures. This flexibility is primarily due to a unique rule-based design that permits autonomous, local-knowledge decisions to be made regarding data placement, query execution location, and storage management. >

Proceedings ArticleDOI
14 Nov 1994
TL;DR: This paper provides an extensive analysis on several complex scientific algorithms written in SAM on a variety of hardware platforms and finds that the performance of these SAM applications depends fundamentally on the scalability of the underlying parallel algorithm, and whether the algorithm's communication requirements can be satisfied by the hardware.
Abstract: This paper describes the design and evaluation of SAM, a shared object system for distributed memory machines. SAM is a portable run-time system that provides a global name space and automatic caching of shared data. SAM incorporates mechanisms to address the problem of high communication overheads on distributed memory machines; these mechanisms include tying synchronization to data access, chaotic access to data, prefetching of data, and pushing of data to remote processors. SAM has been implemented on the CM-5, Intel iPSC/860 and Paragon, IBM SP1, and networks of workstations running PVM. SAM applications run on all these platforms without modification.This paper provides an extensive analysis on several complex scientific algorithms written in SAM on a variety of hardware platforms. We find that the performance of these SAM applications depends fundamentally on the scalability of the underlying parallel algorithm, and whether the algorithm's communication requirements can be satisfied by the hardware. Our experience suggests that SAM is successful in allowing programmers to use distributed memory machines effectively with much less programming effort than required today.

Proceedings ArticleDOI
01 Jan 1994
TL;DR: The SGI Challenge system architecture provides a high-bandwidth, low-latency cache-coherent interconnect for several high performance processors, I/O busses, and a scalable memory system.
Abstract: This paper presents the architecture, implementation, and performance results for the SGI Challenge symmetric multiprocessor system. Novel aspects of the architecture are highlighted, as well as key design trade-offs targeted at increasing performance and reducing complexity. Multiprocessor design verification techniques and their impact is also presented. The SGI Challenge system architecture provides a high-bandwidth, low-latency cache-coherent interconnect for several high performance processors, I/O busses, and a scalable memory system. Hardware cache coherence mechanisms maintain a consistent view of shared memory for all processors, with no software overhead and minimal impact on processor performance. HDL simulation with random, self checking vector generation and a lightweight operating system on full processor models contributed to a concept to customer shipment cycle of 26 months. >

Proceedings ArticleDOI
01 Nov 1994
TL;DR: This paper proposes an alternative way of structuring distributed systems that takes advantage of a communication model based on remote network access (reads and writes) to protected memory segments, and demonstrates how separating data transfer and control transfer can eliminate unnecessary control transfers and facilitate tighter coupling of the client and server.
Abstract: Advances in processor architecture and technology have resulted in workstations in the 100+ MIPS range. As well, newer local-area networks such as ATM promise a ten- to hundred-fold increase in throughput, much reduced latency, greater scalability, and greatly increased reliability, when compared to current LANs such as Ethernet.We believe that these new network and processor technologies will permit tighter coupling of distributed systems at the hardware level, and that distributed systems software should be designed to benefit from that tighter coupling. In this paper, we propose an alternative way of structuring distributed systems that takes advantage of a communication model based on remote network access (reads and writes) to protected memory segments.A key feature of the new structure, directly supported by the communication model, is the separation of data transfer and control transfer. This is in contrast to the structure of traditional distributed systems, which are typically organized using message passing or remote procedure call (RPC). In RPC-style systems, data and control are inextricably linked—all RPCs must transfer both data and control, even if the control transfer is unnecessary.We have implemented our model on DECstation hardware connected by an ATM network. We demonstrate how separating data transfer and control transfer can eliminate unnecessary control transfers and facilitate tighter coupling of the client and server. This has the potential to increase performance and reduce server load, which supports scaling in the face of an increasing number of clients. For example, for a small set of file server operations, our analysis shows a 50% decrease in server load when we switched from a communications mechanism requiring both control transfer and data transfer, to an alternative structure based on pure data transfer.

Journal ArticleDOI
01 Jan 1994
TL;DR: The author presents a taxonomy of dynamic task scheduling schemes that is synthesised by treating state estimation and decision making as orthogonal problems that is regular, easily understood, compact, and its wide applicability is demonstrated by means of examples that encompass solutions proposed in the literature.
Abstract: System state estimation and decision making are the two major components of dynamic task scheduling in a distributed computing system. Combinations of solutions to each individual component constitute solutions to the dynamic task scheduling problem. It is important to consider a solution to the state estimation problem separate from a solution to the decision making problem to understand the similarities and differences between different solutions to dynamic task scheduling. Also, a solution to the state estimation problem has a significant impact on the scalability of a task scheduling solution in large scale distributed systems. The author presents a taxonomy of dynamic task scheduling schemes that is synthesised by treating state estimation and decision making as orthogonal problems. Solutions to estimation and decision making are analysed in detail and the resulting solution space of dynamic task scheduling is clearly shown. The proposed taxonomy is regular, easily understood, compact, and its wide applicability is demonstrated by means of examples that encompass solutions proposed in the literature. The taxonomy illustrates possible solutions that have not been evaluated and those solutions that may have potential in future research. >

Journal ArticleDOI
01 Apr 1994
TL;DR: This paper focuses on the expressiveness of the Linda model, on techniques required to build efficient implementations, and on observed performance both on workstation networks and distributed-memory parallel machines.
Abstract: The use of distributed data structures in a logically-shared memory is a natural, readily-understood approach to parallel programming. The principal argument against such an approach for portable software has always been that efficient implementations could not scale to massively-parallel, distributed memory machines. Now, however, there is growing evidence that it is possible to develop efficient and portable implementations of virtual shared memory models on scalable architectures. In this paper we discuss one particular example: Linda. After presenting an introduction to the Linda model, we focus on the expressiveness of the model, on techniques required to build efficient implementations, and on observed performance both on workstation networks and distributed-memory parallel machines. Finally, we conclude by briefly discussing the range of applications developed with Linda and Linda's suitability for the sorts of heterogeneous, dynamically-changing computational environments that are of growing significance.

Journal ArticleDOI
TL;DR: The authors mathematically analyze the drift of low- resolution images obtained from a smaller IDCT of a subset of the DCT coefficients of the full-resolution images in motion-compensated hybrid predictive/DCT coding such as MPEG-2, which allows for frequency scalability.
Abstract: The authors mathematically analyze the drift of low-resolution images obtained from a smaller IDCT of a subset of the DCT coefficients of the full-resolution images in motion-compensated hybrid predictive/DCT coding such as MPEG-2, which allows for frequency scalability. Using this mathematical structure, they derive a low-resolution decoder that has the theoretically minimum possible drift, and propose techniques for implementation that produce substantial improvement in real sequences. The minimum drift can also be used as a milestone, to be compared with other techniques of drift reduction (of worse performance but lower complexity). For the case where leakage is used to reduce the drift, the authors determine a minimum-energy non-uniform DCT-domain leakage matrix which is no more complex than uniform leakage, but gives a substantial improvement. Finally, they note that DCT-based pyramidal coding is essentially the same as the drift case, and thus they use the same mathematical structure to derive the theoretically-best upward predictor in pyramidal coding. >

Proceedings ArticleDOI
Yan1
01 Jan 1994
TL;DR: A software toolkit that facilitates performance evaluation of parallel applications on multiprocessors, the Automated Instrumentation and Monitoring System (AIMS), is described in this paper.
Abstract: Whether a researcher is designing the "next parallel programming paradigm", another "scalable multiprocessor", or investigating resource allocation algorithms for multiprocessors, a facility that enables parallel program execution to be captured and displayed is invaluable. A software toolkit that facilitates performance evaluation of parallel applications on multiprocessors, the Automated Instrumentation and Monitoring System (AIMS), is described in this paper. It has four major software components: a source-code instrumentor, which automatically inserts event recorders into the application; a run-time performance-monitoring library, which collects performance data; a trace-file animation and analysis toolkit; and a trace post-processor which compensates for the data collection overhead. We illustrate the process of performance tuning using AIMS with two examples. Currently, AIMS accepts FORTRAN and C parallel programs written for TMC's CM-5, Intel's iPSC/860, iPSC/Delta, Paragon, and HP workstations running PVM. >

Proceedings ArticleDOI
14 Nov 1994
TL;DR: A set of tools for performance tuning of parallel programs that enable programmers to develop accurate performance models of parallel applications without requiring extensive performance modeling expertise are described.
Abstract: Most performance debugging and tuning of parallel programs is based on the "measure-modify" approach, which is heavily dependent on detailed measurements of programs during execution. This approach is extremely time-consuming and does not lend itself to predicting performance under varying conditions. Analytic modeling and scalability analysis provide predictive power, but are not widely used in practice, due primarily to their emphasis on asymptotic behavior and the difficulty of developing accurate models that work for real-world programs. In this paper we describe a set of tools for performance tuning of parallel programs that bridges this gap between measurement and modeling.Our approach is based on lost cycles analysis, which involves measurement and modeling of all sources of overhead in a parallel program. We first describe a tool for measuring overheads in parallel programs that we have incorporated into the runtime environment for Fortran programs on the Kendall Square KSR1. We then describe a tool that fits these overhead measurements to analytic forms. We illustrate the use of these tools by analyzing the performance tradeoffs among parallel implementations of 2D FFT. These examples show how our tools enable programmers to develop accurate performance models of parallel applications without requiring extensive performance modeling expertise.

Journal ArticleDOI
TL;DR: This paper presents the OMMH topology, analyzes its architectural properties and potentials for massively parallel computing, and compares it to the hypercube, and presents a three-dimensional optical design methodology based on free-space optics.
Abstract: A new interconnection network for massively parallel computing is introduced. This network is called an optical multi-mesh hypercube (OMMH) network. The OMMH integrates positive features of both hypercube (small diameter, high connectivity, symmetry, simple control and routing, fault tolerance, etc.) and mesh (constant node degree and scalability) topologies and at the same time circumvents their limitations (e.g., the lack of scalability of hypercubes, and the large diameter of meshes). The OMMH can maintain a constant node degree regardless of the increase in the network size. In addition, the flexibility of the OMMH network makes it well suited for optical implementations. This paper presents the OMMH topology, analyzes its architectural properties and potentials for massively parallel computing, and compares it to the hypercube. Moreover, it also presents a three-dimensional optical design methodology based on free-space optics. The proposed optical implementation has totally space-invariant connection patterns at every node, which enables the OMMH to be highly amenable to optical implementation using simple and efficient large space-bandwidth product space-invariant optical elements. >

Journal ArticleDOI
TL;DR: This paper discusses the scalability of Cholesky, LU, and QR factorization routines on MIMD distributed memory concurrent computers, and shows that the routines are highly scalable on this machine for problems that occupy more than about 25% of the memory on each processor.

Proceedings ArticleDOI
15 Oct 1994
TL;DR: The approach appears to be capable of providing a scalable, high-performance, and economical mechanism to provide a data storage system for several classes of data, and for applications (clients) that operate in a high-speed network environment.
Abstract: We have designed, built, and analyzed a distributed parallel storage system that will supply image streams fast enough to permit multi-user, “real-time”, video-like applications in a wide-area ATM network-based Internet environment. We have based the implementation on user-level code in order to secure portability; we have characterized the performance bottlenecks arising from operating system and hardware issues, and based on this have optimized our design to make the best use of the available performance. Although at this time we have only operated with a few classes of data, the approach appears to be capable of providing a scalable, high-performance, and economical mechanism to provide a data storage system for several classes of data (including mixed multimedia streams), and for applications (clients) that operate in a high-speed network environment.

Proceedings ArticleDOI
14 Nov 1994
TL;DR: It is shown how a general approach to hybrid algorithms yields performance across the entire range of vector lengths, including non-power-of-two grids.
Abstract: In this paper, we report on a project to develop a unified approach for building a library of collective communication operations that performs well on a cross-section of problems encountered in real applications. The target architecture is a two-dimensional mesh with worm-hole routing, but the techniques are more general. The approach differs from traditional library implementations in that we address the need for implementations that perform well for various sized vectors and grid dimensions, including non-power-of-two grids. We show how a general approach to hybrid algorithms yields performance across the entire range of vector lengths. Moreover, many scalable implementations of application libraries require collective communication within groups of nodes. Our approach yields the same kind of performance for group collective communication. Results from the Intel Paragon system are included. To obtain this library for Intel systems contact intercom©cs.utexas.edu.

Journal ArticleDOI
TL;DR: It is shown that network latency forms a major obstacle to improving parallel computing performance and scalability, and an experimental metric, using network latency to measure and evaluate the scalability of parallel programs and architectures is presented.

Proceedings ArticleDOI
Peter F. Corbett1, Dror G. Feitelson1
23 May 1994
TL;DR: The Vesta parallel file system is designed to provide parallel file access to application programs running on multicomputers with parallel I/O subsystems, and is beginning to be used by application programmers.
Abstract: The Vesta parallel file system is designed to provide parallel file access to application programs running on multicomputers with parallel I/O subsystems. Vesta uses a new abstraction of files: a file is not a sequence of bytes, but rather it can be partitioned into multiple disjoint sequences that are accessed in parallel. The partitioning-which can also be changed dynamically-reduces the need for synchronization and coordination during the access. Some control over the layout of data is also provided, so the layout can be marched with the anticipated access patterns. The system is fully implemented, and is beginning to be used by application programmers. The implementation does not compromise scalability or parallelism. In fact, all data accesses are done directly to the I/O node that contains the requested data, without any indirection or access to shared metadata. There are no centralized control points in the system. >