scispace - formally typeset
Search or ask a question

Showing papers on "Distributed memory published in 1999"


Proceedings Article
07 Sep 1999
TL;DR: This paper examines four commercial DBMSs running on an Intel Xeon and NT 4.0 and introduces a framework for analyzing query execution time, and finds that database developers should not expect the overall execution time to decrease significantly without addressing stalls related to subtle implementation issues.
Abstract: Recent high-performance processors employ sophisticated techniques to overlap and simultaneously execute multiple computation and memory operations. Intuitively, these techniques should help database applications, which are becoming increasingly compute and memory bound. Unfortunately, recent studies report that faster processors do not improve database system performance to the same extent as scientific workloads. Recent work on database systems focusing on minimizing memory latencies, such as cache-conscious algorithms for sorting and data placement, is one step toward addressing this problem. However, to best design high performance DBMSs we must carefully evaluate and understand the processor and memory behavior of commercial DBMSs on today’s hardware platforms. In this paper we answer the question “Where does time go when a database system is executed on a modern computer platform?” We examine four commercial DBMSs running on an Intel Xeon and NT 4.0. We introduce a framework for analyzing query execution time on a DBMS running on a server with a modern processor and memory architecture. To focus on processor and memory interactions and exclude effects from the I/O subsystem, we use a memory resident database. Using simple queries we find that database developers should (a) optimize data placement for the second level of data cache, and not the first, (b) optimize instruction placement to reduce first-level instruction cache stalls, but (c) not expect the overall execution time to decrease significantly without addressing stalls related to subtle implementation issues (e.g., branch prediction).

551 citations


Journal ArticleDOI
TL;DR: In this article, an aerodynamic shape optimization method that treats the design of complex aircraft configurations subject to high fidelity computational fluid dynamics (CFD), geometric constraints and multiple design points is described.
Abstract: An aerodynamic shape optimization method that treats the design of complex aircraft configurations subject to high fidelity computational fluid dynamics (CFD), geometric constraints and multiple design points is described. The design process will be greatly accelerated through the use of both control theory and distributed memory computer architectures. Control theory is employed to derive the adjoint differential equations whose solution allows for the evaluation of design gradient information at a fraction of the computational cost required by previous design methods. The resulting problem is implemented on parallel distributed memory architectures using a domain decomposition approach, an optimized communication schedule, and the MPI (Message Passing Interface) standard for portability and efficiency. The final result achieves very rapid aerodynamic design based on a higher order CFD method. In order to facilitate the integration of these high fidelity CFD approaches into future multi-disciplinary optimization (NW) applications, new methods must be developed which are capable of simultaneously addressing complex geometries, multiple objective functions, and geometric design constraints. In our earlier studies, we coupled the adjoint based design formulations with unconstrained optimization algorithms and showed that the approach was effective for the aerodynamic design of airfoils, wings, wing-bodies, and complex aircraft configurations. In many of the results presented in these earlier works, geometric constraints were satisfied either by a projection into feasible space or by posing the design space parameterization such that it automatically satisfied constraints. Furthermore, with the exception of reference 9 where the second author initially explored the use of multipoint design in conjunction with adjoint formulations, our earlier works have focused on single point design efforts. Here we demonstrate that the same methodology may be extended to treat complete configuration designs subject to multiple design points and geometric constraints. Examples are presented for both transonic and supersonic configurations ranging from wing alone designs to complex configuration designs involving wing, fuselage, nacelles and pylons.

362 citations


Patent
17 Dec 1999
TL;DR: In this article, the physical address space of the processors in each partition is mapped to the respective exclusive memory windows assigned to each partition, so that the exclusive windows appear to the operating systems executing on those partitions as if they all start at the same base address.
Abstract: A computer system comprises a plurality of processing modules that can be configured into different partitions within the computer system, and a main memory. Each partition operates under the control of a separate operating system. At least one shared memory window is defined within the main memory to which multiple partitions have shared access, and each partition may also be assigned and exclusive memory window. Program code executing on different partitions enables those partitions to communicate with each other through the shared memory window. Means are also provided for mapping the physical address space of the processors in each partition to the respective exclusive memory windows assigned to each partition, so that the exclusive memory windows assigned to each partition appear to the respective operating systems executing on those partitions as if they all start at the same base address.

258 citations


Journal ArticleDOI
TL;DR: This text is an in depth introduction to the concepts of Parallel Computing designed for use in university level computer science courses and provides the current theories that are in use in industry today.
Abstract: Kai Hwang and Zhiwei Xu McGraw-Hill, Boston, 1998, 802 pp. ISBN 0-07-031798-4, $97.30 This text is an in depth introduction to the concepts of Parallel Computing. Designed for use in university level computer science courses, the text covers scalable architecture and parallel programming of symmetric muli-processors, clusters of workstations, massively parallel processors, and Internet-based metacomputing platforms. Hwang and Xu give an excellent overview in these topics while keeping the text easily comprehensible. The text is organized into four parts. Part I covers scalability and clustering. Part II deals with the technology used to construct a parallel system. Part III pertains to the architecture of scalable systems. Finally, Part IV presents methods of parallel programming on various platforms and languages. The first chapter presents different models on scalability as divided into resources, applications, and technology. It defines three abstract models (PRAM, BSP, and phase parallel models) and five physical models (PVP, SMP, MPP, COW, and MPP systems). Chapter 2 introduces the ideas behind parallel programming including processes, tasks, threads and environments. Chapter 3 introduces performance issues and metrics. As an introduction to Part II, Chapter 4 introduces the history of microprocessor types and their applications in the architectures of current systems. Chapter 5 deals with the issues of distributed memory. It discusses several models such as UMA, NORMA, CC-NUMA, COMA, and DSM. Chapter 6 presents gigabit networks, switched interconnects, and other various high-speed networking architectures to construct clusters. Chapter 7 discusses the overheads created by parallel computing, such as threads, synchronization, and efficient communication between nodes. Part III, Chapter 8, 9 and 11, give comparisons between various types of scalable systems (SMP, CC-NUMA, Clusters, and MPP). The comparisons are based on hardware architecture, the system software, and special features that make each system unique. Chapter 10 compares various research and commercial clusters with an in depth study of the Berkeley NOW, IBM SP2, and Digital TruCluster systems. Chapter 12 introduces the concepts of Part IV with details into parallel programming paradigms. Chapter 13 discusses communications between the processors using message passing programming (such as MPI and PVM libraries). Chapter 14 studies the data parallel approach with an emphasis in Fortran 90 and HPF. With attention to detail through examples, Hwang and Xu have created a well-written introduction to Parallel Computing. The authors are distinguished for their contributions in this field. This text is based on cutting-edge research, providing the current theories that are in use in industry today. Bin Cong, Shawn Morrison and Michael Yorg, Department of Computer Science California Polytechnic State University at San Luis Obispo

208 citations


Book ChapterDOI
21 Sep 1999
TL;DR: This paper explores the possibility of exploiting a distributed-memory execution environment, such as a network of workstations interconnected by a standard LAN, to extend the size of the verification problems that can be successfully handled by SPIN.
Abstract: The main limiting factor of the model checker SPIN is currently the amount of available physical memory. This paper explores the possibility of exploiting a distributed-memory execution environment, such as a network of workstations interconnected by a standard LAN, to extend the size of the verification problems that can be successfully handled by SPIN. A distributed version of the algorithm used by SPIN to verify safety properties is presented, and its compatibility with the main memory and complexity reduction mechanisms of SPIN is discussed. Finally, some preliminary experimental results are presented.

179 citations


Patent
Howard Thomas Olnowich1
10 Sep 1999
TL;DR: In this article, a shared memory parallel processing system interconnected by a multi-stage network combines new system configuration techniques with special-purpose hardware to provide remote memory accesses across the network, while controlling cache coherency efficiently.
Abstract: A shared memory parallel processing system interconnected by a multi-stage network combines new system configuration techniques with special-purpose hardware to provide remote memory accesses across the network, while controlling cache coherency efficiently across the network. The system configuration techniques include a systematic method for partitioning and controlling the memory in relation to local verses remote accesses and changeable verses unchangeable data. Most of the special-purpose hardware is implemented in the memory controller and network adapter, which implements three send FIFOs and three receive FIFOs at each node to segregate and handle efficiently invalidate functions, remote stores, and remote accesses requiring cache coherency. The segregation of these three functions into different send and receive FIFOs greatly facilitates the cache coherency function over the network. In addition, the network itself is tailored to provide the best efficiency for remote accesses.

160 citations


Proceedings ArticleDOI
01 May 1999
TL;DR: Maps, a compiler managed memory system for Raw architectures, is implemented based on the SUIF infrastructure and it is demonstrated that the exclusive use of static promotion yields roughly 20-fold speedup on 32 tiles for regular applications and about 5-foldspeedup on 16 or more tiles for irregular applications.
Abstract: This paper describes Maps, a compiler managed memory system for Raw architectures. Traditional processors for sequential programs maintain the abstraction of a unified memory by using a single centralized memory system. This implementation leads to the infamous "Von Neumann bottleneck," with machine performance limited by the large memory latency and limited memory bandwidth. A Raw architecture addresses this problem by taking advantage of the rapidly increasing transistor budget to move much of its memory on chip. To remove the bottleneck and complexity associated with centralized memory, Raw distributes the memory with its processing elements. Unified memory semantics are implemented jointly by the hardware and the compiler. The hardware provides a clean compiler interface to its two inter-tile interconnects: a fast, statically schedulable network and a traditional dynamic network. Maps then uses these communication mechanisms to orchestrate the memory accesses for low latency and parallelism while enforcing proper dependence. It optimizes for speed in two ways: by finding accesses that can be scheduled on the static interconnect through static promotion, and by minimizing dependence sequentialization for the remaining accesses. Static promotion is performed using equivalence class unification and modulo unrolling; memory dependences are enforced through explicit synchronization and software serial ordering. We have implemented Maps based on the SUIF infrastructure. This paper demonstrates that the exclusive use of static promotion yields roughly 20-fold speedup on 32 tiles for our regular applications and about 5-fold speedup on 16 or more tiles for our irregular applications. The paper also shows that selective use of dynamic accesses can be a useful complement to the mostly static memory system.

130 citations


Journal ArticleDOI
TL;DR: This paper presents the results of a scalability study for a three-dimensional semicoarsening multigrid solver on a distributed memory computer, and examines the scalability of the solver theoretically and experimentally.
Abstract: This paper presents the results of a scalability study for a three-dimensional semicoarsening multigrid solver on a distributed memory computer. In particular, we are interested in the scalability of the solver---how the solution time varies as both problem size and number of processors are increased. For an iterative linear solver, scalability involves both algorithmic issues and implementation issues. We examine the scalability of the solver theoretically by constructing a simple parallel model and experimentally by results obtained on an IBM SP. The results are compared with those obtained for other solvers on the same computer.

123 citations


Journal Article
TL;DR: This paper gives an overview of self-tuning methods for a spectrum of memory management issues, ranging from traditional caching to exploiting distributed memory in a server cluster and speculative prefetching in a Web-based system.
Abstract: Although today’s computers provide huge amounts of main memory, the ever-increasing load of large data servers, imposed by resource-intensive decision-support queries and accesses to multimedia and other complex data, often leads to memory contention and may result in severe performance degradation. Therefore, careful tuning of memory mangement is crucial for heavy-load data servers. This paper gives an overview of self-tuning methods for a spectrum of memory management issues, ranging from traditional caching to exploiting distributed memory in a server cluster and speculative prefetching in a Web-based system. The common, fundamental elements in these methods include on-line load tracking, near-future access prediction based on stochastic models and the available on-line statistics, and dynamic and automatic adjustment of control parameters in a feedback loop. 1 The Need for Memory Tuning Although memory is relatively inexpensive and modern computer systems are amply equipped with it, memory contention on heavily loaded data servers is a common cause of performance problems. The reasons are threefold: Servers are operating with a multitude of complex software, ranging from the operating system to database systems, object request brokers, and application services. Much of this software has been written so as to quickly penetrate the market rather than optimizing memory usage and other resource consumption. The distinctive characteristic and key problem of a data server is that it operates in multi-user mode, serving many clients concurrently or in parallel. Therefore, a server needs to divide up its resources among the simultaneously active threads for executing queries, transactions, stored procedures, Web applications, etc. Often, multiple data-intensive decision-support queries compete for memory. The data volumes that need to be managed by a server seem to be growing without limits. One part of this trend is that multimedia data types such as images, speech, or video have become more popular and are being merged into conventional-data applications (e.g., images or videos for insurance claims). The other Copyright 1999 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering

123 citations


Journal ArticleDOI
TL;DR: This paper compares different strategies for choosing a-priori an approximate sparsity structure of A −1 and exactly determines the submatrices that are used in the SPAI algorithm to compute one new column of the sparse approximate inverse M.

114 citations


Journal ArticleDOI
TL;DR: This article examines the problem of an increasing Processor - Memory Performance Gap, which now is the primary obstacle to improved computer system performance.
Abstract: The rate of improvement in microprocessor speed exceeds the rate of improvement in DRAM (Dynamic Random Access Memory) speed. So although the disparity between processor and memory speed is already an issue, downstream someplace it will be a much bigger one. Hence computer designers are faced with an increasing Processor - Memory Performance Gap [1], which now is the primary obstacle to improved computer system performance. This article examines this problem as well as its various solutions.

Journal ArticleDOI
Yutai Ma1
TL;DR: The memory organization of FFT processors is considered and a new memory addressing assignment allows simultaneous access to all the data needed for butterfly calculations.
Abstract: The memory organization of FFT processors is considered. The new memory addressing assignment allows simultaneous access to all the data needed for butterfly calculations. The advantage of this memory addressing scheme lies in the fact that it reduces the delay of address generation nearly by half compared to existing ones.

Journal ArticleDOI
TL;DR: This algorithm yields the first scalable, portable, and numerically stable parallel divide and conquer eigensolver and is compared with that of the QR algorithm and of bisection followed by inverse iteration on an IBM SP2 and a cluster of Pentium PIIs.
Abstract: We present a new parallel implementation of a divide and conquer algorithm for computing the spectral decomposition of a symmetric tridiagonal matrix on distributed memory architectures. The implementation we develop differs from other implementations in that we use a two-dimensional block cyclic distribution of the data, we use the Lowner theorem approach to compute orthogonal eigenvectors, and we introduce permutations before the back transformation of each rank-one update in order to make good use of deflation. This algorithm yields the first scalable, portable, and numerically stable parallel divide and conquer eigensolver. Numerical results confirm the effectiveness of our algorithm. We compare performance of the algorithm with that of the QR algorithm and of bisection followed by inverse iteration on an IBM SP2 and a cluster of Pentium PIIs.

Patent
15 Nov 1999
TL;DR: In this article, a data packet switching and server load balancing device is provided by a general-purpose multiprocessor computer system (10), which comprises a plurality of symmetrical processors coupled together by a common data bus.
Abstract: A data packet switching and server load balancing device is provided by a general-purpose multiprocessor computer system (10). The general-purpose multiprocessor computer system (10) comprises a plurality of symmetrical processors (24o ... 24n) coupled together by a common data bus (12), a main memory (14) shared by the processors (24o ... 24n), and a plurality of network interfaces (17i ... 17m) each adapted to be coupled to respective external networks for receiving and sending data packets via a particular communication protocol, such as Transmission Control/Internet Protocol (TCP/IP). A first one of the processors (24o ... 24n) is adaptive to serve as a control processor and remaining ones of the processors (24o ... 24n) are adapted to serve as data packet switching processors.

Proceedings ArticleDOI
12 Apr 1999
TL;DR: A comparison between the threaded implementation and the original implementation of TreadMarks shows that using the hardware shared memory within an SMP node significantly reduces the amount of data and the number of messages transmitted between nodes, and consequently achieves speedups up to 30% better than the original versions.
Abstract: In this paper we present the first system that implements OpenMP on a network of shared-memory multiprocessors. This system enables the programmer to rely on a single, standard, shared-memory API for parallelization within a multiprocessor and between multiprocessors. It is implemented via a translator that converts OpenMP directives to appropriate calls to a modified version of the TreadMarks software distributed memory system (SDSM). In contrast to previous SDSM systems for SMPs, the modified TreadMarks uses POSIX threads for parallelism within an SMP node. This approach greatly simplifies the changes required to the SDSM in order to exploit the intra-node hardware shared memory. We present performance results for six applications (SPLASH-2 Barnes-Hut and Water; NAS 3D-FFT, SOR, TSP and MGS) running on an SP2 with four four-processor SMP nodes. A comparison between the threaded implementation and the original implementation of TreadMarks shows that using the hardware shared memory within an SMP node significantly reduces the amount of data and the number of messages transmitted between nodes, and consequently achieves speedups up to 30% better than the original versions. We also compare SDSM against message passing. Overall, the speedups of multithreaded TreadMarks programs are within 7-30% of the MPI versions.

Journal ArticleDOI
TL;DR: The BSPRAM model is used to simplify the description of the algorithms, and new memory-efficient BSP algorithms both for standard and for fast matrix multiplication are proposed.
Abstract: The model of bulk-synchronous parallel (BSP) computation is an emerging paradigm of general-purpose parallel computing. Its modification, the BSPRAM model, allows one to combine the advantages of distributed and shared-memory style programming. In this paper we study the BSP memory complexity of matrix multiplication. We propose new memory-efficient BSP algorithms both for standard and for fast matrix multiplication. The BSPRAM model is used to simplify the description of the algorithms. The communication and synchronization complexity of our algorithms is slightly higher than that of known time-efficient BSP algorithms. The current time-efficient and new memory-efficient algorithms are connected by a continuous tradeoff.

Journal ArticleDOI
TL;DR: A mesh-smoothing algorithm based on nonsmooth optimization techniques and a scalable implementation of this algorithm that proves that the parallel algorithm has a provably fast runtime bound and executes correctly for a parallel random access machine (PRAM) computational model.
Abstract: Maintaining good mesh quality during the generation and refinement of unstructured meshes in finite-element applications is an important aspect in obtaining accurate discretizations and well-conditioned linear systems. In this article, we present a mesh-smoothing algorithm based on nonsmooth optimization techniques and a scalable implementation of this algorithm. We prove that the parallel algorithm has a provably fast runtime bound and executes correctly for a parallel random access machine (PRAM) computational model. We extend the PRAM algorithm to distributed memory computers and report results for two- and three-dimensional simplicial meshes that demonstrate the efficiency and scalability of this approach for a number of different test cases. We also examine the effect of different architectures on the parallel algorithm and present results for the IBM SP supercomputer and an ATM-connected network of SPARC Ultras.

Journal ArticleDOI
TL;DR: This work presents an analytical strategy for exploring the on-chip memory architecture for a given application, based on a memory performance estimation scheme, and demonstrates that its estimations closely follow the actual simulated performance at significantly reduced run times.
Abstract: Embedded processor-based systems allow for the tailoring of the on-chip memory architecture based on application specific requirements. We present an analytical strategy for exploring the on-chip memory architecture for a given application, based on a memory performance estimation scheme. The analytical technique has the important advantage of enabling a fast evaluation of candidate memory architectures in the early stages of system design. Many digital signal-processing applications involve array accesses and loop nests that can benefit from such an exploration. Our experiments demonstrate that our estimations closely follow the actual simulated performance at significantly reduced run times.

Patent
15 Sep 1999
TL;DR: In this paper, the authors propose a shared memory architecture for a high-performance shared-memory computer system, where each node has a nodal cache, nodal directory and nodal electronic switches to manage all transfers and data coherence among all cells in the same node and in different nodes.
Abstract: A novel structure for a highly-scaleable high-performance shared-memory computer system having simplified manufacturability. The computer system contains a repetition of system cells, in which each cell is comprised of a processor chip and a memory subset (having memory chips such as DRAMs or SRAMs) connected to the processor chip by a local memory bus. A unique type of intra-nodal busing connects each system cell in each node to each other cell in the same node. The memory subsets in the different cells need not have equal sizes, and the different nodes need not have the same number of cells. Each node has a nodal cache, a nodal directory and nodal electronic switches to manage all transfers and data coherence among all cells in the same node and in different nodes. The collection of all memory subsets in the computer system comprises the system shared memory, in which data stored in any memory subset is accessible to the processors on the other processor chips in the system. Each location in the system shared memory has a unique real address, which may be used by any processor in the system. Thus, the same memory addresses may be used in the executable instructions of all processors in the system. The nodal directories automatically manage the coherence of all data being changed in all processor caches in the computer system, regardless of where the data is stored in the shared memory of the system and regardless of which cell in the system contains the processor changing the data to provide data coherence across all nodes in the computer system.

Journal ArticleDOI
TL;DR: For an important subclass of structured method parallelism, a scheduling methodology which takes data redistributions between multiprocessor tasks into account is presented which is designed for an integration into a parallel compiler tool.

Journal ArticleDOI
01 Mar 1999
TL;DR: The results indicate that adaptation between single- and multiple-writer and dynamic page aggregation are clearly beneficial and the results for the adaptation between invalidate and update are less compelling, showing at best gains similar to the dynamic aggregation adaptation and at worst serious performance deterioration.
Abstract: We demonstrate the benefits of software shared memory protocols that adapt at run time to the memory access patterns observed in the applications. This adaptation is automatic-no user annotations are required-and does not rely on compiler support or special hardware. We investigate adaptation between singleand multiple-writer protocols, dynamic aggregation of pages into a larger transfer unit, and adaptation between invalidate and update. Our results indicate that adaptation between single- and multiple-writer and dynamic page aggregation are clearly beneficial. The results for the adaptation between invalidate and update are less compelling, showing at best gains similar to the dynamic aggregation adaptation and at worst serious performance deterioration.

Book ChapterDOI
01 Jan 1999
TL;DR: This paper argues that immunological memory is in the same class of associative memories as Kanerva'sparse Distributed Memory, Albus'sCerebellar Model Arithmetic Computer, and Marr's Theory of the Cerebellar Cortex.
Abstract: This paper argues that immunological memory is in the same class of associative memories as Kanerva"sSparse Distributed Memory,Albus"sCerebellar Model Arithmetic Computer,and Marr"sTheory of the Cerebellar Cortex.This class of memories Enrives its associative and robust nature from a sparse sampling of a huge input space by recognition units (B and T cells in the immune system) and a distribution of the memory among many inEnpenEnnt units (B and T cells in the memory population in the immune system).

Journal ArticleDOI
TL;DR: A new calculation schedule is proposed that reduces this buffer memory size with up to two orders of magnitude, while still ensuring a number of external (off-chip) memory accesses that is very close to the theoretical minimum.
Abstract: This paper addresses the problem of minimizing memory size and memory accesses in multiresolution texture coding architectures for discrete cosine transform (DCT) and wavelet-based schemes used, for example, in virtual-world walk-throughs or facial animation scenes of an MPEG-4 system. The problem of minimizing the memory cost is important since memory accesses, memory bandwidth limitations, and in general the correct handling of the data flows have become the true critical issues in designing high-speed and low-power video-processing architectures and in efficiently using multimedia processors. For instance, the straightforward implementation of a multiresolution texture codec typically needs an extra memory buffer of the same size as the image to be encoded/decoded. We propose a new calculation schedule that reduces this buffer memory size with up to two orders of magnitude, while still ensuring a number of external (off-chip) memory accesses that is very close to the theoretical minimum. The analysis is generic and is therefore useful for both wavelet and multiresolution DCT codecs.

Proceedings ArticleDOI
01 May 1999
TL;DR: This paper presents some techniques for efficient thread forking and joining in parallel execution environments, taking into consideration the physical structure of NUMA machines and the support for multi-level parallelization and processor grouping.
Abstract: This paper presents some techniques for efficient thread forking and joining in parallel execution environments, taking into consideration the physical structure of NUMA machines and the support for multi-level parallelization and processor grouping. Two work generation schemes and one join mechanism are designed, implemented, evaluated and compared with the ones used in the IRIX MP library, an efficient implementation which supports a single level of parallelism. Supporting multiple levels of parallelism is a current research goal, both in shared and distributed memory machines. Our proposals include a first work generation scheme (GWD, or global work descriptor) which supports multiple levels of parallelism, but not processor grouping. The second work generation scheme (LWD, or local work descriptor) has been designed to support multiple levels of parallelism and processor grouping. Processor grouping is needed to distribute processors among different parts of the computation and maintain the working set of each processor across different parallel constructs. The mechanisms are evaluated using synthetic benchmarks, two SPEC95fp applications and one NAS application. The performance evaluation concludes that: i) the overhead of the proposed mechanisms is similar to the overhead of the existing ones when exploiting a single level of parallelism, and ii) a remarkable improvement in performance is obtained for applications that have multiple levels of parallelism. The comparison with the traditional single-level parallelism exploitation gives an improvement in the range of 30-65% for these applications.

Journal ArticleDOI
01 Jan 1999
TL;DR: A survey of the key developments in shared virtual memory research is provided, placing the multitrack flow of ideas and results obtained so far in a comprehensive new framework.
Abstract: Shared virtual memory, a technique for supporting a shared address space in software on parallel systems, has undergone a decade of research, with significant maturing of protocols and communication layers having now been achieved. We provide a survey of the key developments in this research, placing the multitrack flow of ideas and results obtained so far in a comprehensive new framework. Four major research tracks are covered: relaxed consistency models; protocol laziness; architectural support; and application-driven research. Several related avenues are also discussed, such as fine grained software coherence, software protocols across multiprocessor nodes, and performance scalability. We summarize comparative performance results from the literature, discuss their limitations, and identify lessons learned so far, key outstanding questions, and important directions for future research in this area.

Patent
26 Aug 1999
TL;DR: In this article, a flexible probe command/response routing scheme is proposed for a distributed memory system with multiple processing nodes coupled to separate memories which may form a shared memory system. But the scheme is limited to the case of read and write transactions, where the target may determine when to commit the write data to memory and receive any dirty data to be merged with the read data.
Abstract: A computer system may include multiple processing nodes, one or more of which may be coupled to separate memories which may form a distributed memory system. The processing nodes may include caches, and the computer system may maintain coherency between the caches and the distributed memory system. Particularly, the computer system may implement a flexible probe command/response routing scheme. The scheme may employ an indication within the probe command which identifies a receiving node to receive the probe responses. For example, probe commands indicating that the target or the source of transaction should receive probe responses corresponding to the transaction may be included. Probe commands may specify the source of the transaction as the receiving node for read transactions (such that dirty data is delivered to the source node from the node storing the dirty data). On the other hand, for write transactions (in which data is being updated in memory at the target node of the transaction), the probe commands may specify the target of the transaction as the receiving node. In this manner, the target may determine when to commit the write data to memory and may receive any dirty data to be merged with the write data.

Book
01 Jan 1999
TL;DR: This paper studies the various options available to system designers to transparently decrease the fraction of data misses serviced remotely and proposes a hybrid scheme that combines hardware and software techniques.
Abstract: Given the limitations of bus-based multiprocessors, CC-NUMA is the scalable architecture of choice for shared-memory machines. The most important characteristic of the CC-NUMA architecture is that the latency to access data on a remote node is considerably larger than the latency to access local memory. On such machines, good data locality can reduce memory stall time and is therefore a critical factor in application performance.In this paper we study the various options available to system designers to transparently decrease the fraction of data misses serviced remotely. This work is done in the context of the Stanford FLASH multiprocessor. FLASH is unique in that each node has a single pool of DRAM that can be used in a variety of ways by the programmable memory controller. We use the programmability of FLASH to explore different options for cache-coherence and data-locality in compute-server workloads. First, we consider two protocols for providing base cache-coherence, one with centralized directory information (dynamic pointer allocation) and another with distributed directory information (SCI). While several commercial systems are based on SCI, we find that a centralized scheme has superior performance. Next, we consider different hardware and software techniques that use some or all of the local memory in a node to improve data locality. Finally, we propose a hybrid scheme that combines hardware and software techniques. These schemes work on the same base platform with both user and kernel references from the workloads. The paper thus offers a realistic and fair comparison of replication/migration techniques that has not previously been feasible.

Patent
Hubertus Franke1, Mark E. Giampapa1, Joefon Jann1, Douglas J. Joseph1, Pratap Pattnaik1 
23 Feb 1999
TL;DR: In this article, the authors propose a method and apparatus for sharing memory in a multiprocessor computing system, which provides a number of system buses with each bus being connected to a respective memory controller which controls a corresponding partition of the memory.
Abstract: A method and apparatus for sharing memory in a multiprocessor computing system. More specifically, this invention provides a number of system buses with each bus being connected to a respective memory controller which controls a corresponding partition of the memory. Any one of the processors can use any one of the system buses to send real addresses to the connected memory controller which then converts the real addresses into physical addresses corresponding to the partition of memory that is controlled by the receiving memory controller. The processors can be dynamically assigned to different partitions of the memory by via a switching mechanism.

Journal ArticleDOI
01 Mar 1999
TL;DR: Results show that integrating message passing with shared memory enables a cost efficient solution to the cache coherence problem and provides a rich set of programming primitives, and shows that messaging and shared memory operations are both important because each helps the programmer to achieve the best performance for various machine configurations.
Abstract: A variety of models for parallel architectures, such as shared memory, message passing, and data flow, have converged in the recent past to a hybrid architecture form called distributed shared memory (DSM) Alewife, an early prototype of such DSM architectures, uses hybrid software and hardware mechanisms to support coherent shared memory, efficient user level messaging, fine grain synchronization, and latency tolerance Alewife supports up to 512 processing nodes connected over a scalable and cost effective mesh network at a constant cost per node Four mechanisms combine to achieve Alewife's goals of scalability and programmability: software extended coherent shared memory provides a global, linear address space; integrated message passing allows compiler and operating system designers to provide efficient communication and synchronization; support for fine grain computation allows many processors to cooperate on small problem sizes; and latency tolerance mechanisms-including block multithreading and prefetching-mask unavoidable delays due to communication Extensive results from microbenchmarks, together with over a dozen complete applications running on a 32-node prototype, demonstrate that integrating message passing with shared memory enables a cost efficient solution to the cache coherence problem and provides a rich set of programming primitives Our results further show that messaging and shared memory operations are both important because each helps the programmer to achieve the best performance for various machine configurations

Patent
31 Mar 1999
TL;DR: In this paper, the authors present a software executable on a computer having a working memory with demand-loadable components initially stored outside of the working memory, each component having an entry point including a constructor for an object.
Abstract: The invention is embodied in software executable on a computer having a working memory with demand-loadable components initially stored outside of the working memory, each component having an entry point including a constructor for an object. Preferably, the demand-loadable components are initially provided in a memory within the computer or a location external of the computer. A Namespace in the working memory provides access in the working memory to the components as they become needed by applications running in the computer. The Namespace provides the access by managing demand-loading and unloading of the components in the working memory.