scispace - formally typeset
Search or ask a question

Showing papers by "Kai Li published in 1996"


Journal ArticleDOI
TL;DR: This article presents the design, implementation, and performance of a file system that integrates application-controlled caching, prefetching, and disk scheduling and shows that this combination of techniques greatly improves the performance of the file system.
Abstract: As the performance gap between disks and micropocessors continues to increase, effective utilization of the file cache becomes increasingly immportant. Application-controlled file caching and prefetching can apply application-specific knowledge to improve file cache management. However, supporting application-controlled file caching and prefetching is nontrivial because caching and prefetching need to be integrated carefully, and the kernel needs to allocate cache blocks among processes appropriately. This article presents the design, implementation, and performance of a file system that integrates application-controlled caching, prefetching, and disk scheduling. We use a two-level cache management strategy. The kernel uses the LRU-SP (Least-Recently-Used with Swapping and Placeholders) policy to allocate blocks to processes, and each process integrates application-specific caching and prefetching based on the controlled-aggressive policy, an algorithm previously shown in a theoretical sense to be nearly optimal. Each process also improves its disk access latency by submittint its prefetches in batches so that the requests can be scheduled to optimize disk access performance. Our measurements show that this combination of techniques greatly improves the performance of the file system. We measured that the running time is reduced by 3% to 49% (average 26%) for single-process workloads and by 5% to 76% (average 32%) for multiprocess workloads.

249 citations


Proceedings ArticleDOI
28 Oct 1996
TL;DR: Overlapped Home-based LRC takes advantage of the communication processor found on each node of the Paragon to take advantage of some of the protocol overhead of HLRC from the critical path followed by the compute processor, and it is shown that OHLRC provides modest improvements over HLRC.
Abstract: This paper investigates the performance of shared virtual memory protocols on large-scale multicomputers. Using experiments on a 64-node Paragon, we show that the traditional Lazy Release Consistency (LRC) protocol does not scale well, because of the large number of messages it requires, the large amount of memory it consumes for protocol overhead data, and because of the diÆculty of garbage collecting that data. To achieve more scalable performance, we introduce and evaluate two new protocols. The rst, Home-based LRC (HLRC), is based on the Automatic Update Release Consistency (AURC) protocol. Like AURC, HLRC maintains a home for each page to which all updates are propagated and from which all copies are derived. Unlike AURC, HLRC requires no specialized hardware support. We nd that the use of homes provides substantial improvements in performance and scalability over LRC. Our second protocol, called Overlapped Home-based LRC (OHLRC), takes advantage of the communication processor found on each node of the Paragon to o oad some of the protocol overhead of HLRC from the critical path followed by the compute processor. We nd that OHLRC provides modest improvements over HLRC. We also apply overlapping to the base LRC protocol, with similar results. Our experiments were done using ve of the Splash-2 benchmarks. We report overall execution times, as well as detailed breakdowns of elapsed time, message traÆc, and memory use for each of the protocols.

214 citations


Proceedings ArticleDOI
24 Jun 1996
TL;DR: Systems that maintain coherence at large granularity, such as shared virtual memory systems, suffer from false sharing and extra communication and Relaxed memory consistency models have been used to alleviate these problems, but at a cost in programming complexity.
Abstract: Systems that maintain coherence at large granularity, such as shared virtual memory systems, suffer from false sharing and extra communication. Relaxed memory consistency models have been used to alleviate these problems, but at a cost in programming complexity. Release Consistency (RC) and Lazy Release Consistency (LRC) are accepted to offer a reasonable tradeoff between performance and programming complexity. Entry Consistency (EC) offers a more relaxed consistency model, but it requires explicit association of shared data objects with synchronization variables. The programming burden of providing such associations can be substantial.

198 citations


01 Jan 1996
TL;DR: In this paper, the effects of several combined prefetching and caching strategies for systems with multiple disks are investigated using disk-accurate traced-riven simulation, and the performance characteristics of each of the algorithms in cases in which applications provide full advance knowledge of accesses using hints.
Abstract: High-performance I/O systems depend on prefetching and caching in order to deliver good performance to applications. These two techniques have generally been considered in isolation, even though there are signi cant interactions between them; a block prefetched too early reduces the e ectiveness of the cache, while a block cached too long reduces the effectiveness of prefetching. In this paper we study the effects of several combined prefetching and caching strategies for systems with multiple disks. Using disk-accurate tracedriven simulation, we explore the performance characteristics of each of the algorithms in cases in which applications provide full advance knowledge of accesses using hints. Some of the strategies have been published with theoretical performance bounds, and some are components of systems that have been built. One is a new algorithm that combines the desirable characteristics of the others. We nd that when performance is limited by I/O stalls, aggressive prefetching helps to alleviate the problem; that more conservative prefetching is appropriate when signi cant I/O stalls are not present; and that a single, simple strategy is capable of doing both.

126 citations


Patent
02 Feb 1996
TL;DR: User-level Direct Memory Access (UDMA) as discussed by the authors reduces the typically high overhead requirement for CPU instructions to operate a conventional direct memory access (DMA) controller to two user-level memory references via UDMA.
Abstract: In a computer system the typically high overhead requirement for CPU instructions to operate a conventional direct memory access (DMA) controller are reduced to two user-level memory references via User-level Direct Memory Access (UDMA) The UDMA apparatus is located between the CPU and a DMA Controller, whereby the UDMA is programmed to use existing virtual memory translation hardware of the associated computer system to perform permission checking and address translation without Kernel involvement, and otherwise use minimal Kernel involvement for other operations

122 citations


Proceedings ArticleDOI
03 Feb 1996
TL;DR: This paper proposes a new lazy release consistency based protocol, called Automatic Update Release Consistency (AURC), that uses automatic update to propagate and merge shared memory modifications and shows that the AURC approach can substantially improve the performance of LRC.
Abstract: Shared virtual memory is a software technique to provide shared memory on a network of computers without special hardware support. Although several relaxed consistency models and implementations are quite effective, there is still a considerable performance gap between the "software-only" approach and the hardware approach that uses directory-based caches. Automatic update is a simple communication mechanism, implemented in the SHRIMP multicomputer, that forwards local writes to remote memory transparently. In this paper we propose a new lazy release consistency based protocol, called Automatic Update Release Consistency (AURC), that uses automatic update to propagate and merge shared memory modifications. We compare the performance of this protocol against a software-only LRC implementation on several Splash-2 applications and show that the AURC approach can substantially improve the performance of LRC. For 16 processors, the average speedup has increased from 5.9 under LRC to 8.3 under AURC.

119 citations


Proceedings ArticleDOI
28 Oct 1996
TL;DR: The effects of several combined prefetching and caching strategies for systems with multiple disks are studied using disk-accurate tracedriven simulation to explore the performance characteristics of each of the algorithms in cases in which applications provide full advance knowledge of accesses using hints.

116 citations


Proceedings ArticleDOI
01 Sep 1996
TL;DR: Experiments with several application programs show that the thread scheduling method can improve program performance by reducing second-level cache misses.
Abstract: This paper describes a method to improve the cache locality of sequential programs by scheduling fine-grained threads. The algorithm relies upon hints provided at the time of thread creation to determine a thread execution order likely to reduce cache misses. This technique may be particularly valuable when compiler-directed tiling is not feasible. Experiments with several application programs, on two systems with different cache structures, show that our thread scheduling method can improve program performance by reducing second-level cache misses.

107 citations


Proceedings ArticleDOI
03 Feb 1996
TL;DR: The UDMA mechanism uses existing virtual memory translation hardware to perform permission checking and address translation without kernel involvement to initiate DMA transfers of input/output data, with full protection, at a cost of only two user-level memory references.
Abstract: Traditional DMA requires the operating system to perform many tasks to initiate a transfer, with overhead on the order of hundreds or thousands of CPU instructions. This paper describes a mechanism, called User-level Direct Memory Access (UDMA), for initiating DMA transfers of input/output data, with full protection, at a cost of only two user-level memory references. The UDMA mechanism uses existing virtual memory translation hardware to perform permission checking and address translation without kernel involvement. The implementation of the UDMA mechanism is simple, requiring a small extension to the traditional DMA controller and minimal operating system kernel support. The mechanism can be used with a wide variety of I/O devices including network interfaces, data storage devices such as disks and tape drives, and memory-mapped devices such as graphics frame-buffers. As an illustration, we describe how we used UDMA in building network interface hardware for the SHRIMP multicomputer.

93 citations


Proceedings ArticleDOI
15 Apr 1996
TL;DR: System software support for the VMMC model including its API, operating system support, and software architecture, for two network interfaces designed in the SHRIMP project are described, showing that the VM MC model can indeed expose the available hardware performance to user programs.
Abstract: Virtual memory-mapped communication (VMMC) is a communication model providing direct data transfer between the sender's and receiver's virtual address spaces. This model eliminates operating system involvement in communication, provides full protection, supports user-level buffer management and zero-copy protocols, and minimizes software communication overhead. This paper describes system software support for the model including its API, operating system support, and software architecture, for two network interfaces designed in the SHRIMP project. Our implementations and experiments show that the VMMC model can indeed expose the available hardware performance to user programs. On two Pentium PCs with our prototype network interface hardware over a network, we have achieved user-to-user latency of 4.8 /spl mu/sec and sustained bandwidth of 23 MB/s, which is close to the peak hardware bandwidth. Software communication overhead is only a few user-level instructions.

67 citations


Proceedings ArticleDOI
01 May 1996
TL;DR: It is found that SVM can indeed perform quite well for systems of at leant up to 32 processors for several nontrivial applications, however, performance is much more variable across applications than on CC-NUMA systems, and the problem sizes needed to obtain good parallel performance are substantially larger.
Abstract: Many researchers have proposed interesting protocols for shared virtual memory (SVM) systems, and demonstrated performance improvements on parallel programs. However, there is still no clear understanding of the performance potential of SVM systems for different classes of applications. This paper begins to fill this gap, by studying the performance of a range of applications in detail and understanding it in light of application characteristics.We first develop a brief classification of the inherent data sharing patterns in the applications, and how they interact with system granularities to yield the communication patterns relevant to SVM systems. We then use detailed simulation to compare the performance of two SVM approaches---Lazy Released Consistency (LRC) and Automatic Update Release Consistency (AURC)---with each other and with an all-hardware CC-NUMA approach. We examine how performance is affected by problem size, machine size, key system parameters, and the use of less optimized program implementations. We find that SVM can indeed perform quite well for systems of at leant up to 32 processors for several nontrivial applications. However, performance is much more variable across applications than on CC-NUMA systems, and the problem sizes needed to obtain good parallel performance are substantially larger. The hardware-assisted AURC system tends to perform significantly better than the all-software LRC under our system assumptions, particularly when realistic cache hierarchies are used.

Proceedings ArticleDOI
01 May 1996
TL;DR: The experience shows that the VMMC mechanism supports these message-passing interfaces well, and when zero-copy protocols are allowed by the semantics of the interface, VMMC can effectively deliver to applications almost all of the raw hardware's communication performance.
Abstract: The SHRIMP multicomputer provides virtual memory-mapped communication (VMMC), which supports protected, user-level message passing, allows user programs to perform their own buffer management, and separates data transfers from control transfers so that a data transfer can be done without the intervention of the receiving node CPU. An important question is whether such a mechanism can indeed deliver all of the available hardware performance to applications which use conventional message-passing libraries.This paper reports our early experience with message-passing on a small, working SHRIMP multicomputer. We have implemented several user-level communication libraries on top of the VMMC mechanism, including the NX message-passing interface, Sun RPC, stream sockets, and specialized RPC. The first three are fully compatible with existing systems. Our experience shows that the VMMC mechanism supports these message-passing interfaces well. When zero-copy protocols are allowed by the semantics of the interface, VMMC can effectively deliver to applications almost all of the raw hardware's communication performance.

Journal ArticleDOI
15 May 1996
TL;DR: Two integrated prefetching and caching strategies for multiple disks can achieve near linear speedup when the load is distributed evenly on the disks, and the best algorithm performs well even when the placement of blocks on disks distributes the load unevenly.
Abstract: Prefetching and caching are widely-used approaches for improving the performance of file systems. A recent study shows that it is important to integrate the two, and proposed an algorithm that performs well both in theory and in practice [2, I]. That study waa restricted to the case of a single disk. Here, we study integrated prefetching and caching strategies for multiple disks. The interaction between caching and prefetching is further complicated when a system has multiple disks, not only because it is possible to do multiple prefetches in parallel, but also because appropriate cache replacement strategies can alleviate the load imbalance among the disks. We present two offline algorithms, one of which has provably near-optimal performance. Using tracedriven simulation, we evaluated these algorithms under a variety of data placement alternatives. Our results show that both algorithms can achieve near linear speedup when the load is distributed evenly on the disks, and our best algorithm performs well even when the placement of blocks on disks distributes the load unevenly. Our simulations also show that replicating data, even across all of the disks, offers little performance advantage over a striped layout if prefetching is done well. Finally, we evaluated online variations of the algorithms and show that the online algorithms perform well even with moderate advance knowledge of future file accesses.

Patent
30 Sep 1996
TL;DR: In this article, the authors propose a method for improving the cache locality of an application executing in a computer system by decomposing the application into one or more threads and subsequently scheduling the execution of the threads such that a next thread to be executed is likely to reside in cache.
Abstract: A method for improving the cache locality of an application executing in a computer system by decomposing the application into one or more threads and subsequently scheduling the execution of the threads such that a next thread to be executed is likely to reside in cache. The method operates by identifying a tour of points through a k-dimensional space such that cache misses are minimized. The space is divided into a plurality of equally sized blocks and may be extended for application to multiple cache levels.

Proceedings ArticleDOI
12 Aug 1996
TL;DR: The design, implementation and performance of the NX message-passing interface on the Shrimp multicomputer, exploiting Shrimp's virtual memory-mapped communication facility, performs buffer management at user level without using a special message-Passing processor, and requires no CPU intervention upon message arrival in the common cases.
Abstract: This paper describes the design, implementation and performance of the NX message-passing interface on the Shrimp multicomputer. Unlike traditional methods, our implementation, exploiting Shrimp's virtual memory-mapped communication facility, performs buffer management at user level without using a special message-passing processor, and requires no CPU intervention upon message arrival in the common cases. For a four-byte message, our implementation, achieves a user-to-user latency of 12 microseconds, about factor of four smaller than that on the Intel Paragon. For large messages, our implementation quickly approaches the bandwidth limit imposed by the Shrimp hardware.