Showing papers on "Cache published in 1995"

PDF

Open Access

Patent•

Enhanced raid write hole protection and recovery

[...]

Clark Edward Lubbers, Susan G. Elkington, Ronald H. Mclean¹•Institutions (1)

13 Oct 1995

TL;DR: In this article, a method and apparatus for reconstructing data in a computer system employing a modified RAID 5 data protection scheme is described, where the computer system includes a write back cache composed of non-volatile memory for storing metadata information in the nonvolatile memories.

...read moreread less

Abstract: Disclosed is a method and apparatus for reconstructing data in a computer system employing a modified RAID 5 data protection scheme. The computer system includes a write back cache composed of non-volatile memory for storing (1) writes outstanding to a device and associated data read, and (2) storing metadata information in the non-volatile memory. The metadata includes a first field containing the logical block number or address (LBN or LBA) of the data, a second field containing the device ID, and a third field containing the block status. From the metadata information it is determined where the write was intended when the crash occurred. An examination is made to determine whether parity is consistent across the slice, and if not, the data in the non-volatile write back cache is used to reconstruct the write that was occurring when the crash occurred to insure consistent parity, so that only those blocks affected by the crash have to be reconstructed.

...read moreread less

1,069 citations

Proceedings Article•DOI•

Informed prefetching and caching

[...]

R.H. Patterson¹, Garth A. Gibson¹, Eka Ginting¹, Daniel Stodolsky¹, Jim Zelenka¹ - Show less +1 more•Institutions (1)

Carnegie Mellon University¹

03 Dec 1995

TL;DR: This paper shows how to use application-disclosed access patterns (hints) to expose and exploit I/O parallelism and to allocate dynamically file buffers among three competing demands: prefetching hinted blocks, caching hinted blocks for reuse, and caching recently used data for unhinted accesses.

...read moreread less

Abstract: The underutilization of disk parallelism and file cache buffers by traditional file systems induces I/O stall time that degrades the performance of modern microprocessor-based systems. In this paper, we present aggressive mechanisms that tailor file system resource management to the needs of I/O-intensive applications. In particular, we show how to use application-disclosed access patterns (hints) to expose and exploit I/O parallelism and to allocate dynamically file buffers among three competing demands: prefetching hinted blocks, caching hinted blocks for reuse, and caching recently used data for unhinted accesses. Our approach estimates the impact of alternative buffer allocations on application execution time and applies a cost-benefit analysis to allocate buffers where they will have the greatest impact. We implemented informed prefetching and caching in DEC''s OSF/1 operating system and measured its performance on a 150 MHz Alpha equipped with 15 disks running a range of applications including text search, 3D scientific visualization, relational database queries, speech recognition, and computational chemistry. Informed prefetching reduces the execution time of the first four of these applications by 20% to 87%. Informed caching reduces the execution time of the fifth application by up to 30%.

...read moreread less

770 citations

Patent•

System and method for controlling access to data entities in a computer network

[...]

Ross M. Brown¹, Richard G. Greenberg¹•Institutions (1)

Microsoft¹

18 Aug 1995

TL;DR: In this paper, access rights of users of a computer network with respect to data entities are specified by a relational database stored on one or more security servers, and an access rights cache on each application server caches the access rights lists of the users that are connected to the respective application server, so that user access rights to specific data entities can rapidly be determined.

...read moreread less

Abstract: Access rights of users of a computer network with respect to data entities are specified by a relational database stored on one or more security servers. Application servers on the network that provide user access to the data entities generate queries to the relational database in order to obtain access rights lists of specific users. An access rights cache on each application server caches the access rights lists of the users that are connected to the respective application server, so that user access rights to specific data entities can rapidly be determined. Each user-specific access rights list includes a series of category identifiers plus a series of access rights values. The category identifiers specify categories of data entities to which the user has access, and the access rights values specify privilege levels of the users with respect to the corresponding data entity categories. The privilege levels are converted into specific access capabilities by application programs running on the application servers.

...read moreread less

548 citations

Journal Article•DOI•

Effective hardware-based data prefetching for high-performance processors

[...]

Tien-Fu Chen¹, Jean-Loup Baer²•Institutions (2)

National Chung Cheng University¹, University of Washington²

01 May 1995-IEEE Transactions on Computers

TL;DR: The results show that the three hardware prefetching schemes all yield significant reductions in the data access penalty when compared with regular caches, the benefits are greater when the hardware assist augments small on-chip caches, and the lookahead scheme is the preferred one cost-performance wise.

...read moreread less

Abstract: Memory latency and bandwidth are progressing at a much slower pace than processor performance. In this paper, we describe and evaluate the performance of three variations of a hardware function unit whose goal is to assist a data cache in prefetching data accesses so that memory latency is hidden as often as possible. The basic idea of the prefetching scheme is to keep track of data access patterns in a reference prediction table (RPT) organized as an instruction cache. The three designs differ mostly on the timing of the prefetching. In the simplest scheme (basic), prefetches can be generated one iteration ahead of actual use. The lookahead variation takes advantage of a lookahead program counter that ideally stays one memory latency time ahead of the real program counter and that is used as the control mechanism to generate the prefetches. Finally the correlated scheme uses a more sophisticated design to detect patterns across loop levels. These designs are evaluated by simulating the ten SPEC benchmarks on a cycle-by-cycle basis. The results show that 1) the three hardware prefetching schemes all yield significant reductions in the data access penalty when compared with regular caches, 2) the benefits are greater when the hardware assist augments small on-chip caches, and 3) the lookahead scheme is the preferred one cost-performance wise. >

...read moreread less

543 citations

Caching Proxies: Limitations and Potentials

[...]

Marc D. Abrams, Charles R. Standridge, Ghaleb Abdulla, Stephen Williams, Edward A. Fox - Show less +1 more

18 Jul 1995

TL;DR: This work assesses the potential of proxy servers to cache documents retrieved with the HTTP protocol, and finds that a proxy server really functions as a second level cache, and its hit rate may tend to decline with time after initial loading given a more or less constant set of users.

...read moreread less

Abstract: As the number of World-Wide Web users grow, so does the number of connections made to servers. This increases both network load and server load. Caching can reduce both loads by migrating copies of server files closer to the clients that use those files. Caching can either be done at a client or in the network (by a proxy server or gateway). We assess the potential of proxy servers to cache documents retrieved with the HTTP protocol. We monitored traffic corresponding to three types of educational workloads over a one semester period, and used this as input to a cache simulation. Our main findings are (1) that with our workloads a proxy has a 30-50% maximum possible hit rate no matter how it is designed; (2) that when the cache is full and a document is replaced, least recently used (LRU) is a poor policy, but simple variations can dramatically improve hit rate and reduce cache size; (3) that a proxy server really functions as a second level cache, and its hit rate may tend to decline with time after initial loading given a more or less constant set of users; and (4) that certain tuning configuration parameters for a cache may have little benefit.

...read moreread less

495 citations

Proceedings Article•DOI•

Tile size selection using cache organization and data layout

[...]

Stephanie Coleman¹, Kathryn S. McKinley²•Institutions (2)

Intermetrics, Inc.¹, University of Massachusetts Amherst²

01 Jun 1995

TL;DR: This paper presents a new algorithm for choosing problem-size dependent tile sizes based on the cache size and cache line size for a direct-mapped cache that eliminates both capacity and self-interference misses and reduces cross-Interference misses.

...read moreread less

Abstract: When dense matrix computations are too large to fit in cache, previous research proposes tiling to reduce or eliminate capacity misses. This paper presents a new algorithm for choosing problem-size dependent tile sizes based on the cache size and cache line size for a direct-mapped cache. The algorithm eliminates both capacity and self-interference misses and reduces cross-interference misses. We measured simulated miss rates and execution times for our algorithm and two others on a variety of problem sizes and cache organizations. At higher set associativity, our algorithm does not always achieve the best performance. However on direct-mapped caches, our algorithm improves simulated miss rates and measured execution times when compared with previous work.

...read moreread less

434 citations

Patent•

Method and system for automated data storage system space allocation utilizing prioritized data set parameters

[...]

Stevan Charles Allen¹•Institutions (1)

IBM¹

05 Dec 1995

TL;DR: In this paper, a hierarchy of data set preference/requirement parameter hierarchy is established for each data set, listing each parameter from a "most important" parameter to a "least important".

...read moreread less

Abstract: A method and system for automatically allocating space within a data storage system for multiple data sets which may include units of data, databases, files or objects. Each data set preferably includes a group of associated preference/requirement parameters which are arranged in a hierarchical order and then compared to corresponding data storage system characteristics for available devices. The data set preference/requirement parameters may include performance, size, availability, location, portability, share status and other attributes which affect data storage system selection. Data storage systems may include solid-state memory, disk drives, tape drives, and other peripheral storage systems. Data storage system characteristics may thus represent available space, cache, performance, portability, volatility, location, cost, fragmentation, and other characteristics which address user needs. The data set preference/requirement parameter hierarchy is established for each data set, listing each parameter from a "most important" parameter to a "least important" parameter. Each attempted storage of a data set will result in an analysis of all available data storage systems and the creation of a linked chain of available data storage systems representing an ordered sequence of preferred data storage systems. Data storage system selection is then performed utilizing this preference chain, which includes all candidate storage systems.

...read moreread less

433 citations

Proceedings Article•DOI•

The case for geographical push-caching

[...]

James S. Gwertzman¹, Margo Seltzer¹•Institutions (1)

Harvard University¹

04 May 1995

TL;DR: This work presents an architecture that allows a Web server to autonomously replicate HTML pages and proposes geographical push-caching as a way of bringing the server back into the loop.

...read moreread less

Abstract: Most wide-area caching schemes are client initiated. Decisions on when and where to cache information are made without the benefit of the server's global knowledge of the situation. We believe that the server should play a role in making these caching decisions, and we propose geographical push-caching as a way of bringing the server back into the loop. The World Wide Web is an excellent example of a wide-area system that will benefit from geographical push-caching, and we present an architecture that allows a Web server to autonomously replicate HTML pages.

...read moreread less

371 citations

Book•

The Wisconsin Wind Tunnel: virtual prototyping of parallel computers

[...]

Steven K. Reinhardt, Mark D. Hill, James R. Larus, Alvin R. Lebeck, James C. Lewis, Darien Wood - Show less +2 more

01 Jan 1995

TL;DR: A new technique for evaluating cache coherent, shared-memory computers and the Wisconsin Wind Tunnel (WWT) is developed, which correctly interleaves target machine events and calculates target program execution time.

...read moreread less

Abstract: We have developed a new technique for evaluating cache coherent, shared-memory computers. The Wisconsin Wind Tunnel (WWT) runs a parallel shared-memory program on a parallel computer (CM-5) and uses execution-driven, distributed, discrete-event simulation to accurately calculate program execution time. WWT is a virtual prototype that exploits similarities between the system under design (the target) and an existing evaluation platform (the host). The host directly executes all target program instructions and memory references that hit in the target cache. WWT's shared memory uses the CM-5 memory's error-correcting code (ECC) as valid bits for a fine-grained extension of shared virtual memory. Only memory references that miss in the target cache trap to WWT, which simulates a cache-coherence protocol. WWT correctly interleaves target machine events and calculates target program execution time. WWT runs on parallel computers with greater speed and memory capacity than uniprocessors. WWT's simulation time decreases as target system size increases for fixed-size problems and holds roughly constant as the target system and problem scale.

...read moreread less

338 citations

Proceedings Article•DOI•

Cache design trade-offs for power and performance optimization: a case study

[...]

Ching-Long Su¹, Alvin M. Despain¹•Institutions (1)

University of Southern California¹

23 Apr 1995

TL;DR: This paper examines performance and power trade-offs in cache designs and the effectiveness of energy reduction for several novel cache design techniques targeted for low power.

...read moreread less

Abstract: Caches consume a significant amount of energy in modern microprocessors. To design an energy-efficient microprocessor, it is important to optimize cache energy consumption. This paper examines performance and power trade-offs in cache designs and the effectiveness of energy reduction for several novel cache design techniques targeted for low power.

...read moreread less

335 citations

Proceedings Article•DOI•

A data cache with multiple caching strategies tuned to different types of locality

[...]

Antonio González¹, Carlos Aliagas¹, Mateo Valero¹•Institutions (1)

Polytechnic University of Catalonia¹

03 Jul 1995

Proceedings Article•

An accurate worst case timing analysis for RISC processors

[...]

Sung-Soo Lim, Young Hyun Bae, Gyu Tae Jang, Byung-Do Rhee, Sang Lyul Min, Chang Yun Park, Heonshik Shin, Kunsoo Park, Soo-Mook Moon, Chong Sang Kim - Show less +6 more

01 Jan 1995

TL;DR: The results show that tight WCET bounds can be obtained by using the revised timing schema approach, which accurately accounts for the timing effects of pipelined execution and cache memory not only within but also across program constructs.

...read moreread less

Abstract: An accurate and safe estimation of a task's worst case execution time (WCET) is crucial for reasoning about the timing properties of real-time systems. In RISC processors, the execution time of a program construct (e.g., a statement) is affected by various factors such as cache hits/misses and pipeline hazards, and these factors impose serious problems in analyzing the WCETs of tasks. To analyze the timing effects of RISC's pipelined execution and cache memory, we propose extensions to the original timing schema where the timing information associated with each program construct is a simple time-bound. In our approach, associated with each program construct is worst case timing abstraction, (WCTA), which contains detailed timing information of every execution path that might be the worst case execution path of the program construct. This extension leads to a revised timing schema that is similar to the original timing schema except that concatenation and pruning operations on WCTAs are newly defined to replace the add and max operations on time-bounds in the original timing schema. Our revised timing schema accurately accounts for the timing effects of pipelined execution and cache memory not only within but also across program constructs. This paper also reports on preliminary results of WCET analysis for a RISC processor. Our results show that tight WCET bounds (within a maximum of about 30% overestimation) can be obtained by using the revised timing schema approach.

...read moreread less

The HP lhtoRAID hierarchical storage system

[...]

John Wilkes, Richard Gelding, Carl Staelin, Tim Sullivan

01 Jan 1995

TL;DR: The technology described, known as HP AutoRAID, automatically and transparently manages migration of data blocks between these two levels as access patterns change, resulting in a fully redundant storage system that is extremely easy to use, is suitable for a wide variety of workloads, and is largely insensitive to dynamic workload changes.

...read moreread less

Abstract: Configuring redundant disk arrays is a black art. To properly configure an array, a system administrator must understand the details of both the array and the workload it will support; incorrect understanding of eithec or changes in the workload over time, can lead to poor performance. We present a solution to this problem: a two-level storage hierarchy implemented inside a single disk-array controller In theupper level of this hierarchy, two copies of active data are stored to provide full redundancy and excellent performance. In the lower level, RAID 5 parity protection is used to provide excel[ent storage cost for inactive data, at somewhat lower performance. The technology we describe in this pape< known as HP AutoRAID, automatically and transparently manages migration of data blocks between these two levels as access patterns change. The result is a fully-redundant storage system that is extremely easy to use, suitable for a wide variety of workloads, largely insensitive to dynamic workload changes, and that performs much better than disk arrays with comparable numbers of spindles and much larger amounts offront-end ~~ cache. Because the implementation of the I-W AUtORAID technology is almost entirely in embedded software, the additional hardware cost for these benejits is very small. We describe the HP AUtORAID technology in detail, and provide performance data for an embodiment of it in a prototype storage array, together with the results of simulation studies used to choose algorithms used in the array.

...read moreread less

Patent•

Method of viewing at a client viewing station a multiple media title stored at a server and containing a plurality of topics utilizing anticipatory caching

[...]

Jean-Remy Facq¹, Lindsay A. Harris¹•Institutions (1)

Microsoft¹

14 Jul 1995

TL;DR: In this article, an on-line multiple media viewer system provides an interactive presentation at a client viewing station of multiple media content retrieved over a remote connection from a server at which the content resides using a set of client-initiated and server-driven remote services for anticipatory caching of media content.

...read moreread less

Abstract: An on-line multiple media viewer system provides a responsive interactive presentation at a client viewing station of multiple media content retrieved over a remote connection from a server at which the content resides using a set of client-initiated and server-driven remote services for anticipatory caching of media content. In response to an initial request for an item of media content from the server, the remote services predict additional items of media content likely to be requested and transmit these items in advance of their request. Transmitted items are cached by services at the client viewing station in a cache storage. The client checks the cache storage before making additional requests for transfer over the remote connection. The items are transmitted in multi-channel asynchronous operations over the remote connection.

...read moreread less

Proceedings Article•DOI•

Interrupt-based hardware support for profiling memory system performance

[...]

A. Goldberg¹, J. Trotter¹•Institutions (1)

Bell Labs¹

02 Oct 1995

TL;DR: This paper describes how to combine simple hardware support and sampling techniques to obtain empirical data on memory system behavior without appreciably perturbing system performance.

...read moreread less

Abstract: Fueled by higher clock rates and superscalar technologies, growth in processor speed continues to outpace improvement in memory system performance. Reflecting this trend, architects are developing increasingly complex memory hierarchies to mask the speed gap, compiler writers are adding locality enhancing transformations to better utilize complex memory hierarchies, and applications programmers are recoding their algorithms to exploit memory systems. All of these groups need empirical data on memory system behavior to guide their optimizations. This paper describes how to combine simple hardware support and sampling techniques to obtain such data without appreciably perturbing system performance. The idea is implemented in the Mprof prototype that profiles data stall cycles, first level cache misses, and second level misses on the Sun Sparc 10/41.

...read moreread less

Book•

In Search of Clusters

[...]

Gregory F. Pfister

01 Apr 1995

TL;DR: This chapter discusses Cluster Hardware Structures, Symmetric Multiprocessors, "NUMA," and Clusters, and explains why the concept of Cluster is so important.

...read moreread less

Abstract: I. WHAT ARE CLUSTERS, AND WHY USE THEM? 1. Introduction. Working Harder. Working Smarter. Getting Help. The Road to Lowly Parallel Processing. A Neglected Paradigm. What is to Come. 2. Examples. Beer & Subpoenas. Serving the Web. The Farm. Fermilab. Other Compute Clusters. Full System Clusters. Cluster Software Products. Basic (Availability) Clusters. Not the End. 3. Why Clusters? The Standard Litany. Why Now? Why Not Now? Commercial Node Performance. The Need for High Availability. 4. Definition, Distinctions, and Initial Comparisons. Definition. Distinction from Parallel Systems. Distinctions from Distributed Systems. Concerning "Single System Image." Other Comparisons. Reactions. II. HARDWARE. 5. A Cluster Bestiary. Exposed vs. Enclosed. "Glass-House" vs. "Campus-Wide" Cluster. Cluster Hardware Structures. Communication Requirements. Cluster Acceleration Techniques. 6. Symmetric Multiprocessors. What is an SMP? What is a Cache, and Why Is It Necessary? Memory Contention. Cache Coherence. Sequential and Other Consistencies. Input/Output. Summary. 7. NUMA and Friends. UMA, NUMA, NORMA, and CC-NUMA. How CC-NUMA Works. The "N" in CC-NUMA. Software Implications. Other CC-NUMA Implications. Is "NUMA" Inevitable? Great Big CC-NUMA. Simple COMA. III. SOFTWARE. 8. Workloads. Why Discuss Workloads? Serial: Throughput. Parallel. Amdahl's Law. The Point of All This. 9. Basic Programming Models and Issues. What is a Programming Model? The Sample Problem. Uniprocessor. Shared Memory. Message-Passing. CC-NUMA. SIMD and All That. Importance. 10. Commercial Programming Models. Small N vs. Large N. Small N Programming Models. Large-N I/O Programming Models. Large-N Processor-Memory Models. Shared Disk or not Shared Disk? 11. Single System Image. Single System Image Boundaries. Single System Image Levels. The Application and Subsystem Levels. The Operating System Kernel Levels. Hardware Levels. SSI and System Management. IV. SYSTEMS. 12. High Availability. What Does "High Availability" Mean? The Basic Idea: Failover. Resources. Failing Over Data. Failing Over Communications. Towards Instant Failover. Failover to Where? Lock Data Reconstruction. Heartbeats, Events, and Failover Processing. System Structure. Related Issues. 13. Symmetric Multiprocessors, "NUMA," and Clusters. Preliminaries. Performance. Cost. High Availability. Other Issues. Partitioning. Conclusion. 14. Why We Need the Concept of Cluster. Benchmarks. Development Directions. Confusion of Issues. The Lure of Large Numbers. 15. Conclusion. Cluster Operating Systems. Exploitation. Standards. Software Pricing. What About 2010?. Coda: The End of Parallel Computer Architecture. Annotated Bibliography. Index. About the Author.

...read moreread less

Patent•

System and method for providing opportunistic file access in a network environment

[...]

Hans Hurvig¹•Institutions (1)

Microsoft¹

07 Jun 1995

TL;DR: In this article, a file allocation and management system for a multi-user network environment is described, where at least one server and two or more clients are disposed along the network in communicating via a request/response transfer protocol.

...read moreread less

Abstract: A file allocation and management system for a multi-user network environment is disclosed. At least one server and two or more clients are disposed along the network in communicating via a request/response transfer protocol. Files directed for shared usage among the clients along the network are stored at the server. Each client is adapted to communicate with the server through a plurality of identifier sockets, wherein a first identifier socket is configured for bi-directional communication and a second identifier socket is configured for uni-directional communications initiated by the server. Files normally stored at the server, under appropriate circumstances may be temporarily stored in an internal cache or other memory at each client location, when the file is in use.

...read moreread less

Proceedings Article•DOI•

A modified approach to data cache management

[...]

Gary Tyson¹, Matthew Farrens², John Matthews², Andrew R. Pleszkun³•Institutions (3)

University of California, Riverside¹, University of California, Davis², University of Colorado Boulder³

01 Dec 1995

TL;DR: The bare minimum amount of local memories that programs require to run without delay is measured by using the Value Reuse Profile, which contains the dynamic value reuse information of a program's execution, and by assuming the existence of efficient memory systems.

...read moreread less

Abstract: As processor performance continues to improve, more emphasis must be placed on the performance of the memory system. In this paper, a detailed characterization of data cache behavior for individual load instructions is given. We show that by selectively applying cache line allocation according the characteristics of individual load instructions, overall performance can be improved for both the data cache and the memory system. This approach can improve some aspects of memory performance by as much as 60 percent on existing executables.

...read moreread less

Proceedings Article•DOI•

The impact of architectural trends on operating system performance

[...]

Mendel Rosenblum¹, Edouard Bugnion¹, S. A. Herrod¹, Emmett Witchel¹, Abhinav Gupta¹ - Show less +1 more•Institutions (1)

Stanford University¹

03 Dec 1995

TL;DR: The paper presents a detailed decomposition of execution time for important kernel services in the three workloads, finding that disk I/O is the first-order bottleneck for workloads such as program development and transaction processing and its importance continues to grow over time.

...read moreread less

Abstract: Computer systems are rapidly changing. Over the next few years, we will see wide-scale deployment of dynamically-scheduled processors that can issue multiple instructions every clock cycle, execute instructions out of order, and overlap computation and cache misses. We also expect clock-rates to increase, caches to grow, and multiprocessors to replace uniprocessors. Using SimOS, a complete machine simulation environment, this paper explores the impact of the above architectural trends on operating system performance. We present results based on the execution of large and realistic workloads (program development, transaction processing, and engineering compute-server) running on the IRIX 5.3 operating system from Silicon Graphics Inc. Looking at uniprocessor trends, we find that disk I/O is the first-order bottleneck for workloads such as program development and transaction processing. Its importance continues to grow over time. Ignoring I/O, we find that the memory system is the key bottleneck, stalling the CPU for over 50% of the execution time. Surprisingly, however, our results show that this stall fraction is unlikely to increase on future machines due to increased cache sizes and new latency hiding techniques in processors. We also find that the benefits of these architectural trends spread broadly across a majority of the important services provided by the operating system. We find the situation to be much worse for multiprocessors. Most operating systems services consume 30-70% more time than their uniprocessor counterparts. A large fraction of the stalls are due to coherence misses caused by communication between processors. Because larger caches do not reduce coherence misses, the performance gap between uniprocessor and multiprocessor performance will increase unless operating system developers focus on kernel restructuring to reduce unnecessary communication. The paper presents a detailed decomposition of execution time (e.g., instruction execution time, memory stall time separately for instructions and data, synchronization time) for important kernel services in the three workloads.

...read moreread less

Book•

Reducing memory latency via non-blocking and prefetching caches

[...]

Tien-Fu Chen, Jean-Loup Baer

01 Mar 1995

TL;DR: In this paper, a hybrid non-blocking cache and prefetching cache is proposed to hide memory latency by exploiting the overlap of processor computations with data accesses, and a hybrid design based on the combination of these two hardware-based schemes is proposed.

...read moreread less

Abstract: Non-blocking caches and prefetehing caches are two techniques for hiding memory latency by exploiting the overlap of processor computations with data accesses. A nonblocking cache allows execution to proceed concurrently with cache misses as long as dependency constraints are observed, thus exploiting post-miss operations, A prefetching cache generates prefetch requests to bring data in the cache before it is actually needed, thus allowing overlap with premiss computations. In this paper, we evaluate the effectiveness of these two hardware-based schemes. We propose a hybrid design based on the combination of these approaches. We also consider compiler-based optimization to enhance the effectiveness of non-blocking caches. Results from instruction level simulations on the SPEC benchmarks show that the hardware prefetching caches generally outperform nonblocking caches. Also, the relative effectiveness of nonblocklng caches is more adversely affected by an increase in memory latency than that of prefetching caches,, However, the performance of non-blocking caches can be improved substantially by compiler optimizations such as instruction scheduling and register renaming. The hybrid design cm be very effective in reducing the memory latency penalty for many applications.

...read moreread less

Proceedings Article•DOI•

Dynamic self-invalidation: reducing coherence overhead in shared-memory multiprocessors

[...]

Alvin R. Lebeck¹, Darien Wood¹•Institutions (1)

University of Wisconsin-Madison¹

01 May 1995

TL;DR: The results show that DSI reduces execution time of a sequentially consistent full-map coherence protocol by as much as 41%.

...read moreread less

Abstract: This paper introduces dynamic self-invalidation (DSI), a new technique for reducing cache coherence overhead in shared-memory multiprocessors. DSI eliminates invalidation messages by having a processor automatically invalidate its local copy of a cache block before a conflicting access by another processor. Eliminating invalidation overhead is particularly important under sequential consistency, where the latency of invalidating outstanding copies can increase a program's critical path.DSI is applicable to software, hardware, and hybrid coherence schemes. In this paper we evaluate DSI in the context of hardware directory-based write-invalidate coherence protocols. Our results show that DSI reduces execution time of a sequentially consistent full-map coherence protocol by as much as 41%. This is comparable to an implementation of weak consistency that uses a coalescing write-buffer to allow up to 16 outstanding requests for exclusive blocks. When used in conjunction with weak consistency, DSI can exploit tear-off blocks---which eliminate both invalidation and acknowledgment messages---for a total reduction in messages of up to 26%.

...read moreread less

Journal Article•DOI•

Sleepers and workaholics: caching strategies in mobile environments (extended version)

[...]

Daniel Barbará¹, Tomasz Imielinski²•Institutions (2)

Princeton University¹, Rutgers University²

01 Oct 1995

TL;DR: A taxonomy of different cache invalidation strategies is proposed, and the impact of clients' disconnection times on their performance is studied to improve further the efficiency of the invalidation techniques described.

...read moreread less

Abstract: In the mobile wireless computing environment of the future, a large number of users, equipped with low-powered palmtop machines, will query databases over wireless communication channels. Palmtop-based units will often be disconnected for prolonged periods of time, due to battery power saving measures; palmtops also will frequently relocate between different cells, and will connect to different data servers at different times. Caching of frequently accessed data items will be an important technique that will reduce contention on the narrowbandwidth, wireless channel. However, cache individualization strategies will be severely affected by the disconnection and mobility of the clients. The server may no longer know which clients are currently residing under its cell, and which of them are currently on. We propose a taxonomy of different cache invalidation strategies, and study the impact of clients' disconnection times on their performance. We study ways to improve further the efficiency of the invalidation techniques described. We also describe how our techniques can be implemented over different network environments.

...read moreread less

Proceedings Article•DOI•

Exploiting weak connectivity for mobile file access

[...]

Lily B. Mummert¹, Maria R. Ebling¹, Mahadev Satyanarayanan¹•Institutions (1)

Carnegie Mellon University¹

03 Dec 1995

TL;DR: In this paper, the authors describe how the Coda File System has evolved to exploit weak connectivity in lowbandwidth and intermittent, low-bandwidth, or expensive networks in mobile computing.

...read moreread less

Abstract: Weak connectivity, in the form of intermittent, low-bandwidth, or expensive networks is a fact of life in mobile computing. In this paper, we describe how the Coda File System has evolved to exploit such networks. The underlying theme of this evolution has been the systematic introduction of adaptivity to eliminate hidden assumptions about strong connectivity. Many aspects of the system, including communication, cache validation, update propagation and cache miss handling have been modified. As a result, Coda is able to provide good performance even when network bandwidth varies over four orders of magnitude - from modem speeds to LAN speeds.

...read moreread less

Journal Article•DOI•

An accurate worst case timing analysis for RISC processors

[...]

Sung-Soo Lim¹, Young Hyun Bae¹, Gyu Tae Jang¹, Byung-Do Rhee¹, Sang Lyul Min¹, Chang Yun Park², Heonshik Shin¹, Kunsoo Park¹, Soo-Mook Moon¹, Chong Sang Kim¹ - Show less +6 more•Institutions (2)

Seoul National University¹, Chung-Ang University²

01 Jul 1995-IEEE Transactions on Software Engineering

TL;DR: In this paper, the authors propose an extension to the original timing schema where the timing information associated with each program construct is a simple time bound, and they show that tight WCET bounds can be obtained by using the revised timing schema approach.

...read moreread less

Abstract: An accurate and safe estimation of a task's worst case execution time (WCET) is crucial for reasoning about the timing properties of real time systems. In RISC processors, the execution time of a program construct (e.g., a statement) is affected by various factors such as cache hits/misses and pipeline hazards, and these factors impose serious problems in analyzing the WCETs of tasks. To analyze the timing effects of RISC's pipelined execution and cache memory, we propose extensions to the original timing schema where the timing information associated with each program construct is a simple time bound. In our approach, associated with each program construct is worst case timing abstraction, (WCTA), which contains detailed timing information of every execution path that might be the worst case execution path of the program construct. This extension leads to a revised timing schema that is similar to the original timing schema except that concatenation and pruning operations on WCTAs are newly defined to replace the add and max operations on time bounds in the original timing schema. Our revised timing schema accurately accounts for the timing effects of pipelined execution and cache memory not only within but also across program constructs. The paper also reports on preliminary results of WCET analysis for a RISC processor. Our results show that tight WCET bounds (within a maximum of about 30% overestimation) can be obtained by using the revised timing schema approach. >

...read moreread less

Patent•

High performance superscalar microprocessor including a common reorder buffer and common register file for both integer and floating point operations

[...]

David B. Witt¹, William M. Johnson¹•Institutions (1)

Advanced Micro Devices¹

10 Jul 1995

TL;DR: In this article, the superscalar microprocessor is presented, which includes an integer functional unit and a floating-point functional unit that share a high performance main data processing bus.

...read moreread less

Abstract: A superscalar microprocessor is provided which includes a integer functional unit and a floating point functional unit that share a high performance main data processing bus. The integer unit and the floating point unit also share a common reorder buffer, register file, branch prediction unit and load/store unit which all reside on the same main data processing bus. Instruction and data caches are coupled to a main memory via an internal address data bus which handles communications therebetween. An instruction decoder is coupled to the instruction cache and is capable of decoding multiple instructions per microprocessor cycle. Instructions are dispatched from the decoder in speculative order, issued out-of-order and completed out-of-order. Instructions are retired from the reorder buffer to the register file in-order. The functional units of the microprocessor desirably accommodate operands exhibiting multiple data widths. High performance and efficient use of the microprocessor die size are achieved by the sharing architecture of the disclosed superscalar microprocessor.

...read moreread less

Patent•

Method for transferring data from a host computer to a storage media using selectable caching strategies

[...]

Rodney A. DeKoning¹, Donald R. Humlicek¹, Max L. Johnson², Curtis W. Rink¹•Institutions (2)

LSI Corporation¹, Avago Technologies²

23 May 1995

TL;DR: In this paper, an apparatus and method to adjust the caching strategy used for writing its write request data to storage media during execution of various software applications is presented. But, the cache flushing parameter is transferred from the host computer to a controller which has a cache memory, and a quantity of query data is written from the cache memory to a storage medium in accordance with the cache-flushing parameter.

...read moreread less

Abstract: An apparatus and method is disclosed which enables a host computer to adjust the caching strategy used for writing its write request data to storage media during execution of various software applications. The method includes the step of generating a caching-flushing parameter in the host computer. The cache flushing parameter is then transferred from the host computer to a controller which has a cache memory. Thereafter, a quantity of write request data is written from the cache memory to a storage medium in accordance with the cache-flushing parameter.

...read moreread less

Proceedings Article•DOI•

Integrating the timing analysis of pipelining and instruction caching

[...]

Christopher Healy¹, David Whalley¹, Marion G. Harmon•Institutions (1)

Florida State University¹

05 Dec 1995

TL;DR: This paper describes an approach for bounding the worst-case performance of large code segments on machines that exploit both pipelining and instruction caching, and a graphical user interface is invoked that allows a user to request timing predictions on portions of the program.

...read moreread less

Abstract: Recently designed machines contain pipelines and caches. While both features provide significant performance advantages, they also pose problems for predicting execution time of code segments in real-time systems. Pipeline hazards may result in multicycle delays. Instruction or data memory references may not be found in cache and these misses typically require several cycles to resolve. Whether an instruction will stall due to a pipeline hazard or a cache miss depends on the dynamic sequence of previous instructions executed and memory references performed. Furthermore, these penalties are not independent since delays due to pipeline stalls and cache miss penalties may overlap. This paper describes an approach for bounding the worst-case performance of large code segments on machines that exploit both pipelining and instruction caching. First, a method is used to analyze a program's control flow to statically categorize the caching behavior of each instruction. Next, these categorizations are used in the pipeline analysis of sequences of instructions representing paths within the program. A timing analyzer uses the pipeline path analysis to estimate the worst-case execution performance of each loop and function in the program. Finally, a graphical user interface is invoked that allows a user to request timing predictions on portions of the program.

...read moreread less

Patent•

Database system with buffer manager providing per page native data compression and decompression

[...]

Clark French¹, Peter W. White¹•Institutions (1)

Sybase¹

11 Dec 1995

TL;DR: In this paper, the authors describe a client/server database system with improved methods for performing database queries, particularly DSS-type queries, which includes one or more Clients (e.g., Terminals or PCs) connected via a Network to a Server.

...read moreread less

Abstract: A Client/Server Database System with improved methods for performing database queries, particularly DSS-type queries, is described. The system includes one or more Clients (e.g., Terminals or PCs) connected via a Network to a Server. In general operation, Clients store data in and retrieve data from one or more database tables resident on the Server by submitting SQL commands, some of which specify "queries"--criteria for selecting particular records of a table. The system implements methods for storing data vertically (i.e., by column), instead of horizontally (i.e., by row) as is traditionally done. Each column comprises a plurality of "cells" (i.e., column value for a record), which are arranged on a data page in a contiguous fashion. By storing data in a column-wise basis, the system can process a DSS query by bringing in only those columns of data which are of interest. Instead of retrieving row-based data pages consisting of information which is largely not of interest to a query, column-based pages can be retrieved consisting of information which is mostly, if not completely, of interest to the query. The retrieval itself can be done using more-efficient large block I/O transfers. The system includes data compression which is provided at the level of Cache or Buffer Managers, thus providing on-the-fly data compression in a manner which is transparent to each object. Since vertical storage of data leads to high repetition on a given data page, the system provides improved compression/decompression.

...read moreread less

Proceedings Article•DOI•

Application-level document caching in the Internet

[...]

Azer Bestavros¹, Robert L. Carter¹, Mark Crovella¹, Carlos Cunha¹, Abdelsalam A. Heddaya¹, Sulaiman A. Mirdad¹ - Show less +2 more•Institutions (1)

Boston University¹

05 Jun 1995

TL;DR: The results suggest that distinguishing between documents produced locally and those produced remotely can provide useful leverage in designing caching policies, because of differences in the potential for sharing these two document types among multiple users.

...read moreread less

Abstract: With the increasing demand for document transfer services such as the World Wide Web comes a need for better resource management to reduce the latency of documents in these systems. To address this need, we analyze the potential for document caching at the application level in document transfer services. We have collected traces of actual executions of Mosaic, reflecting over half a million user requests for WWW documents. Using those traces, we study the tradeoffs between caching at three levels in the system, and the potential for use of application-level information in the caching system. Our traces show that while a high hit rate in terms of URLs is achievable, a much lower hit rate is possible in terms of bytes, because most profitably-cached documents are small. We consider the performance of caching when applied at the level of individual user sessions, at the level of individual hosts, and at the level of a collection of hosts on a single LAN. We show that the performance gain achievable by caching at the session level (which is straightforward to implement) is nearly all of that achievable at the LAN level (where caching is more difficult to implement). However, when resource requirements are considered, LAN level caching becomes muck more desirable, since it can achieve a given level of caching performance using a much smaller amount of cache space. Finally, we consider the use of organizational boundary information as an example of the potential for use of application-level information in caching. Our results suggest that distinguishing between documents produced locally and those produced remotely can provide useful leverage in designing caching policies, because of differences in the potential for sharing these two document types among multiple users. >

...read moreread less

Journal Article•DOI•

Sequential hardware prefetching in shared-memory multiprocessors

[...]

Fredrik Dahlgren¹, Michel Dubois², Per Stenström¹•Institutions (2)

Lund University¹, University of Southern California²

01 Jul 1995-IEEE Transactions on Parallel and Distributed Systems

TL;DR: Simulations of this adaptive scheme show reductions of the number of read misses, the read penalty, and of the execution time by up to 78%, 58%, and 25% respectively.

...read moreread less

Abstract: To offset the effect of read miss penalties on processor utilization in shared-memory multiprocessors, several software- and hardware-based data prefetching schemes have been proposed. A major advantage of hardware techniques is that they need no support from the programmer or compiler. Sequential prefetching is a simple hardware-controlled prefetching technique which relies on the automatic prefetch of consecutive blocks following the block that misses in the cache, thus exploiting spatial locality. In its simplest form, the number of prefetched blocks on each miss is fixed throughout the execution. However, since the prefetching efficiency varies during the execution of a program, we propose to adapt the number of pre-fetched blocks according to a dynamic measure of prefetching effectiveness. Simulations of this adaptive scheme show reductions of the number of read misses, the read penalty, and of the execution time by up to 78%, 58%, and 25% respectively. >

...read moreread less

Collapse